Matlab generate normal random sample with outliers

3 views (last 30 days)
Hi! I need a help for my trouble.
My trouble:
I want
  1. create random normal sample
  2. choose random index at 1 to 20 (only one index), e.g. obe element in every sample must be increased in 10-12 times
  3. find element with this index, and increase in 10-12 times
  4. after using bootstrap function in matlab evaluate mean and median for sample
  5. every sample i store in one cell
  6. steps 1-5 i want to repeat for every cell, and finally get all_y and all_stats....
clear
clc
clf
close all
format long
warning('off','all')
location = 17;
scale = 1;
num_samples = 20;
num_bootstraps = 1;
y = cell(1, num_samples);
stats = cell(1, num_samples);
for i = 1:num_samples
% Generate a random normal sample
sample = normrnd(location, scale, [1, num_samples]);
% Choose a random index
idx = randi([1, num_samples]);
% Increase the element at the chosen index by 10-12 times
sample(idx) = sample(idx) * randi([10, 12]);
y{i} = sample;
% Perform bootstrap resampling to calculate mean and median
bootstrap_means = zeros(num_bootstraps, 1);
bootstrap_medians = zeros(num_bootstraps, 1);
for j = 1:num_bootstraps
% Resample with replacement
resampled_data = randsample(sample, num_samples, true);
% Calculate mean and median for the resampled data
bootstrap_means(j) = mean(resampled_data);
bootstrap_medians(j) = median(resampled_data);
end
stats{i} = [bootstrap_means, bootstrap_medians];
end
% Combine all samples into one array
all_y = cat(1, y{:});
% Combine all statistics (mean and median) into one array
all_stats = cat(1, stats{:});
% Combine the original samples, means, and medians into a single dataset
data_3 = [all_y, all_stats];
This code not work correctly,in some samples not find outliers and some samples contain more than one outlier in code above? How to solve this problem?
Coode of problem solution provided above.

Accepted Answer

Steven Lord
Steven Lord on 9 Nov 2023
warning('off','all')
Seeing this in the code smells bad.
If you want to select a different element each iteration (and have enough elements in your vector where that's possible), generate a list of which element will be replaced before entering your loop using the randperm function and then index into that list to determine which element to replace at each iteration.
n = 10;
r = randperm(20, n);
for k = 1:n
fprintf("At iteration k = %d, replace element %d.\n", k, r(k))
end
At iteration k = 1, replace element 8. At iteration k = 2, replace element 2. At iteration k = 3, replace element 9. At iteration k = 4, replace element 15. At iteration k = 5, replace element 7. At iteration k = 6, replace element 1. At iteration k = 7, replace element 11. At iteration k = 8, replace element 13. At iteration k = 9, replace element 10. At iteration k = 10, replace element 4.
If you ask for more elements of the random permutation than are available, MATLAB will throw an error.
v = randperm(20, 21);
Error using randperm
K must be less than or equal to N.

More Answers (1)

dpb
dpb on 9 Nov 2023
Moved: dpb on 9 Nov 2023
The funny thing about random sampling is that it is, well, random.
You're resampling with replacement so it is possible to pick the same sample more than once. The likelihood of that is going to be heavily dependent upon the size of the sample population; a sample size of 20 is not very many so it's almost guaranteed you will have such occur. On the other hand, it's also possible the particular index of the outlier isn't going to be in the resampled index vector.
num_samples=20;
N=10;
x=randi([1,num_samples],N,num_samples); % N sample vectors
r=cell2mat(arrayfun(@(i)randsample(x(i,:),num_samples,true),[1:N].','uni',0)); % resampled with replacement
nnz(arrayfun(@(i)numel(unique(x(i,:))),1:N).'<num_samples) % how many had duplicated indices
ans = 10
They all had duplicated indices; every case had at least one sample left out.OTOH, if there are N total samples returned and at least one isn't chosen, then at least one other one must have been duplicated. There's a 1:N chance it is your outlier for each opportunity to sample; by the time you do it N times, it goes up to where odds are pretty good. I'll let you calculate that... :)
  3 Comments
dpb
dpb on 9 Nov 2023
Moved: dpb on 9 Nov 2023
You already generated a sample with an extreme value; it's not clear why you are resampling if that was the intent.
With the magnitude of the offset you'r introducing in comparison to the variance, it will be virtually guaranteed that that element will be identified as such by any common test statistic.
It's also possible some other extreme element could fall outside such test statistic and be indicated as such, but the likelihood would be pretty small with those parameters. As the sample variance increases in relation to the introduced bias magnitude, that probability of more than one element being identified as an outlier will increase.
MATLAB has builtin isoutlier which is based on 3X deviation from sample median by default; I'd recommend perusing the <NIST> site for a discussion of testing for outliers.

Sign in to comment.

Categories

Find more on Descriptive Statistics in Help Center and File Exchange

Products


Release

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!