how to put zero or nan instead of rejecting my data in Chauvenet-Script

hello, my task is to detect outliers in large dataset using chauvenet criterion.. Chauvenet-Test said: A reading may be rejected if the probability of obtaining the particular deviation is less than 1/2n. in other words it compares the probability of data deviation and reject the data from a list, if this distance is to large.. So, my question is not to Reject a data, but to replace bad data with 0 or NaN ..
I have following script:
`function [ data_bio2, data_percent_rejected, data_cv ] = chauvenet( x )
% remove zero entries
data_zeros=find(x==0.0);
data_nonzeros=find(x>0.0);
data_bio2 = x(data_nonzeros);
% compute length, mean, std, min max of non-zero data
data_length2=length(data_bio2); %
data_mean2 =mean(data_bio2); %
data_standard2 = std(data_bio2); %
data_max2 = max(data_bio2); %
data_min2 = min(data_bio2); %
% Part three - Identify outliers using Chauvenets criterion
% Z-score data and compute two-sided Z-score for Chauvenets criteria
data_probability = 1/(2*length(data_nonzeros)); %
data_zscore = (data_bio2 - data_mean2)/(data_standard2);
data_ptest = 1 - data_probability/2;
zc=norminv(data_ptest, 0, 1);
% Hence, reject data with biomass > std*zc
data_limit = zc * data_standard2;
data_cv = data_bio2( data_zscore >= -zc & data_zscore <= zc );
data_cvlength = length(data_cv);
index_rejected = find(data_zscore > zc | data_zscore < -zc);
%!!! index_rejected: these are the indices of the rejected values in your data vector
data_rejected = data_bio2(data_zscore > zc | data_zscore < -zc)
index_rejected_original = data_nonzeros(index_rejected); %!!!FLAG THOSE LINES!!!
biomass_rejected_original = data_bio(index_rejected_original);
%!!!index/biomass_rejected_original: these are the lines/biomasses
%of your original data file that need to be flagged
% percent of data rejected by Chavenets criterion
data_percent_rejected = (1- data_cvlength/length(data_bio2))* 100
% compute histogram using linear bin-size
[M,Y]=hist(data_bio2,1000);
[M_cv]=hist(data_cv,Y);
end
So, how can I change the script to put zero or Nan for my bad data and not to reject it from the list Thank you in advance!

 Accepted Answer

If I understand your code correctly, this will replace your ‘data_rejected’ selections withto NaN:
data_bio2(index_rejected) = NaN;
I would replace them with NaN instead of zero because zero could enter into your calculations and be considered a valid number. NaN will not be considered a valid number.

4 Comments

But with some operations or functions (such as mean()) having a nan in there will make the result a nan, rather than giving a valid result with the nans ignored.
Agreed, but we don’t know how the data will be used. It would be easier to ignore the NaN values by eliminating them first and doing the calculations on the remaining data. That’s all the Statistics Toolbox nanmean and related functions do. Keeping them as NaN values doesn’t risk inadvertently including them in calculations as zero would do, leading to erroneous results.
thank you very much, guys! Yes, you are right, it would be better to put Nan instead of zeros! The goal is to keep the quantity of the dataset and replace erroneous data with some kind of pertinent value, like mean or linear interpolation between Nan.

Sign in to comment.

More Answers (0)

Categories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!