How to detect noisy data / outliers?
Show older comments
Hi
I have a problem to distinguish between a complete noisy data or data containing some outliers. In fact, I am trying to do a sort of Grid Independence Check (term usually is used in Computational Fluid Dynamics to identify the greed size where the numerical results are independent from the grid/network size) for my data set. I introduced this term only to better explain my case: I have a model which outputs a set of numerical data. Changing the step size in the model, increases the size of output numerical result (obviously). However, the output results contain noisy data which should be removed prior to the check for grid independence (or step size independence). An example output from the model can be similar to the following image.

These output data points are "independent" from the step size, because the overall data (after removing the outliers) follow a linear relation independent of the steps. However, this can be detected after removal of outlier points (if not a robust criteria to check the linear relation for various step sizes would not be available; here, I use Pearson/Spearman correlations). Removing outliers from the data set shown above, can be achieved by employing MATLAB built-in function stdfilt() or simply:
rmvIdx = (abs(DataPoints- median(DataPoints)) > N*mad(DataPoints));
DataPoints(rmvIdx) = [];
Nevertheless, both the approaches miss some data points, and cannot "completely" detect outliers. Therefore, my first question is: how to fully detect outliers in the depicted image. I followed several variations to fully detect the outliers, e.g.:
% First remove distant outliers with N = 3
rmvIdx = (abs(DataPoints- median(DataPoints)) > N*mad(DataPoints));
DataPoints(rmvIdx) = [];
% Remove using local SD
Local_std = stdfilt(DataPoints);
Local_std(abs(Local_std- mode(Local_std)) < 1e-4) = 0;
DataPoints(Local_std) = [];
% Smoothing data as a further step
DataPoints= smooth(DataPoints,0.6);
The above approach will find more outliers, however this "strict" search of outliers may cause issue with other data sets, such as:

As shown, this set of data is "completely noisy" (compared to the previous image); however, when using the above-mentioned approach to detect outliers, it will erroneously detect linear relationship (Spearman/Spearman R > 0.9-0.99) for small step sizes (simply because in small step sizes noisy data may be damped by the approach I took). Thus the second questions is: how to detect completely noisy data sets, specially when I used that "strict" approach to find outliers? how to trade-off between these different but highly dependent cases?
Thanks in advance.
Accepted Answer
More Answers (1)
Image Analyst
on 24 Jun 2017
Edited: Image Analyst
on 24 Jun 2017
1 vote

You might also try deleteoutliers() by Mathworker Brett Shoelson: http://www.mathworks.com/matlabcentral/fileexchange/3961-deleteoutliers
3 Comments
Ive J
on 24 Jun 2017
Image Analyst
on 24 Jun 2017
I don't know why you say "however" as if that rules out RANSAC. To distinguish between data sets that are complete noise vs. those with few outliers, why can't you just look at the percentage of data points that are on the RANSAC line? Complete noise would have a very low percentage, like 5% or so, while data with a line buried in there somewhere, would have a higher percentage, like 40% or even higher with "data sets containing few (or detectable) noise and outliers" I don't see any reason why RANSAC can't still be used.
Ive J
on 24 Jun 2017
Categories
Find more on t Location-Scale Distribution in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!