How to detect noisy data / outliers?

Question

0 votes

Hi

I have a problem to distinguish between a complete noisy data or data containing some outliers. In fact, I am trying to do a sort of Grid Independence Check (term usually is used in Computational Fluid Dynamics to identify the greed size where the numerical results are independent from the grid/network size) for my data set. I introduced this term only to better explain my case: I have a model which outputs a set of numerical data. Changing the step size in the model, increases the size of output numerical result (obviously). However, the output results contain noisy data which should be removed prior to the check for grid independence (or step size independence). An example output from the model can be similar to the following image.

These output data points are "independent" from the step size, because the overall data (after removing the outliers) follow a linear relation independent of the steps. However, this can be detected after removal of outlier points (if not a robust criteria to check the linear relation for various step sizes would not be available; here, I use Pearson/Spearman correlations). Removing outliers from the data set shown above, can be achieved by employing MATLAB built-in function stdfilt() or simply:

rmvIdx = (abs(DataPoints- median(DataPoints)) > N*mad(DataPoints));
DataPoints(rmvIdx) = [];

Nevertheless, both the approaches miss some data points, and cannot "completely" detect outliers. Therefore, my first question is: how to fully detect outliers in the depicted image. I followed several variations to fully detect the outliers, e.g.:

% First remove distant outliers with N = 3
 rmvIdx = (abs(DataPoints- median(DataPoints)) > N*mad(DataPoints));
 DataPoints(rmvIdx) = [];
% Remove using local SD
Local_std = stdfilt(DataPoints);
Local_std(abs(Local_std- mode(Local_std)) < 1e-4) = 0;
DataPoints(Local_std) = [];
% Smoothing data as a further step
DataPoints= smooth(DataPoints,0.6);

The above approach will find more outliers, however this "strict" search of outliers may cause issue with other data sets, such as:

As shown, this set of data is "completely noisy" (compared to the previous image); however, when using the above-mentioned approach to detect outliers, it will erroneously detect linear relationship (Spearman/Spearman R > 0.9-0.99) for small step sizes (simply because in small step sizes noisy data may be damped by the approach I took). Thus the second questions is: how to detect completely noisy data sets, specially when I used that "strict" approach to find outliers? how to trade-off between these different but highly dependent cases?

Thanks in advance.

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

John D'Errico on 24 Jun 2017

Edited: John D'Errico on 24 Jun 2017

1 vote

Outliers are rare events, that do not follow the same distribution as the regular noise in your data. You can think of data with low noise, plus some outliers as a mixture distribution, thus most of the time, you get the base (small) noise, but some of the time, you get an outlier in there.

So the first curve has outliers in it, almost too many to be considered rare, but I will concede they can be called outliers. The second curve is just heavily corrupted with large noise. The outliers are the noise there. Outlier detection/correction schemes will fail on the second curve, since it is all outliers.

Next, the second curve appears to have heteroscedastic noise. The noise does not have the same noise parameters along the curve, with a standard deviation that appears to be roughly proportional to the signal. The noise also appears to come from a distribution that is not a continuous one. Only discrete levels seem to arise.

I also notice that most of your outliers seem to lie on the low side of that curve, but not all of them. A good scheme will take this fact into account.

Schemes to detect outliers perfectly will be rare and hard to come by. In fact, you will never find a perfect scheme to do so. And the latter mess of a curve you show will never be easily fit very well.

I might also question if possibly the noise might all come from ONE distribution, with very fat tails, but a lot of mass right near zero.

You will be best served if you choose a scheme that deals with the problems I mentioned above, starting with the heteroscedasticity, and the skewed distribution of the outliers, then possibly the apparently discrete nature of the noise.

Finally, what is the nature of the underlying curve here? Is it a straight line? What can we assume about that curve?

Were it my problem to solve, I think you have noise that is non-standard. A simple scheme won't be sufficient. I would be thinking along the lines of maximum likelihood estimation of a model, using a mixture distribution, thus some outliers, some just simple noise. Or I might look for a noise distribution with long tails. Again, MLE will be needed to model the process. Or, possibly you might have success by a simple linear rescaling of the entire curve before you try to model it at all.

You also need to consider how much energy you are willing to invest in solving the problem. A good solution will take some effort and time to develop. A simple scheme might work some of the time, but it won't be as good. The standard rule applies in almost anything you do: {Good, fast, cheap}. Take 2.

If you want better help, you would want to post some real data. A picture is nice, but data is better.

2 Comments
Show None Hide None

Ive J on 24 Jun 2017

Dear John,

Greatly appreciate your detailed answer. It is full of notes I should consider. However, I think I should check Heteroscedasticity (maybe White test or similar ones) to decide on completely noisy data points (the second image).

Nevertheless, the term "model" in my case refers to constraint-based modeling (here, linear programming), and data points represent the flux rates of reactions in a biochemical networks. Steps are the level of regulation of these reactions, so, as control (regulation) increases the reaction fluxes change. Each image (of which two are shown in my example) shows the variation of flux rates with the level of regulation. As seen, this regulation for some reactions generates a linear relationship with some noises (which I am interested to detect these reactions, since they have a meaningful flux variation. Of course, after noise removal). On the other hand, there maybe completely noisy flux rates, which I intend to "NOT" recognize them as "linear".

As you have just mentioned, I tried to use a mixture of methods to remove outliers, and then take Spearman correlation between flux rates and growth of the cell (this variable is quite robust and stable, and I want to check its association with flux rates). Spearman correlations above a threshold (~0.9) produce a robust set of reactions across various step sizes (or regulation/control level). I assumed that the correlation coefficients below this threshold denotes the completely noisy data, so, they are successfully rejected by this approach. However, of course, some minor variations exist which could be ignored.

John D'Errico on 25 Jun 2017

Anytime you have a problem like this where you know a fair amount about the process, then you are best off by using your knowledge about the system. As you point out, it may require multiple schemes to find the outliers.

Sign in to comment.

Answer 2

Image Analyst on 24 Jun 2017

Edited: Image Analyst on 24 Jun 2017

1 vote

Perhaps try RANSAC: https://en.wikipedia.org/wiki/Random_sample_consensus

MATLAB toolbox version: https://www.mathworks.com/discovery/ransac.html

You might also try deleteoutliers() by Mathworker Brett Shoelson: http://www.mathworks.com/matlabcentral/fileexchange/3961-deleteoutliers

3 Comments
Show 1 older comment Hide 1 older comment

Image Analyst on 24 Jun 2017

I don't know why you say "however" as if that rules out RANSAC. To distinguish between data sets that are complete noise vs. those with few outliers, why can't you just look at the percentage of data points that are on the RANSAC line? Complete noise would have a very low percentage, like 5% or so, while data with a line buried in there somewhere, would have a higher percentage, like 40% or even higher with "data sets containing few (or detectable) noise and outliers" I don't see any reason why RANSAC can't still be used.

Ive J on 24 Jun 2017

Yes, that's a possible solution. I did not consider the percentage threshold, as you've pointed out. I will try it.

Deeply appreciate your clues.

Sign in to comment.

How to detect noisy data / outliers?

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

2 Comments
Show None Hide None

More Answers (1)

3 Comments
Show 1 older comment Hide 1 older comment

Categories

Tags

Community Treasure Hunt

How to detect noisy data / outliers?

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

2 Comments Show None Hide None

More Answers (1)

3 Comments Show 1 older comment Hide 1 older comment

Categories

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

2 Comments
Show None Hide None

3 Comments
Show 1 older comment Hide 1 older comment