using PCA to filter outliers in MATLAB

21 views (last 30 days)
A bit of a silly question (I suspect). I am using MATLAB to preprocess some bulk RNAseq data. I am calling PCA on my normalized counts just as an initial way to identify outliers I didn't automatically reject. There are indeed some clear outliers but I am not sure how to use the score matrix output by pca to filter my original data because it is now in descending order of explained variance. How can I use the pca scores to return to the indicies of problematic samples. Thank you so much! Info below:
[coeff,score,latent,tsquared,explained] = pca(normCounts);
size(normCounts) 136 samples with 24427 measured genes
ans =
136 24427
  2 Comments
Mathieu NOE
Mathieu NOE on 21 May 2024
It would help if you coud share some code and data
Cooper
Cooper on 21 May 2024
Hi Matthieu thanks so much for your speedy response please find attached my code. Unfortunately I can't attach any of the data since it is sensitive and unpublished. My question is really just trying to understand how PCA is implemented in Matlab and how to return to the original order of an input matrix after outputting rank ordered scores for explained variance.
Thanks so much
%% normalizing samples
%first we take a pseudoReference by calculating the geometric mean across
%each one of our genes
pseudoRefSample= geomean(countData2use,2);
nz = pseudoRefSample > 0; %remove genes with no expression for calc
ratios = bsxfun(@rdivide,countData2use(nz,:),pseudoRefSample(nz)); %then calculate
sizeFactors = median(ratios,1);
normCounts = bsxfun(@rdivide,countData2use,sizeFactors);
normCounts(end,:)=[]; %last one is always nan'd out
%remove samples with lots of outliers
extremeVals=sizeFactors >1.4 | sizeFactors<=.6;
normCounts(:,extremeVals)=[]; ptID2use=ptIDZ;
ptID2use(extremeVals)=[];
groupID2use=groupID; groupID2use(extremeVals)=[];
[coeff,score,latent,tsquared,explained] = pca(normCounts');
figure
plot(1:length(explained),explained)
ylabel('explained variance')
xlabel('Principal Component')
figure
scatter(score(1,:),score(2,:))
axis equal
xlabel(['1st Principal Component'])
ylabel('2nd Principal Component')
%this will be where (ideally) I filter out the few data points that are clearly far removed from the distribution
a=find(abs(score(1,:))>10000)

Sign in to comment.

Accepted Answer

John D'Errico
John D'Errico on 21 May 2024
Edited: John D'Errico on 22 May 2024
I've attached my quick shot at an n-dimensional outlier tool, based on Mahalanobis distance. I am pretty sure it has some serious issues, and since I refuse to accept the label of statistician, I'll choose to ignore them. :)
For some random data, with 10% potential outliers...
X = randn(100,2)*randn(2,3) + randn(100,3)/50;
X = [X;randn(10,3)*50+5];
As you should see, the last 10 of the 110 points will be potential outliers, though some of those points might just randomly have fallen inside the upper cloud. Use it in a sequential fashion, removing the worst point, IF the flag is greater than 1.
retainset = 1:110;
outlierlist = [];
continueflag = true;
tol = 0.001;
while continueflag
outflag = mahalOutliers(X(retainset,:),tol);
[maxflag,maxind] = max(outflag);
if maxflag > 1;
outlierlist(end+1) = retainset(maxind);
retainset(maxind) = [];
else
continueflag = false;
end
end
outlierlist
outlierlist = 1x10
105 108 102 101 103 107 109 104 110 106
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
So, with a tolerance set at 0.001, the tool identified points numbered 105, 108, 102, etc., in sequence, then happily stopped after it found 10 outliers. Coincidentally, those were the 10 points I added to the set.
As I said, the idea surely has flaws. And I would seriously agree with everything said by @the cyclist.
  1 Comment
Cooper
Cooper on 22 May 2024
I don't know what to say... I'm so grateful for this incredibly edifying and kind response. I will think critically about my approach to this problem, do some more reading, and hopefully one day pay this forward.
Thank you thank you thank you!!

Sign in to comment.

More Answers (1)

the cyclist
the cyclist on 21 May 2024
There is a built-in MATLAB function, rmoutliers, for detecting and removing outliers.
I frankly did not read all your code, but you might also be interested in this tutorial-style answer I wrote about using MATLAB's pca function. In particular, if you are trying to remove observations that are outliers along the principal component axes, you should be able to
  • apply pca
  • apply rmoutliers on the variable I call dataInPrincipalComponentSpace
  • transform that variable back to the original space
  5 Comments
John D'Errico
John D'Errico on 21 May 2024
Edited: John D'Errico on 21 May 2024
I was kind of vague, partly because I'm not really a true sadistician. I hope I spelled that correclty. ;-)
But there are two issues I see. One is the question of lack of normality. For example, consider the set:
X = exp(randn(1000,3));
X is lognormally distributed. And I am sure any PCA on X would identify some of those points as outliers. But if you consider the log of that sample, you would recognize there are no outliers at all. Everything is as perfectly well behaved as possible in that domain. Even worse, suppose I added in one new point:
X2 = [X;exp([-8 -8 -8])];
I would conjecture that this new point will not even be remotely observed as an outlier by any PCA based scheme, but back in the log space, that last point is out in the weeds. The problem in all of this is PCA implicitly presumes normality. And when you lack normality, PCA will fail to perform well. Worse, any data that contains outliers explicitly fails the presumption of normally distributed data.
The second issue I was thinking of is how a PCA will treat data with an outlier in it. The scores will all be strange, and potentially might miss an outlier. I think I need to come up with an example of what I am thinking, but my proposal below would seem to eliminate that issue anyway.
Anyway, if you want to do this, I would work in terms of a leave-one-out Mahalanobis distance, instead of using a PCA. Thus for each point, compute the mean and convariance matrix of the set, leaving that point out. Now, compute the distance for the point you left out. Repeat for every point in your set. Choose the one with the largest distance measure, IF that distance is sufficiently large in terms of Mahalanobis distance. Drop it from your set. Now lather, rinse and repeat, until none of the points are sufficiently far out to be considered an outlier by this scheme.
Will that scheme fail for the lognormal example I showed before? Of course.
Finally, does rmoutliers do something like this already? I'd need to check on that, but I don't see that it does, at least not at a glance.
the cyclist
the cyclist on 21 May 2024
The transformation to, and back from, the principal components will not change the order of the observations. It will only change the coordinate system.
The cautions you received from @John D'Errico are valid (of course). I took the "low road", in just telling you how to accomplish what you asked, rather than the high road of giving you the many things to think about in what you are doing. I have a 90-page book on how to detect and handle outliers, and it's just difficult for me to know where to draw the line on what to share here. :-)
I think the most important short lesson is to ask yourself why are you dropping an observation. The simple fact that it lies away from the bulk of the distribution -- without any hypothesis about why this may have happened -- is not a very good reason to do so.

Sign in to comment.

Categories

Find more on Dimensionality Reduction and Feature Extraction in Help Center and File Exchange

Tags

Products


Release

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!