How to interpret the coefficients of the LDA function fitcdiscr for dimensionality reduction?

78 views (last 30 days)
LDA gives me discriminative functions, similar to the principal components of PCA (with the difference that PCA simply maximizes the variance for the PCs, while LDA uses existing labels to make my clusters as seperatible as possible).
As I understand the textbook descriptions of LDA, the first DF will be the one doing the best job at seperating the clusters, the seconds DF the next best job, and so on. But when using fitcdiscr in Matlab I get one function for each possible combination of 2 clusters.
I expected to get (N-1) DFs, with N beeing the number of clusters, but the way the function returns the coefficients I get (N-1)+(N-2)+...+1 different DFs.
How do I know which DFs are the most important ones? If I wanted to reduce my dimensions as far as possible, which ones would I choose? Do I have to iterate through all possible combinations of DFs myself to check how far apart my clusters are?

Answers (2)

Alexander Jamieson
Alexander Jamieson on 27 Apr 2021
Edited: Alexander Jamieson on 27 Apr 2021
Sorry I'm almost a year late, I was curious myself how to actually do it and couldn't get a straightforward answer anywhere on the internet. Eventually figured it out by using a community-built LDA function and comparing the variables to the ones outputted by the official MATLAB function.
Once you get out your LDA model using the fitcdiscr function, you need to calculate the eigenvalues and eigenvectors, which are obtained by using the eig function on the BetweenSigma and Sigma properties of your LDA model. The Eigenvalues and Eigenvectors are then sorted into descending order, and your resultant output is simply the product of your original Input X and the Eigenvector W. Example code:
Mdl = fitcdiscr(X,L)
[W, LAMBDA] = eig(Mdl.BetweenSigma, Mdl.Sigma) %Must be in the right order!
lambda = diag(LAMBDA);
[lambda, SortOrder] = sort(lambda, 'descend')
W = W(:, SortOrder);
Y = X*W;
Your resultant output Y is the equivalent of the score output in the pca function, which you can then directly visualize in the feature space e.g by using the first 2-3 columns
  2 Comments
Samuel Acuna
Samuel Acuna on 5 Oct 2021
Edited: Samuel Acuna on 5 Oct 2021
I agree with this answer. I'll add that the most number of columns of W you can use is C-1, where C is the number of classes you trained (or in your case, the number of clusters). When you compute W using the above code, you will get values in columns for more than C-1, but these additional columns actually do not explain any more variance. They are close to zero, and would be exactly zero if not for floating point imprecision.
So you can limit the size of W you calculate to be d x k, where d is the original number of dimensions you had and k is the reduced number of dimensions as a result of the LDA (where k is not greater than C-1).
So for your original data set X of n observations (X is n x d ), when you compute Y = X*W, you thus transform X into Y, and Y is n x k. Thus you have reduced the number of dimensions of X using LDA.
This is a useful method to reduce multi-dimensional data onto a 2D or 3D space, which makes visualization and plotting much easier.
heng ma
heng ma on 24 Nov 2023
Edited: heng ma on 24 Nov 2023
Thank you for providing this solution. However, I still don't fully understand the underlying principle, especially regarding the introduction of solving for generalized eigenvalues. For instance, if the original data is an n*d matrix, with n points and d original dimensions, and LDA is used for C-class classification, theoretically, a space of dimension C-1 is obtained. In this space, there are C-1 mutually orthogonal axes. Why can these two types of covariance(between and within) yield a global feature space? Also, the resulting W is a d*d matrix. Assuming d is not less than C-1, why can the first C-1 columns constitute the feature space? Are these C-1 vectors, corresponding to the C-1 columns, mutually orthogonal like in PCA? In fact, I did a test that the pc components are orthogonal to each other, and the pairwise angles of the first C-1 vectors obtained using this W method are concentrated at 90 degrees, but not equal to 90, and not more concentrated than angle distribution which vectors are randomly sampled in d dimension sphere.

Sign in to comment.


Aditya Patil
Aditya Patil on 15 Jul 2020
When you use fitcdiscr function, it returns the model that best seperates the classes. You can check the documentation for ClassificationDiscriminant on how to use the model, or how to get the parameters.
For example, you can use Mu property to get the means.
  1 Comment
A. D.
A. D. on 15 Jul 2020
Yes, I'm aware of that, but I don't want to use to full model for classification, I simply intend to use the generated DFs for dimensionality reduction to plot the clustering in my 80+ predictors with only two or three axes as best as possible.
In PCA I can simply choose the first two PCs for that, because I automatically know that these explain the most of my variance. But with the output of the fitcdiscr function I can not know which of the DFs to use, if I don't wont to use all of them.

Sign in to comment.

Products


Release

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!