How to get linearly independent subset of matrix columns in a high-dimensional matrix

6 views (last 30 days)
Dear all:
I have a few sets of fixed effects in my linear model, which has a collinearity problem since the rank does not match the number of columns.
  • After a wide search on the platform, I found that there is a great function called lincols to solve this issue. However, it occurs to me that the function becomes very slow as the dimension of my fixed effects is very large (at least in the scale of thousands).
My questions are:
  • Is there any alternative (faster) function that could handle a high-dimensional collinearity problem?
  • While the lincols function handles matrices of any kind, I am wondering if there is any way to speed up the algorithm since I am only dealing with dummies (fixed effects).
Thank you very much, and I look forward to hearing from you!
Best,
Long

Accepted Answer

John D'Errico
John D'Errico on 23 Aug 2020
Edited: John D'Errico on 23 Aug 2020
You have thousands of columns, and therefore probably thousands of rows. Why does it not seem surprising that big problems are computationally intensive, and that bigger problems are more so?
However you choose to solve this, the problem will be O(n^3), that is, it will grow in complexity with the cube of the size of the matrix. This means if you double the size of the matrix, I'd expect to see the complexity of the problem to go up by a factor of 8. One way or another, you will be using a column pivoted matrix factorization. There is no magic here that will solve the problem in the blink of an eye. And that you think of the columns as dummies is irrelevant. In any case, you need to use linear algebra to solve the problem, and linear algebra does not care what the columns mean. Big problems take big time, or at least big computers.
  7 Comments
John D'Errico
John D'Errico on 24 Aug 2020
Edited: John D'Errico on 24 Aug 2020
The problem is, in order to use QR for this purpose, you need to use the THREE output version of QR. The reason QR does the work for you, is in the column pivoting. At each step, it kills off what it has effectively already seem, then it takes the column that is most linearly independent form those it has already seen. This is why it works for your purpose.
It is important that qr does this task on sparse matrices, since your problem is sparse. However, if that cannot be pushed onto the GPU, you just need to use the fastest server you can find. The most available cores would be important.
For example, if I try this not very sparse problem, but large enough that it will make my computer work hard enough to get all 8 cores humming for long enough to see that happen:
X = sprand(5000,20000,.01);
[Q,R,E] = qr(X,0);
Then I see MATLAB is indeed using the full capacity of my computer, all 8 cores. On small problems, only 1 core will wake up. And there is a big difference between 2 cores on a laptop and 8 or 12 or more. (A lot of heat generated too.)
My point is, for the most speed, find something with as many physical cores that you can access as possible. If you can get time on something with 32 or 64 or 128 cores, then do so.
Long Hong
Long Hong on 24 Aug 2020
Thanks @John for your suggestions! My university does have some super computers for this purpose (with many cores). I will try to find a way to use it. :D

Sign in to comment.

More Answers (0)

Categories

Find more on Parallel Computing Fundamentals in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!