How to get linearly independent subset of matrix columns in a high-dimensional matrix

Question

Long Hong on 23 Aug 2020

0
Link

Direct link to this question

https://in.mathworks.com/matlabcentral/answers/583481-how-to-get-linearly-independent-subset-of-matrix-columns-in-a-high-dimensional-matrix

Commented: Long Hong on 24 Aug 2020

Dear all:

I have a few sets of fixed effects in my linear model, which has a collinearity problem since the rank does not match the number of columns.

After a wide search on the platform, I found that there is a great function called lincols to solve this issue. However, it occurs to me that the function becomes very slow as the dimension of my fixed effects is very large (at least in the scale of thousands).

My questions are:

Is there any alternative (faster) function that could handle a high-dimensional collinearity problem?
While the lincols function handles matrices of any kind, I am wondering if there is any way to speed up the algorithm since I am only dealing with dummies (fixed effects).

Thank you very much, and I look forward to hearing from you!

Best,

Long

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

John D'Errico on 23 Aug 2020

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/583481-how-to-get-linearly-independent-subset-of-matrix-columns-in-a-high-dimensional-matrix#answer_483998

Edited: John D'Errico on 23 Aug 2020

You have thousands of columns, and therefore probably thousands of rows. Why does it not seem surprising that big problems are computationally intensive, and that bigger problems are more so?

However you choose to solve this, the problem will be O(n^3), that is, it will grow in complexity with the cube of the size of the matrix. This means if you double the size of the matrix, I'd expect to see the complexity of the problem to go up by a factor of 8. One way or another, you will be using a column pivoted matrix factorization. There is no magic here that will solve the problem in the blink of an eye. And that you think of the columns as dummies is irrelevant. In any case, you need to use linear algebra to solve the problem, and linear algebra does not care what the columns mean. Big problems take big time, or at least big computers.

7 Comments
Show 5 older commentsHide 5 older comments

John D'Errico on 24 Aug 2020

Edited: John D'Errico on 24 Aug 2020

Open in MATLAB Online

I have no idea where you found lincols, so that means I need to find it. It was probably on the file exchange. It always helps if you give a link when you say that you found code.

https://www.mathworks.com/matlabcentral/fileexchange/77437-extract-linearly-independent-subset-of-matrix-columns

But what does lincols do? IT CALLS QR, a column pivoted QR factorization to be exact. And that is what I'd probably use. Look at the code you found.

Is QR the best tool for the job? Probably. A pivoted QR is an excellent tool for this purpose. It is fast, efficient, and numerically stable, and better in all respects than you will get from any other tool you might choose.

The guts of lincols is just a call to that pivoted QR. I won't bother downloading lincols.

function E = testlindep(X)
[Q,R,E] = qr(X,0);

Now let me test it on a random rank 400 matrix of size 1000x1000.

X = rand(1000,400)*rand(400,1000);
timeit(@() testlindep(X))
ans =
            0.035278595132

And that is far better than any alternative you will find. So lincols will be fast as blazes. The main alternative I might have tried is rref, but rref is not compiled code, and is hugely slower than qr. That it takes more time for larger problam is, as I said, just a reflection that big problems take big time or big computers.

Make sure that you are not memory limited, as that could cause problems.

While someone will surely tell you to compile the code, that will not help, since qr is already incredibly highly optimized, and it is already compiled. If someone tells you to use the parallel computing toolbox, again, that will be a waste of time, since qr is already going to be using all of your available cores on big problems.

Is there anything you can do? Perhaps. You can gain a little bit, if you really don't care how accurate it is.

Xs = single(X);
timeit(@() testlindep(Xs))
ans =
            0.024235015132

So in this case, if I convert the array into single precision, the QR call was 33% faster. This is at a considerable cost in how well the code will run. And you would need to hack the lincols code, since the tolerance it uses is far too small in context of a single precision array.

Another option is if you have the parallel computing toolbox, and you can offload the computation onto your GPU. (I don't even know if that can be done, since I don't have that toolbox.)

John D'Errico on 24 Aug 2020

Edited: John D'Errico on 24 Aug 2020

Open in MATLAB Online

The problem is, in order to use QR for this purpose, you need to use the THREE output version of QR. The reason QR does the work for you, is in the column pivoting. At each step, it kills off what it has effectively already seem, then it takes the column that is most linearly independent form those it has already seen. This is why it works for your purpose.

It is important that qr does this task on sparse matrices, since your problem is sparse. However, if that cannot be pushed onto the GPU, you just need to use the fastest server you can find. The most available cores would be important.

For example, if I try this not very sparse problem, but large enough that it will make my computer work hard enough to get all 8 cores humming for long enough to see that happen:

X = sprand(5000,20000,.01);
[Q,R,E] = qr(X,0);

Then I see MATLAB is indeed using the full capacity of my computer, all 8 cores. On small problems, only 1 core will wake up. And there is a big difference between 2 cores on a laptop and 8 or 12 or more. (A lot of heat generated too.)

My point is, for the most speed, find something with as many physical cores that you can access as possible. If you can get time on something with 32 or 64 or 128 cores, then do so.

Long Hong on 24 Aug 2020

Thanks @John for your suggestions! My university does have some super computers for this purpose (with many cores). I will try to find a way to use it. :D

Sign in to comment.

How to get linearly independent subset of matrix columns in a high-dimensional matrix

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

7 Comments
Show 5 older commentsHide 5 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

How to get linearly independent subset of matrix columns in a high-dimensional matrix

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

7 Comments Show 5 older commentsHide 5 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

7 Comments
Show 5 older commentsHide 5 older comments