Matrix algebra very slow on GPU
Show older comments
I've been testing some of the Matlab matrix routines on a TESLA K20 GPU. So far I've found that chol, lu, \, svd, and eig all run significantly slower on the GPU than on the CPU even without including the time to transfer the data to the GPU. Is this a common experience? If not, what might I be doing wrong?
7 Comments
Jill Reese
on 11 Nov 2013
What version of MATLAB are you using?
Matt J
on 11 Nov 2013
Are you doing the operations in double precision or single precision?
Bonnie
on 11 Nov 2013
Bonnie
on 11 Nov 2013
Matt J
on 11 Nov 2013
Are they slower than the CPU in single precision as well?
Bonnie
on 11 Nov 2013
Bonnie
on 11 Nov 2013
Answers (2)
Sean de Wolski
on 11 Nov 2013
Edited: Sean de Wolski
on 11 Nov 2013
0 votes
How are you doing the timing?
If upgrading is an option, in R2013b, we released gputimeit which will give better measurements of GPU timing and of course a whole year's worth of other improvements:
And, as Jill asked: what exactly are you running?
18 Comments
Matt J
on 11 Nov 2013
The matrices you showed are not of compatible sizes. You should be getting errors.
Bonnie
on 11 Nov 2013
Matt J
on 11 Nov 2013
Can we see an example without the typo?
Edric Ellis
on 12 Nov 2013
Bonnie, have you seen this benchmark http://www.mathworks.co.uk/help/distcomp/examples/benchmarking-a-b-on-the-gpu.html ? This shows a series of backslash timings on a K20, and shows a decent speedup over the CPU. Note that this times the case of "matrix \ vector" rather than "vector \ matrix" that you have stated in your comment.
Sean de Wolski
on 12 Nov 2013
Hi Bonnie, can you put that code in a function file and run the function a couple of times? Timing at the command window is not very accurate.
I can confirm what Bonnie is seeing for the GTX 580, R2012b
t1 =
0.2764
t2 =
0.1444
I did this in a function file running several times as Sean suggested.
However, when I run the following, equivalent operations below, the GPU does much better relative to the CPU. Also, both implementations perform better than their mldivide counterparts. So I'm tempted to think that gpuArray.mldivide simply wasn't optimized very well for matrices of these particular relative dimensions, because maybe the mtimes implementation was felt to be more likely.
tic,
y = (L.'*X)/norm(L)^2;
t3 = toc
tic
y1 = (L1.'*X1)/norm(L1)^2;
t4 = toc
t3 =
0.0027
t4 =
0.0093
Bonnie
on 13 Nov 2013
Matt J
on 13 Nov 2013
I think you mean the link that Sean posted, not Jill. As he mentioned, you would need to upgrade to R2013b.
Matt J
on 13 Nov 2013
It would help us if you used the

button to format your code separately from your text.
Aside from that, maybe give us some detail about your CPU?
Bonnie
on 13 Nov 2013
So 12 cores? I think you have to normalize your t2 by the number of cores in some way to account for the advantage that a multi-core CPU gives you. There are machines with dozens of cores that a single GPU could never beat. Assuming your average CPU is dual core, that would mean a handicap factor of 6, bringing your speed-up ratio to around 2.8.
Pretty decent, I guess, compared to dual-core benchmarks. I doubt all gpuArray operations are expected to be faster than their CPU counterparts. You just don't want them to be slower than some average CPU.
Bonnie
on 13 Nov 2013
Joss Knight
on 27 Apr 2016
It might be worth answering this question for posterity.
The questioner it seems was testing at least the linear solves with a very unusual system, many right-hand-sides but only one column in the system matrix. Since this is not a typical circumstance, MLDIVIDE is not optimised for it - to get an accurate answer it has to account for possible poor conditioning by using a QR factorisation, and this is less parallelisable than other approaches to solving these equations, one of which is given in the comments to Sean's answer. Another is to solve the normal equations:
% Solve A*X = B for X
R = chol(A'*A);
X = R\(R'\(A'*B));
For SVD and EIG it is possible the same situation applies, perhaps the questioner was carrying out the SVD on a tall skinny matrix. However, it is true that these functions do not parallelise well. I found that a 2000x2000 random matrix could be factored faster on my K20 than on the CPU, but the performance tails off for larger matrices, presumably due to resource contention on the device. It does make a difference whether you ask for all three factors or just the singular values (or, in the case of EIG, whether you ask for eigenvectors or just the eigenvalues).
For LU on a general matrix and CHOL on a symmetric matrix I found my K20 was much faster than the CPU, so it would be necessary to see exactly what the questioner was doing when they were timing these functions.
Categories
Find more on Linear Algebra in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!