Matrix algebra very slow on GPU

I've been testing some of the Matlab matrix routines on a TESLA K20 GPU. So far I've found that chol, lu, \, svd, and eig all run significantly slower on the GPU than on the CPU even without including the time to transfer the data to the GPU. Is this a common experience? If not, what might I be doing wrong?

7 Comments

What version of MATLAB are you using?
Are you doing the operations in double precision or single precision?
I'm running 8.0.0.783 (R2012b) 64bit
I'm doing the operations in double precision.
Are they slower than the CPU in single precision as well?
In single precision the \ function is faster on the GPU than on the CPU but not by much. Also, In single precision the \ function takes twice as long on the GPU as the same calculation in double precision. That makes no sense to me.
In single precision, the SVD takes twice as long on the GPU as it does on the CPU. It's the same in double precision.

Sign in to comment.

Answers (2)

Sean de Wolski
Sean de Wolski on 11 Nov 2013
Edited: Sean de Wolski on 11 Nov 2013
How are you doing the timing?
If upgrading is an option, in R2013b, we released gputimeit which will give better measurements of GPU timing and of course a whole year's worth of other improvements:
And, as Jill asked: what exactly are you running?

18 Comments

Bonnie
Bonnie on 11 Nov 2013
Edited: Matt J on 11 Nov 2013
I'm timing it using tic and toc but I doubt the problem is with the accuracy of the timing. I have run the commands within a larger program that takes several seconds to execute and this program takes twice as long to run on the GPU.
Right now I'm looking at the very simple set of commands
L = gpuArray.rand(10000,1);
X = gpuArray.rand(5000,5000);
tic
y = L\X;t1 = toc
L1 = gather(L);X1 = gather(X);
tic,y1 = L1\X1;t2 = toc
I get t1/t2 = 1.4193;
The matrices you showed are not of compatible sizes. You should be getting errors.
Sorry, That's a typo. I've run several sizes. It doesn't make much difference in the relative times between the GPU and the CPU.
Can we see an example without the typo?
Bonnie, have you seen this benchmark http://www.mathworks.co.uk/help/distcomp/examples/benchmarking-a-b-on-the-gpu.html ? This shows a series of backslash timings on a K20, and shows a decent speedup over the CPU. Note that this times the case of "matrix \ vector" rather than "vector \ matrix" that you have stated in your comment.
Bonnie
Bonnie on 12 Nov 2013
Edited: Matt J on 12 Nov 2013
Matt J, Here is an example copied from the command window
>> L = gpuArray.rand(5000,1);
>> X = gpuArray.rand(5000,5000);
>> tic,y = L\X;t1 = toc
t1 =
0.2782
>> L1 = gather(L);X1 = gather(X);
>> tic,y1 = L1\X1;t2 = toc
t2 =
0.1864
>> t1/t2
ans =
1.4928
Hi Bonnie, can you put that code in a function file and run the function a couple of times? Timing at the command window is not very accurate.
Matt J
Matt J on 12 Nov 2013
Edited: Matt J on 12 Nov 2013
I can confirm what Bonnie is seeing for the GTX 580, R2012b
t1 =
0.2764
t2 =
0.1444
I did this in a function file running several times as Sean suggested.
However, when I run the following, equivalent operations below, the GPU does much better relative to the CPU. Also, both implementations perform better than their mldivide counterparts. So I'm tempted to think that gpuArray.mldivide simply wasn't optimized very well for matrices of these particular relative dimensions, because maybe the mtimes implementation was felt to be more likely.
tic,
y = (L.'*X)/norm(L)^2;
t3 = toc
tic
y1 = (L1.'*X1)/norm(L1)^2;
t4 = toc
t3 =
0.0027
t4 =
0.0093
Jill, I am unable to access the link you posted. I get a message saying that the document is only accessible to license holders. I registered my license for the parallel processing toolbox but I still cannot access the document.
I think you mean the link that Sean posted, not Jill. As he mentioned, you would need to upgrade to R2013b.
Bonnie
Bonnie on 13 Nov 2013
Edited: Bonnie on 13 Nov 2013
I've now upgraded to R2013a but I will have to wait for my license administration before I can upgrade to R2013b.
My key problem right now is with the SVD function which is running significantly slower on the GPU than on the CPU even for very large matrices. To compare the two, I written the following function
function time = timefunc(A)
tic;
[u,s,v] = svd(A);
time = toc;
end
executed with the following commands
A = gpuArray.randn(1000,1000);
A1 = gather(A);
t1 = timefunc(A);
t2 = timefunc(A1);
speedup = t2/t1
For a 1000 x 1000 matrix A, speedup = 0.4743 (i.e. a factor of two faster on the CPU than on the GP). For a 10000 x 10000 matrix A, the speedup increase to 0.7333, still significantly faster on the CPU. The CPU is faster than the GPU for any matrix that does not exceed my memory.
It would help us if you used the
button to format your code separately from your text.
Aside from that, maybe give us some detail about your CPU?
I have two Intel® Xeon® CPU E5-2620 2.00 GHz CPUs.
Matt J
Matt J on 13 Nov 2013
Edited: Matt J on 13 Nov 2013
So 12 cores? I think you have to normalize your t2 by the number of cores in some way to account for the advantage that a multi-core CPU gives you. There are machines with dozens of cores that a single GPU could never beat. Assuming your average CPU is dual core, that would mean a handicap factor of 6, bringing your speed-up ratio to around 2.8.
Pretty decent, I guess, compared to dual-core benchmarks. I doubt all gpuArray operations are expected to be faster than their CPU counterparts. You just don't want them to be slower than some average CPU.
Yes, But I'm also running the Tesla k20 GPU. It appears the MATLAB implementation of the SVD algorithm on the GPU is quite inefficient.
Matt J
Matt J on 13 Nov 2013
Edited: Matt J on 13 Nov 2013
I don't know if we can conclude that. Do we know why GPUs should do better than CPUs when it comes to SVD? It doesn't seem like a very parallel operation, in my mind.
Bonnie
Bonnie on 13 Nov 2013
Edited: Bonnie on 13 Nov 2013
There are several reports in the literature of GPU SVD algorithms that show greater than 7 fold performance over CPUs on commodity GPUs. It is very disappointing that the Matlab algorithm performs so poorly on the Tesla K20.
Matt J
Matt J on 13 Nov 2013
Edited: Matt J on 13 Nov 2013
You should cite that literature. Maybe the developers will look at it...
In case it's worth mentioning, I seem to be seeing the same speeds on the GTX 580. I suppose it's suspicious that the Tesla K20 doesn't offer more speed than a 500-series GPU.

Sign in to comment.

It might be worth answering this question for posterity.
The questioner it seems was testing at least the linear solves with a very unusual system, many right-hand-sides but only one column in the system matrix. Since this is not a typical circumstance, MLDIVIDE is not optimised for it - to get an accurate answer it has to account for possible poor conditioning by using a QR factorisation, and this is less parallelisable than other approaches to solving these equations, one of which is given in the comments to Sean's answer. Another is to solve the normal equations:
% Solve A*X = B for X
R = chol(A'*A);
X = R\(R'\(A'*B));
For SVD and EIG it is possible the same situation applies, perhaps the questioner was carrying out the SVD on a tall skinny matrix. However, it is true that these functions do not parallelise well. I found that a 2000x2000 random matrix could be factored faster on my K20 than on the CPU, but the performance tails off for larger matrices, presumably due to resource contention on the device. It does make a difference whether you ask for all three factors or just the singular values (or, in the case of EIG, whether you ask for eigenvectors or just the eigenvalues).
For LU on a general matrix and CHOL on a symmetric matrix I found my K20 was much faster than the CPU, so it would be necessary to see exactly what the questioner was doing when they were timing these functions.

Categories

Find more on Linear Algebra in Help Center and File Exchange

Asked:

on 11 Nov 2013

Answered:

on 27 Apr 2016

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!