Summing array elements seems to be slow on GPU

11 views (last 30 days)
I am testing the times of execution for the following function on CPU and GPU
function funTestGPU(P,U,K,UN)
for k = 1:P
H = exp(1i*K);
HU = U.*H;
UN(k,:) = sum(HU,[1,3]);
end
end
where , are complex arrays of size and Kis a complex array of size . So in each iteration I perform element-wise exp(), element-wise multiplication of two arrays and summing elements of 3D array along two dimensions.
I test the execution time on CPU and on GPU with the help of the following script
P = 200;
URe = 1/(sqrt(2))*rand(P);
UIm = 1/(sqrt(2))*rand(P);
KRe = 1/(sqrt(2))*rand(P,P,P);
KIm = 1/(sqrt(2))*rand(P,P,P);
% CPU
U = complex(URe, UIm);
K = complex(KRe, KIm);
UN = complex(zeros(P), zeros(P));
fcpu = @() funTestGPU(P,U,K,UN);
tcpu = timeit(fcpu);
disp(['CPU time: ',num2str(tcpu)])
% GPU
U = gpuArray(complex(URe, UIm));
K = gpuArray(complex(KRe, KIm));
UN = gpuArray(complex(zeros(P), zeros(P)));
fgpu = @() funTestGPU(P,U,K,UN);
tgpu = gputimeit(fgpu);
disp(['GPU time: ',num2str(tgpu)])
and I obtain the results
CPU time: 9.0315
GPU time: 3.3894
My concern is that if I remove the last operation from the funTestGPU (summing array elements) I obtain the results
CPU time: 8.0185
GPU time: 0.0045631
So it looks like the summation is the most time-consuming operation on GPU. Is that an expected result?
I wrote the analogical codes in cuPy and in Pytorch and there the summation does not seem to be the most time consuming operation.
I use Matlab 2019b. My graphics card is NVIDIA GeForce GTX 1050 Ti (768 CUDA cores), my processor is AMD Ryzen 7 3700X (8 physical cores).
  2 Comments
Matt J
Matt J on 27 Apr 2023
Moved: Matt J on 27 Apr 2023
So it looks like the summation is the most time-consuming operation on GPU. Is that an expected result?
That's what I would expect. It's the only operation in the chain that is not element-wise.
Damian Suski
Damian Suski on 27 Apr 2023
@Matt J Thank you for your comment. Before I run tests, I imagined that the exponential will be the most time consuming operation, but it turns out that element-wise operations are not the bottleneck of calculations. I just wanted to make sure that I do not miss something obvious.

Sign in to comment.

Accepted Answer

Joss Knight
Joss Knight on 27 Apr 2023
These are my results that I got on my (somewhat old) GeForce GTX 1080 Ti:
CPU time: 16.1288
GPU time: 0.96266
If I change the datatype to single I get:
CPU time: 14.9785
GPU time: 0.35102
That's maybe 2x faster?
So on the one hand your GPU is pretty slow and your CPU is pretty fast, and on the other maybe you could try using single precision instead, if you don't mind the loss of accuracy.
  1 Comment
Damian Suski
Damian Suski on 27 Apr 2023
Well, I would also say that my CPu is quite fast and GPu is rather weak (only 800 CUDA cores, 4GB RAM). Several years ago I have bought the cheapest graphics card, without parallel computations in mind.
The results for your card (over 3.5k CUDA cores, 11GB RAM) are pretty impressive, I have tried GeForce RTX 3060 (over 3.5k CUDA cores, 12GB RAM) on another computer and it gave 1,5s for double precision. For the analogical code in pytorch, I have tried Tesla T4 card (freely available on Google Colab), which gave also 1,5s. So the proper choice of the GPU card makes the difference.
I will definitely try single precision, but at the moment it is hard for me to say whether the precision loss will be acceptable for my purpose.

Sign in to comment.

More Answers (1)

Joss Knight
Joss Knight on 27 Apr 2023
Moved: Matt J on 27 Apr 2023
Why are you recomputing H and HU inside the loop? They do not change. If you remove the sum, because the results are never used from the first (P-1) iterations, only the last computation of those values will actually take place.
  6 Comments
Damian Suski
Damian Suski on 28 Apr 2023
I have tried batching approach on my GPU, but have not noticed any speed-up. I will try it on a better GPU and decribe the deatiled results.
Damian Suski
Damian Suski on 18 May 2023
I made the experiments and I haven't noticed the speedup in the case of batching. Time of computations increases proportionally to the batch size.
I have implemented the proper procedure and I was able to reproduce the discussed speedup results for the dummy example. The computations time was reduced from 186s on CPU to 42s on GPU. On a better graphics card the computations time is even shorter - 21s. Summing up, I'm satisfied with the results.
What still concerns me is that in Matlab the element-wise exp() is much faster than summing elements along two dimensions. For the analogical calculations in cuPy or pytorch, the situation seems to be the opposite. Can I place here the detailed results of my findings or should I start a new topic?

Sign in to comment.

Tags

Products


Release

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!