Using kernels in for-loop: computation time of GPU scales linearly with iterations

I've got an algorithm in MATLAB which is based on a for-loop of time steps as follows:
for cnt=1:cnt_max
do calculation based on measurement data and result of previous time step
end
If I now use a gpuArray interface and arrayfun, then my computation time per iteration scales linearly with cnt. The same happens if I write the functions in CUDA and make kernels in MATLAB to do the calculations using feval:
Make ten different Kernels with parallel.gpu.CUDAKernel
Set their gridSize and ThreadBlockSize
Initialize result variables as gpuArrays
for cnt=1:cnt_max
tic;
data_cnt = gpuArray(data_cnt) %data is stored in matrix on CPU
result1_cnt=feval(myKernel1,result_cnt1,input)
(...)
result10_cnt=feval(myKernel10,result_cnt10,input)
wait(gpuDevice);toc;
end
I really have no clue why my computation time is getting bigger and bigger. I neither create variables inside the loop nor do I change their size. I am not used to GPU computing and CUDA, so I don't know what to do. I use MATLAB R2013b, the parallel computing toolbox and GPU "Tesla K20c".

5 Comments

Do you have any code that you can post to reproduce the problem?
I just was able to fix it.
At the end of my for-loop I calculate the distance as follows:
res = square( (a-b)^2 +(c-d)^2 )
where res,a,c are gpuArray and b,d are stored on CPU. All variables are singletons.
Changing the square ^2 into elementwise square .^2 did the trick.
res = square( (a-b).^2 +(c-d).^2 )
I don't know why this is, because I thought for singleton values there is no difference between ^2 and .^2. But obviously this was the problem.
If you could post an example, it would still be very useful. You are right that for scalars ^2 and .^2 should be the same, so there may be something we need to investigate there.
As you can see in the attached plot for the previous version my computing time per iteration was linearly scaling up. With the modification x^2 -> x.^2 I didn't have this problem anymore.
Previous version:
for i=1:N
tic;
%calculate A and B on GPU
res=sqrt( (A-datalist(i).x(1,1))^2+(B-datalist(i).x(2,1))^2 );
wait(gpuDevice);
time_per_iteration=toc;
end;
Fixed version:
for i=1:N
tic;
%calculate A and B on GPU
res=sqrt( (A-datalist(i).x(1,1)).^2+(B-datalist(i).x(2,1)).^2 );
time_per_iteration=toc;
end;
where A, B are singleton gpuArray and datalist is stored on CPU
I am not sure if it is related, but I just discovered another strange behaviour of using ^2 on gpuArray.
A is a negative gpuArray singleton: A<0, imag(A)=0
B = A^2 -> imag(B)=0.0000e+00
B = A.^2 -> imag(B)=0
B = abs(A)^2 -> imag(B)=0
B = A*A -> imag(B)=0
So if I use ^2 on a negative singleton gpuArray, then the result gets an imaginary part. This part is in fact zero, but to MATLAB A is no real number anymore.

Sign in to comment.

Answers (0)

Asked:

on 16 Jan 2014

Commented:

on 22 Jan 2014

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!