GPU underperforms after some iterations in the for loop
Show older comments
Hello all,
I am attempting to measure the timing of the following function.
x is a complex random matrix
x = rand(n1,n2)+1i*rand(n1,n2);
n1 is 168*188*222*12
and
n2 is something between 1 and 8.
A Is an array of structs, and each struct contains the Tucker decomposition components of a tensor.
to_GPU
passes the array of structs to GPU and
nmp
implements multiplications for assembling the full 3D tensor T1-T12 from its Tucker componennts.
function y = mvp_Zbc_herm_adj_pwl_tucker_gpu(A,x)
[~,m] = size(A);
Nports = size(x,2);
y = zeros(m,Nports,'like',x);
Q1 = to_GPU(A(1,:),1);
Q2 = to_GPU(A(2,:),1);
Q3 = to_GPU(A(3,:),1);
Q4 = to_GPU(A(4,:),1);
Q5 = to_GPU(A(5,:),1);
Q6 = to_GPU(A(6,:),1);
Q7 = to_GPU(A(7,:),1);
Q8 = to_GPU(A(8,:),1);
Q9 = to_GPU(A(9,:),1);
Q10 = to_GPU(A(10,:),1);
Q11 = to_GPU(A(11,:),1);
Q12 = to_GPU(A(12,:),1);
for i = 1:m
T1 = nmp(nmp(nmp(Q1(i).G,Q1(i).U1,1),Q1(i).U2,2),Q1(i).U3,3);
T2 = nmp(nmp(nmp(Q2(i).G,Q2(i).U1,1),Q2(i).U2,2),Q2(i).U3,3);
T3 = nmp(nmp(nmp(Q3(i).G,Q3(i).U1,1),Q3(i).U2,2),Q3(i).U3,3);
T4 = nmp(nmp(nmp(Q4(i).G,Q4(i).U1,1),Q4(i).U2,2),Q4(i).U3,3);
T5 = nmp(nmp(nmp(Q5(i).G,Q5(i).U1,1),Q5(i).U2,2),Q5(i).U3,3);
T6 = nmp(nmp(nmp(Q6(i).G,Q6(i).U1,1),Q6(i).U2,2),Q6(i).U3,3);
T7 = nmp(nmp(nmp(Q7(i).G,Q7(i).U1,1),Q7(i).U2,2),Q7(i).U3,3);
T8 = nmp(nmp(nmp(Q8(i).G,Q8(i).U1,1),Q8(i).U2,2),Q8(i).U3,3);
T9 = nmp(nmp(nmp(Q9(i).G,Q9(i).U1,1),Q9(i).U2,2),Q9(i).U3,3);
T10 = nmp(nmp(nmp(Q10(i).G,Q10(i).U1,1),Q10(i).U2,2),Q10(i).U3,3);
T11 = nmp(nmp(nmp(Q11(i).G,Q11(i).U1,1),Q11(i).U2,2),Q11(i).U3,3);
T12 = nmp(nmp(nmp(Q12(i).G,Q12(i).U1,1),Q12(i).U2,2),Q12(i).U3,3);
y(i,:) = ([T1(:);T2(:);T3(:);T4(:);T5(:);T6(:);T7(:);T8(:);T9(:);T10(:);T11(:);T12(:)])'*x;
end
y = gather(y);
end
When measuring the time using n2=1, the result is around 10 seconds.
If n2=2 or anything higher, the time goes to 30+ minutes.
The GPU is a NVIDIA Quadro Volta GV100 with 32GB of memory.
Theoretically, the operations to compute T1-T12 are the more costly ones and are the same no matter the number of n2. n2 should only increase the occupied memory of GPU and affect the cost of the multiplication to compute y(i,:).
However, when running the code with n2 = 2 or higher, I noticed some time spikes in the computation of T1-T12 as the for loop proceeds. In general they need less than 0.1 seconds, but sometimes this can take up to 4 seconds. This doesn't happen if n2=1.
Any help will be greatly appreciated.
Answers (0)
Categories
Find more on Linear Algebra in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!