GPU underperforms after some iterations in the for loop

Question

Open in MATLAB Online

0 votes

Hello all,

I am attempting to measure the timing of the following function.

x is a complex random matrix

x = rand(n1,n2)+1i*rand(n1,n2);

n1 is 168*188*222*12

and

n2 is something between 1 and 8.

A Is an array of structs, and each struct contains the Tucker decomposition components of a tensor.

to_GPU

passes the array of structs to GPU and

nmp

implements multiplications for assembling the full 3D tensor T1-T12 from its Tucker componennts.

function y = mvp_Zbc_herm_adj_pwl_tucker_gpu(A,x)
    
    [~,m]  = size(A);
    Nports = size(x,2);
    y = zeros(m,Nports,'like',x);
    
    Q1  = to_GPU(A(1,:),1);
    Q2  = to_GPU(A(2,:),1);
    Q3  = to_GPU(A(3,:),1);
    Q4  = to_GPU(A(4,:),1);
    Q5  = to_GPU(A(5,:),1);
    Q6  = to_GPU(A(6,:),1);
    Q7  = to_GPU(A(7,:),1);
    Q8  = to_GPU(A(8,:),1);
    Q9  = to_GPU(A(9,:),1);
    Q10 = to_GPU(A(10,:),1);
    Q11 = to_GPU(A(11,:),1);
    Q12 = to_GPU(A(12,:),1);
    
    for i = 1:m
        
        T1  = nmp(nmp(nmp(Q1(i).G,Q1(i).U1,1),Q1(i).U2,2),Q1(i).U3,3);
        T2  = nmp(nmp(nmp(Q2(i).G,Q2(i).U1,1),Q2(i).U2,2),Q2(i).U3,3);
        T3  = nmp(nmp(nmp(Q3(i).G,Q3(i).U1,1),Q3(i).U2,2),Q3(i).U3,3);
        T4  = nmp(nmp(nmp(Q4(i).G,Q4(i).U1,1),Q4(i).U2,2),Q4(i).U3,3);
        T5  = nmp(nmp(nmp(Q5(i).G,Q5(i).U1,1),Q5(i).U2,2),Q5(i).U3,3);
        T6  = nmp(nmp(nmp(Q6(i).G,Q6(i).U1,1),Q6(i).U2,2),Q6(i).U3,3);
        T7  = nmp(nmp(nmp(Q7(i).G,Q7(i).U1,1),Q7(i).U2,2),Q7(i).U3,3);
        T8  = nmp(nmp(nmp(Q8(i).G,Q8(i).U1,1),Q8(i).U2,2),Q8(i).U3,3);
        T9  = nmp(nmp(nmp(Q9(i).G,Q9(i).U1,1),Q9(i).U2,2),Q9(i).U3,3);
        T10 = nmp(nmp(nmp(Q10(i).G,Q10(i).U1,1),Q10(i).U2,2),Q10(i).U3,3);
        T11 = nmp(nmp(nmp(Q11(i).G,Q11(i).U1,1),Q11(i).U2,2),Q11(i).U3,3);
        T12 = nmp(nmp(nmp(Q12(i).G,Q12(i).U1,1),Q12(i).U2,2),Q12(i).U3,3);
        
        y(i,:) = ([T1(:);T2(:);T3(:);T4(:);T5(:);T6(:);T7(:);T8(:);T9(:);T10(:);T11(:);T12(:)])'*x;   
    end
    
    y = gather(y);
    
end

When measuring the time using n2=1, the result is around 10 seconds.

If n2=2 or anything higher, the time goes to 30+ minutes.

The GPU is a NVIDIA Quadro Volta GV100 with 32GB of memory.

Theoretically, the operations to compute T1-T12 are the more costly ones and are the same no matter the number of n2. n2 should only increase the occupied memory of GPU and affect the cost of the multiplication to compute y(i,:).

However, when running the code with n2 = 2 or higher, I noticed some time spikes in the computation of T1-T12 as the for loop proceeds. In general they need less than 0.1 seconds, but sometimes this can take up to 4 seconds. This doesn't happen if n2=1.

GPU underperforms after some iterations in the for loop

0 Comments
Show -2 older comments Hide -2 older comments

Answers (0)

Categories

Products

Release

Tags

Community Treasure Hunt

GPU underperforms after some iterations in the for loop

0 Comments Show -2 older comments Hide -2 older comments

Answers (0)

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments