GPU computation on mesh using MatLab

Hello,
I am new to GPU computation in Matlab and would like some advice.
I would like to calculate certain physical properties (stress) on a 2D mesh. The value at each grid point is the sum of a number of stresses (which need to be in turn calculated).
Therefore, one could have either 1 loop over the mesh points and a vectorized version of the calculation of the constituent stresses (summing them altogether at the end), or two nested for loops.
The number of constituent stresses does not necessarily match the number of mesh points. Hence, I do not believe I can use the arrayfun tool in Matlab. Furthermore, I would also have to pass through certain physical constants.
I have tried evaluating the first possible structure (i.e. 1 loop over the mesh points and a vectorial calculation of the inputs), by transferring the grids etc. onto the gpu using gpuArray. I find however that the performance is terrible (orders of magnitude slower than CPU computing, even though I have a fairly powerful graphics card, Nvidia GTX670).
A question is, if one has a for loop using arrays on the gpu, does Matlab automatically assign them to different threads?
I have a C version of the calculation of a single input on a single mesh point, so I wonder whether I should simply write a kernel function in CUDA?
Kind Regards,
F

 Accepted Answer

Matt J
Matt J on 28 Jul 2013
Edited: Matt J on 28 Jul 2013
A question is, if one has a for loop using arrays on the gpu, does Matlab automatically assign them to different threads?
No. In fact, I don't think it ever does.
I have a C version of the calculation of a single input on a single mesh point, so I wonder whether I should simply write a kernel function in CUDA?
Maybe, but knowing more about what that calculation looks like might help us see how to do it with built-in gpuArray commands.

6 Comments

This is the function. Its rather complicated...
function [Txp, Typ, Tzp, Txm, Tym, Tzm] = gpuSurfaceTraction(Lx,Ly,nx,ny,thickness,segments,a,MU,NU,virtual_seg)
% GRID CHARACTERISTICS
%generate x-y-z point grid suitable for FieldPointStress_mex
% Lx = ; %x-length from center
% Ly = ; %y-length from center
% nx = ; %nrow
% ny = ; %ncol
% t = ; %half-thickness
% COMPUTE STRESS FIELD IN INFINITE MEDIUM sigma_inf
%generate suitable dislocation segment list
%format of segments array:
% (n0,n1,bx,by,bz,x1,y1,z1,x2,y2,z2)
x1r = segments(:,6:8);
x2r = segments(:,9:11);
br = segments(:,3:5);
if isempty(virtual_seg) %if virtual_seg does not exist, no need to calculate stress contribution of virtual segs and yoffe image stress
x1 = x1r;
x2 = x2r;
b = br;
else %virtual segments are present
x1v = virtual_seg(:,6:8);
x2v = virtual_seg(:,9:11);
bv = virtual_seg(:,3:5);
x1 = vertcat(x1r,x1v);
x2 = vertcat(x2r,x2v);
b = vertcat(br,bv);
end
%%GPU
%generate grid on GPU
gridsp=parallel.gpu.GPUArray.zeros(nx*ny,3);
gridsm=parallel.gpu.GPUArray.zeros(nx*ny,3);
% gridsp=zeros(nx*ny,3);
% gridsm=zeros(nx*ny,3);
%generate grid for upper and lower surface
for i=1:nx
for j=1:ny
gridsp(i+(j-1)*nx,1)=(i-1)/nx*Lx;
gridsp(i+(j-1)*nx,2)=(j-1)/ny*Ly;
gridsp(i+(j-1)*nx,3)=thickness;
gridsm(i+(j-1)*nx,1)=(i-1)/nx*Lx;
gridsm(i+(j-1)*nx,2)=(j-1)/ny*Ly;
gridsm(i+(j-1)*nx,3)=-thickness;
end
end
x=vertcat(gridsp,gridsm); %grid points for top and bottom film
% Transfer dislocation list on GPU
x1=gpuArray(x1);
b=gpuArray(b);
a=gpuArray(a);
MU=gpuArray(MU);
NU=gpuArray(NU);
sx=size(x,1);
sx1=size(x1,1);
Stress_total = parallel.gpu.GPUArray.zeros(sx,6);
%%Calculate Stress (on GPU)
% inputs:x(dim,1:3) a list of positions where you want the stress calculated
% x1(nseg,1:3) a list of starting positions of the dislocations segments
% x2(nseg,1:3) a list of ending positions of the dislocation segments
% b(nseg,1:3) a list of burgers vectors associated with the segments going from x1 to x2
% a=core width
% MU= shear modulus
% NU=poisson's ratio
%
% outputs:Stress(dim,1:6) stresses at the field points requested it will calculate the field point stress as a sum of all the segments
% the components of the stress are organized in the following fashion
% Stress(i,:)=[s_11 s_22 s_33 s_12 s_23 s_13]
for i=1:sx
x_temp=ones(sx1,1)*x(i,:); %vectorize
Diff=x2-x1;
oneoverL=1./sqrt(sum(Diff.*Diff,2));
t=Diff.*[oneoverL oneoverL oneoverL];
R=(x_temp-x1);
Rdt=sum(R.*t,2);
nd=R-[Rdt Rdt Rdt].*t;
d2=sum(nd.*nd,2);
s(:,1)=-sum(R.*t,2);
s(:,2)=-sum((x_temp-x2).*t,2);
a2=a*a;
a2_d2 = a2+d2;
temp=1./a2_d2;
a2d2inv=[temp temp];
Ra = sqrt( [a2_d2 a2_d2] + s.*s);
Rainv=1./Ra;
Ra3inv=Rainv.*Rainv.*Rainv;
sRa3inv=s.*Ra3inv;
s_03=s.*Rainv.*a2d2inv;
s_13=-Rainv;
s_05=(2.*s_03+sRa3inv).*a2d2inv;
s_15=-Ra3inv;
s_25=s_03-sRa3inv;
mf=[-1 1]';
s_03=s_03*mf;
s_13=s_13*mf;
s_05=s_05*mf;
s_15=s_15*mf;
s_25=s_25*mf;
m4p=0.25 * MU / pi;
m8p=0.5 * m4p;
m4pn=m4p / ( 1 - NU );
mn4pn=m4pn*NU;
%a2m4pn=a2 * m4pn;
a2m8p=a2 * m8p;
onev=ones(size(x1,1),1);
zerov=zeros(size(x1,1),1);
eyesix=[onev onev onev zerov zerov zerov];
txb=[ t(:,2).*b(:,3)-t(:,3).*b(:,2) , t(:,3).*b(:,1)-t(:,1).*b(:,3) , t(:,1).*b(:,2)-t(:,2).*b(:,1)];
dxb=[nd(:,2).*b(:,3)-nd(:,3).*b(:,2) , nd(:,3).*b(:,1)-nd(:,1).*b(:,3) , nd(:,1).*b(:,2)-nd(:,2).*b(:,1)];
dxbdt=sum(dxb.*t,2);
dxbdtv=[dxbdt dxbdt dxbdt dxbdt dxbdt dxbdt];
dmd=[ nd(:,1).* nd(:,1) nd(:,2).* nd(:,2) nd(:,3).* nd(:,3) nd(:,1).* nd(:,2) nd(:,2).* nd(:,3) nd(:,3).* nd(:,1)];
tmt=[ t(:,1).* t(:,1) t(:,2).* t(:,2) t(:,3).* t(:,3) t(:,1).* t(:,2) t(:,2).* t(:,3) t(:,3).* t(:,1)];
tmd=[ 2.*t(:,1).* nd(:,1) 2.*t(:,2).* nd(:,2) 2.*t(:,3).* nd(:,3) t(:,1).* nd(:,2)+ nd(:,1).* t(:,2) t(:,2).* nd(:,3)+ nd(:,2).* t(:,3) t(:,3).* nd(:,1)+ nd(:,3).* t(:,1)];
tmtxb=[ 2.*t(:,1).*txb(:,1) 2.*t(:,2).*txb(:,2) 2.*t(:,3).*txb(:,3) t(:,1).*txb(:,2)+ txb(:,1).* t(:,2) t(:,2).*txb(:,3)+ txb(:,2).* t(:,3) t(:,3).*txb(:,1)+txb(:,3).* t(:,1)];
dmtxb=[2.*nd(:,1).*txb(:,1) 2.*nd(:,2).*txb(:,2) 2.*nd(:,3).*txb(:,3) nd(:,1).*txb(:,2)+ txb(:,1).*nd(:,2) nd(:,2).*txb(:,3)+ txb(:,2).*nd(:,3) nd(:,3).*txb(:,1)+txb(:,3).*nd(:,1)];
tmdxb=[ 2.*t(:,1).*dxb(:,1) 2.*t(:,2).*dxb(:,2) 2.*t(:,3).*dxb(:,3) t(:,1).*dxb(:,2)+ dxb(:,1).* t(:,2) t(:,2).*dxb(:,3)+ dxb(:,2).* t(:,3) t(:,3).*dxb(:,1)+dxb(:,3).* t(:,1)];
common=m4pn.*dxbdtv;
I_03=m4pn.*(dxbdtv.*eyesix+dmtxb)-m4p.*tmdxb;
I_13=-mn4pn.*tmtxb;
I_05=common.*(a2.*eyesix+dmd)-a2m8p.*tmdxb;
I_15=a2m8p.*tmtxb-common.*tmd;
I_25=common.*tmt;
Stress= I_03.*[s_03 s_03 s_03 s_03 s_03 s_03]+I_13.*[s_13 s_13 s_13 s_13 s_13 s_13]...
+I_05.*[s_05 s_05 s_05 s_05 s_05 s_05]+I_15.*[s_15 s_15 s_15 s_15 s_15 s_15]+I_25.*[s_25 s_25 s_25 s_25 s_25 s_25];
Stress_total(i,:)=sum(Stress,1);
end
%%COMPUTE TRACTION ON THIN FILM SURFACE
%find normal to planes
l=size(gridsp,1);
P1 = gridsp(1,:);
P2 = gridsp(2,:);
P3 = gridsp(l,:);
normal_p = cross(P1-P2, P1-P3); %find normal to thin film plane (outwards)
normal_p = normal_p/norm(normal_p); %unit vector upper film plane
normal_m = -normal_p; %unit vector lower film plane
nxp = normal_p(1);
nyp = normal_p(2);
nzp = normal_p(3);
nxm = normal_m(1);
nym = normal_m(2);
nzm = normal_m(3);
%Calculate Tp
sx=sx*0.5;
Txp = Stress_total(1:sx,1)*nxp + Stress_total(1:sx,4)*nyp + Stress_total(1:sx,6)*nzp;
Typ = Stress_total(1:sx,4)*nxp + Stress_total(1:sx,2)*nyp + Stress_total(1:sx,5)*nzp;
Tzp = Stress_total(1:sx,6)*nxp + Stress_total(1:sx,5)*nyp + Stress_total(1:sx,3)*nzp;
%Calculate Tm
tsx=sx*2;
sxp=sx+1;
Txm = Stress_total(sxp:tsx,1)*nxm + Stress_total(sxp:tsx,4)*nym + Stress_total(sxp:tsx,6)*nzm;
Tym = Stress_total(sxp:tsx,4)*nxm + Stress_total(sxp:tsx,2)*nym + Stress_total(sxp:tsx,5)*nzm;
Tzm = Stress_total(sxp:tsx,6)*nxm + Stress_total(sxp:tsx,5)*nym + Stress_total(sxp:tsx,3)*nzm;
%Reshape into grid
Txp = reshape(Txp,nx,ny);
Typ = reshape(Typ,nx,ny);
Tzp = reshape(Tzp,nx,ny);
Txm = reshape(Txm,nx,ny);
Tym = reshape(Tym,nx,ny);
Tzm = reshape(Tzm,nx,ny);
% Remove DC-term to remove "acceleration" of grid
Tzp = fft2(Tzp);
Txp = fft2(Txp);
Typ = fft2(Typ);
Tzm = fft2(Tzm);
Txm = fft2(Txm);
Tym = fft2(Tym);
Txp(1,1) = 0;
Typ(1,1) = 0;
Tzp(1,1) = 0;
Txm(1,1) = 0;
Tym(1,1) = 0;
Tzm(1,1) = 0;
Tzp = real(ifft2(Tzp));
Txp = real(ifft2(Txp));
Typ = real(ifft2(Typ));
Tzm = real(ifft2(Tzm));
Txm = real(ifft2(Txm));
Tym = real(ifft2(Tym));
Txp = gather(Txp);
Typ = gather(Typ);
Tzp = gather(Tzp);
Txm = gather(Txm);
Tym = gather(Tym);
Tzm = gather(Tzm);
end
This looks more like a job for CPU plus PARFOR, than the GPU. There appears to be a lot you can do to optimize your for-loop, though. Things like this
Diff=x2-x1;
oneoverL=1./sqrt(sum(Diff.*Diff,2));
t=Diff.*[oneoverL oneoverL oneoverL];
and this
m4p=0.25 * MU / pi;
m8p=0.5 * m4p;
m4pn=m4p / ( 1 - NU );
mn4pn=m4pn*NU;
%a2m4pn=a2 * m4pn;
a2m8p=a2 * m8p;
onev=ones(size(x1,1),1);
zerov=zeros(size(x1,1),1);
eyesix=[onev onev onev zerov zerov zerov];
txb=[ t(:,2).*b(:,3)-t(:,3).*b(:,2) , t(:,3).*b(:,1)-t(:,1).*b(:,3) , t(:,1).*b(:,2)-t(:,2).*b(:,1)];
seem to be the same in every iteration of the loop, independently of i, and therefore could be pre-computed. I don't know why you do things like this
Ra3inv=Rainv.*Rainv.*Rainv;
instead of
Ra3inv=Rainv.^3;
It also looks like you could benefit from BSXFUN. Duplicating data like here
x_temp=ones(sx1,1)*x(i,:); %vectorize
R=(x_temp-x1);
has not been necessary since R2008. Instead, you would do
R=bsxfun(@minus, x(i,:),x1);
Also, the use of FFTs looks like massively inefficient overkill, here. Instead of
Tzp = fft2(Tzp);
Tzp(1,1) = 0;
Tzp = real(ifft2(Tzp));
You could just do
Tzp = Tzp-sum(Tzp(:));
After some thinking, I'm not sure why you need a for-loop here at all. Most of your operations are just adds,subtracts, divides, etc... all of which are abundantly do-able using vectorized statements.
Hello,
Thanks for the comments. I appreciate you taking time to look at my code!
I agree that a lot of things in the i=1:sx loop can be taken out and pre-computed, I had quickly written this from a C-implementation which is slightly different; however, I am not sure why you think I do not need the for-loop. The variable "Stress" is already a vector. What I am doing is, for each grid point, adding various stress contributions (which are in the form of a vector) into "Stress_total".
Hence, why I thought of assigning a node of the 2D grid to a thread in the GPU. A structured grid problem seems a fairly good thing to parallelize on a GPU, no?
I disagree on certain things:
>> array=randn(2500);
>> tic; array.^3; toc;
Elapsed time is 0.000217 seconds.
>> tic; array.*array.*array; toc;
Elapsed time is 0.000013 seconds.
"Unfolding" the power of three makes it faster.
Secondly, using BSXFUN makes it go slower for me (also because I use x_temp later on).
Thirdly, I guess you are right I can avoid doing FFTs, but I think it should be mean() not sum().
Thanks again!!
Regards,
F
however, I am not sure why you think I do not need the for-loop. The variable "Stress" is already a vector. What I am doing is, for each grid point, adding various stress contributions (which are in the form of a vector) into "Stress_total".
That doesn't automatically mean you need a for-loop. If I have an MxNxP array StressVectors, where N is the number of grid points and P is the number of vectors at each grid point, I can just do
Stress_total=sum(StressVectors,3);
If I have different numbers of stress vectors at each grid point, I just set some of them to zero. MATLAB will parallelize the call to sum() using its natural built-in multi-threading or, if you make StressVectors a gpuArray variable, it will leverage the GPU to do it.
Thirdly, I guess you are right I can avoid doing FFTs, but I think it should be mean() not sum().
Not based on what you had written. The DC component of MATLAB's fft(x) is the sum over the x(i), e.g.,
>> y=fft(1:5); y(1)-sum(1:5)
ans =
0
OK.
I'll try vectorizing it that way and also see if I can make the CUDA C kernel file communicate with Matlab and see which is fastest.
Thanks a lot!
F

Sign in to comment.

More Answers (0)

Asked:

on 28 Jul 2013

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!