GPU optimization of looped vector operations

1 view (last 30 days)
I think I am making a simple mistake. I am comparing a vectrorized integration using the GPU and the CPU. This code takes ~365 sec on the GPU (Nvidia quadro P5200 and 96 sec on the parallel CPU (6 workers, Xeon E-2176M, 2.7 GHz). The integral is a straight forward operation with vectors 90,000 long in this example that repeats 90,000 times in the loop. A test of an array multiplication of two 10,000x10,000 arrays of random numbers takes 0.65 s on my GPU and 10.8 s on my CPU. In the example below the GPU is slower for larger arrays. It seems as though the loop introduces a lot of overhead on the GPU operations.
What is the best strategy to optimize this problem for the GPU?
nubar_low = 2450;
nubar_high = 3350;
p_density = 100; %points per wavenumber
nu_bar = nubar_low:1/p_density:nubar_high;
K = zeros(size(nu_bar));
nub = nu_bar;
n_inf = 0;
nub = nu_bar;
k_max = 0.01; %max k
nub_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nub-nub_0).^2 + (gamma/2)^2).^-1 - ((nub+nub_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
len=length(nu_bar);
dN_KK = zeros(1,len);
% The integral
tic
try
canUseGPU = parallel.gpu.GPUDevice.isAvailable;
catch ME
canUseGPU = false;
end
%canUseGPU = false;
if canUseGPU
%integral using GPU
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
for i = 1:len
gdN_KK(i) = sum(gnu_bar([1:i-1, i+1:end]) .* gK([1:i-1, i+1:end]) ./ (gnu_bar([1:i-1, i+1:end]).^2 - gnu_bar(i).^2));
gdN_KK(i) = 2*gdN_KK(i) + gK(i)./(2*gnu_bar(i)) + gdK(i);
end
dN_KK =gather(gdN_KK);
else
%integral using GPU
parfor i = 1:len
dN_KK(i) = sum(nu_bar([1:i-1, i+1:end]) .* K([1:i-1, i+1:end]) ./ (nu_bar([1:i-1, i+1:end]).^2 - nu_bar(i).^2));
dN_KK(i) = 2*dN_KK(i) + K(i)./(2*nu_bar(i)) + dK(i);
end
end
% Scales data
dN_KK = (1/(pi*p_density))*dN_KK;
% Adds constant for N infinity
N_KK = dN_KK + n_inf;
toc
  4 Comments
Joss Knight
Joss Knight on 29 Aug 2019
Lloyd, have you tried running your computation in single precision? Your Quadro P5200 has respectable single precision performance of about 8 or 9 teraflops, but like most graphics cards except special ones, its double precision performance is a small fraction of that at about 280 gigaflops (figures from the Wikipedia page where NVIDIA post their specs). This is why you're not getting much better matrix multiply performance out of your GPU than your CPU - this would be dramatically different in single. It is perfectly normal for an algorithm to run faster on the CPU than on one of these graphics-focussed cards, especially if it is an algorithm with a lot of unvectorized loops and a multiprocess parallelization.
Lloyd Bumm
Lloyd Bumm on 6 Sep 2019
I noted the single precision times in benchmarks below. The effect is only a factor of 2. However this is not the sort of problem that should be done in single precision.

Sign in to comment.

Accepted Answer

Matt J
Matt J on 28 Aug 2019
Edited: Matt J on 6 Sep 2019
This modification uses mat2tiles from the File Exchange, to help divide the computation into bigger, vectorized chunks
It runs in about 2 seconds on my graphics card (GeForce GTX 1080 Ti). Aside from increased vectorization, the key is to eliminate all the indexing expressions x([1:i-1, i+1:end]). Those are costly.
tic;
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
chunksize=1000;
vv=gnu_bar.^2;
vvchunks=mat2tiles( vv , [1,chunksize]);
numer=gnu_bar.*gK;
c=1;
for k=1:numel(vvchunks)
Q=numer(:)./(vv.'-vvchunks{k});
Q(c:len+1:end)=0;
c=c+size(Q,2);
vvchunks{k}=sum(Q,1);
end
gdN_KK=[vvchunks{:}];
gdN_KK = 2*gdN_KK + gK./(2*gnu_bar) + gdK;
wait(gd)
toc %Elapsed time is 2.027665 seconds.
  15 Comments
Matt J
Matt J on 6 Sep 2019
Edited: Matt J on 6 Sep 2019
Okay, I did make a few fixes, but now to be sure we're on the same page, I share the test code below, and I see strong agreement between the two versions
nubar_low = 2450;
nubar_high = 2451;
p_density = 100; %points per wavenumber
nu_bar = nubar_low:1/p_density:nubar_high;
K = zeros(size(nu_bar));
nub = nu_bar;
n_inf = 0;
nub = nu_bar;
k_max = 0.01; %max k
nub_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nub-nub_0).^2 + (gamma/2)^2).^-1 - ((nub+nub_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
len=length(nu_bar);
dN_KK = zeros(1,len);
len,
tic;
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
%%%% ORIGINAL %%%%%
for i = 1:len
gdN_KK(i) = sum(gnu_bar([1:i-1, i+1:end]) .* gK([1:i-1, i+1:end]) ./ (gnu_bar([1:i-1, i+1:end]).^2 - gnu_bar(i).^2));
gdN_KK(i) = 2*gdN_KK(i) + gK(i)./(2*gnu_bar(i)) + gdK(i);
end
version1 = gdN_KK ;
%%%% OPTIMIZED %%%%%%
chunksize=5;
vv=gnu_bar.^2;
vvchunks=mat2tiles( vv , [1,chunksize]);
numer=gnu_bar.*gK;
c=1;
for k=1:numel(vvchunks)
Q=numer(:)./(vv.'-vvchunks{k});
Q(c:len+1:end)=0;
c=c+size(Q,2);
vvchunks{k}=sum(Q,1);
end
gdN_KK=[vvchunks{:}];
gdN_KK = 2*gdN_KK + gK./(2*gnu_bar) + gdK;
%wait(gd)
toc %Elapsed time is 2.027665 seconds.
version2 = gdN_KK ;
plot(1:len,version1,'-',1:len,+version2,'x'); legend('Lloyd','Matt')
Lloyd Bumm
Lloyd Bumm on 6 Sep 2019
At first I didn't recognize what was going on becasue the you had the interval set to one wavenumber (2450-2451) far away from the absorption (2800). I changed the interval back to 2450-3350, increased the chunksize to 1000, scaled it properly, and compared it to my 1D vectorized CPU code (a fairer comparison). It is spot on now.
LB CPU 1D vector optimized: 4.036714 seconds
MJ GPU mat2tiles optimized: 2.554221 seconds
I'll need to figure out is I can implement your solution when the integrals are being evaluated at a subset of the points in the integration.
MJ_LB_compare.2019_09_06.png
nubar_low = 2450;
nubar_high = 3350;
p_density = 100; %points per wavenumber
nu_bar = nubar_low:1/p_density:nubar_high;
K = zeros(size(nu_bar));
nub = nu_bar;
n_inf = 0;
nub = nu_bar;
k_max = 0.01; %max k
nub_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nub-nub_0).^2 + (gamma/2)^2).^-1 - ((nub+nub_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
len=length(nu_bar);
dN_KK = zeros(1,len);
len,
%%%% 1D vector optimized %%%%%
tic;
part_a = nu_bar .* K;
part_b = nu_bar .^2;
for i = 1:len
part_c = part_a ./ (part_b - part_b(i));
part_c(i) = 0;
dN_KK(i) = sum(part_c);
end
dN_KK = (1/(pi*p_density))*(2*dN_KK + K./(2*nu_bar) + dK);
version1 = dN_KK ;
toc %Elapsed time is 4.036714 seconds.
%%%% OPTIMIZED %%%%%%
tic
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
chunksize=1000;
vv=gnu_bar.^2;
vvchunks=mat2tiles( vv , [1,chunksize]);
numer=gnu_bar.*gK;
c=1;
for k=1:numel(vvchunks)
Q=numer(:)./(vv.'-vvchunks{k});
Q(c:len+1:end)=0;
c=c+size(Q,2);
vvchunks{k}=sum(Q,1);
end
gdN_KK=[vvchunks{:}];
gdN_KK = (1/(pi*p_density))*(2*gdN_KK + gK./(2*gnu_bar) + gdK);
%wait(gd)
toc %Elapsed time is 2.554221 seconds.
version2 = gdN_KK ;
figure
plot(nu_bar,version1,nu_bar,+version2,'x'); legend('Lloyd','Matt')

Sign in to comment.

More Answers (1)

Lloyd Bumm
Lloyd Bumm on 6 Sep 2019
Edited: Lloyd Bumm on 6 Sep 2019
I had some time to get back to this project today. The code below is set diferently that the OP in that the number of points and the range of the Kramers-Kronig integration is different than the range and density of points over which it is evaluated.
Below I compare the un-optimized method in the OP with an efficient 1D vectorization and the 2D vectorization using the for loops, parfor loops, and the GPU. The primary improvement is the 1D opimization implemented beased on discusssions above.
The parameters and timings are in the code, but I will list them here for convenience.
990001 points per integral; 301 integrals
19.893800 seconds.
CPU for 1D vector optimized: 0.319634 seconds.
CPU 2D vectorized: 1.318735 seconds.
CPU for 1D vector optimized single precision: 0.079821 seconds.
CPU 2D vectorized single precision: 0.861958 seconds.
CPU parfor unoptimized: 7.018130 seconds.
CPU parfor 1D vector optimized: 0.736592 seconds.
GPU for un-optimized: 11.348325 seconds.
GPU for 1D vector optimized: 0.540666 seconds.
GPU for 2D vector optimized: 33.338617 seconds.
GPU for un-optimized single precision: 11.187868 seconds.
GPU for 1D vector optimized single precision: 0.471209 seconds.
GPU for 2D vector optimized single precision: 16.836338 seconds.
% nu_bar_low and nubar_high are starting points for the intergrals in wavenumbers (nu_bar),
% p_density is the number of points between wavenumbers
% The K-K integral is evaluated at subsets of wavenumbers given by
% on the interval nubar_start to nubar_end (nu_bar_eval) with a point density lower than
% the intergral by the factor eval_sub_density
% the indicies for eval_ind must exactly correspond to points in nu_bar
close all
clear all
nubar_low = 100;
nubar_high = 10000;
nubar_start = 2450;
nubar_end = 3350;
p_density = 100; %points per wavenumber for integral
eval_sub_density = 300; % evaluate interval lower than the p_density by this factor
nu_bar = nubar_low:1/p_density:nubar_high; %wavenumber vector for integral
len_nubar = length(nu_bar);
nu_bar_eval = nubar_start:1/(p_density/eval_sub_density):nubar_end; %vector wavenumbers where integral is evaluated
len_eval=length(nu_bar_eval);
bb = (nu_bar >= nubar_start) & (nu_bar <= nubar_end);
ind_start = find(bb,1,'first');
ind_end = find(bb,1,'last');
eval_ind = ind_start:eval_sub_density:ind_end; %indices in nu_bar that correspond to nu_bar_eval
fprintf('%i points per integral; %i integrals\r',len_nubar, len_eval);
n_inf = 0;
%compute K spectrum
K = zeros(size(nu_bar));
k_max = 0.01; %max k
nu_bar_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nu_bar-nu_bar_0).^2 + (gamma/2)^2).^-1 - ((nu_bar+nu_bar_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
dN_KK = zeros(1,len_eval);
%times for len_nubar = 990001; len_nubar_eval = 301
fprintf('\rCPU for unoptimized\r'); % 19.9 s
tic
for i = 1:len_eval
jj = eval_ind(i);
dN_KK(i) = sum(nu_bar([1:jj-1, i+1:end]) .* K([1:jj-1, i+1:end]) ./ (nu_bar([1:jj-1, i+1:end]).^2 - nu_bar(jj).^2));
dN_KK(i) = 2*dN_KK(i) + K(jj)./(2*nu_bar(jj)) + dK(jj);
end
dN_KK = 2*dN_KK + K(eval_ind)./(2*nu_bar_eval) + dK(eval_ind);
dN_KK = (1/(pi*p_density))*dN_KK;
N_KK = dN_KK + n_inf;
toc
figure
plot(nu_bar_eval,dN_KK);
hold on
fprintf('\rCPU for 1D vector optimized\r'); % 0.32 s
tic
part_a = nu_bar .* K;
part_b = nu_bar .^2;
for i = 1:len_eval
jj = eval_ind(i);
part_c = part_a ./ (part_b - part_b(jj));
part_c(jj) = 0; %set singlar points to zero before sum
dN_KK(i) = sum(part_c);
end
dN_KK = 2*dN_KK + K(eval_ind)./(2*nu_bar_eval) + dK(eval_ind);
dN_KK = (1/(pi*p_density))*dN_KK;
N_KK = dN_KK + n_inf;
toc
plot(nu_bar_eval,dN_KK);
hold on
fprintf('\rCPU 2D vectorized\r'); % 1.32 s
%test for sufficient available memory
mem = memory;
mem_avail = mem.MemAvailableAllArrays;
mem_need = len_nubar*len_eval*8*4;
if mem_need > mem_avail
fprintf('not enough memory, need %10.3e, only %10.3e available\r',mem_need, mem_avail);
proceed = false;
else
fprintf('have enough memory, need %10.3e, %10.3e available\r',mem_need, mem_avail);
proceed = true;
end
% proceed if sufficient available memory
if proceed
tic
R_temp = repmat(nu_bar.^2,len_eval,1); % conserve memory use this varaiable as a temp
R_nubar_K = repmat(nu_bar.*K,len_eval,1);
R_nu_bar_eval_sq = repmat(nu_bar_eval'.^2,1,len_nubar);
R_temp = (R_nubar_K) ./ (R_temp - R_nu_bar_eval_sq);
for i = 1:len_eval
R_temp(i,eval_ind(i)) = 0; %set singlar points to zero before sum
end
dN_KK = sum(R_temp,2)';
dN_KK = 2*dN_KK + K(eval_ind)./(2*nu_bar_eval) + dK(eval_ind);
dN_KK = (1/(pi*p_density))*dN_KK;
N_KK = dN_KK + n_inf;
toc
clear R_temp R_nubar_K R_nu_bar_eval_sq; %dump big arrays
plot(nu_bar_eval,dN_KK);
hold on
end
%%%%%%%%%%%%%%%%%%%%%%%%%% single precision %%%%%%%%%%%%%%%%
fprintf('\rCPU for 1D vector optimized single precision\r'); % 0.8 s
tic
snu_bar = single(nu_bar);
sK = single(K);
sdK = single(dK);
snu_bar_eval = single(nu_bar_eval);
seval_ind = single(eval_ind);
sdN_KK = single(dN_KK);
spart_a = snu_bar .* sK;
spart_b = snu_bar .^2;
for i = 1:len_eval
jj = seval_ind(i);
spart_c = spart_a ./ (spart_b - spart_b(jj));
spart_c(jj) = 0; %set singlar points to zero before sum
sdN_KK(i) = sum(spart_c);
end
dN_KK = double(2*sdN_KK + sK(seval_ind)./(2*snu_bar_eval) + sdK(seval_ind));
dN_KK = (1/(pi*p_density))*dN_KK;
N_KK = dN_KK + n_inf;
toc
plot(nu_bar_eval,dN_KK);
hold on
fprintf('\rCPU 2D vectorized single precision\r'); % 0.86 s
%test for sufficient available memory
mem = memory;
mem_avail = mem.MemAvailableAllArrays;
mem_need = len_nubar*len_eval*4*4;
if mem_need > mem_avail
fprintf('not enough memory, need %10.3e, only %10.3e available\r',mem_need, mem_avail);
proceed = false;
else
fprintf('have enough memory, need %10.3e, %10.3e available\r',mem_need, mem_avail);
proceed = true;
end
% proceed if sufficient available memory
if proceed
tic
sR_temp = repmat(single(nu_bar).^2,len_eval,1); % conserve memory use this varaiable as a temp
sR_nubar_K = repmat(single(nu_bar).*K,len_eval,1);
sR_nu_bar_eval_sq = repmat(single(nu_bar_eval)'.^2,1,len_nubar);
seval_ind = single(eval_ind);
sR_temp = (sR_nubar_K) ./ (sR_temp - sR_nu_bar_eval_sq);
for i = 1:len_eval
sR_temp(i,seval_ind(i)) = 0; %set singlar points to zero before sum
end
dN_KK = double(sum(sR_temp,2)');
dN_KK = 2*dN_KK + K(eval_ind)./(2*nu_bar_eval) + dK(eval_ind);
dN_KK = (1/(pi*p_density))*dN_KK;
N_KK = dN_KK + n_inf;
toc
clear sR_temp sR_nubar_K sR_nu_bar_eval_sq; %dump big arrays
plot(nu_bar_eval,dN_KK);
hold on
end
%^^^^^^^^^^^^^^^^^^^^ end single precision ^^^^^^^^^^^^^^^^^^
fprintf('\rCPU parfor unoptimized\r'); % 7.02 s
poolobj = gcp;
tic
parfor i = 1:len_eval
jj = eval_ind(i);
dN_KK(i) = sum(nu_bar([1:jj-1, i+1:end]) .* K([1:jj-1, i+1:end]) ./ (nu_bar([1:jj-1, i+1:end]).^2 - nu_bar(jj).^2));
dN_KK(i) = 2*dN_KK(i) + K(jj)./(2*nu_bar(jj)) + dK(jj);
end
dN_KK = 2*dN_KK + K(eval_ind)./(2*nu_bar_eval) + dK(eval_ind);
dN_KK = (1/(pi*p_density))*dN_KK;
N_KK = dN_KK + n_inf;
toc
plot(nu_bar_eval,dN_KK);
hold on
fprintf('\rCPU parfor 1D vector optimized\r'); % 0.74 s
poolobj = gcp;
tic
part_a = nu_bar .* K;
part_b = nu_bar .^2;
parfor i = 1:len_eval
jj = eval_ind(i);
part_c = part_a ./ (part_b - part_b(jj));
part_c(jj) = 0; %set singlar points to zero before sum
dN_KK(i) = sum(part_c);
end
dN_KK = 2*dN_KK + K(eval_ind)./(2*nu_bar_eval) + dK(eval_ind);
dN_KK = (1/(pi*p_density))*dN_KK;
N_KK = dN_KK + n_inf;
toc
plot(nu_bar_eval,dN_KK);
hold on
% Test for presence of GPU
try
canUseGPU = parallel.gpu.GPUDevice.isAvailable;
catch ME
canUseGPU = false;
end
if canUseGPU
fprintf('\rGPU for un-optimized\r'); % 11.35 s
gGPU = gpuDevice(1);
reset(gGPU);
tic
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
for i = 1:len_eval
jj = eval_ind(i);
gdN_KK(i) = sum(gnu_bar([1:jj-1, jj+1:end]) .* gK([1:jj-1, jj+1:end]) ./ (gnu_bar([1:jj-1, jj+1:end]).^2 - gnu_bar(jj).^2));
gdN_KK(i) = 2*gdN_KK(i) + gK(jj)./(2*gnu_bar(jj)) + gdK(jj);
end
gdN_KK = (1/(pi*p_density))*gdN_KK;
dN_KK =gather(gdN_KK);
wait(gGPU);
toc
plot(nu_bar_eval,dN_KK);
hold on
fprintf('\rGPU for 1D vector optimized\r'); % 0.54 s
gGPU = gpuDevice(1);
reset(gGPU);
tic
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gnu_bar_eval = gpuArray(nu_bar_eval);
gdN_KK = gpuArray(dN_KK);
geval_ind = gpuArray(eval_ind);
gpart_a = gnu_bar .* gK;
gpart_b = gnu_bar .^2;
for i = 1:len_eval
jj = geval_ind(i);
gpart_c = gpart_a ./ (gpart_b - gpart_b(jj));
gpart_c(jj) = 0; %set singlar points to zero before sum
gdN_KK(i)= sum(gpart_c);
end
gdN_KK = 2*gdN_KK + gK(geval_ind)./(2*gnu_bar_eval) + gdK(geval_ind);
gdN_KK = (1/(pi*p_density))*gdN_KK;
dN_KK =gather(gdN_KK);
wait(gGPU);
N_KK = dN_KK + n_inf;
toc
plot(nu_bar_eval,dN_KK);
hold on
fprintf('\rGPU for 2D vector optimized\r'); % 33.34 s
gGPU = gpuDevice(1);
reset(gGPU);
%test for sufficient memory
mem_GPU_avail = gGPU.AvailableMemory;
mem_need = len_nubar*len_eval*8*5;
if mem_need > mem_GPU_avail
fprintf('not enough memory, need %10.3e, only %10.3e available\r',mem_need, mem_GPU_avail);
proceed = false;
else
fprintf('have enough memory, need %10.3e, %10.3e available\r',mem_need, mem_GPU_avail);
proceed = true;
end
%proceed is sufficient memory is available
if proceed
tic
gnu_bar = gpuArray(nu_bar);
gnu_bar_eval = gpuArray(nu_bar_eval);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
geval_ind = gpuArray(eval_ind);
R_gtemp = repmat(gnu_bar.^2,len_eval,1);
R_gnubar_K = repmat(gnu_bar.*gK,len_eval,1);
R_gnu_bar_eval_sq = repmat(gnu_bar_eval'.^2,1,len_nubar);
R_gtemp = (R_gnubar_K) ./ (R_gtemp - R_gnu_bar_eval_sq);
for i = 1:len_eval
R_gtemp(i,eval_ind(i)) = 0; %set singlar points to zero before sum
end
gdN_KK = sum(R_gtemp,2)';
gdN_KK = 2*gdN_KK + gK(eval_ind)./(2*gnu_bar_eval) + gdK(geval_ind);
gdN_KK = (1/(pi*p_density))*gdN_KK;
dN_KK =gather(gdN_KK);
wait(gGPU);
N_KK = dN_KK + n_inf;
toc
plot(nu_bar_eval,dN_KK);
hold on
end
%%%%%%%%%%%%% single precision tests
fprintf('\rGPU for un-optimized single precision\r'); % 11.19 s
gGPU = gpuDevice(1);
reset(gGPU);
tic
gnu_bar = gpuArray(single(nu_bar));
gK = gpuArray(single(K));
gdK = gpuArray(single(dK));
gdN_KK = gpuArray(single(dN_KK));
for i = 1:len_eval
jj = eval_ind(i);
gdN_KK(i) = sum(gnu_bar([1:jj-1, jj+1:end]) .* gK([1:jj-1, jj+1:end]) ./ (gnu_bar([1:jj-1, jj+1:end]).^2 - gnu_bar(jj).^2));
gdN_KK(i) = 2*gdN_KK(i) + gK(jj)./(2*gnu_bar(jj)) + gdK(jj);
end
gdN_KK = (1/(pi*p_density))*gdN_KK;
dN_KK =double(gather(gdN_KK));
wait(gGPU);
toc
plot(nu_bar_eval,dN_KK);
hold on
fprintf('\rGPU for 1D vector optimized single precision\r'); % 0.47 s
gGPU = gpuDevice(1);
reset(gGPU);
tic
gnu_bar = gpuArray(single(nu_bar));
gK = gpuArray(single(K));
gdK = gpuArray(single(dK));
gdN_KK = gpuArray(single(dN_KK));
gnu_bar_eval = gpuArray(single(nu_bar_eval));
geval_ind = gpuArray(single(eval_ind));
gpart_a = gnu_bar .* gK;
gpart_b = gnu_bar .^2;
for i = 1:len_eval
jj = geval_ind(i);
gpart_c = gpart_a ./ (gpart_b - gpart_b(jj));
gpart_c(jj) = 0; %set singlar points to zero before sum
gdN_KK(i)= sum(gpart_c);
end
gdN_KK = 2*gdN_KK + gK(geval_ind)./(2*gnu_bar_eval) + gdK(geval_ind);
gdN_KK = (1/(pi*p_density))*gdN_KK;
dN_KK =double(gather(gdN_KK));
wait(gGPU);
N_KK = dN_KK + n_inf;
toc
plot(nu_bar_eval,dN_KK);
hold on
fprintf('\rGPU for 2D vector optimized single precision\r'); % 16.84 s
gGPU = gpuDevice(1);
reset(gGPU);
%test for sufficient memory
mem_GPU_avail = gGPU.AvailableMemory;
mem_need = len_nubar*len_eval*4*5;
if mem_need > mem_GPU_avail
fprintf('not enough memory, need %10.3e, only %10.3e available\r',mem_need, mem_GPU_avail);
proceed = false;
else
fprintf('have enough memory, need %10.3e, %10.3e available\r',mem_need, mem_GPU_avail);
proceed = true;
end
%proceed is sufficient memory is available
if proceed
tic
gnu_bar = gpuArray(single(nu_bar));
gK = gpuArray(single(K));
gdK = gpuArray(single(dK));
gdN_KK = gpuArray(single(dN_KK));
gnu_bar_eval = gpuArray(single(nu_bar_eval));
geval_ind = gpuArray(single(eval_ind));
R_gtemp = repmat(gnu_bar.^2,len_eval,1);
R_gnubar_K = repmat(gnu_bar.*gK,len_eval,1);
R_gnu_bar_eval_sq = repmat(gnu_bar_eval'.^2,1,len_nubar);
R_gtemp = (R_gnubar_K) ./ (R_gtemp - R_gnu_bar_eval_sq);
for i = 1:len_eval
R_gtemp(i,eval_ind(i)) = 0; %set singlar points to zero before sum
end
gdN_KK = sum(R_gtemp,2)';
gdN_KK = 2*gdN_KK + gK(eval_ind)./(2*gnu_bar_eval) + gdK(geval_ind);
gdN_KK = (1/(pi*p_density))*gdN_KK;
dN_KK =double(gather(gdN_KK));
wait(gGPU);
N_KK = dN_KK + n_inf;
toc
plot(nu_bar_eval,dN_KK);
hold on
end
end
  3 Comments
Lloyd Bumm
Lloyd Bumm on 6 Sep 2019
I edited the above results to include single precsion on the CPU and the GPU.
Joss Knight
Joss Knight on 6 Sep 2019
Thanks. The only explanation for that is that your cost is all overhead on the GPU, and not computation.

Sign in to comment.

Categories

Find more on Parallel Computing Fundamentals in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!