Parallel matrix operations on GPU using arrayfun: why is it slower than looping, and/or is there a better method for coding this?
Show older comments
I have been attempting to parralelize a stack of matrix operations performed to the same root matrix on a GPU. A MWE is included below. The basic problem can be summed up using this example:
let im be some matrix of size [M x N] . let [c1,c2...] be some parameter vectors of length K. The operation can be given by the equation
out = sum_kk = [1:K] of f(im,c1(kk),c2(kk)...),
where f() is some function of the input matrix im, and the parameters. The output is the same size as the original input matrix.Since the function is being repeated on the same matrix just using different parameters, I should be able to parallelize the function operations, even if I have to perform the summation as a subsequent step.
I did this using arrayfun, but in my implementation it is slower than running as a loop. I do not understand why, so I have two primary questions:
- Is my implementation bad? Is there a better way to perform this type of operation?
- Is there a fundamental reason arrayfun isn't outperforming the loop?
This is a stand-in for a much more complicated problem, but should get me where I need to be. There is a minimal working example below. NOTE: the sum over the cells in the arrayfun portion is not the main chokepoint in speed, its arrayfun itself.
%%MWE of highly parallel operations on GPU
function highParMWE
%%Looped Version
% lets create a 500x500 pixel image
im = gpuArray(rand(500,500));
% lets create two 'corrections' to the image, with 101 different realizations.
corr1 = gpuArray(rand(101,1));
corr2 = gpuArray(rand(101,1));
%
% I want to create an output image the same size as the original, but with
% the following equation applied:
%
% new_im = Sum_i=1:101 ((im-corr1(ii))/corr2(ii)).
%
% So I am applying each correction pair to the original image, then summing
% all of the resulting image. Note, this is a stand in for an application
% with much more complex functions than the above equation. (I realize this
% exact example might be more quickly done with 3-D arrays, but that is not
% what I am going for.
%
% My first instinct is to use a 'for' loop. However, Since these are
% relatively small arrays, I was hoping to parallelize the calculation on
% the GPU. I may be misunderstanding the documentation, but arrayfun seems
% like the right way to do this.
%
% BUT, when I time it, the loop is actually faster.
%
t_loop = gputimeit(@() loopit(im,corr1,corr2));
%
t_array = gputimeit(@() arrayfunit(im,corr1,corr2));
%
% My questions:
%
% (1) Is there a better way to do this?
% (2) Why is arrayfun not significantly faster than looping? Often it is
% slower.
end
%
function imfinal = loopit(im,corr1,corr2)
imfinal = gpuArray.zeros(size(im));
for ii = 1:length(corr1);
imfinal = imfinal + calculateCorrImage(im,corr1(ii),corr2(ii));
end
imfinal = gather(imfinal);
end
%
function imfinal = arrayfunit(im,corr1,corr2)
imfinal = gpuArray.zeros(size(im));
% I have to set uniform to 0 because I am outputing an image matrix
% that is not the same size as my number of corrections
jj = 1:length(corr1);
tmp = arrayfun(@(jj) calculateCorrImage_arr(jj,im, corr1,corr2), jj,'UniformOutput',0);
% arrayfun has dumped out a cell array which I still
% have to sum over.
imfinal = gpuArray.zeros(size(im));
for ii = 1:length(corr1);
imfinal = imfinal + tmp{ii};
end
imfinal = gather(imfinal);
%
end
%
function out = calculateCorrImage(im,corr1,corr2)
out = (im+corr1)/corr2;
end
%
function out = calculateCorrImage_arr(ii,im,corr1,corr2)
out = (im+corr1(ii))/corr2(ii);
end
Thanks again all!
Accepted Answer
More Answers (0)
Categories
Find more on Loops and Conditional Statements in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!