Have I reached the limitations of GPU processing?

I'm working on a project that requires massive, repetitive use of a single function. spmd would be an ideal solution to my problem, but I only have one graphics card running on a single computer.
I'm using a trial version of the Parallel Computing Toolbox, and I want to make sure that I'm not overlooking any optimization techniques before declaring that a GPU, and the toolbox, won't be suitable for me needs.
The sample function performs an operation on two arguments, and produces a single output.
C = operation(A,B);
Where A, B and C are all simple numerical arrays of the same type. They each have the same dimensions, and the same number of elements.
The "operation" function works as follows:
function C = operation(A,B)
Fixed_Size = 100;
V{1} = [3 5; 2 8; 10 7]; %Three rows
V{2} = [8 3]; % One row
V{3} = [23 2; 19 4; 16 5; 34 9]; % Four rows
.
.
.
V{Fixed_Size} = [73 28; 45 22]; % Two rows
C = zeros(1,Fixed_Size);
for n = 1:Fixed_Size
part1 = A(V{n}(:,1)) .* B(V{n}(:,2));
C(n) = rem(sum(part1),19);
end
I've tried eliminating the loop with more vectorization, but this slows things down. That's because the matricies, in V, have varying numbers of rows. To get rid of the loop, and cell array, I would have to pad the rows so that they're all the same size. This then allows me to let "V" be a matrix. However, this process is slower because there are more elements to work with.
The "operation" function is used in the following way, without the GPU:
tic;
C1 = operation(A1,B1);
C2 = operation(A2,B2);
F = operation(C1,C2);
toc;
Here's the fastest way that I can figure to achieve this using the GPU:
% first the inputs are put into the GPU.
% The time taken to do this isn't included in the time comparison.
A1 = gpuArray(A1);
A2 = gpuArray(A2);
B1 = gpuArray(B1);
B2 = gpuArray(B2);
tic;
F = operation(operation(A1,B1), operation(A2,B2));
toc;
In practice, there are WAY more inputs than just two pairs. Instead of 2 "A" inputs, and 2 "B" inputs, imagine having nearly 200,000. With that many inputs, there will obviously be limits to how many pairs can be handled in a single line for the GPU. But the concept remains the same -- I'm using the GPU to process the "operation" function simultaneously.
Using the GPU, in this way, is still a lot slower than the first method -- which doesn't use the GPU. The only real benefit, that I've noticed, is that as I increase the number of inputs, the time increase with the CPU is linear, as expected. However, with the GPU it's not. In one case, when doubling the number of inputs, the time it took to complete only increased by about 50%, instead of 100% (with the GPU). However, when the number of inputs got considerably large, the increase in time became linear.
Is this the best that I can expect from a GPU, or am I missing something? I'd hate to conclude that this setup is a failure, only to find that it would have succeeded with a properly used technique. I'm new to GPU computing, so I've only tried what makes sense. Is there some trick to unleashing the real power, or is this the best kind of performance that I could expect for such an algorithm?
Thanks for your time, and for any input. I apologize for such a long question, but I wanted to be thorough.
*Please Note: In this sample "operation" function, I've shown "V" as being declared inside the function. However, that's not the way it's set up in practice. In practice, I have V set up as a global Cell array that the function has access to. This prevents the function from having to declare "V" every time it is used.

 Accepted Answer

What you've done is unlikely to work well on the GPU, you are not doing enough work inside the loop.
I will assume for the purposes of this answer that you cannot reorganize your index data in a more efficient way manually. In that case, I suggest vectorizing by concatenating all your indices and creating a single long vector of values from A and B. Keep track of where the last value is in each section and then use that to index the data later to get your dot products:
VMat = cell2mat(V');
LastRowIndex = cumsum(cellfun(@(x)size(x,1), V));
ProdAll = A(VMat(:,1)) .* B(VMat(:,2));
DotProdAll = cumsum(ProdAll);
DotProd = DotProdAll(LastRowIndex); % Index into cumsum to get answer
DotProd = [DotProd(1); diff(DotProd)];
C = rem(DotProd,19);
Make A and B gpuArrays and Bob's your Uncle. Only problem here is the potential for overflow with such a big accumulation.

4 Comments

Actually there's a way to do it using accumarray that avoids the cumsum over all the data, but it's a bit more complicated:
VMat = cell2mat(V');
LastRowIndex = cumsum(cellfun(@(x)size(x,1), V));
FirstRowIndexMask = zeros(size(VMat,1),1);
FirstRowIndexMask(LastRowIndex(1:end-1)+1) = 1;
SectionIndices = cumsum(FirstRowIndexMask) + 1;
ProdAll = A(VMat(:,1)) .* B(VMat(:,2));
DotProd = accumarray(SectionIndices, ProdAll);
C = rem(DotProd,19);
Thanks, Joss. All I can say is "wow"!
Your algorithm is extremely slick and efficient. I wasn't aware of the "cumsum" and "accumarray" functions.
I made one slight variation to your second approach. I redefined "V" from being a global cell array to being "VMat", by default. I also defined "SectionIndicies" as a global numerical array of corresponding indicies, making it so that it wouldn't have to be calculated each time the function runs.
The good news is that this is an outstanding implementation that never would have occurred to me. I love the idea of concatenating the matricies in "V" so that there's no padding. Excellent idea!
The bad news is that I (now prematurely) returned the graphics card before reading your solution. I tried the algorithm with the CPU, and there's not much of a difference in terms of speed compared to using the loop and cell array. The loop is negligibly faster. I can only imagine what the speed would have been with your solution and the graphics card.
I returned the graphics card with the intent of upgrading my processor to an 8 core processor (it's currently a dual processor). I've noticed a dramatic increase in speed after using a dual-core, modern processor. My old one was single-core running on a 32-bit system. With that system, my program took a little over three days to complete. With the newer system, that time has been cut to about six hours. I'm hoping the 8-core processor will cut the time down to an hour and a half.
Thanks, again, for the outstanding algorithm. Learning about those two functions, alone, is a great help. I'll be kicking myself for a while that I returned the graphics card so soon. I'm tempted to buy it again, just to see what it'll do.
Anyway. Thanks again!
You're welcome. Too bad I didn't get there in time.
Vectorizing code is definitely a skill that takes time to acquire. I can recommend this Blog article and this page in the MATLAB documentation to give a breakdown of the process and an overview of all the various tools available to you.
It might take a bit of time, but learning to write vectorized code is well worth the effort. Vectorization is the most powerful way of writing fast and efficient MATLAB code.

Sign in to comment.

More Answers (0)

Categories

Find more on MATLAB in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!