Preallocation of composites using smpd

I'm seeing dramatically non-linear execution times with the test code below where I am allocating a different gpu for up to 4 spmd worker. (Yes, I do have the hardware) I'll then make some work on each worker and time it for 10 trials.
Note the clear line w/in the trial loop but outside the smpd loop.
If that clear is included the trial_times make sense. If that clear is not included the trial_times do not make sense.
As an example when n_gpu's = 2 with the clear produces values in trial_time with a narrow range of 0.0938 to 0.1111, but w/out the clear I get 0.3884 0.0915 6.4601 15.2599 15.2746 15.2792 15.2892 15.2900
....I'm left pondering that if this were not smpd code I would find a way to preallocate the data, but I'm not sure how to do that with composites in this case.
Ideas and explanations are welcome.
for N_gpus=1:4
poolobj = gcp('nocreate'); % If no pool, do not create new one.
if isempty(poolobj)
poolobj = parpool( N_gpus );
poolsize = poolobj.NumWorkers;
else
poolsize = poolobj.NumWorkers;
end
for trial=1:10
spmd( N_gpus )
g = gpuDevice();
end
tic
spmd( N_gpus )
for m=1:50
A = rand(5000,5000,'gpuArray');
B = rand(5000,5000,'gpuArray');
C = A * B;
max_C = max(C);
end
end
clear A B C; %%THIS IS THE INTERESTING LINE
trial_time(trial)=toc;
end
trial_time;
tt = mean(trial_time(1:10));
fprintf( 'N=%d time=%6.3f \n', N_gpus, tt );
poolobj = gcp( 'nocreate' );
delete( poolobj );
end

 Accepted Answer

It seems like this is just an issue of timing and synchronisation. You can see this by adding a call to wait(g) at the end of your spmd block, which will eliminate the dependency on use of clear.
Basically, if you don't call clear then the first call to rand for each trial doesn't have enough pooled memory, so it has to do a raw allocation. Of course, as it turns out it could have freed up the memory currently being used by A, but it doesn't know that it isn't going to error, so it has to create a copy first, in case A needs to be left unchanged (this wouldn't be true if your entire script was inside a function, since A doesn't have to be preserved if there's an error).
When you do a raw allocation, the device has to be synchronised. But when you do call clear the memory for A, B and C is returned to the pool and so the next time no raw allocation is needed. So no synchronisation happens. So the loop happily continues, queuing up 300 or so kernels and then exiting the spmd block and recording the time on the client, long before any of those kernels have actually finished.
So when you don't call clear you're usually getting the actual time of the previous trial, and when you do you're recording completely the wrong time, since the computations haven't finished yet.
Depending on the GPU memory, how much is needed, how much is available when the code is called, how much is already pooled due to earlier operations (MATLAB by default pools memory up to a quarter of device memory), and whether or not you're inside a function, your timing will give different results. Your best bet for getting realistic timings is to use gputimeit, or if you must, use tic and toc in conjunction with wait. However, the pool will always create confusion here because you don't necessarily know when raw allocations (which are costly even ignoring synchronisation) are going to happen.

4 Comments

What a great answer. Lots to chew on. I've certainly been exposed to something I wasn't aware of. Thank you.
So the time values recorded with the clear statement in place are consistent, but they are consistent trash. The values in max_c also aren't useful at the end of either the m loop or the trail loop.
The timing values with a wait statement in place after the end of the m loop are consistent, but are at the upper end of what I saw before. The actual values in max_c should be good.
answers always begat questions.
Taking your advice I took a stab at modifying the code to use gputimeit several ways and consistently ended up with "An anonymous function cannot be defined in an spmd block." That makes sense.
"the pool will always create confusion because you don't necassarily know when raw allocations are going to happen."
Ouch. I'm guessing if you know your hardware in advance you can tune your code to minimize raw allocations, but...Ouch.
Thank you again.
Most users want MATLAB to try to minimize costs wherever possible, no matter how. If you are not that kind of user, you can use
feature('GpuAllocPoolSizeKb', 0);
to turn off pooling (or control the size of the pool). Basically, with this command you can force every array creation to cause a raw allocation. For some people, this is what they want, at least to help them determine what's going on.
Yes, you can't define anonymous functions inside SPMD, but you can define them OUTSIDE SPMD. You can also create a normal function or nested function that takes no arguments, and use it inside gputimeit.
Joss Knight
Joss Knight on 24 May 2017
Edited: Joss Knight on 24 May 2017
By the way, the values in max_C are fine. Asynchronous execution NEVER means you get wrong answers. If you ever ask to see, copy, or operate on the results of an operation, it will ensure that operation is finished before doing that (e.g. it won't display max_C without finishing computing max_C).
Good to know!

Sign in to comment.

More Answers (0)

Categories

Find more on Parallel Computing Toolbox in Help Center and File Exchange

Products

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!