Preallocation of composites using smpd

Question

1 vote

I'm seeing dramatically non-linear execution times with the test code below where I am allocating a different gpu for up to 4 spmd worker. (Yes, I do have the hardware) I'll then make some work on each worker and time it for 10 trials.

Note the clear line w/in the trial loop but outside the smpd loop.

If that clear is included the trial_times make sense. If that clear is not included the trial_times do not make sense.

As an example when n_gpu's = 2 with the clear produces values in trial_time with a narrow range of 0.0938 to 0.1111, but w/out the clear I get 0.3884 0.0915 6.4601 15.2599 15.2746 15.2792 15.2892 15.2900

....I'm left pondering that if this were not smpd code I would find a way to preallocate the data, but I'm not sure how to do that with composites in this case.

Ideas and explanations are welcome.

for N_gpus=1:4
    poolobj = gcp('nocreate'); % If no pool, do not create new one.
    if isempty(poolobj)
      poolobj = parpool( N_gpus );
      poolsize = poolobj.NumWorkers;
    else
      poolsize = poolobj.NumWorkers;
    end
      for trial=1:10   
        spmd( N_gpus )
          g = gpuDevice();
        end
          tic
          spmd( N_gpus )
              for m=1:50
                  A = rand(5000,5000,'gpuArray');
                  B = rand(5000,5000,'gpuArray');
                  C = A * B;
                  max_C = max(C);
              end
          end
          clear A B C;     %%THIS IS THE INTERESTING LINE
          trial_time(trial)=toc;
      end
      trial_time;
      tt = mean(trial_time(1:10));
      fprintf( 'N=%d time=%6.3f \n', N_gpus, tt );
      poolobj = gcp( 'nocreate' );
      delete( poolobj );
   end

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Joss Knight on 19 May 2017

1 vote

It seems like this is just an issue of timing and synchronisation. You can see this by adding a call to wait(g) at the end of your spmd block, which will eliminate the dependency on use of clear.

Basically, if you don't call clear then the first call to rand for each trial doesn't have enough pooled memory, so it has to do a raw allocation. Of course, as it turns out it could have freed up the memory currently being used by A, but it doesn't know that it isn't going to error, so it has to create a copy first, in case A needs to be left unchanged (this wouldn't be true if your entire script was inside a function, since A doesn't have to be preserved if there's an error).

When you do a raw allocation, the device has to be synchronised. But when you do call clear the memory for A, B and C is returned to the pool and so the next time no raw allocation is needed. So no synchronisation happens. So the loop happily continues, queuing up 300 or so kernels and then exiting the spmd block and recording the time on the client, long before any of those kernels have actually finished.

So when you don't call clear you're usually getting the actual time of the previous trial, and when you do you're recording completely the wrong time, since the computations haven't finished yet.

Depending on the GPU memory, how much is needed, how much is available when the code is called, how much is already pooled due to earlier operations (MATLAB by default pools memory up to a quarter of device memory), and whether or not you're inside a function, your timing will give different results. Your best bet for getting realistic timings is to use gputimeit, or if you must, use tic and toc in conjunction with wait. However, the pool will always create confusion here because you don't necessarily know when raw allocations (which are costly even ignoring synchronisation) are going to happen.

4 Comments
Show 2 older comments Hide 2 older comments

Joss Knight on 24 May 2017

Edited: Joss Knight on 24 May 2017

By the way, the values in max_C are fine. Asynchronous execution NEVER means you get wrong answers. If you ever ask to see, copy, or operate on the results of an operation, it will ensure that operation is finished before doing that (e.g. it won't display max_C without finishing computing max_C).

David Short on 25 May 2017

Good to know!

Sign in to comment.

Preallocation of composites using smpd

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

4 Comments
Show 2 older comments Hide 2 older comments

More Answers (0)

Categories

Products

Tags

Community Treasure Hunt

Preallocation of composites using smpd

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

4 Comments Show 2 older comments Hide 2 older comments

More Answers (0)

Categories

Products

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

4 Comments
Show 2 older comments Hide 2 older comments