Initializing GPU on multiple workers cause an unknown error

Question

Igor Varfolomeev on 25 Jun 2018

1
Link

Direct link to this question

https://in.mathworks.com/matlabcentral/answers/407188-initializing-gpu-on-multiple-workers-cause-an-unknown-error

Answered: Igor Varfolomeev on 25 Nov 2018

I've noticed that the following simple code results in an weird error, if I use R2016b on a machine with two GTX1080Ti and one K2200 :

% start a _new_ Matlab instance first!
parpool(16); 
fetchOutputs( parfevalOnAll(@() gather(gpuArray(1)),1) )

The error message I get:

Error using parallel.FevalOnAllFuture/fetchOutputs (line 69)
One or more futures resulted in an error.
Caused by:
    Error using parallel.internal.pool.deserialize>@()gather(gpuArray(1))
    An unexpected error occurred during CUDA execution. The CUDA error was:
    unknown error
    <-- repeated multiple times -->

After that, all GPU functionality gets completely broken:

>> a=gpuArray(1)
Error using gpuArray
An unexpected error occurred during CUDA execution. The CUDA error was:
unknown error

Even re-starting Matlab won't help. The fix is to clear the CUDA JIT cache folder, "%USERPROFILE%\AppData\Roaming\NVIDIA\ComputeCache".

However, the following "longer pre-initialization" works OK for me:

% start a _new_ Matlab instance first and clear CUDA JIT cache if there was an error.
gpuDevice(1)
gather(gpuArray(1)) 
parpool(); 
fetchOutputs( parfevalOnAll(@() gpuDevice(1),1) ) 
fetchOutputs(parfevalOnAll(@() gather(gpuArray(1)),1))

AFAIU:

Matlab R2016b that I use here, was designed for CUDA 7.5, and there are no binaries for CUDA Compute Capability 6.1.
That's why Matlab uses CUDA JIT to recompile a ton (~400 MB) of stuff when user calls any gpu-related function the first time. (Which also causes many " gpuDevice() is slow " questions.
There's something wrong with that JIT, if combined with parpool (a race condition?).

My system is: Windows 10, CUDA 8.0 (cuda_8.0.61_win10) with patch 2 (cuda_8.0.61.2_windows), nvidia driver r384.94. The CUDA_CACHE_MAXSIZE environment variable is set to 2147483647.

My questions:

Is my "longer pre-initialization" workaround actually "safe"? Is it a real workaround for those "race condition"? Or is it as good as the original (might be stable on my specific system, but is likely to fail on some other)? Assuming I have to stay with R2016b for now, targeting CUDA 8.0 and Pascal GPU (building a dll).
Same code works OK in R2017b-R2018a and above. Is that just because they don't use CUDA JIT here? Or is the real underlying issue actually fixed? (I don't have a device with compute capability >6.x at hand, so I'm unable to check that.)R2017a behaves like R2016b here, even though it claims CUDA 8.0 support - it still writes something (but just ~40MB) to CUDA JIT cache, fails in test #1 and works in test #2.

10 Comments
Show 8 older commentsHide 8 older comments

Igor Varfolomeev on 28 Jun 2018

Open in MATLAB Online

However, this may not be your problem because you can't even seem to select the device. Your best place to start is to ensure that your code runs on each of your GPUs. Select each GPU in turn (on your client MATLAB) and ensure you can use it.

The

for i=1:3; gpuDevice(i); gather(gpuArray(1));end

works OK, unless the JIT cache is broken already.

-----------------------------------------------------------------------------------------------

Secondly, you are trying to use each GPU from multiple processes. Check whether everything works when you only have three workers in your pool - the same as the number of GPUs.

Yep, that's definitely a good idea to test the same thing with 3 or 2 workers. (The thing is - 3rd device is K2200, which got a different CUDA Compute Capability (5.0), which looks like a potential caveat as well.)

I've just tried

% start a _new_ Matlab instance first!
parpool(3); 
fetchOutputs( parfevalOnAll(@() gather(gpuArray(1)),1) )

and

% start a _new_ Matlab instance first!
parpool(2); 
fetchOutputs( parfevalOnAll(@() gather(gpuArray(1)),1) )

several times but the result is just the same. I've noticed that that size of JIT cache (after it fails) is somewhat random: from 100MB to 400MB. Probably, there's a chance that everything would be compiled OK. For me, this definitely looks like some kind of race, failing at some point by a random chance.

-----------------------------------------------------------------------------------------------

Finally - why are you running GPU code on a pool of 16 workers? Generally, if most of your computation is on the GPU, you should not have more workers than you have GPUs.

Actually, my idea was to use just two GPUs. (And ensure that just one worker is using each GPU at a time with a semaphore.) There's something like 3 steps: read data, do some calculations, write data. While some worker is using GPU, others are busy with IO. Otherwise, GPUs utilization would be rather low.

-----------------------------------------------------------------------------------------------

JITted code from R2016b has been known in the past not to work on a 1080i due to bugs in the driver optimizer.

For me, everything works OK, unless I try to run something in parallel (via parfevalOnAll()) without those "longer pre-initialization" trick. I think that the issue comes from JIT trying to compile something in parallel. It looks like that race condition is NVidia's fail. But I'm not really sure. I'm not sure whether this is fixed in newer NVidia drivers. If not - maybe it would be a good idea to add some kind of mutex inside Matlab to ensure that JIT is not called from multiple processes simultaneously. And maybe you would be able to report this behavior to NVidia, so that they would actually fix it.

Igor Varfolomeev on 29 Jun 2018

Edited: Igor Varfolomeev on 29 Jun 2018

Open in MATLAB Online

Can you check that this problem is only with your Pascal cards? Exclude one of the Pascal cards from the pool and try again.

Nice idea! The

% clear CUDA JIT cache and restart Matlab first
parpool(2);
spmd
   gpuDevice(labindex+1);
   gather(gpuArray(1));
end

works flawlessly.

Single Maxwell GPU:

% clear CUDA JIT cache and restart Matlab first
parpool(1);
spmd
   gpuDevice(3);
   gather(gpuArray(1));
end

works OK, (cuda JIT cache size is ~42MB after running this).

Single Pascal GPU:

% clear CUDA JIT cache and restart Matlab first
parpool(1);
spmd
   gpuDevice(2);
   gather(gpuArray(1));
end

works as well (cuda JIT cache size is ~436MB after running this). Moreover, I've backup-ed and compared these two CUDA JIT cache folders with the folder comparison tool - there's just no "collisions" (no files with the same name and path), except the "index" file. Probably that's the reason why simultaneously running JIT on those two GPUs works OK as well.

-----------------------------------------------------------------------------------------------

Now try only the two Pascal cards and see if the problem recurs.

Indeed, this:

% clear CUDA JIT cache and restart Matlab first
parpool(2);
spmd
   gpuDevice(labindex);
   gather(gpuArray(1));
end

fails (this time it failed after successfully compiling ~268MB).

-----------------------------------------------------------------------------------------------

If this is what is happening perhaps there is some issue with two processes reading from the JIT cache.

Well, I agree in general, but I'd say it probably fails on writing stuff.

-----------------------------------------------------------------------------------------------

However, this does seem like an extremely unusual thing for NVIDIA to have got wrong - the system is designed to have multiple processes using the same GPU.

That's why I found this so interesting. It feels like it just can't be true. If it was NVidia's fault, it would be either fixed long time ago, or a well-known at least. Well... It still might be their fault - maybe noone noticed this before simply because few CUDA users use CUDA JIT to compile 400MB in one shot.

-----------------------------------------------------------------------------------------------

I don't know why your JIT cache would be changing after it has been populated, since you are not generating any new kernels, just JITting the ones in the CUDA libraries and PCT libraries.

That's because after those bug happens, the CUDA JIT cache gets broken. And even most simple commands, like

a=gpuArray(1)

fail. Even after restarting Matlab. That's why I have to manually clear CUDA JIT cache after each such error. So, I've decided to clear it before each experiment like that. I think this also improves reproducibility. Also, this way I'm able to see that the resulting size of the JIT cache is somewhat random when it fails, which is also suspicious, and looks like a race (and when it works, size is always the same. Moreover - all those files got binary equal content each time, bit-to-bit, except the "index" file).

-----------------------------------------------------------------------------------------------

Oh, and one final thing to try is running three MATLAB instances (instead of a pool), use different cards, and check everything works. Ultimately, that's all a pool is, it's a communicating set of MATLAB instances.

Running

   % clear CUDA JIT cache and restart Matlab first
   gpuDevice(1); gather(gpuArray(1))

in two separate Matlab instances produces even more weird result. This specific command finishes OK, but the next one fails (in both Matlab instances).

   % clear CUDA JIT cache and restart Matlab first
   >> gpuDevice(1); gather(gpuArray(1))
   ans =
     1
   >> gpuDevice(1); gather(gpuArray(1))
   Error using gpuDevice (line 26)
   An unexpected error occurred during CUDA execution. The CUDA error 
   was:
   unknown error

The CUDA JIT cache size is 377MB, which definitely looks like something is wrong. If there's just one Matlab instance, running this command multiple times works OK, of course.

-----------------------------------------------------------------------------------------------

UPD: I repeated those last experiment few more times (see table below). Cause those result above was really weird. It looks like running:

>> gpuDevice(X); gather(gpuArray(1))

simultaneously in two Matlab instances finishes OK with ~33% chance (which is much higher than for parpool). I've tried using X==1 in both instances as well as using X==1 in the first one and X==2 in the second one.

I'd say that the bug is just the same, but Matlab processes are less-perfectly-in-sync, cause I manually launch these commands. And, thus, the second command is started about a second later.

I ran into those "really weird" result one more time. But it only when X==1. (Still, I can't say that my results are statistically significant.) Maybe it's somewhat related to compiling almost all CUDA code successfully. Not sure.

exp#  GPU  Error  Error_on_2nd_command  CacheSize (MB)
1,2  true    true    323.7
1,2  false    false    435.9
1,2  true    true    128.3
1,1  false    true    361.8
1,1  false    false    435.9
1,1  true    true    311.7

Igor Varfolomeev on 3 Jul 2018

Open in MATLAB Online

Have you tried clearing the cache, then running gpuArray(1) on one Pascal card. Wait for the JIT cache to be populated fully, then open a pool and run on multiple cards.

Yep, this is the part of the "longer pre-initialization" workaround that I'm currently using. But still, this:

% clear CUDA JIT cache and restart Matlab first
gpuDevice(1)
gather(gpuArray(1)) 
parpool(N); 
fetchOutputs( parfevalOnAll(@() gather(gpuArray(1)),1) )

usually fails for N=16. It usually works OK if N<=3 though (probably the issue is still there, but the chance to run into it is somewhat lower).

To make it work more reliably, I also have to run gpuDevice on each worker first (I have to explicitly assign GPUs to workers anyway):

% clear CUDA JIT cache and restart Matlab first
gpuDevice(1)
gather(gpuArray(1)) 
parpool(16); 
fetchOutputs( parfevalOnAll(@() gpuDevice(1),1) ) 
fetchOutputs( parfevalOnAll(@() gather(gpuArray(1)),1) )

-----------------------------------------------------------------------------------------------

I'd be very surprised if there's a general issue, since we regularly run multi-GPU code on dual Pascal cards.

Yep, there's always a chance that it's only to my specific OS version, CUDA version, GPU driver version... The GPU driver is the most questionable among others, probably. I've not tested other versions yet, only r384.94. Maybe I'll do this in a while.

-----------------------------------------------------------------------------------------------

Is this Windows, and is one of your Pascal cards driving the display?

Yep, it's Windows 10 version 1703 build 15063.1112. All my monitors are already attached to K2200. :)

Igor Varfolomeev on 3 Jul 2018

I don't really have any more ideas I'm afraid

Personally, my solution was to re-write GPU part in plain CUDA & c++, using mex, but without using mxInitGPU and gpuArray at all. This is somewhat hacky (cause I have to keep in mind to not use gpuArray), but it works. And it's faster.

-----------------------------------------------------------------------------------------------

Try downloading a newer driver, then try downloading an OLDER driver.

Yep, maybe I'll try different drivers later. I just don't have enough time for this now.

But, I've just checked that the code from my original message fails on a completely different system (with two GTX1080+K2200 as well, though), with Windows 7 x64 and NVidia 388.19 driver. So, at least, it's not "just on my machine".

-----------------------------------------------------------------------------------------------

Nonetheless, I'll requisition one and check your code in 16b.

That would be very nice, if you would be able to reproduce the issue - this confirmation might help a lot if I would file a bug report to NVidia.

-----------------------------------------------------------------------------------------------

since this works fine in later versions of MATLAB

Personally, I'm not sure CUDA JIT works in newer Matlab versions... Isn't it working just because JIT is not actually needed? If I had a Volta GPU, I could have tried to reproduce the same thing in R2017b as well....

-----------------------------------------------------------------------------------------------

I suspect it is but really, CUDA 7.5 is pretty old now and NVIDIA don't worry themselves too much about supporting the JIT pipeline for older cards, so there could be an issue in your driver that won't be fixed, or will never work because the PTX itself is faulty.

Well, it's definitely possible that this is only related to particular "pipeline", related to particular ptx version. Even though preventing different processes/threads from writing into the same file simultaneously (or whatever) looks like a somewhat version-independent part of code. But from what I currency know - it's possible that the bug is still there as well.

-----------------------------------------------------------------------------------------------

running some of the CUDA toolkit samples

That would be the best possible approach, if there would be an easy way to produce 400MB of CUDA binaries at once... And with just few KB - it might be difficult to reproduce the whole thing.

-----------------------------------------------------------------------------------------------

setting the environment variable CUDA_FORCE_PTX_JIT to 1

Sounds interesting... But this does not give me the same behaviour - the ComputeCache is still almost empty after running those commands - few KB only. It looks like files are being added and instantly erased. Hmm... Could you please advice - am I doing something wrong here? Were you able to make it populate the ComputeCache?

Joss Knight on 4 Jul 2018

I had a colleague check their dual GTX 1080 system and they saw no issues, with 16b or with the current version with a forced JIT.

Sounds interesting... But this does not give me the same behaviour - the ComputeCache is still almost empty after running those commands - few KB only. It looks like files are being added and instantly erased. Hmm... Could you please advice - am I doing something wrong here? Were you able to make it populate the ComputeCache?

This works for me but ... possibly only when your card's architecture is the maximum supported or higher, because if it were lower there would be no compatible PTX in the libraries. So you'll need to run R2017a or R2017b for your Pascal card.

It would be good to establish why upgrading MATLAB is not an option for you.

Igor Varfolomeev on 8 Jul 2018

Edited: Igor Varfolomeev on 8 Jul 2018

Open in MATLAB Online

It would be good to establish why upgrading MATLAB is not an option for you.

That's because in this particular case the request to me was to improve the performance without any major changes, like adding new external dependencies (e.g. newer MCR). For the next version, we'll definitely migrate to a newer Matlab.

-----------------------------------------------------------------------------------------------

This works for me but ... possibly only when your card's architecture is the maximum supported or higher, because if it were lower there would be no compatible PTX in the libraries. So you'll need to run R2017a or R2017b for your Pascal card.

Yep, I've used R2017b (because R2017a got the same issue as R2016b). But this trick does not work for me. I've just tried this once again - the ComputeCache size oscilates from 0 to few MB, while gpuDevice(1) is running (but it takes few minutes, so it's definitely compiling something). In the end, ComputeCache size is below 1 MB. That's strange. Just in case, I set this environment variable in cmd, before starting Matlab, e.g.

Microsoft Windows [Version 10.0.15063]
(c) 2017 Microsoft Corporation. All rights reserved.
C:\>cd "C:\Program Files\MATLAB\R2017b\bin\"  
C:\Program Files\MATLAB\R2017b\bin>echo %CUDA_CACHE_MAXSIZE%
2147483647
C:\Program Files\MATLAB\R2017b\bin>echo %CUDA_CACHE_DISABLE%
0    
C:\Program Files\MATLAB\R2017b\bin>set CUDA_FORCE_PTX_JIT=1
C:\Program Files\MATLAB\R2017b\bin>echo %CUDA_FORCE_PTX_JIT%
1
C:\Program Files\MATLAB\R2017b\bin>matlab.exe

In R2018a update 3, trying to run gpu-related commands with CUDA_FORCE_PTX_JIT=1 produces a different result. The ComputeCache remains empty. There's almost no delay. And it fails on convn:

>> getenv('CUDA_FORCE_PTX_JIT')
ans =
  '1'
    
>> gpuDevice(1);
Warning: The CUDA driver must recompile the GPU libraries because CUDA_FORCE_PTX_JIT is set to '1'. Recompiling can take several minutes. Learn more. 
> In parallel.internal.gpu.selectDevice
In parallel.gpu.GPUDevice.select (line 58)
In gpuDevice (line 21) 
    
>> gpuDevice(2);
Warning: The CUDA driver must recompile the GPU libraries because CUDA_FORCE_PTX_JIT is set to '1'. Recompiling can take several minutes. Learn more. 
> In parallel.internal.gpu.selectDevice
In parallel.gpu.GPUDevice.select (line 58)
In gpuDevice (line 21) 
    
>> a=gpuArray(zeros([9 9 9]));
>> b=gpuArray(zeros([3 3 3]));
>> c=convn(a,b)
Error using gpuArray/convn
An unexpected error occurred trying to launch a kernel. The CUDA error was:
invalid device symbol

Probably it fails because R2018a is designed for CUDA9, and my current GPU driver does not support it. This is as-expected. The strange part is that the ComputeCache is just empty.

-----------------------------------------------------------------------------------------------

I had a colleague check their dual GTX 1080 system and they saw no issues, with 16b or with the current version with a forced JIT.

Thanks for testing this! But could you please also specify, what was the NVidia driver version?

Provided that it works in the "current version" - probably that's some newer driver, with CUDA9 support. Maybe this means that the issue is fixed in newer NVidia drivers. I think I should test this myself as well.

However, maybe, after all, this issue do depend on "mixing Pascal and Maxwell". It looks like at least some aspects do depend on it. I've recently noticed that this fails:

% clear CUDA JIT cache and restart Matlab first
gpuDevice(1);
parpool(16); 
fetchOutputs( parfevalOnAll(@gpuDevice,1) )

but this works (tested twice):

% clear CUDA JIT cache and restart Matlab first
gpuDevice(1);
parpool(16); 
fetchOutputs( parfevalOnAll(@() gpuDevice(1),1) )

this works as well (tested twice):

% clear CUDA JIT cache and restart Matlab first
gpuDevice(1);
parpool(16); 
spmd
  gpuDevice(mod(labindex,2)+1);
  gather(gpuArray(1));
end

but this fails:

% clear CUDA JIT cache and restart Matlab first
gpuDevice(1);
parpool(16); 
spmd
  gpuDevice(mod(labindex,2)+2);
  gather(gpuArray(1));
end  
    
Starting parallel pool (parpool) using the 'local' profile ... connected to 16 workers.
Warning: An error has occurred during SPMD execution. An attempt has been made to interrupt execution on the workers. If this situation persists, it may be necessary to
interrupt execution using CTRL-C and then deleting and restarting the parallel pool.
    
The error that occurred on worker 13 is:
Error using gpuDevice (line 26)
An unexpected error occurred during CUDA execution. The CUDA error was:
unknown error
. 
> In spmdlang.RemoteSpmdExecutor/maybeWarnIfInterruptedAndWaiting (line 300)
  In spmdlang.RemoteSpmdExecutor/isComputationComplete (line 131)
  In spmdlang.spmd_feval_impl (line 19)
  In spmd_feval (line 8) 
Error detected on worker 13.
    
Caused by:
  Error using gpuDevice (line 26)
  An unexpected error occurred during CUDA execution. The CUDA error was:
  unknown error

-----------------------------------------------------------------------------------------------

UPD:

I've just tried Nvidia 397.93 driver. And now the original issue is gone, and this:

% clear CUDA JIT cache and restart Matlab first
parpool(16); 
fetchOutputs( parfevalOnAll(@() gather(gpuArray(1)),1) )

works OK in R2016b (tested twice). And the ComputeCache size is much smaller - only ~140MB.

So, after all, it looks like the issue does not exist in newer driver versions. So, sorry for the buzz. I should have checked this before. :)

(But the CUDA_FORCE_PTX_JIT in R2017b still behaves the same for me, by the way.)

Sign in to comment.

Sign in to answer this question.

Answer 1

Igor Varfolomeev on 25 Nov 2018

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/407188-initializing-gpu-on-multiple-workers-cause-an-unknown-error#answer_348790

As noted in comments, it looks like the issue does not exist in newer driver versions. So, I'm sorry for the buzz.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Initializing GPU on multiple workers cause an unknown error

10 Comments
Show 8 older commentsHide 8 older comments

Accepted Answer

0 Comments
Show -2 older commentsHide -2 older comments

More Answers (0)

See Also

Categories

Tags

Products

Community Treasure Hunt

Initializing GPU on multiple workers cause an unknown error

10 Comments Show 8 older commentsHide 8 older comments

Accepted Answer

0 Comments Show -2 older commentsHide -2 older comments

More Answers (0)

See Also

Categories

Tags

Products

Community Treasure Hunt

10 Comments
Show 8 older commentsHide 8 older comments

0 Comments
Show -2 older commentsHide -2 older comments