Cuda with Turing GPU and NeuralNetworkToolbox in 2017b

Im running Matlab 2017b with update 9 and I wanted to train a convolutional neural net using trainNetwork(...) together with a Nvidia Turing GPU (RTX 2070, Driver version 416.34). However, there is an error message showing up after quite a bit of delay:
_
Training on single GPU.
Initializing image normalization.
|=======================================================================================================================|
| Epoch | Iteration | Time Elapsed | Mini-batch | Validation | Mini-batch | Validation | Base Learning|
| | | (seconds) | Loss | Loss | RMSE | RMSE | Rate |
|=======================================================================================================================|
Error using trainNetwork (line 140)
Unexpected error calling cuDNN: *CUDNN_STATUS_EXECUTION_FAILED.*
Error in NetTrainTest (line 67)
net = trainNetwork(inputConv,outputConv,layers,options);
Caused by:
Error using nnet.internal.cnngpu.convolveForward2D
Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED._
When I train the network using the CPU, everything is fine. I also do not get any errors when I use normal calculus with GPU arrays, everything works fine.
The output of gpuDevice is as follows:
Name: 'GeForce RTX 2070'
Index: 1
ComputeCapability: '7.5'
SupportsDouble: 1
DriverVersion: 10
ToolkitVersion: 8
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 8.5899e+09
AvailableMemory: 7.6195e+09
MultiprocessorCount: 36
ClockRateKHz: 1620000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
Any ideas on what causes this issue? Thank you for your reply.

 Accepted Answer

This is a bit distressing, to discover this. However, if the option is available to you, you should upgrade MATLAB to 18a or 18b. 17b does not natively support Turing and there may be issues.

24 Comments

Does that mean that the error is internal and there is nothing to be done? Unfortunately I do not have the possiblility to upgrade at the moment.
It looks like you have hit this bug, but that was fixed in Update 2. I can only assume that the fix NVIDIA provided only works for Volta and not Turing, however, we will investigate.
All right, thank you very much!
If you type version -modules after running your code, what does it say on the line with libcudnn.so in it?
GW
GW on 28 Oct 2018
Edited: GW on 28 Oct 2018
I'm on windows 7 64bit. The version of the library cudnn64_7.dll (which is the equivalent to the file you mentioned I guess) is 6.14.11.9000 according to the explorer. The product name is given as NVIDIA CUDA 9.0.176 CUDNN Library.
edit: Matlab has some issues detecting the version correctly: E:\Programme\Matlab2017b\bin\win64\cudnn64_7.dll Version unknown
That certainly implies you have the fix. Thanks for checking.
GW
GW on 30 Oct 2018
Edited: GW on 30 Oct 2018
Thanks for checking. Do you think you can provide a fix in one of the next updates?
I would say that is unlikely because R2017b is two releases old.
All right. I upgraded to 2018a, update6 for testing. The bug is in there as well.
Okay thanks. That's very bad news.
Allright, I made quite a bit of progress. I suspect that there is some weird bug with the cuda JIT-compiler. When I run the train command twice everything works perfectly on the second call.
So
try
net = trainNetwork(inputConv,outputConv,layers,options);
catch
net = trainNetwork(inputConv,outputConv,layers,options);
end
works reliably. Ugly coding but...oh well...
Does 2018b use the Cuda 10 libraries?
edit: RTX 2070 rocks. 10-15x speedup compared to my i5-4690k for my application.
Unfortunately CUDA 10 came out after R2018b shipped, so it is on CUDA 9.1. It is not supposed to make a significant difference because Turing isn't actually a new major compute capability, which means no JIT-compilation is supposed to be necessary.
I'm glad you've found some sort of workaround. I can't be sure what the issue is, but you should try something simple that triggers the loading of the cuDNN libraries rather than calling trainNetwork which is a bit of a heavy hammer. You can jump straight to an internal call such as
nnet.internal.cnngpu.reluForward(gpuArray(0));
Perhaps that fixes the issue.
All right that earns you the checkmark. I added
try
nnet.internal.cnngpu.reluForward(gpuArray(0));
end
to my startup file and things are really fast now and no more errors show up.
Thank you for the kind and helpful support!
Great news. However, can you try one more thing for me?
Delete your NVIDIA Compute Cache. You'll find it in %APPDATA%\NVIDIAComputeCache, so delete that directory, or at least its contents.
Make sure you have your CUDA_CACHE_MAXSIZE environment variable set to at least 512MB. Follow the instructions in this Answer.
Start MATLAB again (without your new command in startup.m). You will have to wait 5 minutes for the cache to be repopulated by the NVIDIA JIT compiler when you first run something on the GPU. But from then on I would expect you never to see this problem again.
Thanks for your idea.
I'm afraid that did not fix it. It still comes up after every first call to one of the cuda libraries after I restart matlab.
btw: When calling
nnet.internal.cnngpu.reluForward(gpuArray(0));
for the first time, the error is:
Identifier: nnet_cnn:internal:cnngpu:CuDNNError
Message: Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED.
No cause, no stack.
Pity. It's such odd behaviour I felt it had to be some old data in the CUDA cache.
Don't worry about the lack of a stack, that function is a built-in so there is no MATLAB call stack below the entry point.
Hello, same problem here by using
[net, info] = trainNetwork(pximds,lgraph,options);
Error:
Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED.
Matlab: R2018b (Update 2) / GPU: RTX 2070 aswell / CUDA Toolkit 10.0
CUDADevice with properties:
Name: 'GeForce RTX 2070'
Index: 1
ComputeCapability: '7.5'
SupportsDouble: 1
DriverVersion: 10
ToolkitVersion: 9.1000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 8.5899e+09
AvailableMemory: 6.8846e+09
MultiprocessorCount: 36
ClockRateKHz: 1635000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
Could anyone solve this problem or are there other ideas?
Thank you for your reply!
Please see the problem bypass described above.
This question contains the answer with the most complete work-around to all the Turing problems.
M J
M J on 7 Dec 2020
Edited: M J on 7 Dec 2020
Hi, I am planning on getting a Geforce RTX 2080 (I am mainly using the trainNetwork function). Do you know of any compatibility issue with Matlab 2020b ? Or is it okay/reasonable to go with that option? - Hoping to significantly speed up my training process. Thank you !
R2020b is not ready for the RTX 2080 for some of the Deep Learning functions, due to some bugs in the NVIDIA supplied compute libraries.
I have not heard the timeframe at which the fix is expected to be in place.
Sorry to bother again, but what about the RTX2060 - or is there a list of gpus that would be compatible with R2020b ? Also I heard some GPUs don't signficantly increase the training speed (with the trainNetwork function), and I'm worried about that. Thanks for your help!
Sorry, my mistake. The RTX 2xxx should be fine in R2020b. It is the RTX 3xxx that are not ready, along with the RTX A6000
Thank you so much!! Really appreciate it

Sign in to comment.

More Answers (0)

Products

Release

R2017b

Asked:

GW
on 26 Oct 2018

Commented:

M J
on 7 Dec 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!