Im running Matlab 2017b with update 9 and I wanted to train a convolutional neural net using trainNetwork(...) together with a Nvidia Turing GPU (RTX 2070, Driver version 416.34). However, there is an error message showing up after quite a bit of delay: _ Training on single GPU. Initializing image normalization. |=======================================================================================================================| | Epoch | Iteration | Time Elapsed | Mini-batch | Validation | Mini-batch | Validation | Base Learning| | | | (seconds) | Loss | Loss | RMSE | RMSE | Rate | |=======================================================================================================================| Error using trainNetwork (line 140) Unexpected error calling cuDNN: *CUDNN_STATUS_EXECUTION_FAILED.* Error in NetTrainTest (line 67) net = trainNetwork(inputConv,outputConv,layers,options); Caused by: Error using nnet.internal.cnngpu.convolveForward2D Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED._ When I train the network using the CPU, everything is fine. I also do not get any errors when I use normal calculus with GPU arrays, everything works fine. The output of gpuDevice is as follows: Name: 'GeForce RTX 2070' Index: 1 ComputeCapability: '7.5' SupportsDouble: 1 DriverVersion: 10 ToolkitVersion: 8 MaxThreadsPerBlock: 1024 MaxShmemPerBlock: 49152 MaxThreadBlockSize: [1024 1024 64] MaxGridSize: [2.1475e+09 65535 65535] SIMDWidth: 32 TotalMemory: 8.5899e+09 AvailableMemory: 7.6195e+09 MultiprocessorCount: 36 ClockRateKHz: 1620000 ComputeMode: 'Default' GPUOverlapsTransfers: 1 KernelExecutionTimeout: 1 CanMapHostMemory: 1 DeviceSupported: 1 DeviceSelected: 1 Any ideas on what causes this issue? Thank you for your reply.

This is a bit distressing, to discover this. However, if the option is available to you, you should upgrade MATLAB to 18a or 18b. 17b does not natively support Turing and there may be issues.

Cuda with Turing GPU and NeuralNetworkToolbox in 2017b

GW on 26 Oct 2018

Does that mean that the error is internal and there is nothing to be done? Unfortunately I do not have the possiblility to upgrade at the moment.

Joss Knight on 27 Oct 2018

It looks like you have hit this bug, but that was fixed in Update 2. I can only assume that the fix NVIDIA provided only works for Volta and not Turing, however, we will investigate.

GW on 27 Oct 2018

All right, thank you very much!

Joss Knight on 28 Oct 2018

If you type version -modules after running your code, what does it say on the line with libcudnn.so in it?

GW on 28 Oct 2018

Edited: GW on 28 Oct 2018

I'm on windows 7 64bit. The version of the library cudnn64_7.dll (which is the equivalent to the file you mentioned I guess) is 6.14.11.9000 according to the explorer. The product name is given as NVIDIA CUDA 9.0.176 CUDNN Library.

edit: Matlab has some issues detecting the version correctly: E:\Programme\Matlab2017b\bin\win64\cudnn64_7.dll Version unknown

Joss Knight on 28 Oct 2018

That certainly implies you have the fix. Thanks for checking.

GW on 30 Oct 2018

Edited: GW on 30 Oct 2018

Thanks for checking. Do you think you can provide a fix in one of the next updates?

Joss Knight on 30 Oct 2018

I would say that is unlikely because R2017b is two releases old.

GW on 1 Nov 2018

All right. I upgraded to 2018a, update6 for testing. The bug is in there as well.

Joss Knight on 1 Nov 2018

Okay thanks. That's very bad news.

GW on 2 Nov 2018

Edited: GW on 2 Nov 2018

Open in MATLAB Online

Allright, I made quite a bit of progress. I suspect that there is some weird bug with the cuda JIT-compiler. When I run the train command twice everything works perfectly on the second call.

So

try
   net = trainNetwork(inputConv,outputConv,layers,options);
catch
   net = trainNetwork(inputConv,outputConv,layers,options);
end

works reliably. Ugly coding but...oh well...

Does 2018b use the Cuda 10 libraries?

edit: RTX 2070 rocks. 10-15x speedup compared to my i5-4690k for my application.

Joss Knight on 2 Nov 2018

Open in MATLAB Online

Unfortunately CUDA 10 came out after R2018b shipped, so it is on CUDA 9.1. It is not supposed to make a significant difference because Turing isn't actually a new major compute capability, which means no JIT-compilation is supposed to be necessary.

I'm glad you've found some sort of workaround. I can't be sure what the issue is, but you should try something simple that triggers the loading of the cuDNN libraries rather than calling trainNetwork which is a bit of a heavy hammer. You can jump straight to an internal call such as

nnet.internal.cnngpu.reluForward(gpuArray(0));

Perhaps that fixes the issue.

GW on 3 Nov 2018

Open in MATLAB Online

All right that earns you the checkmark. I added

try
 nnet.internal.cnngpu.reluForward(gpuArray(0));
end

to my startup file and things are really fast now and no more errors show up.

Thank you for the kind and helpful support!

Joss Knight on 3 Nov 2018

Great news. However, can you try one more thing for me?

Delete your NVIDIA Compute Cache. You'll find it in %APPDATA%\NVIDIAComputeCache, so delete that directory, or at least its contents.

Make sure you have your CUDA_CACHE_MAXSIZE environment variable set to at least 512MB. Follow the instructions in this Answer.

Start MATLAB again (without your new command in startup.m). You will have to wait 5 minutes for the cache to be repopulated by the NVIDIA JIT compiler when you first run something on the GPU. But from then on I would expect you never to see this problem again.

GW on 3 Nov 2018

Edited: GW on 3 Nov 2018

Open in MATLAB Online

Thanks for your idea.

I'm afraid that did not fix it. It still comes up after every first call to one of the cuda libraries after I restart matlab.

btw: When calling

nnet.internal.cnngpu.reluForward(gpuArray(0));

for the first time, the error is:

Identifier: nnet_cnn:internal:cnngpu:CuDNNError
Message: Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED.

No cause, no stack.

Joss Knight on 3 Nov 2018

Pity. It's such odd behaviour I felt it had to be some old data in the CUDA cache.

Don't worry about the lack of a stack, that function is a built-in so there is no MATLAB call stack below the entry point.

Patrick Kiefer on 29 Jan 2019

Open in MATLAB Online

Hello, same problem here by using

[net, info] = trainNetwork(pximds,lgraph,options);

Error:

Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED.

Matlab: R2018b (Update 2) / GPU: RTX 2070 aswell / CUDA Toolkit 10.0

  CUDADevice with properties:
                      Name: 'GeForce RTX 2070'
                     Index: 1
         ComputeCapability: '7.5'
            SupportsDouble: 1
             DriverVersion: 10
            ToolkitVersion: 9.1000
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 8.5899e+09
           AvailableMemory: 6.8846e+09
       MultiprocessorCount: 36
              ClockRateKHz: 1635000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 1
          CanMapHostMemory: 1
           DeviceSupported: 1
            DeviceSelected: 1

Could anyone solve this problem or are there other ideas?

Thank you for your reply!

GW on 29 Jan 2019

Please see the problem bypass described above.

Joss Knight on 29 Jan 2019

This question contains the answer with the most complete work-around to all the Turing problems.

M J on 7 Dec 2020

Edited: M J on 7 Dec 2020

Hi, I am planning on getting a Geforce RTX 2080 (I am mainly using the trainNetwork function). Do you know of any compatibility issue with Matlab 2020b ? Or is it okay/reasonable to go with that option? - Hoping to significantly speed up my training process. Thank you !

Walter Roberson on 7 Dec 2020

R2020b is not ready for the RTX 2080 for some of the Deep Learning functions, due to some bugs in the NVIDIA supplied compute libraries.

I have not heard the timeframe at which the fix is expected to be in place.

M J on 7 Dec 2020

Sorry to bother again, but what about the RTX2060 - or is there a list of gpus that would be compatible with R2020b ? Also I heard some GPUs don't signficantly increase the training speed (with the trainNetwork function), and I'm worried about that. Thanks for your help!

Walter Roberson on 7 Dec 2020

Sorry, my mistake. The RTX 2xxx should be fine in R2020b. It is the RTX 3xxx that are not ready, along with the RTX A6000

https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Products_using_Ampere

M J on 7 Dec 2020

Thank you so much!! Really appreciate it

Cuda with Turing GPU and NeuralNetworkToolbox in 2017b

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

24 Comments
Show 22 older comments Hide 22 older comments

More Answers (0)

Categories

Products

Release

Tags

Community Treasure Hunt

Cuda with Turing GPU and NeuralNetworkToolbox in 2017b

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

24 Comments Show 22 older comments Hide 22 older comments

More Answers (0)

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

24 Comments
Show 22 older comments Hide 22 older comments