Unexpected error calling cuDNN: CUDNN_STATUS_BAD_PARAM.

18 views (last 30 days)

Scott Stearns on 20 Mar 2021

0
Link

Direct link to this question

https://in.mathworks.com/matlabcentral/answers/778632-unexpected-error-calling-cudnn-cudnn_status_bad_param

Commented: Tom Van den heuvel on 21 Sep 2021

Hi,

This error stops training when the 'ExecutionEnvironment' is 'parallel', 'multi-gpu', or 'gpu'. Training is running uninterrupted when set to 'cpu'. I'm running code for the first time on Ubuntu 20.04.2 LTS system with Intel i9 12 core cpu and 2x 3070 gpu's. It indicates only 12 workers and seems to not recognize the gpus.

Any suggestions and help is welcome.

Thank-you

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

Joss Knight on 21 Mar 2021

4
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/778632-unexpected-error-calling-cudnn-cudnn_status_bad_param#answer_653787

Edited: Joss Knight on 23 Mar 2021

Open in MATLAB Online

After some investigation (see thread below), this problem seems to be limited to RTX 3080 and 3070 and Linux. It can be worked around by disabling tensor cores. Restart MATLAB and run

setenv NVIDIA_TF32_OVERRIDE 0

before you do anything else. Further investigations are under way to look for a solution that doesn't require this workaround, which will reduce performance.

Original answer:

Are you running MATLAB release R2021a? The 3070 is not supported on earlier releases.

47 Comments
Show 45 older commentsHide 45 older comments

Scott Stearns on 21 Mar 2021

gpuDeviceTable

ans =

2×5 table

Index Name ComputeCapability DeviceAvailable DeviceSelected

_____ __________________ _________________ _______________ ______________

1 "GeForce RTX 3070" "8.6" true true

2 "GeForce RTX 3070" "8.6" true false

%%%%%%%%%%%%%%%

the gpuDevice(i) output for both is the same:

CUDADevice with properties:

Name: 'GeForce RTX 3070'

Index: 1

ComputeCapability: '8.6'

SupportsDouble: 1

DriverVersion: 11.2000

ToolkitVersion: 11

MaxThreadsPerBlock: 1024

MaxShmemPerBlock: 49152

MaxThreadBlockSize: [1024 1024 64]

MaxGridSize: [2.1475e+09 65535 65535]

SIMDWidth: 32

TotalMemory: 8.3701e+09

AvailableMemory: 8.0412e+09

MultiprocessorCount: 46

ClockRateKHz: 1770000

ComputeMode: 'Default'

GPUOverlapsTransfers: 1

KernelExecutionTimeout: 1

CanMapHostMemory: 1

DeviceSupported: 1

DeviceAvailable: 1

DeviceSelected: 1

Scott Stearns on 21 Mar 2021

Joss, sorry I didn't get this earlier. Here is the output from the attempted training (ExecutionEnvironment 'multi-gpu'):

training network....

Starting parallel pool (parpool) using the 'local' profile ...

Connected to the parallel pool (number of workers: 2).

Initializing input data normalization.

|======================================================================================================================|

Mini-batch | Validation | Base Learning |

| Loss | Loss | Rate |

|======================================================================================================================|

Error using trainNetwork (line 184)

Unexpected error calling cuDNN: CUDNN_STATUS_BAD_PARAM.

Error in deepLearnWhip1 (line 106)

trainedNet = trainNetwork(imagesTrainds, lgraph, options);

Caused by:

Error using nnet.internal.cnn.ParallelTrainer/train (line 96)

Error detected on worker 1.

Error using

nnet.internal.cnn.layer.util.Convolution2DGPUStrategy/backward

(line 82)

Unexpected error calling cuDNN: CUDNN_STATUS_BAD_PARAM.

Artem Lenskiy on 23 Mar 2021

Edited: Artem Lenskiy on 23 Mar 2021

Open in MATLAB Online

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3080    Off  | 00000000:01:00.0  On |                  N/A |
| 30%   29C    P0    85W / 320W |    440MiB / 10001MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 3080    Off  | 00000000:21:00.0 Off |                  N/A |
| 30%   23C    P8     4W / 320W |     10MiB / 10018MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Please let me know if it helps.

Scott Stearns on 23 Mar 2021

Edited: Scott Stearns on 23 Mar 2021

Tue Mar 23 08:23:23 2021

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|===============================+======================+======================|

| 0 GeForce RTX 3070 Off | 00000000:1A:00.0 Off | N/A |

| 0% 49C P8 27W / 240W | 3203MiB / 7982MiB | 0% Default |

| | | N/A |

+-------------------------------+----------------------+----------------------+

| 1 GeForce RTX 3070 Off | 00000000:68:00.0 On | N/A |

| 0% 48C P8 29W / 240W | 271MiB / 7974MiB | 1% Default |

| | | N/A |

+-------------------------------+----------------------+----------------------+

Scott Stearns on 17 Jun 2021

Hi Joss,

This workaround is no longer working. Do we have any progress on the toolboxes/GPU issues?

Here is what I'm seeing:

training network....

Error using trainNetwork (line 184)

GPU support for deep neural networks requires Parallel Computing Toolbox and a supported GPU device.

Error in deepLearnUCSF (line 139)

trainedNet = trainNetwork(imagesTrainds, lgraph, options);

Caused by:

Error using feval

Unable to find a supported GPU device. For more information on GPU support, see GPU Support by Release.

I restarted MATLAB and have: setenv NVIDIA_TF32_OVERRIDE 0 at the top of my code. Here are the trainingOptions I'm using. Fustrated that this expensive machine is not being used. Hope there's help on this...

Thanks,

Scott

options = trainingOptions('sgdm', ...

'InitialLearnRate',initialLearnRate,...

'Momentum',momentumFactor,...

'MaxEpochs',maxEpochs, ...

'MiniBatchSize',miniBatchSize, ...

'Shuffle','every-epoch',...

'Verbose',true, ...

'ValidationFrequency',floor(NumTrain/miniBatchSize),...

'ValidationData',imagesValidds,...

'Plots','training-progress',...

'LearnRateSchedule','piecewise',...

'LearnRateDropFactor',learnRateDropFactor, ...

'LearnRateDropPeriod',learnRateDropPeriod, ...

'CheckpointPath', checkpointPath,...

'ExecutionEnvironment','multi-gpu');

Joss Knight on 21 Sep 2021

Edited: Joss Knight on 21 Sep 2021

This is fixed in the next update of MATLAB R2021a, however you'd be better off simply downloading R2021b which will be out in a week or so.

Unfortunately NVIDIA weren't able to provide us with a fix that has no effect on performance, but we can at least limit the workaround to the problematic convolutions. A proper fix will arrive with the next CUDA upgrade.

We've never seen this problem on Windows.

Tom Van den heuvel on 21 Sep 2021

Thx for the update!

Products

Release

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!