GPU computation freezes randomly on Windows 10

I'm experencing a strange problem using GPU computation on a Windows 10 machine.
The function which causes the problem is a simple random walk called with arrayfun() for computation on the gpu. So nothing fancy there. Since it is only adding up the position with a random step for a certain amount of timesteps it cannot get stuck in theory.
The exact same code runs perfectly fine on Windows 7 and Windows 8.1 on the same machine using a GTX 1070 using the TdrLevel 0 registry entry. I tried several different driver versions on Windows 10 but after some random time the computation freezes. The GPU load remains at 100% but the Powerconsumption goes down from 45% to 25% and remains there forever. There is also no monitor connected to this GPU.
Sometimes I can trigger this freeze by opening the Taskmanager or GPU-Z so it seams that if something tries to get information from the GPU it freezes.
How can I debug the reason for this freeze when using arrayfun? Because when it freezes I cannot use CTRL+c to stop the computation in Matlab. I have to kill the matlab task. There is also no error in the Command Window.
Many thanks in advance, Dominik

8 Comments

Are you sure you're using the correct device? Try
for i = 1:gpuDeviceCount
gpuDevice
end
I'll admit that the behaviour of Windows GPUs in WDDM mode often defies explanation, but what you have here is a graphics card with timeouts disabled running a long-running kernel and causing your graphics to become suspended. Logically, your GPU is doing some graphics. If this were a laptop it would be easy to explain.
It would be helpful to know what hardware you have and how it is configured. Can you run nvidia-smi and tell me what it says?
I'm using an EVGA 1070 FTW. There is no display connected to the GTX 1070, the display is connected to an old AMD HD6450. However the output of nvidia-smi is as follows (with the computation running...):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 388.13 Driver Version: 388.13 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 WDDM | 00000000:29:00.0 Off | N/A |
| 24% 61C P2 80W / 185W | 279MiB / 8192MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2180 C ...iles\MATLAB\R2017b\bin\win64\MATLAB.exe N/A |
+-----------------------------------------------------------------------------+
But as mentioned in my answer below, when I manually reduce the core clock by 100MHz the freezes are gone. When comparing between Windows 8 and 10 I noticed that the GPU boosts are higher on Windows 10, which in my opinion causes the freeze of the task. So it seems like it is a stability issue due to the factory overclock of the GPU when using it as a compute device.
System specs:
AMD Ryzen 7 1700x (Reason for switch to Windows10 since it does not officially support Windows 8.1)
32 Gb DDr4 ram
650W be quiet! straight power E10 PSU
AMD HD 6450 as for display output
Evga 1070 FTW for computation
Win 8.1 Pro / Win 10 Education (Fall Creators update, but was the same with previous version)
Ok, things get even stranger. I rebooted the system after the succesfull finisehd computation and now this "temporaly" fix does not work anymore. The task on the GTX 1070 crashes again even with "debug mode" or reduced Core clock.
To be clear again the computation gets stuck and I can still use the computer since the display is connected to the AMD gpu. But the computation process is stuck forever at 100% with reduced power draw. This behaviour typically happens after 1 to 30minutes after starting the job. Using Windows 8.1 it still runs without any problems.
I'm afraid you've gone beyond my area of expertise. It would appear you need to talk to NVIDIA, since this would appear to be a hardware configuration issue. (Or perhaps you'll find someone more useful than me on this forum of course...)
In answer to your original question, you can't debug an arrayfun kernel in MATLAB, because it's not MATLAB code that's executing but a GPU kernel compiled from that code. But you can try attaching a CUDA debugger or analysing behaviour in one of the CUDA tools, like the Visual Profiler. The profiler can tell you quite a lot about running kernels.
Using Visual Profiler did indeed help. By looking at the timeline I was able to narrow done the problem to the end of the computation shortly before or while the "gather()" command. Also by looking at the resource monitor in windows I noticed an increase of ram errors. So I decided to set my ram frequency to 2133MHz. Up to now there are no freezes over several days and different workloads. What leaves me a bit puzzled is the fact that it worked and works fine with the other setting using Window 7, 8.1 and Linux.
I have to conclude that Windows 10 is a mystery :)

Sign in to comment.

Answers (0)

Categories

Find more on Parallel Computing Toolbox in Help Center and File Exchange

Asked:

on 1 Nov 2017

Commented:

on 8 Nov 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!