GPU Performance Analyzer

The GPU Performance Analyzer exposes GPU and CPU activities, events, and performance metrics in a chronological timeline plot that you can use to visualize, identify, and address performance bottlenecks in the generated CUDA^® code.

The GPU Performance Analyzer

Profiling Timeline

The Profiling Timeline tab shows the complete trace of the events that have a runtime higher than the threshold value. The timeline captures events such as:

Functions
Deep learning layers
Loops
Memory transfers between the CPU and GPU
GPU memory allocation and deallocation
Kernels

This image shows part of a profiling trace.

The profiling timeline showing the events from about 1ms to 2.5ms

You can use the mouse wheel or an equivalent touchpad option to zoom into and out of the timeline. Alternatively, you can use the timeline summary at the top of the tab to zoom and navigate the timeline plot. Use the Key Bindings button to display the bindings for the GPU Performance Analyzer. Use the Legend button to display the meanings of the colors.

The tooltips on each event indicate the start time, end time, and duration of the selected event on the CPU and the GPU. They also indicate the time elapsed between the kernel launch on the CPU and the actual execution of the kernel on the GPU.

By default, in the Filters section of the toolstrip, the Show single run button is selected and the associated drop-down menu is set to the last iteration of the generated code. To see profiling data for a previous iteration, select that run in the drop-down menu. To view profiling data for the entire application, including initialization and termination, select the Show entire profiling session button.

On the Functions and Loops rows, you can navigate between caller and callee functions and loops using the up and down arrows on the right side of the event bar.

Profiling Summary

The Profiling Summary pane provides an overview of the GPU and CPU activities. The bar charts changes according to the zoom level of the profiling timeline. This image shows the Profiling Summary pane for the region selected on the timeline. It shows that the GPU utilization is 75%.

The profiling summary. It shows the GPU and CPU events such as the Kernel, CPU overhead, and GPU idle time from 0.977ms to 2.323ms.

Event Statistics

The Event Statistics pane shows additional information for the selected event. For example, suppose your project contains a kernel feature_matching_kernel1. If you select feature_matching_kernel1 in the Profiling Timeline pane, the Event Statistics pane displays more information about it:

The event statistics showing the start time, end time, duration, launch parameters, shared memory, and registers per thread.

Call Tree

The Call Tree pane lists the GPU events called from the CPU. Each event in the call tree shows the execution times as percentages of the caller function. You can use this metric to identify performance bottlenecks in generated code. You can also navigate to specific events on the profiling timeline by clicking on the corresponding events in the call tree.

File

To open a GPU Profiling report, use the Open Report button. By default, the gpuPerformanceAnalyzer function generates the gpuProfiler.mldatx report file in the following location:

codegen/target/fcn_name/html

where target is:

mex for CUDA MEX
lib for CUDA libraries
dll for CUDA dynamic libraries

fcn_name is the name of the MATLAB^® entry-point function.

Note

Each time gpuPerformanceAnalyzer generates the same type of output for the same code, it removes the files from the previous build. If you want to preserve files from a previous build, before starting another build, copy them to a different location.

Filtering Options

You can use the Filters section of the toolstrip to filter the report.

Show entire profiling session — Use this option to view profiling results for the entire application, including initialization and termination.
Show single run — Use this option to view profiling results for a single iteration of the generated code.
Under Filter Events:
- Threshold (ms) — Skip events shorter than the given threshold.
- Memory Allocation/Free — Show GPU device memory allocation and deallocation related events on the CPU activities bar.
- Memory Transfer — Show host-to-device and device-to-host memory transfers.
- Kernel — Show CPU kernel launches and GPU kernel activities.
- Other Event — Show other GPU related events such as synchronization and waiting for GPU.

Limitations

On the Functions and Loops rows, you can navigate between caller and callee functions and loops using the up and down arrows on the right side of the event bar. For short events, it may not be possible to navigate back to the calling function or loop by using the up and down arrows. In such cases, use the call tree to navigate to the functions or loops.
GPU Performance Analyzer displays the row header even if the row does not contain an event.
At low zoom levels, GPU Performance Analyzer represents a densely populated area of short events separated by short distances as a single event. At higher levels of zoom, GPU Performance Analyzer displays the individual events. However, if the event duration is extremely short, it may not be possible to render this event on the timeline plot, even at high zoom levels.
GPU Performance Analyzer uses a single row to represent all the GPU events. In case of multiple CUDA streams, the GPU Activities row may contain overlapping events and the occupancy calculation in the Insights panel may be inaccurate. For example, deep learning libraries such as cuDNN may use multiple CUDA streams.

Related Examples

More About

How useful was this information?

Unrated 1 star 2 stars 3 stars 4 stars 5 stars