Analyze Performance of Code Generated for Deep Learning Networks

This example uses:

This example shows how to analyze the performance of CUDA® code generated for deep learning networks by using the gpuPerformanceAnalyzer function.

The gpuPerformanceAnalyzer function generates code and collects metrics on CPU and GPU activities in the generated code. The function generates a report that contains a chronological timeline plot that you can use to visualize and identify performance bottlenecks in the generated CUDA code. Additionally, the report contains a dashboard that you can use to analyze the deep learning network's performance.

This example generates the performance analysis report for a function that uses a deep learning variational autoencoder (VAE) to generate digit images. For more information, see Generate Digit Images on NVIDIA GPU Using Variational Autoencoder.

Third-Party Prerequisites

CUDA-enabled NVIDIA® GPU.
NVIDIA CUDA toolkit and driver. For information on the supported versions of the compilers and libraries, see Third-Party Hardware.
Environment variables for the compilers and libraries. For setting up the environment variables, see Setting Up the Prerequisite Products.

Verify GPU Environment

To verify that the compilers and libraries for this example are set up correctly, use the coder.checkGpuInstall function.

envCfg = coder.gpuEnvConfig("host");
envCfg.DeepLibTarget = "cudnn";
envCfg.DeepCodegen = 1;
envCfg.Quiet = 1;
coder.checkGpuInstall(envCfg);

When the Quiet property of the coder.gpuEnvConfig object is set to true, the coder.checkGpuInstall function returns only warning or error messages.

Analyze the Pretrained Variational Autoencoder Network

Autoencoders have two parts: the encoder and the decoder. The encoder takes an image input and outputs a compressed representation. The decoder takes this compressed representation, decodes it, and recreates the original image.

VAEs differ from regular autoencoders in that they do not use the encoding-decoding process to reconstruct an input. Instead, they impose a probability distribution on the latent space, and learn the distribution so that the distribution of outputs from the decoder matches that of the observed data. Then, they sample from this distribution to generate new data.

This example uses the decoder network trained in the Train Variational Autoencoder (VAE) to Generate Images (Deep Learning Toolbox) example. The encoder outputs a compressed representation that is a vector of size latent_dim. In this example, the value of latent_dim is equal to 20.

The VAE Encoder and Decoder layers

Examine the Entry-Point Function

The generateVAE entry-point function loads the dlnetwork object from the trainedDecoderVAENet MAT-file into a persistent variable and reuses the persistent object during subsequent prediction calls. It initializes a dlarray object that contains 25 randomly generated encodings, passes them through the decoder network, and extracts the numeric data of the generated image from the deep learning array object.

type("generateVAE.m")

function generatedImage =  generateVAE(decoderNetFileName,latentDim,Environment) %#codegen
% Copyright 2020-2021 The MathWorks, Inc.

persistent decoderNet;
if isempty(decoderNet)
    decoderNet = coder.loadDeepLearningNetwork(decoderNetFileName);
end

% Generate random noise
randomNoise = dlarray(randn(1,1,latentDim,25,'single'),'SSCB');

if coder.target('MATLAB') && strcmp(Environment,'gpu')
    randomNoise = gpuArray(randomNoise);
end

% Generate new image from noise
generatedImage = sigmoid(predict(decoderNet,randomNoise));

% Extract numeric data from dlarray
generatedImage = extractdata(generatedImage);

end

Generate GPU Performance Analyzer Report

To analyze the performance of the generated code, use the gpuPerformanceAnalyzer function. First, create a code configuration object with a MEX build type by using the mex input argument.

cfg = coder.gpuConfig("mex");

Use the coder.DeepLearningConfig function to create a CuDNN deep learning configuration object and assign it to the DeepLearningConfig property of the GPU code configuration object.

cfg.TargetLang = "C++";
cfg.GpuConfig.EnableMemoryManager = true;
cfg.DeepLearningConfig = coder.DeepLearningConfig("cudnn");

Run gpuPerformanceAnalyzer with the default iteration count of 2. The GPU Performance Analyzer collects performance data for both iterations and opens automatically.

latentDim = 20;
Env = "gpu";
matfile = "trainedDecoderVAENet.mat";
inputs  = {coder.Constant(matfile), coder.Constant(latentDim), coder.Constant(Env)};
designFileName = "generateVAE";

gpuPerformanceAnalyzer(designFileName, inputs, ...
    "Config", cfg, "NumIterations", 2);

### Starting GPU code generation
Code generation successful: View report

### GPU code generation finished
### Starting application profiling
### Application profiling finished
### Starting profiling data processing
### Profiling data processing finished
### Showing profiling data

GPU Performance Analyzer report for generateVAE

For this example, the values in the app depend on your hardware. The profiling in this example was performed using MATLAB® R2025a on a machine with a 13th gen Intel® Core™ i9-13900K CPU and an NVIDIA Titan V GPU.

The Profiling Timeline shows the trace of the events on the CPU and GPU. In this example, the timeline shows that most of the execution time is due to events in the GPU Activities row. Additionally, the Profiling Summary shows that the GPU utilization is 85%. Use the mouse wheel or a touchpad to zoom in on the events in the Loops row. In the Loops row, the application contains a loop named activations_whileloop_0 for activations for the network.

Profiling Timeline zoomed in to show activations_whileloop_0 in the Loops row

Analyze Network Performance in the Deep Learning Dashboard

Next, analyze the runtime statistics for the network using the Deep Learning Dashboard.

Open the Deep Learning Dashboard

In the toolstrip, click Show Predict Functions to see the deep learning inference functions in the Profiling Timeline. The Show Predict Functions button is enabled because the entry-point function uses a deep learning network.

GPU Performance Analyzer toolstrip showing the Show Predict Functions button

The Performance Analyzer marks deep learning events with a network icon. In this run, there is one predict function, dlnetwork_predict.

Profiling Timeline pane showing the dlnetwork_predict function

Select dlnetwork_predict. In the toolstrip, click Open Deep Learning Dashboard. Alternatively, in the toolbar that appears above dlnetwork_predict, click the Network button. The dashboard shows the runtime statistics from dlnetwork_predict.

Deep Learning Dashboard

Examine Execution Overview

The Execution Overview section shows the time the network took to execute and estimates the network efficiency as a percent.

In this example, the Execution Overview shows the network took 0.37ms out of the 0.56ms needed to run the entire entry-point function. The Network Efficiency estimates the efficiency of the network as a percent based on the runtime statistics. In this example, the network has an efficiency over 99%, which indicates that the generated code for the network is well-optimized and uses the GPU efficiently.

Analyze Network Runtime

Use the Runtime Breakdown section to view the CPU and GPU activities during the network execution. This section displays the cumulative CPU and GPU activities side-by-side over the network execution time. In the Runtime Breakdown, in the GPU row, point to the green rectangle. A tooltip shows the GPU spent approximately 92% of the network execution time on computation or kernel execution.

Runtime Breakdown showing Compute and Kernels needed 91.6% of the execution time

Analyze Runtime by Operation Type

To identify the layer types that take the longest to execute, use the Runtime Breakdown by Operation Type pane. You view either the CPU or GPU execution times. In the Runtime Breakdown by Operation Type section, in the Select a Device list, select GPU.

The tree map groups the layers based on their type and displays layers that take longer to execute with more area. In this example, the tree map shows transposed 2-D convolution layers as the layer type that takes the longest to execute on the GPU, taking more than 95% of the execution time.

Runtime Breakdown by Operation Type showing Transposed Convolution 2D Layers needed 95.6% of the execution time

Layers that execute for very short periods do not have labels in the Runtime Breakdown by Operation Type pane. To determine the types of these layers, select them in the Runtime Breakdown by Operation Type pane. In this example, after selecting one of these layers, the Operation Details table shows that the layer is the Rectified Linear Unit (ReLU) layer named relu3.

Runtime Breakdown by Operation Type and Operation Details panes with the relu3 layer selected