How Shared GPU Memory Manager Improves Performance of Generated MEX

You can use the GPU memory manager for efficient memory allocation, management, and improving run-time performance. The GPU memory manager creates a collection of large GPU memory pools and manages allocation and deallocation of chunks of memory blocks within these pools. By creating large memory pools, the memory manager reduces the number of calls to the CUDA^® memory APIs, improving run-time performance. See GPU Memory Allocation and Minimization.

In particular, when you generate CUDA MEX code, GPU Coder™ creates a single universal memory manager that handles the memory management for all running CUDA MEX functions, thereby further improving the performance of the MEX functions. To view the shared MEX memory manager properties and manage allocation, create a gpucoder.MemoryManager object by using the cudaMemoryManager function. To free the GPU memory that is not in use, call the freeUnusedMemory function. This topic explains the working of this shared memory manager with the help of an example.

Obtain Fog Rectification Example Files

This example uses the design file fog_rectification.m and the image file foggyInput.png of the Fog Rectification example. To create a folder that contains these files, run this command.

openExample('gpucoder/FogRectificationGPUExample')

Generate and Profile CUDA MEX with GPU Memory Manager Disabled

Create a GPU code configuration object for generating a MEX function. To generate code that does not use the memory manager, set the EnableMemoryManager property to false.

cfg = coder.gpuConfig("mex");
cfg.GpuConfig.EnableMemoryManager = false;

Generate and profile CUDA MEX code for the design file fog_rectification.m using the gpuPerformanceAnalyzer function. Specify the input type using an example value inputImage, which is the variable into which you loaded the foggyInput.png image file. Run the GPU Performance Analyzer with the default iteration count of 2.

inputImage = imread("foggyInput.png");
gpuPerformanceAnalyzer("fog_rectification",{inputImage},Config=cfg);

GPU Performance Analyzer showing the profiling data for the generated MEX with memory manager disabled

In the Performance Analyzer report, observe that a significant portion of the execution time is spent on memory allocation and deallocation.

Generate and Profile CUDA MEX with GPU Memory Manager Enabled

Enable GPU memory manager. Then, generate and profile the CUDA MEX function again.

cfg.GpuConfig.EnableMemoryManager = true;
gpuPerformanceAnalyzer("fog_rectification",{inputImage},Config=cfg);

GPU Performance Analyzer window showing the profiling data for the generated MEX with memory manager enabled

Observe that most memory allocation and deallocation events have disappeared from the profiling report. Therefore, the generated MEX now has improved run-time performance. The remaining memory allocation and deallocation activities originate from a function call to the Thrust library, which cannot benefit from the GPU memory manager.

Shared Memory Manager Allocations and Deallocations

To see when the shared GPU memory manager allocates large GPU memory pools, select the first run of fog_rectification_mex in the profiling report.

GPU Performance Analyzer showing the profiling data for the first iteration of the generated MEX with memory manager enabled

Observe that, compared to the second run, the first run has three extra GPU memory allocation events in the timeline graph. These events correspond to the allocation of three memory pools by the shared GPU memory manager. Subsequent runs of fog_rectification_mex reuse the memory pools allocated in the first run, thereby improving the run-time performance.

For MEX code generation, the memory pools allocated for fog_rectification_mex are preserved after fog_rectification_mex finishes its first execution. This allows subsequent MEX functions to reuse the memory pools allocated for fog_rectification_mex. However, for standalone CUDA code generation, the memory pools are private to the target (executable or static/dynamic library) and are deallocated when the standalone target is unloaded from the memory.