Memory Bottleneck Analysis

Data Alignment

Condition

MATLAB^® is column major but the algorithm could be implemented for an optimized row-major implementation. In the generated code, if your fastest changing dimension is not the innermost loop, then memory is not coalesced. Often, transposing the input matrices can simply fix this problem.

Action

Try transposing the data.

Small Data Sizes

Condition

If your problem/data size is too small, then the overhead of moving data to GPU (even if it is just at the I/O boundary) can offset the performance gains of running on the GPU.

Action

Try the algorithm with larger data sizes.

Too Many cudaMemcpys

Condition

If you use only coder.gpu.kernel, then everything outside the loop goes to the CPU. To try to keep most of the code on the GPU, use of both pragmas is recommended. Also, presence of unsupported functions or a function/statement that cannot run on the GPU, causes more cudaMemcpys to be generated.

Action

Use coder.gpu.kernelfun in addition to coder.gpu.kernel

Constant Inputs

Recommendation

If certain inputs of your entry-point function are constant, wrap them using the coder.const object. Use of coder.const object indicates that these variables are constant during code generation. Without this function, GPU Coder™ considers these inputs to be variables and hence treats matrices sized by these variables as variable-dimension matrices. GPU Coder does not create good kernels out of variable-dimension matrices since currently dynamic sizing of kernels or dynamic cudaMemcpy function calls is not supported.

Stack Memory Usage

Recommendation

Using large stack memory inside kernels can reduce the performance of the generated code. Under such conditions consider rewriting the algorithm in a different fashion or breaking it into smaller computations to reduce stack memory usage and improve performance.