Memory Bottleneck Analysis
Data Alignment
Condition
MATLAB® is column major but the algorithm could be implemented for an optimized row-major implementation. In the generated code, if your fastest changing dimension is not the innermost loop, then memory is not coalesced. Often, transposing the input matrices can simply fix this problem.
Action
Try transposing the data.
Small Data Sizes
Condition
If your problem/data size is too small, then the overhead of moving data to GPU (even if it is just at the I/O boundary) can offset the performance gains of running on the GPU.
Action
Try the algorithm with larger data sizes.
Too Many cudaMemcpys
Condition
If you use only coder.gpu.kernel
, then everything outside the
loop goes to the CPU. To try to keep most of the code on the GPU, use of both pragmas is
recommended. Also, presence of unsupported functions or a function/statement that cannot
run on the GPU, causes more cudaMemcpys
to be generated.
Action
Use coder.gpu.kernelfun
in addition to coder.gpu.kernel
Constant Inputs
Recommendation
If certain inputs of your entry-point function are constant, wrap them using the
coder.const
object. Use of
coder.const
object indicates that these variables are constant
during code generation. Without this function, GPU Coder™ considers these inputs to be variables and hence treats matrices sized by
these variables as variable-dimension matrices. GPU Coder does not create good kernels out of variable-dimension matrices since
currently dynamic sizing of kernels or dynamic cudaMemcpy
function
calls is not supported.
Stack Memory Usage
Recommendation
Using large stack memory inside kernels can reduce the performance of the generated code. Under such conditions consider rewriting the algorithm in a different fashion or breaking it into smaller computations to reduce stack memory usage and improve performance.