# Batched matrix multiplicaion with CUDA

4 views (last 30 days)
Peter Egli on 28 Apr 2020
Edited: Erik Meade on 5 May 2020
Hi,
I saw that Matlab R2020a implements new features for the GPU coder, especially the gpucoder.stridedMatrixMultiply. However, I don't understand how the batch is defined there. If you take a look at the generated CUDA code that is shown in the example, it states 1 for the batch size (cf. NVIDIA documentation). Also the variables A,B & C are expected to be 2D and of the dimensionality of the matrices to be processes.
How do I use the function correctly? I have a 3D vector in Matlab which holdes many small matrices, so A(:,:,1), A(:,:,2) and so on. The same applies for B. I would like to process them all at the same time using CUDA. I would like to calculate A(:,:,1)*B(:,:,1) etc using a CUDA function. How can I achieve that with the new GPU coder functionality? How do I interface that from Matlab?
Peter

Erik Meade on 5 May 2020
Edited: Erik Meade on 5 May 2020
Hi Peter,
gpucoder.stridedMatrixMultiply works exactly as you want. You can directly pass A and B to gpucoder.stridedMatrixMultiply and it will compute them in the way you want.
A small example, say you have a function called stridedMultiply:
function c = stridedMultiply(a, b)
c = gpucoder.stridedMatrixMultiply(a, b);
end
Then we can generate code for it and verify that the answer is correct with the following code:
% 3D-vector inputs
a = rand(5,4,100);
b = rand(4,5,100);
% Generate Code
codegen -config coder.gpuConfig('mex') -args {a, b} stridedMultiply
% Verify correctness
c_mex = stridedMultiply_mex(a, b);
c = zeros(size(c_mex));
for i = 1:100
c(:,:,i) = a(:,:,i) * b(:,:,i);
end
% Check MATLAB answer vs. stridedMatrixMultiply generated code
tolerance = 1e-8;
assert(all(abs(c(:) - c_mex(:)) < tolerance));
If we look at the generated code, we will see that the batch size has been properly set to 100:
cublasDgemmStridedBatched(getCublasGlobalHandle(), CUBLAS_OP_N, CUBLAS_OP_N, 5,
5, 4, (double *)gpu_alpha1, (double *)&(*gpu_a)[0], 5, 20, (double *)
&(*gpu_b)[0], 4, 20, (double *)gpu_beta1, (double *)&(*gpu_c)[0], 5, 25, 100);
With regards to the example in the doc page you cited, since the input matrices in the example are both 2D, there is only 1 batch to be computed, therefore the parameter is set to 1. I understand your confusion however, since gpucoder.stridedMatrixMultiply is mostly intended to be used with 3D inputs. To clarify, gpucoder.stridedMatrixMultiply multiplies along the first two dimensions only. I understand how that example can be confusing however, and we will look into updating that example.

### Categories

Find more on Get Started with GPU Coder in Help Center and File Exchange

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!