# gpucoder.stridedMatrixMultiply

Optimized GPU implementation of strided and batched matrix multiply operation

## Syntax

``D = gpucoder.stridedMatrixMultiply(A,B)``
``___ = gpucoder.stridedMatrixMultiply(___,Name,Value)``

## Description

````D = gpucoder.stridedMatrixMultiply(A,B)` performs strided matrix-matrix multiplication of a batch of matrices. The input matrices `A` and `B` for each instance of the batch are located at fixed address offsets from their addresses in the previous instance. The `gpucoder.stridedMatrixMultiply` function performs matrix-matrix multiplication of the form: $D=\alpha AB$where $\alpha$ is a scalar multiplication factor, `A`, `B`, and `D` are matrices with dimensions `m`-by-`k`, `k`-by-`n`, and `m`-by-`n` respectively. You can optionally transpose or hermitian-conjugate `A` and `B`. By default, $\alpha$ is set to one and the matrices are not transposed. To specify a different scalar multiplication factor and perform transpose operations on the input matrices, use the `Name,Value` pair arguments.All the batches passed to the `gpucoder.stridedMatrixMultiply` function must be uniform. That is, all instances must have the same dimensions `m,n,k`.```

````___ = gpucoder.stridedMatrixMultiply(___,Name,Value)` performs strided batched matrix multiply operation by using the options specified by one or more `Name,Value` pair arguments.```

## Examples

Perform a simple batched matrix-matrix multiplication and use the `gpucoder.stridedMatrixMultiply` function to generate CUDA® code that calls appropriate `cublas<t>gemmStridedBatched` APIs.

In one file, write an entry-point function `myStridedMatMul` that accepts matrix inputs `A` and `B`. Because the input matrices are not transposed, use the `'nn'` option.

```function [D] = myStridedMatMul(A,B,alpha) [D] = gpucoder.stridedMatrixMultiply(A,B,'alpha',alpha, ... 'transpose','nn'); end ```

To create a type for a matrix of doubles for use in code generation, use the `coder.newtype` function.

```A = coder.newtype('double',[5 4 100],[0 0]); B = coder.newtype('double',[4 5 100],[0 0]); alpha = 0.3; inputs = {A,B,alpha}; ```

To generate a CUDA library, use the `codegen` function.

```cfg = coder.gpuConfig('lib'); cfg.GpuConfig.EnableCUBLAS = true; cfg.GpuConfig.EnableCUSOLVER = true; cfg.GenerateReport = true; codegen -config cfg-args inputs myStridedMatMul ```

The generated CUDA code contains kernels `myStridedMatMul_kernelNN` for initializing the input and output matrices. The code also contains the `cublasDgemmStridedBatched` API calls to the cuBLAS library. The following code is a snippet of the generated code.

```// // File: myStridedMatMul.cu // ... void myStridedMatMul(const double A_data[], const int A_size[3], const double B_data[], const int B_size[3], double alpha, double D_data[], int D_size[3]) { double alpha1; ... beta1 = 0.0; cudaMemcpy(gpu_alpha1, &alpha1, 8ULL, cudaMemcpyHostToDevice); cudaMemcpy(gpu_A_data, (void *)A_data, A_size[0] * A_size[1] * A_size[2] * sizeof(double), cudaMemcpyHostToDevice); cudaMemcpy(gpu_B_data, (void *)B_data, B_size[0] * B_size[1] * B_size[2] * sizeof(double), cudaMemcpyHostToDevice); cudaMemcpy(gpu_beta1, &beta1, 8ULL, cudaMemcpyHostToDevice); if (D_data_dirtyOnCpu) { cudaMemcpy(gpu_D_data, &D_data[0], 25 * D_size[2] * sizeof(double), cudaMemcpyHostToDevice); } if (batchDimsA[2] >= batchDimsB[2]) { if (batchDimsA[2] >= 1) { ntilecols = batchDimsA[2]; } else { ntilecols = 1; } } else { ntilecols = batchDimsB[2]; } cublasDgemmStridedBatched(getCublasGlobalHandle(), CUBLAS_OP_N, CUBLAS_OP_N, 5, 5, 4, (double *)gpu_alpha1, (double *)&gpu_A_data[0], 5, strideA, (double *) &gpu_B_data[0], 4, strideB, (double *)gpu_beta1, (double *)&gpu_D_data[0], 5, 25, ntilecols); cudaMemcpy(&D_data[0], gpu_D_data, 25 * D_size[2] * sizeof(double), cudaMemcpyDeviceToHost); ... } ```

## Input Arguments

Operands, specified as vectors or matrices. `gpucoder.stridedMatrixMultiply` multiplies along the first two dimensions.

Data Types: `double` | `single` | `int8` | `int16` | `int32` | `int64` | `uint8` | `uint16` | `uint32` | `uint64`
Complex Number Support: Yes

### Name-Value Arguments

Specify optional pairs of arguments as `Name1=Value1,...,NameN=ValueN`, where `Name` is the argument name and `Value` is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose `Name` in quotes.

Example: ```D = gpucoder.stridedMatrixMultiply(A,B,'alpha',0.3,'transpose','CC');```

Value of the scalar used for multiplication with `A`. Default value is one.

Character vector or string composed of two characters, indicating the operation performed on the matrices `A` and `B` prior to matrix multiplication. Possible values are normal (`'N'`), transposed (`'T'`), or complex conjugate transpose (`'C'`).

## Output Arguments

Product, returned as a scalar, vector, or matrix. Array `D` has the same number of rows as input `A` and the same number of columns as input `B`.

## Version History

Introduced in R2020a