gpucoder.matrixMatrixKernel

Optimized GPU implementation of functions containing matrix-matrix operations

Syntax

``C = gpucoder.matrixMatrixKernel(fun,A,B)``
``C = gpucoder.matrixMatrixKernel(___,orientation)``
``C = gpucoder.matrixMatrixKernel(___,vectorizedSim)``

Description

````C = gpucoder.matrixMatrixKernel(fun,A,B)` generates kernels from functions that contain GEMM-like operations. For example, matching feature points between two images by using: The sum of absolute differences (SAD) — ```F() = @(a,b)abs(a-b)```The sum of squared differences (SSD) — ```F() = @(a,b)(a-b).*(a-b)``` `C = gpucoder.matrixMatrixKernel(___,orientation)` has the optional argument `orientation` that specifies the orientation of `A` and `B` matrices.`C = gpucoder.matrixMatrixKernel(___,vectorizedSim)` has the optional argument `vectorizedSim` that specifies use of vectorized operations during MATLAB® simulation and CPU code generation. The function handle `fun` must support vector inputs and take one row or column from `A` and one column or row from `B` and outputs a vector equivalent to `arrayfun(FUN, A, B)`.```

Examples

This example performs a simple matrix-matrix multiplication and uses the `matrixMatrixKernel` design pattern to generate CUDA® code.

In one file, write an entry-point function `matMul_nn` that accepts two matrix inputs `f1` and `f2`. Use the MATLAB function `@times` to multiply `f1` and `f2` element by element. The sign @ creates a handle to the function `times`. Insert the `gpucoder.matrixMatrixKernel()` statement. The input matrices are not transposed, therefore use the `'nn'` option.

```function scores = matMul_nn(f1, f2) scores = gpucoder.matrixMatrixKernel(@times, f1, f2, 'nn',true); end```

Use the `codegen` function to generate CUDA MEX function.

```codegen -config coder.gpuConfig('mex') ... -args {ones(1024,1024,'double'),ones(1024,1024,'double')} ... -report matMul_nn```

The generated CUDA code contains two kernels: `matMul_nn_kernel1` for initializing the output matrix `scores` and `matrixMatrixKernel` that performs the `times` operation. The following is a snippet of the generated code.

``` cudaMemcpy(*gpu_f2, cpu_f2, 8388608UL, cudaMemcpyHostToDevice); matMul_nn_kernel1<<<dim3(2048U, 1U, 1U), dim3(512U, 1U, 1U)>>>(*gpu_f2, *gpu_B); cudaMemcpy(*gpu_f1, cpu_f1, 8388608UL, cudaMemcpyHostToDevice); matrixMatrixKernel<<<1024U, 64U>>>(*gpu_f1, *gpu_B, *gpu_scores); cudaMemcpy(cpu_scores, *gpu_scores, 8388608UL, cudaMemcpyDeviceToHost); ```

Input Arguments

Function to apply to the elements of the input arrays, specified as a function handle. `fun` is a handle to a user-defined function. It takes one row or column from matrix `A` and one row or column from matrix `B` and outputs a vector with the same type as the input. The output vector is then summed to compute a single scalar value in `C`.

Data Types: `function_handle`

Numeric inputs `A` and `B` must be either of the same size or have sizes that are compatible. For example, if `A` is an `M`-by-`K` matrix, `B` is a `K`-by-`N` matrix then `C` is an `M`-by-`N` matrix.

Data Types: `single` | `double` | `int8` | `int16` | `int32` | `int64` | `uint8` | `uint16` | `uint32` | `uint64` | `logical` | `half`

Character vector or string composed of two characters, indicating the operation performed on the matrices `A` and `B` prior to matrix multiplication. Possible values are normal (`'N'`), transposed (`'T'`), or complex conjugate transpose (`'C'`).

Possible values are:

• `'nn'` - Matrices `A` and `B` are normal.

• `'nt'` - Matrix `B` is transposed.

• `'tn'` - Matrix `A` is transposed.

• `'tt'` - Both matrices `A` and `B` are transposed.

Specify whether to use vectorized operation during MATLAB simulation and CPU code generation.

Output Arguments

Product, returned as a scalar, vector, or matrix. Array `D` has the same number of rows as input `A` and the same number of columns as input `B`.

Version History

Introduced in R2017b