Poor performance in mex-files created using mexcuda

I've noticed that the matlab gpu enabled code becomes VERY slow suddenly when a seemingly innocent line is added. I therefor wan't to call cuda using mex. In particular I like to call cufft. However when compiling a mexfile and calling cufft I get very! poor performance compared to matlabs fft (which seem to use cufft). What is going on? The cufft part of the mexfile seem to run 50-200 times slower than matlab fft calling the same library. Is mexcuda optimized? Here is the code I've used. Can of course (most likely be some error I've made):
% Matlab side code
% Compile using:
% >> mexcuda -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\lib\x64" -lcufft abc.cu
A = gpuArray.randn(600,600,32,'single') + 1i*randn(600,600,32,'single');
tic
B = abc(A);
toc;
tic, for ii = 1:30, B = fft2(A); end; toc
AA = gather(A);
tic, for ii = 1:30, B = fft2(AA); end; toc
%%Output from a run
>> testabc
Elapsed time is 0.193155 seconds. % Mex file
Elapsed time is 0.004172 seconds. % Matlab fft2
Elapsed time is 1.455618 seconds. % Matlab CPU
// Mex-file code in the file abc.cu
#include "mex.h"
#include "gpu/mxGPUArray.h"
#include <cufft.h>
// Interal type for complex. Same as cufftComplex just another name
typedef float2 Complex;
/*
* Device code
*/
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, mxArray const *prhs[])
{
char const * const errId = "parallel:gpu:mexGPUExample:InvalidInput";
char const * const errMsg = "Invalid input to MEX file.";
/* Declare all variables.*/
mxGPUArray const *A;
mxGPUArray *B;
Complex const *pA;
Complex *pB;
/* Initialize the MathWorks GPU API. */
mxInitGPU();
/* Throw an error if the input is not a GPU array. */
if (nrhs!=1) {
mexErrMsgIdAndTxt(errId, errMsg);
}
for (int ii = 0; ii<1; ii++)
if (!(mxIsGPUArray(prhs[ii])))
mexErrMsgIdAndTxt(errId, errMsg);
A = mxGPUCreateFromMxArray(prhs[0]);
// Verify that input is single arrays before extracting the pointer.
if (mxGPUGetClassID(A) != mxSINGLE_CLASS )
{
mexErrMsgIdAndTxt(errId, errMsg);
}
/* Get the pointer to the data */
pA = (Complex const *)(mxGPUGetDataReadOnly(A));
/* Create a GPUArray to hold the result and get its underlying pointer. */
B = mxGPUCreateGPUArray(mxGPUGetNumberOfDimensions(A),
mxGPUGetDimensions(A),
mxGPUGetClassID(A),
mxGPUGetComplexity(A),
MX_GPU_DO_NOT_INITIALIZE);
pB = (Complex *)(mxGPUGetData(B));
// Now we can do work!
mwSize const * dimSize = mxGPUGetDimensions(A);
// FFT test
cufftHandle plan;
int dd[2];
dd[1] = (int) dimSize[1];
dd[0] = (int) dimSize[0];
int Nq = (int) dimSize[2];
int L = 30;
cufftPlanMany(&plan, 2, dd, NULL,0,0,NULL,0,0,CUFFT_C2C,Nq);
for (int i = 0; i<L; i++)
{
// Do the fft
cufftExecC2C(plan,(cufftComplex *) pA,(cufftComplex *) pB,CUFFT_FORWARD);
}
/* Wrap the result up as a MATLAB gpuArray for return. */
plhs[0] = mxGPUCreateMxArrayOnGPU(B);
// Free resources
cufftDestroy(plan);
mxGPUDestroyGPUArray(A);
mxGPUDestroyGPUArray(B);
}

1 Comment

I suggest using gputimeit to time code running on the GPU, as this takes account of the asynchronous nature of GPU calls.

Sign in to comment.

Answers (1)

I implemented your code and got the same results as you. Then I added gpu = gpuDevice at the top and wait(gpu) before each call to tic and toc, and found that your code was slightly faster. So I think it's just to do with the way you're timing.

Asked:

on 7 Jun 2016

Answered:

on 8 Jun 2016

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!