Poor performance in mex-files created using mexcuda
Show older comments
I've noticed that the matlab gpu enabled code becomes VERY slow suddenly when a seemingly innocent line is added. I therefor wan't to call cuda using mex. In particular I like to call cufft. However when compiling a mexfile and calling cufft I get very! poor performance compared to matlabs fft (which seem to use cufft). What is going on? The cufft part of the mexfile seem to run 50-200 times slower than matlab fft calling the same library. Is mexcuda optimized? Here is the code I've used. Can of course (most likely be some error I've made):
% Matlab side code
% Compile using:
% >> mexcuda -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\lib\x64" -lcufft abc.cu
A = gpuArray.randn(600,600,32,'single') + 1i*randn(600,600,32,'single');
tic
B = abc(A);
toc;
tic, for ii = 1:30, B = fft2(A); end; toc
AA = gather(A);
tic, for ii = 1:30, B = fft2(AA); end; toc
%%Output from a run
>> testabc
Elapsed time is 0.193155 seconds. % Mex file
Elapsed time is 0.004172 seconds. % Matlab fft2
Elapsed time is 1.455618 seconds. % Matlab CPU
// Mex-file code in the file abc.cu
#include "mex.h"
#include "gpu/mxGPUArray.h"
#include <cufft.h>
// Interal type for complex. Same as cufftComplex just another name
typedef float2 Complex;
/*
* Device code
*/
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, mxArray const *prhs[])
{
char const * const errId = "parallel:gpu:mexGPUExample:InvalidInput";
char const * const errMsg = "Invalid input to MEX file.";
/* Declare all variables.*/
mxGPUArray const *A;
mxGPUArray *B;
Complex const *pA;
Complex *pB;
/* Initialize the MathWorks GPU API. */
mxInitGPU();
/* Throw an error if the input is not a GPU array. */
if (nrhs!=1) {
mexErrMsgIdAndTxt(errId, errMsg);
}
for (int ii = 0; ii<1; ii++)
if (!(mxIsGPUArray(prhs[ii])))
mexErrMsgIdAndTxt(errId, errMsg);
A = mxGPUCreateFromMxArray(prhs[0]);
// Verify that input is single arrays before extracting the pointer.
if (mxGPUGetClassID(A) != mxSINGLE_CLASS )
{
mexErrMsgIdAndTxt(errId, errMsg);
}
/* Get the pointer to the data */
pA = (Complex const *)(mxGPUGetDataReadOnly(A));
/* Create a GPUArray to hold the result and get its underlying pointer. */
B = mxGPUCreateGPUArray(mxGPUGetNumberOfDimensions(A),
mxGPUGetDimensions(A),
mxGPUGetClassID(A),
mxGPUGetComplexity(A),
MX_GPU_DO_NOT_INITIALIZE);
pB = (Complex *)(mxGPUGetData(B));
// Now we can do work!
mwSize const * dimSize = mxGPUGetDimensions(A);
// FFT test
cufftHandle plan;
int dd[2];
dd[1] = (int) dimSize[1];
dd[0] = (int) dimSize[0];
int Nq = (int) dimSize[2];
int L = 30;
cufftPlanMany(&plan, 2, dd, NULL,0,0,NULL,0,0,CUFFT_C2C,Nq);
for (int i = 0; i<L; i++)
{
// Do the fft
cufftExecC2C(plan,(cufftComplex *) pA,(cufftComplex *) pB,CUFFT_FORWARD);
}
/* Wrap the result up as a MATLAB gpuArray for return. */
plhs[0] = mxGPUCreateMxArrayOnGPU(B);
// Free resources
cufftDestroy(plan);
mxGPUDestroyGPUArray(A);
mxGPUDestroyGPUArray(B);
}
1 Comment
Edric Ellis
on 8 Jun 2016
I suggest using gputimeit to time code running on the GPU, as this takes account of the asynchronous nature of GPU calls.
Answers (1)
Joss Knight
on 8 Jun 2016
0 votes
I implemented your code and got the same results as you. Then I added gpu = gpuDevice at the top and wait(gpu) before each call to tic and toc, and found that your code was slightly faster. So I think it's just to do with the way you're timing.
Categories
Find more on GPU Computing in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!