Why does EMLMEX generate a slow mex file?
2 views (last 30 days)
Show older comments
I have an eml-compatibile m-function that I converted to a mex file using emlmex. The m-function takes about 0.11 seconds to run based on tic-toc. The mex function takes about 1.1 seconds to run. Are there any factors I should be aware of when using emlmex that can cause a x10 slowdown? emlmex options? I'm using R2010a with lcc. The m-function arguments include some 512x512 arrays, some of which are complex, and the function does some fft2/ifft2 stuff along with a few other operations on the complex data. None of the input arguments are changing in the body of the function, so I hope that they are not being passed by value.
0 Comments
Answers (3)
Kaustubha Govind
on 9 Feb 2011
MATLAB uses the JIT/Accelerator and highly optimized BLAS routines for FFT, so it is likely that the code runs much faster in MATLAB that the corresponding generated code (which may not be as optimized).
2 Comments
Kaustubha Govind
on 10 Feb 2011
No, emlmex generates C code for the MATLAB function and then compiles it into a MEX-file. It is not necessary that the generated code call into the BLAS library (in all likelihood it doesn't, because the code is supposed to be standalone).
Mike Hosea
on 23 May 2011
Using emlmex to speed up FFT/IFFT is not generally possible. MATLAB already uses FFTW, which is compiled and optimized, selects an algorithm that seems especially fast for the data and hardware. Currently, emlmex generates a generic radix-2 FFT/IFFT and then compiles it with your C compiler. How your C compiler manages the cache, however well it generates code, that's how it's going to work relative to what is even possible with a generic radix-2 FFT. Well, that's almost true. There are some arbitrary cutoffs there for whether the twiddle factors are pre-computed in a lookup table or computed on the fly using sine and cosine. That sort of thing matters. Especially considering that data has to marshaled to and from a mex file, the only way emlmex-generated FFT going to be competitive with FFTW is if we're talking about a small enough vector that the FFTW overhead dominates, which is pretty small, in my experience.
In general emlmex does not speed up anything that in MATLAB is already executed in compiled libraries, e.g. LAPACK-type linear algebra computations, including linear system solves, SVD, EIG, and, of course, FFT. The kinds of things that emlmex speeds up are operations that are heavier on the MATLAB interpreter, e.g. computations that require lots of logic and/or loops. It also tends to speed up fixedpoint applications dramatically.
2 Comments
Mike Hosea
on 2 Jun 2011
Hard for me to say about FFTW speedups. Seems like a lot sometimes to me as well, but we're basically talking about an operation here that is really fast, and we discover these problems by running it over large data sets or by running it lots of times, otherwise it finishes in milliseconds. Having looked at it a number of times, I've found that performance varies a lot in a relative sense, I presume because of cache management. If your data set is variable in length or large enough that MATLAB Coder does not precompute the twiddle factors, then this can easily make each FFT take significantly longer in the relative sense. It is difficult to decide a priori what the right time versus space trade-off is for that table of sines or cosines.
Anyway, if you have a problem for which FFTW is a lot faster, and you are just running mex files in MATLAB or simulating in Simulink, then you can use FFTW by declaring fft "extrinsic". If you pre-declare the output size, e.g.
eml.extrinsic('fft');
y = complex(zeros(1024,1));
y = fft(x);
when you know that the output of fft is, in this example, 1024-by-1 and 'double', then you'll get the speed of FFTW minus the overhead of shuffling the data back and forth to MATLAB so that it can call FFTW. You'll have to experiment to see if it helps the speed any.
The Simulink FFT block has its own code, but it is a similar situation.
No, I don't think code generation would benefit from rewriting the matrix multiplications. Hopefully you would have BLAS enabled, and this would be done with xGEMM, and then (assuming you're running MKL BLAS) it should go as fast as the guys at Intel know how to make it.
--
Mike
James Tursa
on 9 Feb 2011
Is your m-file code small enough to post? I would have expected that the emlmex routine would be calling the same functions that the m-file calls, but maybe not.
6 Comments
Mike Hosea
on 2 Jun 2011
Just for completeness here, let me note that probably about half of that, if not a bit more, is actually array bounds checking. You can turn that off by doing
cfg = emlcoder.MEXConfig;
cfg.IntegrityChecks = false;
emlmex emlmextest -eg {u} -o emlmextestmex -s cfg
Quite generally FFTW does better and better over the simple algorithm the larger the data, and I'm not sure, but I seem to recall that FFTW has some facility for doing FFT2 a bit more efficiently than just iterating through the columns (which is what the simple algorithm does). For n=512, my setup is giving me about 7x improvement from FFTW. This drops to around 3x at n=128, and at 32 I had to run it longer to get decent numbers, but I got this:
>> u = complex(randn(32),randn(32));
y = 0*u;
cfg = emlcoder.MEXConfig;
cfg.IntegrityChecks = false;
emlmex emlmextest -eg {u} -o emlmextestmex -s cfg
disp('m-file'); tic, for ii=1:1000,y=emlmextest(u);end,toc
disp('mex-file'); tic, for ii=1:1000,y=emlmextestmex(u);end,toc
disp('m-file'); tic, for ii=1:1000,y=emlmextest(u);end,toc
disp('mex-file'); tic, for ii=1:1000,y=emlmextestmex(u);end,toc
m-file
Elapsed time is 0.149498 seconds.
mex-file
Elapsed time is 0.084005 seconds.
m-file
Elapsed time is 0.106787 seconds.
mex-file
Elapsed time is 0.071497 seconds.
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!