Why does EMLMEX generate a slow mex file?

Question

Paul on 9 Feb 2011

0
Link

Direct link to this question

https://in.mathworks.com/matlabcentral/answers/1115-why-does-emlmex-generate-a-slow-mex-file

I have an eml-compatibile m-function that I converted to a mex file using emlmex. The m-function takes about 0.11 seconds to run based on tic-toc. The mex function takes about 1.1 seconds to run. Are there any factors I should be aware of when using emlmex that can cause a x10 slowdown? emlmex options? I'm using R2010a with lcc. The m-function arguments include some 512x512 arrays, some of which are complex, and the function does some fft2/ifft2 stuff along with a few other operations on the complex data. None of the input arguments are changing in the body of the function, so I hope that they are not being passed by value.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Kaustubha Govind on 9 Feb 2011

1
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/1115-why-does-emlmex-generate-a-slow-mex-file#answer_1569

MATLAB uses the JIT/Accelerator and highly optimized BLAS routines for FFT, so it is likely that the code runs much faster in MATLAB that the corresponding generated code (which may not be as optimized).

2 Comments
Show NoneHide None

Paul on 10 Feb 2011

Wouldn't the eml version of FFT also use the same highly optimized BLAS routine?

Kaustubha Govind on 10 Feb 2011

No, emlmex generates C code for the MATLAB function and then compiles it into a MEX-file. It is not necessary that the generated code call into the BLAS library (in all likelihood it doesn't, because the code is supposed to be standalone).

Sign in to comment.

Answer 2

Mike Hosea on 23 May 2011

1
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/1115-why-does-emlmex-generate-a-slow-mex-file#answer_10991

Using emlmex to speed up FFT/IFFT is not generally possible. MATLAB already uses FFTW, which is compiled and optimized, selects an algorithm that seems especially fast for the data and hardware. Currently, emlmex generates a generic radix-2 FFT/IFFT and then compiles it with your C compiler. How your C compiler manages the cache, however well it generates code, that's how it's going to work relative to what is even possible with a generic radix-2 FFT. Well, that's almost true. There are some arbitrary cutoffs there for whether the twiddle factors are pre-computed in a lookup table or computed on the fly using sine and cosine. That sort of thing matters. Especially considering that data has to marshaled to and from a mex file, the only way emlmex-generated FFT going to be competitive with FFTW is if we're talking about a small enough vector that the FFTW overhead dominates, which is pretty small, in my experience.

In general emlmex does not speed up anything that in MATLAB is already executed in compiled libraries, e.g. LAPACK-type linear algebra computations, including linear system solves, SVD, EIG, and, of course, FFT. The kinds of things that emlmex speeds up are operations that are heavier on the MATLAB interpreter, e.g. computations that require lots of logic and/or loops. It also tends to speed up fixedpoint applications dramatically.

2 Comments
Show NoneHide None

Paul on 24 May 2011

Mike,

Based on your response and what I got from tech support, I'm not surprised that emlmex didn't speed up the FFT2/IFFT2 in this simple example. But should it be that much slower? Seems like a lot to me, but I confess I'm far from an expert on the advantages of FFTW relative to generic routines.

Also, I'm pretty sure that the tic/toc timings there are representative of what I was getting with multiple runs when I was playing with at the keyboard before generating the simple script for posting here.

I'm still back on R2010A. Do you think I would see any improvements in the mex file using the new tools in R2011B?

I'm not hard over on the need for a mex file for its own sake. What I have is an m-file that I want to be able to run from Matlab directly and also call from Simulink via an Embedded Matlab Function block wrapper. So the first step was to make sure that my m-file was eml compatible and gave me expected answers and I was testing that via emlmex and discovered the runtime difference. I called it from Simulink and let Simulink generate the code (which I assume is using the same core as emlmex) and I felt that my Simulink model was also running a lot more slowly than I expected.

Are there any tips for writing m-code that might be more suitable for code generation? For example, in m-code I would happily write:

Z = A*B*C*D % all variable are large matrices

Would code generation work better if it was written as

temp1 = C*D;

temp2 = B*temp1;

Z = A*temp2;

I assume that the generated code is already doing something like this. But could there be a case where the generated code is generating too many intermediate variables and suffering from extraneous data copies?

Thanks,

Paul

Mike Hosea on 2 Jun 2011

Hard for me to say about FFTW speedups. Seems like a lot sometimes to me as well, but we're basically talking about an operation here that is really fast, and we discover these problems by running it over large data sets or by running it lots of times, otherwise it finishes in milliseconds. Having looked at it a number of times, I've found that performance varies a lot in a relative sense, I presume because of cache management. If your data set is variable in length or large enough that MATLAB Coder does not precompute the twiddle factors, then this can easily make each FFT take significantly longer in the relative sense. It is difficult to decide a priori what the right time versus space trade-off is for that table of sines or cosines.

Anyway, if you have a problem for which FFTW is a lot faster, and you are just running mex files in MATLAB or simulating in Simulink, then you can use FFTW by declaring fft "extrinsic". If you pre-declare the output size, e.g.

eml.extrinsic('fft');

y = complex(zeros(1024,1));

y = fft(x);

when you know that the output of fft is, in this example, 1024-by-1 and 'double', then you'll get the speed of FFTW minus the overhead of shuffling the data back and forth to MATLAB so that it can call FFTW. You'll have to experiment to see if it helps the speed any.

The Simulink FFT block has its own code, but it is a similar situation.

No, I don't think code generation would benefit from rewriting the matrix multiplications. Hopefully you would have BLAS enabled, and this would be done with xGEMM, and then (assuming you're running MKL BLAS) it should go as fast as the guys at Intel know how to make it.

--

Mike

Sign in to comment.

Answer 3

James Tursa on 9 Feb 2011

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/1115-why-does-emlmex-generate-a-slow-mex-file#answer_1579

Is your m-file code small enough to post? I would have expected that the emlmex routine would be calling the same functions that the m-file calls, but maybe not.

6 Comments
Show 4 older commentsHide 4 older comments

Paul on 16 Feb 2011

Here is an example function:

function y = emlmextest(u)

%#eml

y = ifft2(fft2(u));

end

And here is a test script:

u = complex(randn(512),randn(512));

y = 0*u;

emlmex emlmextest -eg {u} -o emlmextestmex

disp('m-file'); tic, for ii=1:10,y=emlmextest(u);end,toc

disp('mex-file');tic, for ii=1:10,y=emlmextestmex(u);end,toc

And the results:

m-file

Elapsed time is 1.549288 seconds.

mex-file

Elapsed time is 21.033057 seconds.

I hope the EML versions of fft2 and ifft2 aren't really 15 times slower than their Matlab counterparts.

Mike Hosea on 2 Jun 2011

Just for completeness here, let me note that probably about half of that, if not a bit more, is actually array bounds checking. You can turn that off by doing

cfg = emlcoder.MEXConfig;

cfg.IntegrityChecks = false;

emlmex emlmextest -eg {u} -o emlmextestmex -s cfg

Quite generally FFTW does better and better over the simple algorithm the larger the data, and I'm not sure, but I seem to recall that FFTW has some facility for doing FFT2 a bit more efficiently than just iterating through the columns (which is what the simple algorithm does). For n=512, my setup is giving me about 7x improvement from FFTW. This drops to around 3x at n=128, and at 32 I had to run it longer to get decent numbers, but I got this:

>> u = complex(randn(32),randn(32));

y = 0*u;

cfg = emlcoder.MEXConfig;

cfg.IntegrityChecks = false;

emlmex emlmextest -eg {u} -o emlmextestmex -s cfg

disp('m-file'); tic, for ii=1:1000,y=emlmextest(u);end,toc

disp('mex-file'); tic, for ii=1:1000,y=emlmextestmex(u);end,toc

disp('m-file'); tic, for ii=1:1000,y=emlmextest(u);end,toc

disp('mex-file'); tic, for ii=1:1000,y=emlmextestmex(u);end,toc

m-file

Elapsed time is 0.149498 seconds.

mex-file

Elapsed time is 0.084005 seconds.

m-file

Elapsed time is 0.106787 seconds.

mex-file

Elapsed time is 0.071497 seconds.

Sign in to comment.

Why does EMLMEX generate a slow mex file?

0 Comments
Show -2 older commentsHide -2 older comments

Answers (3)

2 Comments
Show NoneHide None

2 Comments
Show NoneHide None

6 Comments
Show 4 older commentsHide 4 older comments

See Also

Categories

Tags

Community Treasure Hunt

Why does EMLMEX generate a slow mex file?

0 Comments Show -2 older commentsHide -2 older comments

Answers (3)

2 Comments Show NoneHide None

2 Comments Show NoneHide None

6 Comments Show 4 older commentsHide 4 older comments

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None

2 Comments
Show NoneHide None

6 Comments
Show 4 older commentsHide 4 older comments