R2018b real times complex multiplication

I just installed R2018b alongside an existing 2017b installation. I'm getting a puzzling result with execution speed as follows:
>> version
ans =
'9.5.0.944444 (R2018b)'
>> X = randn( 5000, 5000 ); y = randn( 5000, 1 ) * i;
>> tic; X * y; toc
Elapsed time is 0.237362 seconds.
versus
>> version
ans =
'9.3.0.713579 (R2017b)'
>> X = randn( 5000, 5000 ); y = randn( 5000, 1 ) * i;
>> tic; X * y; toc
Elapsed time is 0.010306 seconds.
These are running on the same PC at the same time. Why is R2018b nearly 24x slower with this operation than R2017b???

1 Comment

madhan ravi
madhan ravi on 21 Dec 2018
Edited: madhan ravi on 21 Dec 2018
Elapsed time is 0.391901 seconds. (using 9.5.0.944444 (R2018b))

Sign in to comment.

 Accepted Answer

James Tursa
James Tursa on 21 Dec 2018
Edited: James Tursa on 21 Dec 2018
Probably related to the fact that complex variables changed to an interleaved storage format in R2018a. So my guess is this is what is happening:
R2017b: Only two calls to the BLAS real matrix*vector multiply routine are needed, the results being put into the real & imaginary parts of the output. I.e., one 5000x5000 * 5000x1 multiply for the real*real part and another 5000x5000 * 5000x1 multiply for the real*imag part. Note that no data copying is needed prior to these calls, and the results can be used directly in the output variable without any additional copying needed.
R2018b: The real 5000x5000 matrix is probably first deep copied into an interleaved complex version, then one call to the BLAS complex matrix*vector multiply routine is used to produce the output. So, the extra work that this does over the R2017b version is the allocation for the temporary 5000x5000x2 matrix elements, the deep data copy of the 5000x5000 real matrix into the real part of that interleaved complex matrix, and then twice as much multiplying (5000x5000x2 elements * 5000x1x2 elements multiply).
My guess is the memory allocation and deep data copy is probably the biggest time killer, but I would have to do some tests to verify this. The actual multiply should only be about twice as slow.
So, if you know that downstream your 5000x5000 real matrix will be used to multiply by complex variables many times, it would be best to make it complex first up front so you only incur the temp allocation and deep data copy once.
This points out one advantage of the old separate storage format: You could mix and match real and complex variables in matrix multiply operations and accomplish all of it with multiple real BLAS matrix multiply calls without any data copying needed either before or after the operations. This is not true anymore as of R2018a.
EDIT 12/21/2018
Here are the results of some breakout timing tests
>> version
ans =
'9.3.0.713579 (R2017b)'
>> computer
ans =
'PCWIN64'
>> X = rand(7000,7000); y = rand(7000,1)*1i;
>> tic;X*y;toc
Elapsed time is 0.019165 seconds.
and
>> version
ans =
'9.4.0.813654 (R2018a)'
>> computer
ans =
'PCWIN64'
>> X = rand(7000,7000); y = rand(7000,1)*1i;
>> tic;X*y;toc
Elapsed time is 0.626945 seconds.
>> tic;C=complex(X);toc
Elapsed time is 0.579250 seconds.
>> tic;C*y;toc
Elapsed time is 0.023560 seconds.
So, as suspected, the bulk of the extra timing appears to be in the conversion of the large real matrix to complex interleaved format. The actual matrix multiply isn't that much more than the R2017b version.

5 Comments

Thanks for the detailed explanation, which makes perfect sense and clarifies the core of the issue.
I guess I'm still puzzled as to why Mathworks would implement such a change in storage format without considering carefully all the implications. The solution to pre-define a large real matrix as complex for the purpose of multiplying with a real matrix later does not really work for my application unfortunately -- it is an input to a function that performs a number of operations, some involving multiplying by complex matrices, others by real matrices.
I guess I could could create two versions of 'X' -- one real, one complex and pass them both, then try and keep track of which version to use for each matrix operation. That sounds like an unnecessary headache and would still result in less efficient execution. On the other hand, the setting seems very common: I often work with complex matrices, but almost always need to perform a combination of real / real, real / complex, and complex / complex operations. So this change in storage format is really quite regressive for my purposes.
I guess I will stick to R2017b for now and hope Mathworks will fix this in future releases -- i.e. offer an option to store as "separate storage" or "interleaved complex", or implement some automated way eliminate the uncessary overhead created by copying large matrices.
Anyway, thanks again for the excellent answer.
James Tursa
James Tursa on 24 Dec 2018
Edited: James Tursa on 24 Dec 2018
"... why Mathworks would implement such a change in storage format ..."
To match the BLAS and LAPACK libraries that they use in the background for linear algebra. Although there is a timing downside for some operations (like mixed real/complex matrix multiply), there is a timing upside to other operations (like getting eigen values/vectors of complex matrices). E.g., for the latter, R2017b would be the version forced to do a data copy before & after the operation, whereas R2018a would not suffer this issue.
"... hope Mathworks will fix this in future releases ..."
Unlikely. As pointed out above, there are pluses and minus to both R2017b- and R2018a+ for this change. I doubt it would make sense to TMW to maintain code to support the two storage formats simultaneously in any future MATLAB version.
Side Note: This was a really good observation on your part and an excellent question to ask.
MATLAB FFT, based of FFTW is also works with interleaved complex. Though I didn't not make any benchmark to see if R2018a imrpoves the speed.
Anybody know about GPU preference storage scheme? I guess because of parallel structure, interleaved data is natural way of storage.
@Bruno gpuArray has used interleaved-complex since it was first introduced in R2010b (you could observe this using the MEX interface, or CUDAKernel). This is one of the reasons why there are subtle differences between gpuArray and double for some operations - mostly down to the fact that gpuArray doesn't drop all-zero imaginary parts (this is for performance reasons - as is pretty much everything else as far as gpuArray is concerned).
Thank you Edric.
So far I'm not yet using GPU but it is something that I try to follow the evolution trend.

Sign in to comment.

More Answers (0)

Categories

Find more on Programming in Help Center and File Exchange

Products

Release

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!