How to speed up reading from N-dimensional matrix in a nested loop?

Hi,
I've got a piece of code that has to run in many nested loops. In the next-to-most inner loop I need to grab some elements from an N-dimensional matrix (N>=5) and perform some calculations on these values. When running the profiler I found that the bottle neck is reading elements from the N-dimensional matrix. The matrix size will typically be [9,9,9,9,9,100] so even though it is big, it's not extremely large.
I wonder if you guys have any good idea on how to speed up this code? Any ideas would be extremely welcome - I've spent days reading on google and trying different things in the profiler, and the quickest way I've found is to stack all the data into a 3D matrix, and looping through the 3rd dimension of this new matrix to perform the calculations.
See a code example below (in this code there might be ways to get rid of some of the loops, but that won't be the case in my real code since that will have a lot of logic on each level). It takes ~1.6sec to run on my laptop, and of this ~1.35 sec is spent reading from the N-dimensional matrix...
Kindly, Anders
% File: test.m
clear
%%This section can not be changed in my real code).
%Set up some grids:
small_grid=3;
large_grid=4;
H=small_grid;
a=linspace(1,4,large_grid)'; A=length(a);
b=linspace(1,3,small_grid)'; B=length(b);
c=linspace(1,2,small_grid)';C=length(c);
d=linspace(1,4,small_grid)';D=length(d);
e=linspace(1,4,10)';E=length(e);
f=1:10;
[ga gb gc gd ge]=ndgrid(a,b,c,d,e);
I=32;%
fn1=ga./3+gb/.5 +gc./3 +gd./4 +ge/4;%Some function
fn2=ga./1+gb/.1 +gc./10 +gd./4 +ge/.1;%Some other function
soughtValue=zeros(length(f),8);
tic
for h=1:H
for a_=1:A
aa=rand(I,1).*(a(end)-a(1))+a(1);%In the real code each level will have a lot of logic...
for b_=1:B
bb=rand(I,1).*(b(end)-b(1))+b(1);
for c_=1:C
cc=rand(I,1).*(c(end)-c(1))+c(1);
for d_=1:D
dd=rand(I,1).*(d(end)-d(1))+d(1);
for e_=1:E
ee=rand(I,1).*(e(end)-e(1))+e(1);
% Calculate which elements to pick out
before1=floor(rand(length(f),1)*(length(a)-1)+1);
after1=before1+1;
before2=floor(rand(length(f),1)*(length(b)-1)+1);
after2=before2+1;
c_index=floor(rand(length(f),1)*(C-1)+1);
d_index=floor(rand(length(f),1)*(D-1)+1);
% Some parameters
a1=floor(rand(length(f))*(length(f)-1)+1);
a2=floor(rand(size(f))*(length(f)-1)+1);
a3=floor(rand(size(f))*(length(f)-1)+1);
a4=floor(rand(size(f))*(length(f)-1)+1);
%%This is the section that can be changed in order to speed up calculations
for ff=1:length(f)
%I've found that it's quickest to store the data in a 3D
%matrix and then loop over the last dimension of this matrix
%to perform the calculations. But even this is slow!
new_grid(:,:,1)= (fn1(:,:,c_index(ff) ,d_index(ff) ,ff));
new_grid(:,:,2)= (fn1(:,:,c_index(ff)+1,d_index(ff) ,ff));
new_grid(:,:,3)= (fn1(:,:,c_index(ff) ,d_index(ff)+1,ff));
new_grid(:,:,4)= (fn1(:,:,c_index(ff)+1,d_index(ff)+1,ff));
new_grid(:,:,5)= (fn2(:,:,c_index(ff) ,d_index(ff) ,ff));
new_grid(:,:,6)= (fn2(:,:,c_index(ff)+1,d_index(ff) ,ff));
new_grid(:,:,7)= (fn2(:,:,c_index(ff) ,d_index(ff)+1,ff));
new_grid(:,:,8)= (fn2(:,:,c_index(ff)+1,d_index(ff)+1,ff));
for gg=1:8
soughtValue(ff,gg) =...
( a1(ff).*new_grid(before1(ff),before2(ff),gg) ) ...
+( a2(ff).*new_grid(before1(ff),after2(ff),gg) ) ...
+( a3(ff).*new_grid(after1(ff) ,before2(ff),gg) ) ...
+( a4(ff).*new_grid(after1(ff) ,after2(ff),gg) );
end
% The first iteration of the loop will look like this. Notice
% that this is many times slower than using the loop!
% soughtValue(ff,1)=...
% ( a1(ff).*(fn1(before1(ff),before2(ff),c_index(ff) ,d_index(ff) ,ff)) ) ...
% +( a2(ff).*(fn1(before1(ff),after2(ff),c_index(ff) ,d_index(ff) ,ff)) ) ...
% +( a3(ff).*(fn1(after1(ff),before2(ff),c_index(ff) ,d_index(ff) ,ff)) ) ...
% +( a4(ff).*(fn1(after1(ff),after2(ff),c_index(ff) ,d_index(ff) ,ff)) );
end
%%End of code that can be changed
end
end
end
end
end
end
toc

 Accepted Answer

Hi Anders,
Just by replacing the 8 lines defining new_grid with
d3index = [c_index(ff),c_index(ff)+1] ;
new_grid = cat(3, ...
fn1(:,:,d3index,d_index(ff),ff), ...
fn1(:,:,d3index,d_index(ff)+1,ff), ...
fn2(:,:,d3index,d_index(ff),ff), ...
fn2(:,:,d3index,d_index(ff)+1,ff)) ;
you would make it slightly more efficient. About the loop that follows, are you sure that a1(ff),..,a4(ff) are correct indexing-wise? These are 2D arrays that you are accessing using linear indexing.. which seems weird in this context in the sense that you will get the first column of each matrix with ff in [1,..,numel(f)], which we usually index with e.g. a1(ff,1).
Cheers,
Cedric

7 Comments

I actually wouldn't recommend using CAT here, since it will always allocate fresh memory for its output. Direct assignment to the new_grid(:,:,i) will all be done in-place. See my Answer for an alternative.
I agree, but I made tests of a solution based on a reshape that was slower than the cat.. I don't know why. I am studying your solution right now to see how to do it properly with the reshape. I had also eliminated the 'gg' loop initially, but it happened to be slower than the loop.
Hi Cedric,
Thank you so much for your reply! Unfortunately I'm not in the office at the moment so won't be able to test your solution - I will do it first thing tomorrow and get back to you!
Regarding the a1-a4's: These are supposed to be vectors. For some reason I made a mistake when generating the dummy a1 in the example file, but the a2-a4 are correct. The a1 line should read: a1=floor(rand(size(f))*(length(f)-1)+1);
I'll get back when I have tried your (and the other) solutions! Kindly Anders
My pleasure! Note that the discussion brought by Matt below is more valuable than this answer if you are trying to optimize your loop, as it brings a lot of useful information that would be crucial if you were going for an in-depth introspection of your computation.
That's kind of you to say, Cedric, though I'm not so sure. So far, your proposal in this answer thread is the fastest we've found! Although, I did eventually manage to get my condensed loop to within a few percent of yours.
The a1 line should read: a1=floor(rand(size(f)) x (length(f)-1)+1);
If that's the case then all lines a1...a4 are computed with the exact same statement. It then makes more sense to compute a single A matrix rather than 4 separate "a" variables
A=floor(rand(length(f),4)*(length(f)-1)+1);
where now a1=A(:,1), a2=A(:,2), etc...
Hi, I just gave it a shot and it did indeed work out to be 20-25% faster than my orignal code. Many thanks Cedric!!!
(Regarding the a's: you're correct Matt! It's better to create them using one line and one A-matrix. But in my 'real' code this is not possible and thus I left them as they are!

Sign in to comment.

More Answers (2)

You can get rid of the loop over gg
soughtValue(ff,:)= ...
( a1(ff).*new_grid(before1(ff),before2(ff),:) ) ...
+( a2(ff).*new_grid(before1(ff),after2(ff),:) ) ...
+( a3(ff).*new_grid(after1(ff) ,before2(ff),:) ) ...
+( a4(ff).*new_grid(after1(ff) ,after2(ff),:) );
Also, you should avoid repeated matrix indexing operations. If you're going to be reusing portions of an array, save them to a temporary variable, e.g.,
cff=c_index(ff);
dff=d_index(ff);
tmp=[cff,cff+1];
new_grid(:,:,1:2)= (fn1(:,:,tmp ,dff ,ff));
new_grid(:,:,3:4)= (fn1(:,:,tmp ,dff+1,ff));
                  new_grid(:,:,5:6)=  (fn2(:,:,tmp ,dff ,ff));
                  new_grid(:,:,7:8)=  (fn2(:,:,tmp ,dff+1,ff));

4 Comments

If you pre-permute f1 and fn2 before the loops
fn1=permute(fn1,[1 2 5 3 4]);
fn2=permute(fn2,[1 2 5 3 4]);
[M,N,~]=size(fn1);
then you can simplify this even further
cff=c_index(ff);
dff=d_index(ff);
tmpc=[cff,cff+1];
tmpq=[dff,dff+1];
new_grid(:,:,1:4)= reshape(fn1(:,:,ff, tmpc , tmpq),M,N,4);
new_grid(:,:,5:8)= reshape(fn2(:,:,ff, tmpc , tmpq),M,N,4);
Surprisingly, this is 10% slower than the cat. I brought the first block back into the part that he can modify, before the 'ff' loop:
if all([h,a_,b_,c_,d_,e_]==1)
fn1=permute(fn1,[1 2 5 3 4]);
fn2=permute(fn2,[1 2 5 3 4]);
[M,N,~]=size(fn1);
end
Not so surprisingly. PERMUTE is expensive. It wouldn't make sense to do it inside any of the loops. I'm also not sure what this
if all([h,a_,b_,c_,d_,e_]==1)
is for.
It is for permuting fn1 and fn2 only at the first overall iteration, just to fulfill the requirement not the bring modifications outside of the block that he define as being open to modifications.
The if all() doesn't slow down significantly the execution as far as I can see.

Sign in to comment.

Alright, well here's my fastest version. With the following, I get a 25% speed-up over the original code.
fn=cat(6,fn1,fn2); %PRIOR TO LOOPS
for ff=1:length(f)
new_grid= fn(before1(ff)+[0,1],before2(ff)+[0,1],...
c_index(ff)+[0,1] , d_index(ff)+[0,1],...
ff,:);
AA(4)=a4(ff);
AA(3)=a2(ff);
AA(2)=a3(ff);
AA(1)=a1(ff);
soughtValue(ff,:)=AA*reshape(new_grid,4,8);
end

6 Comments

Hats off for the conciseness! The ugly cat() still wins the toc on my machine, but I think that I will stop trying to understand why before I need yet another aspirin ;)
Actually you don't find the same soughtValue as Anders in his version, but I don't think that we should go on before knowing whether he really meant to linearly index his 2D arrays a1,..,a4..
I get the same soughtValue as Anders, in my tests. It's probably because you didn't run each version with the same seed for the random number generator.
I did actually (I compute "simultaneously" three pairs of new_grid and soughtValue suffixed with _anders/_matt/_cedric), but as it's getting late I certainly made another mistake elsewhere!
Thanks a lot Matt! The first part of the code is really nice, but it seems like it's a tiny bit slower than Cedrics suggestion. But I will keep it as a concice way of writing it! :)
Regarding the last part: I seems to me that this takes ~twice as much computational time as the 8-iteration-loop. Or am I missing something here?
Cheers, Anders
Hi Anders. Not sure what we're calling "the last part". I definitely find that
soughtValue(ff,:)=AA*reshape(new_grid,4,8);
slows things down, as compared to a for-loop, but I haven't measured whether one is twice as fast as the other. It certainly doesn't make more than a 5% impact on the total code's execution time. My fastest version right now is below. By my timings, it's a few percent slower than Cedric's.
My best guess as to why the for-loop is faster is that, maybe, it doesn't allocate any temporary arrays. Possibly, a matrix-vector multiplication always puts its output in a temporary vector, whereas your for-loop might copy results into soughtValue(ff,gg) in a purely in-place fashion.
A few final remarks on all of this:
  1. Because you're jumping around to random locations in an array pulling out values, there are going to be serious limits on how fast you can get. Memory accesses are fastest when done sequentially.
  2. You're doing lots of redundant calculations throughout the code as whole. It doesn't seem to be a good idea to be calling length(f) and size(f) so repeatedly within a for-loop. Each execution by itself is obviously pretty trivial, but the overhead of all those function calls probably adds up.
%Pre-loop computations
fn=cat(6,fn1,fn2);
new_grid=zeros(4,8);
nf=length(f);
for ff=1:nf
new_grid(:) = fn(before1(ff)+[0,1],before2(ff)+[0,1],...
c_index(ff)+[0,1] , d_index(ff)+[0,1],...
ff,:);
%AA1=a1(ff); AA2=a2(ff); AA3=a3(ff); AA4=a4(ff);
for gg=1:8
soughtValue(ff,gg) =...
a1(ff).*new_grid(1,gg)+ ...
a2(ff).*new_grid(2,gg)+ ...
a3(ff).*new_grid(3,gg)+ ...
a4(ff).*new_grid(4,gg);
end
end

Sign in to comment.

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!