How to speed up reading from N-dimensional matrix in a nested loop?
Show older comments
Hi,
I've got a piece of code that has to run in many nested loops. In the next-to-most inner loop I need to grab some elements from an N-dimensional matrix (N>=5) and perform some calculations on these values. When running the profiler I found that the bottle neck is reading elements from the N-dimensional matrix. The matrix size will typically be [9,9,9,9,9,100] so even though it is big, it's not extremely large.
I wonder if you guys have any good idea on how to speed up this code? Any ideas would be extremely welcome - I've spent days reading on google and trying different things in the profiler, and the quickest way I've found is to stack all the data into a 3D matrix, and looping through the 3rd dimension of this new matrix to perform the calculations.
See a code example below (in this code there might be ways to get rid of some of the loops, but that won't be the case in my real code since that will have a lot of logic on each level). It takes ~1.6sec to run on my laptop, and of this ~1.35 sec is spent reading from the N-dimensional matrix...
Kindly, Anders
% File: test.m
clear
%%This section can not be changed in my real code).
%Set up some grids:
small_grid=3;
large_grid=4;
H=small_grid;
a=linspace(1,4,large_grid)'; A=length(a);
b=linspace(1,3,small_grid)'; B=length(b);
c=linspace(1,2,small_grid)';C=length(c);
d=linspace(1,4,small_grid)';D=length(d);
e=linspace(1,4,10)';E=length(e);
f=1:10;
[ga gb gc gd ge]=ndgrid(a,b,c,d,e);
I=32;%
fn1=ga./3+gb/.5 +gc./3 +gd./4 +ge/4;%Some function
fn2=ga./1+gb/.1 +gc./10 +gd./4 +ge/.1;%Some other function
soughtValue=zeros(length(f),8);
tic
for h=1:H
for a_=1:A
aa=rand(I,1).*(a(end)-a(1))+a(1);%In the real code each level will have a lot of logic...
for b_=1:B
bb=rand(I,1).*(b(end)-b(1))+b(1);
for c_=1:C
cc=rand(I,1).*(c(end)-c(1))+c(1);
for d_=1:D
dd=rand(I,1).*(d(end)-d(1))+d(1);
for e_=1:E
ee=rand(I,1).*(e(end)-e(1))+e(1);
% Calculate which elements to pick out
before1=floor(rand(length(f),1)*(length(a)-1)+1);
after1=before1+1;
before2=floor(rand(length(f),1)*(length(b)-1)+1);
after2=before2+1;
c_index=floor(rand(length(f),1)*(C-1)+1);
d_index=floor(rand(length(f),1)*(D-1)+1);
% Some parameters
a1=floor(rand(length(f))*(length(f)-1)+1);
a2=floor(rand(size(f))*(length(f)-1)+1);
a3=floor(rand(size(f))*(length(f)-1)+1);
a4=floor(rand(size(f))*(length(f)-1)+1);
%%This is the section that can be changed in order to speed up calculations
for ff=1:length(f)
%I've found that it's quickest to store the data in a 3D
%matrix and then loop over the last dimension of this matrix
%to perform the calculations. But even this is slow!
new_grid(:,:,1)= (fn1(:,:,c_index(ff) ,d_index(ff) ,ff));
new_grid(:,:,2)= (fn1(:,:,c_index(ff)+1,d_index(ff) ,ff));
new_grid(:,:,3)= (fn1(:,:,c_index(ff) ,d_index(ff)+1,ff));
new_grid(:,:,4)= (fn1(:,:,c_index(ff)+1,d_index(ff)+1,ff));
new_grid(:,:,5)= (fn2(:,:,c_index(ff) ,d_index(ff) ,ff));
new_grid(:,:,6)= (fn2(:,:,c_index(ff)+1,d_index(ff) ,ff));
new_grid(:,:,7)= (fn2(:,:,c_index(ff) ,d_index(ff)+1,ff));
new_grid(:,:,8)= (fn2(:,:,c_index(ff)+1,d_index(ff)+1,ff));
for gg=1:8
soughtValue(ff,gg) =...
( a1(ff).*new_grid(before1(ff),before2(ff),gg) ) ...
+( a2(ff).*new_grid(before1(ff),after2(ff),gg) ) ...
+( a3(ff).*new_grid(after1(ff) ,before2(ff),gg) ) ...
+( a4(ff).*new_grid(after1(ff) ,after2(ff),gg) );
end
% The first iteration of the loop will look like this. Notice
% that this is many times slower than using the loop!
% soughtValue(ff,1)=...
% ( a1(ff).*(fn1(before1(ff),before2(ff),c_index(ff) ,d_index(ff) ,ff)) ) ...
% +( a2(ff).*(fn1(before1(ff),after2(ff),c_index(ff) ,d_index(ff) ,ff)) ) ...
% +( a3(ff).*(fn1(after1(ff),before2(ff),c_index(ff) ,d_index(ff) ,ff)) ) ...
% +( a4(ff).*(fn1(after1(ff),after2(ff),c_index(ff) ,d_index(ff) ,ff)) );
end
%%End of code that can be changed
end
end
end
end
end
end
toc
Accepted Answer
More Answers (2)
You can get rid of the loop over gg
soughtValue(ff,:)= ...
( a1(ff).*new_grid(before1(ff),before2(ff),:) ) ...
+( a2(ff).*new_grid(before1(ff),after2(ff),:) ) ...
+( a3(ff).*new_grid(after1(ff) ,before2(ff),:) ) ...
+( a4(ff).*new_grid(after1(ff) ,after2(ff),:) );
Also, you should avoid repeated matrix indexing operations. If you're going to be reusing portions of an array, save them to a temporary variable, e.g.,
cff=c_index(ff);
dff=d_index(ff);
tmp=[cff,cff+1];
new_grid(:,:,1:2)= (fn1(:,:,tmp ,dff ,ff));
new_grid(:,:,3:4)= (fn1(:,:,tmp ,dff+1,ff));
new_grid(:,:,5:6)= (fn2(:,:,tmp ,dff ,ff));
new_grid(:,:,7:8)= (fn2(:,:,tmp ,dff+1,ff));
4 Comments
If you pre-permute f1 and fn2 before the loops
fn1=permute(fn1,[1 2 5 3 4]);
fn2=permute(fn2,[1 2 5 3 4]);
[M,N,~]=size(fn1);
then you can simplify this even further
cff=c_index(ff);
dff=d_index(ff);
tmpc=[cff,cff+1];
tmpq=[dff,dff+1];
new_grid(:,:,1:4)= reshape(fn1(:,:,ff, tmpc , tmpq),M,N,4);
new_grid(:,:,5:8)= reshape(fn2(:,:,ff, tmpc , tmpq),M,N,4);
Cedric
on 21 Jan 2013
Surprisingly, this is 10% slower than the cat. I brought the first block back into the part that he can modify, before the 'ff' loop:
if all([h,a_,b_,c_,d_,e_]==1)
fn1=permute(fn1,[1 2 5 3 4]);
fn2=permute(fn2,[1 2 5 3 4]);
[M,N,~]=size(fn1);
end
Matt J
on 21 Jan 2013
Not so surprisingly. PERMUTE is expensive. It wouldn't make sense to do it inside any of the loops. I'm also not sure what this
if all([h,a_,b_,c_,d_,e_]==1)
is for.
Cedric
on 21 Jan 2013
It is for permuting fn1 and fn2 only at the first overall iteration, just to fulfill the requirement not the bring modifications outside of the block that he define as being open to modifications.
The if all() doesn't slow down significantly the execution as far as I can see.
Alright, well here's my fastest version. With the following, I get a 25% speed-up over the original code.
fn=cat(6,fn1,fn2); %PRIOR TO LOOPS
for ff=1:length(f)
new_grid= fn(before1(ff)+[0,1],before2(ff)+[0,1],...
c_index(ff)+[0,1] , d_index(ff)+[0,1],...
ff,:);
AA(4)=a4(ff);
AA(3)=a2(ff);
AA(2)=a3(ff);
AA(1)=a1(ff);
soughtValue(ff,:)=AA*reshape(new_grid,4,8);
end
6 Comments
Cedric
on 22 Jan 2013
Hats off for the conciseness! The ugly cat() still wins the toc on my machine, but I think that I will stop trying to understand why before I need yet another aspirin ;)
Anders Österling
on 23 Jan 2013
Hi Anders. Not sure what we're calling "the last part". I definitely find that
soughtValue(ff,:)=AA*reshape(new_grid,4,8);
slows things down, as compared to a for-loop, but I haven't measured whether one is twice as fast as the other. It certainly doesn't make more than a 5% impact on the total code's execution time. My fastest version right now is below. By my timings, it's a few percent slower than Cedric's.
My best guess as to why the for-loop is faster is that, maybe, it doesn't allocate any temporary arrays. Possibly, a matrix-vector multiplication always puts its output in a temporary vector, whereas your for-loop might copy results into soughtValue(ff,gg) in a purely in-place fashion.
A few final remarks on all of this:
- Because you're jumping around to random locations in an array pulling out values, there are going to be serious limits on how fast you can get. Memory accesses are fastest when done sequentially.
- You're doing lots of redundant calculations throughout the code as whole. It doesn't seem to be a good idea to be calling length(f) and size(f) so repeatedly within a for-loop. Each execution by itself is obviously pretty trivial, but the overhead of all those function calls probably adds up.
%Pre-loop computations
fn=cat(6,fn1,fn2);
new_grid=zeros(4,8);
nf=length(f);
for ff=1:nf
new_grid(:) = fn(before1(ff)+[0,1],before2(ff)+[0,1],...
c_index(ff)+[0,1] , d_index(ff)+[0,1],...
ff,:);
%AA1=a1(ff); AA2=a2(ff); AA3=a3(ff); AA4=a4(ff);
for gg=1:8
soughtValue(ff,gg) =...
a1(ff).*new_grid(1,gg)+ ...
a2(ff).*new_grid(2,gg)+ ...
a3(ff).*new_grid(3,gg)+ ...
a4(ff).*new_grid(4,gg);
end
end
Categories
Find more on Response Computation and Visualization in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!