3 views (last 30 days)

Dear All, I have the following problem.

There are relatively simple operations of barycentric interpolation in a large, multi-dimensional array, which leads to long computing times. The code is given as follows:

clear all;

clc;

% auxilliary matrices and vectors

na = 100;

nm = 100;

nL = 10;

a = 1:na;

m = 1:nm;

[ap_grid, mp_grid] = ndgrid(a,m);

[a_grid, m_grid] = ndgrid(a,m);

A = randn(100,na,nm,2,2,3);

M = randn(100,na,nm,2,2,3);

AP_int1 = zeros(100,na,nm,2,2,3,nL);

MP_int1 = zeros(100,na,nm,2,2,3,nL);

AP_int2 = zeros(100,na,nm,2,2,3,nL);

MP_int2 = zeros(100,na,nm,2,2,3,nL);

a_gridVEC = repmat( permute( a_grid, [3 1 2 4 5 6] ), 100,1,1,2,2,3);

m_gridVEC = repmat( permute( m_grid, [3 1 2 4 5 6] ), 100,1,1,2,2,3);

% looped computations

tic

for q1 = 1:100

for q2 = 1:2

for q3 = 1:2

for q4 = 1:3

for r1 = 1:nL % a time consuming loop

ap1 = ap_grid(1,1); % in the real situation points ap and mp change every iteration

ap2 = ap_grid(1,2);

ap3 = ap_grid(2,1);

mp1 = mp_grid(1,1);

mp2 = mp_grid(1,2);

mp3 = mp_grid(2,1);

a1 = A(q1,1,1,q2,q3,q4); % in the real situation the 2nd and 3rd indices of A and M change every iteration

a2 = A(q1,1,2,q2,q3,q4);

a3 = A(q1,2,1,q2,q3,q4);

m1 = M(q1,1,1,q2,q3,q4);

m2 = M(q1,1,2,q2,q3,q4);

m3 = M(q1,2,1,q2,q3,q4);

for r2 = 1:na

for r3 = 1:nm

a_int1 = a_grid(r2,r3);

m_int1 = m_grid(r2,r3);

% barycentric interpolation (weights)

w1 = ( (m2-m3)*(a_int1-a3) + (a3-a2)*(m_int1-m3) ) / ( (m2-m3)*(a1-a3) + (a3-a2)*(m1-m3) );

w2 = ( (m3-m1)*(a_int1-a3) + (a1-a3)*(m_int1-m3) ) / ( (m2-m3)*(a1-a3) + (a3-a2)*(m1-m3) );

w3 = 1 - w1 - w2;

% barycentric interpolation (results)

if w1 < -0.25 || w2 < -0.25 || w3 < -0.25 % we allow for only 'a bit' of extrapolation

AP_int2(q1,r2,r3,q2,q3,q4,r1) = NaN;

MP_int2(q1,r2,r3,q2,q3,q4,r1) = NaN;

else

AP_int2(q1,r2,r3,q2,q3,q4,r1) = w1 * ap1 + w2 * ap2 + w3 * ap3;

MP_int2(q1,r2,r3,q2,q3,q4,r1) = w1 * mp1 + w2 * mp2 + w3 * mp3;

end

end % loop with nm iterations

end % loop with na iterations

end % a time consuming loop

end

end

end

end

toc

% vectorized computations

tic

for r1 = 1:nL % a time consuming loop (in the real situation points ap,mp,a and m change every iteration of this loop)

ap1 = repmat( ap_grid(1,1), 100,na,nm,2,2,3);

ap2 = repmat( ap_grid(1,2), 100,na,nm,2,2,3);

ap3 = repmat( ap_grid(2,1), 100,na,nm,2,2,3);

mp1 = repmat( mp_grid(1,1), 100,na,nm,2,2,3);

mp2 = repmat( mp_grid(1,2), 100,na,nm,2,2,3);

mp3 = repmat( mp_grid(2,1), 100,na,nm,2,2,3);

a1 = repmat( A(:,1,1,:,:,:), 1,na,nm,1,1,1);

a2 = repmat( A(:,1,2,:,:,:), 1,na,nm,1,1,1);

a3 = repmat( A(:,2,1,:,:,:), 1,na,nm,1,1,1);

m1 = repmat( M(:,1,1,:,:,:), 1,na,nm,1,1,1);

m2 = repmat( M(:,1,2,:,:,:), 1,na,nm,1,1,1);

m3 = repmat( M(:,2,1,:,:,:), 1,na,nm,1,1,1);

% barycentric interpolations (weights)

w1 = ( (m2-m3).*(a_gridVEC-a3) + (a3-a2).*(m_gridVEC-m3) ) ./ ( (m2-m3).*(a1-a3) + (a3-a2).*(m1-m3) );

w2 = ( (m3-m1).*(a_gridVEC-a3) + (a1-a3).*(m_gridVEC-m3) ) ./ ( (m2-m3).*(a1-a3) + (a3-a2).*(m1-m3) );

w3 = 1 - w1 - w2;

w1(w1<-0.25) = NaN; % we allow for only 'a bit' of extrapolation

w2(w2<-0.25) = NaN;

w3(w3<-0.25) = NaN;

% barycentric interpolations (results)

AP_int1(:,:,:,:,:,:,r1) = w1 .* ap1 + w2 .* ap2 + w3 .* ap3;

MP_int1(:,:,:,:,:,:,r1) = w1 .* mp1 + w2 .* mp2 + w3 .* mp3;

end

toc

Unfortunately, vectorization of these looped operations makes computational times only worse, contradicting the common Matlab knowledge that vectorization always helps. Why is it so? Is there any other way of speeding up the code? The parfor loop also slows down the code. Would it help to write this piece of code in C++ and call it from Matlab?

Matt J
on 14 Feb 2020

Edited: Matt J
on 14 Feb 2020

Getting rid of repmat (requires R2016b or later) and working with single float precision will get you some speed-up.

In doubles:

Elapsed time is 5.396878 seconds.

Elapsed time is 4.874902 seconds.

In singles:

Elapsed time is 3.312500 seconds.

Elapsed time is 2.872178 seconds.

% auxilliary matrices and vectors

na = 100;

nm = 100;

nL = 10;

a = 1:na;

m = 1:nm;

[ap_grid, mp_grid] = ndgrid(a,m);

[a_grid, m_grid] = ndgrid(a,m);

A = randn(100,na,nm,2,2,3);

M = randn(100,na,nm,2,2,3);

AP_int1 = zeros(100,na,nm,2,2,3,nL);

MP_int1 = zeros(100,na,nm,2,2,3,nL);

AP_int2 = zeros(100,na,nm,2,2,3,nL);

MP_int2 = zeros(100,na,nm,2,2,3,nL);

a_gridVEC = repmat( permute( a_grid, [3 1 2 4 5 6] ), 100,1,1,2,2,3);

m_gridVEC = repmat( permute( m_grid, [3 1 2 4 5 6] ), 100,1,1,2,2,3);

C=cellfun(@single,{A,M,AP_int1,MP_int1 , AP_int2, MP_int2, a_gridVEC,m_gridVEC,ap_grid, mp_grid, a_grid, m_grid},'uni',0);

[A,M,AP_int1,MP_int1 , AP_int2, MP_int2, a_gridVEC,m_gridVEC,ap_grid, mp_grid, a_grid, m_grid]=deal(C{:});

% looped computations

tic

for q1 = 1:100

for q2 = 1:2

for q3 = 1:2

for q4 = 1:3

for r1 = 1:nL % a time consuming loop

ap1 = ap_grid(1,1); % in the real situation points ap and mp change every iteration

ap2 = ap_grid(1,2);

ap3 = ap_grid(2,1);

mp1 = mp_grid(1,1);

mp2 = mp_grid(1,2);

mp3 = mp_grid(2,1);

a1 = A(q1,1,1,q2,q3,q4); % in the real situation the 2nd and 3rd indices of A and M change every iteration

a2 = A(q1,1,2,q2,q3,q4);

a3 = A(q1,2,1,q2,q3,q4);

m1 = M(q1,1,1,q2,q3,q4);

m2 = M(q1,1,2,q2,q3,q4);

m3 = M(q1,2,1,q2,q3,q4);

for r2 = 1:na

for r3 = 1:nm

a_int1 = a_grid(r2,r3);

m_int1 = m_grid(r2,r3);

% barycentric interpolation (weights)

w1 = ( (m2-m3)*(a_int1-a3) + (a3-a2)*(m_int1-m3) ) / ( (m2-m3)*(a1-a3) + (a3-a2)*(m1-m3) );

w2 = ( (m3-m1)*(a_int1-a3) + (a1-a3)*(m_int1-m3) ) / ( (m2-m3)*(a1-a3) + (a3-a2)*(m1-m3) );

w3 = 1 - w1 - w2;

% barycentric interpolation (results)

if w1 < -0.25 || w2 < -0.25 || w3 < -0.25 % we allow for only 'a bit' of extrapolation

AP_int2(q1,r2,r3,q2,q3,q4,r1) = NaN;

MP_int2(q1,r2,r3,q2,q3,q4,r1) = NaN;

else

AP_int2(q1,r2,r3,q2,q3,q4,r1) = w1 * ap1 + w2 * ap2 + w3 * ap3;

MP_int2(q1,r2,r3,q2,q3,q4,r1) = w1 * mp1 + w2 * mp2 + w3 * mp3;

end

end % loop with nm iterations

end % loop with na iterations

end % a time consuming loop

end

end

end

end

toc

% vectorized computations

tic

for r1 = 1:nL % a time consuming loop (in the real situation points ap,mp,a and m change every iteration of this loop)

ap1 = ( ap_grid(1,1));

ap2 = ( ap_grid(1,2));

ap3 = ( ap_grid(2,1));

mp1 = ( mp_grid(1,1));

mp2 = ( mp_grid(1,2));

mp3 = ( mp_grid(2,1));

a1 = ( A(:,1,1,:,:,:));

a2 = ( A(:,1,2,:,:,:));

a3 = ( A(:,2,1,:,:,:));

m1 = ( M(:,1,1,:,:,:));

m2 = ( M(:,1,2,:,:,:));

m3 = ( M(:,2,1,:,:,:));

% barycentric interpolations (weights)

w1 = ( (m2-m3).*(a_gridVEC-a3) + (a3-a2).*(m_gridVEC-m3) ) ./ ( (m2-m3).*(a1-a3) + (a3-a2).*(m1-m3) );

w2 = ( (m3-m1).*(a_gridVEC-a3) + (a1-a3).*(m_gridVEC-m3) ) ./ ( (m2-m3).*(a1-a3) + (a3-a2).*(m1-m3) );

w3 = 1 - w1 - w2;

w1(w1<-0.25) = NaN; % we allow for only 'a bit' of extrapolation

w2(w2<-0.25) = NaN;

w3(w3<-0.25) = NaN;

% barycentric interpolations (results)

AP_int1(:,:,:,:,:,:,r1) = w1 .* ap1 + w2 .* ap2 + w3 .* ap3;

MP_int1(:,:,:,:,:,:,r1) = w1 .* mp1 + w2 .* mp2 + w3 .* mp3;

end

toc

Matt J
on 14 Feb 2020

Using the gpuArrays would also help the vectorized version even further. The following was using the GeForce GTX 1050,

na = 50;

nm = 50;

nL = 10;

....

Elapsed time is 1.304855 seconds. %double floats on CPU

Elapsed time is 0.521651 seconds. %double floats on GPU

Elapsed time is 0.785172 seconds. %single floats on CPU

Elapsed time is 0.278735 seconds. %single floats on GPU

Sign in to comment.

Matt J
on 14 Feb 2020

Edited: Matt J
on 14 Feb 2020

Sign in to answer this question.

Opportunities for recent engineering grads.

Apply Today
## 3 Comments

## Direct link to this comment

https://in.mathworks.com/matlabcentral/answers/505402-speeding-up-nested-for-loops-when-vectorization-seems-to-fail#comment_797237

⋮## Direct link to this comment

https://in.mathworks.com/matlabcentral/answers/505402-speeding-up-nested-for-loops-when-vectorization-seems-to-fail#comment_797237

## Direct link to this comment

https://in.mathworks.com/matlabcentral/answers/505402-speeding-up-nested-for-loops-when-vectorization-seems-to-fail#comment_797241

⋮## Direct link to this comment

https://in.mathworks.com/matlabcentral/answers/505402-speeding-up-nested-for-loops-when-vectorization-seems-to-fail#comment_797241

## Direct link to this comment

https://in.mathworks.com/matlabcentral/answers/505402-speeding-up-nested-for-loops-when-vectorization-seems-to-fail#comment_797250

⋮## Direct link to this comment

https://in.mathworks.com/matlabcentral/answers/505402-speeding-up-nested-for-loops-when-vectorization-seems-to-fail#comment_797250

Sign in to comment.