Inevitable broadcast variable in parfor?

Question

pigeon on 29 Aug 2019

0
Link

Direct link to this question

https://in.mathworks.com/matlabcentral/answers/478072-inevitable-broadcast-variable-in-parfor

Commented: pigeon on 2 Sep 2019

Hi,

I want to perform some calculations in a parfor loop using the neighbouring data in a 3D matrix. Currently I am using the following code, which has the problem that data is a 'broadcast variable'. This slows down the calculations and causes memory problems. (This is a small scale example where 18.5 MB is send to the workers.The actual data matrix will be around 100x100x65336 (i.e. 2.6 GB vs 3.6 MB in this example).)

rng('default');
height = 60;
width  = 60;
ntimes = 2^8;
radius = 10;
block_size = 2*radius+1;
data   = randn(height, width, ntimes, 'single'); % create some random data
output = zeros(height, width, ntimes, 'single');
result = zeros(height-2*radius, width-2*radius, block_size, block_size, ntimes, 'single');
ticBytes(gcp); tic;
parfor y = radius+1:height-radius   
    % create temporary array for parfor loop
    temp_r = zeros(width-2*radius, block_size, block_size, ntimes, 'single'); 
                
    for x = radius+1:width-radius                       
        % here I calculate the fft of a block surrounding the pixel of interest.
        % Since it violates rules of indexing, data is a broadcast variable
        a = fft(data(y-radius:y+radius, x-radius:x+radius, :), [], 3);   
        
        % For simplicity sake I store the fft (In reality I do more calculations).
        temp_r(x-radius, :, :, :) = real(a); 
    end
    
    result(y-radius, :, :, :, :) = temp_r;
end
toc; tocBytes(gcp);

To overcome this problem I considered using linear indexing. However, I then cannot figure out how to overcome the broadcasting. Due to the large size of the actual data, I think that separating the data in blocks (2*radius+1) for every single pixel is not feasible.

rng('default');
height = 60;
width  = 60;
ntimes = 2^8;
radius = 10;
data = randn(height, width, ntimes, 'single'); % create some random data
data = reshape(data, [height*width, ntimes]);
% determination of linear indices
[cols, rows] = meshgrid(radius+1:height-radius, radius+1:width-radius);
linear_ids = sub2ind([height, width], rows, cols);
% create an array with linear ids of neighbouring pixels
[x,y] = meshgrid(radius*(-height-1):height:radius*height-radius, 0:radius*2);
mask = x + y;
% the neighbour matrix has 2*radius+1 columns (the neighbours) 
% and the rows are the different pixels under study
neighbour_matrix = bsxfun(@plus, linear_ids(:), mask(:)');
output = zeros(length(cols) * length(rows),(2*radius+1)^2, ntimes);
ticBytes(gcp); tic;
parfor i = 1:length(cols) * length(rows) 
    neighbours = neighbour_matrix(i,:);
    
    % calculate fft using the neighbours here, but data is still a broadcast variable
    % but how do I make data a sliced variable
    a = fft(data(neighbours, :), [], 2); 
    output(i, :, :) = a;
end
toc; tocBytes(gcp);

I would be very grateful if someone could point me in the right direction. Thank you in advance!

tl;dr

Is it possible to avoid having a broadcast variable when a vector of indices is being used?

2 Comments
Show NoneHide None

Matt J on 30 Aug 2019

Edited: Matt J on 30 Aug 2019

The actual data matrix will be around 100x100x65336 (i.e. 2.6 GB vs 3.6 MB in this example).)

I think you should probably decimate your time samples from 65336 to maybe a few hundred. It shouldn't take that many samples just to fit a line ('poly1') unless you have seriously bad measurement noise.

pigeon on 2 Sep 2019

Hi Matt, the amount of samples influence the resolution of the fourier transform. The linear fits I get are greatly improved by increasing the sample size. But the suggestion is interesting as I might try zero padding the signal in the loop to reduce the amount of data sent to each worker. I will soon look into the effectiveness into this.

For clarification I don't use all 2^16 samples for the linear fit, usually only the first 4 points, as afterwards the signal to noise ratio becomes low.

Sign in to comment.

Sign in to answer this question.

Answer 1

Matt J on 30 Aug 2019

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/478072-inevitable-broadcast-variable-in-parfor#answer_389706

Edited: Matt J on 30 Aug 2019

Open in MATLAB Online

I doubt you need any for-loops at all. Because your fft is along the 3rd dimension in all the sub-blocks, you could just all the FFT effort with,

dataFFT= real( fft(data,[],3) );

The rest of the operations involved in deriving "result" consist of extracting sliding 2D windows from dataFFT and copying them into a larger matrix. You should probably not be doing that as it entails a lot of duplicate storage. What are you planning to do with the sub-blocks one you have them?

7 Comments
Show 5 older commentsHide 5 older comments

pigeon on 30 Aug 2019

Edited: pigeon on 30 Aug 2019

Open in MATLAB Online

I could have indeed done as you suggested, in hindsight I tried to simplify my question wrongly.

What I want to do is calculate the correlation with a pixel with all its neighbours over time. To do so, I calculate the product of the complex conjugate and the fft of the neighbours in fourier space.

The loop looks like this:

% I will only use the first part of the result
nbins = 4;
% note that the time is now the first axis, this is what I normally use (but not in the original post)
data = randn(height, width, ntimes, 'single'); % create some random data
% here I could calculate what you commented dataFFT= real( fft(data) ); If I have enough RAM
parfor y = min_xy:max_y
    temp_r = zeros(width-2*radius, block_size, block_size, ntimes, 'single'); 
    for x = min_xy:max_x
        
        % and then here I could extract the data i.e. conj( dataFFT(:, y, x) )
        a = conj(fft(data(:, y, x)));.
        b = fft(data(:, y-radius:y+radius, x-radius:x+radius));
        c = ifft(a .* b, 'symmetric');
                
        % a scale is calculated from the sum of the fft
        scale = ntimes ./ real(a(1)) ./ real(b(1,:,:));
        
        % take only the first bins that will be used
        out = c(2:nbins+1, :, :) .* scale - 1;
        
        temp_r(x-radius, :, :, :) =  permute(out, [2,3,1]); % change to y,x,t
                
    end
    
    result(y-radius, :, :, :, :) = temp_r;
end

pigeon on 30 Aug 2019

Edited: pigeon on 30 Aug 2019

Open in MATLAB Online

For each pixel I will fit it with a 2D Gaussian. Like:

[y_pixels, x_pixels, ~, ~, nbins] = size(result);
flow_x = zeros(y_pixels, x_pixels);
flow_y = flow_x;
parfor y = 1:y_pixels
    for x = 1:x_pixels
        block = squeeze(result(y,x,:,:,:));
        
        center_x = zeros(nbins, 1);
        center_y = center_x;
        for t = 1:nbins
            % fit for every timepoint a 2D gaussian
            fit_result = fit_2d_gaussian(squeeze(block(:,:,t)));
                 
            center_x(t) = fit_result(1); % x coordinate center of gaussian
            center_y(t) = fit_result(2); % y coordinate center of gaussian
        end
        fo = fit(1:nbins, center_x, 'poly1');
        flow_x(y, x) = fo.p1;
        fo = fit(1:nbins, center_y, 'poly1');
        flow_y(y, x) = fo.p1;
    end
end  

I could of course do this in the same loop.

I obtain 2 matrices from which I can make images, one for flow_x and one for flow_y.

Does this make it clearer for you? Otherwise I will certainly try to rephrase my problem.

Once again, thank you for your time!

Matt J on 30 Aug 2019

Edited: Matt J on 2 Sep 2019

Open in MATLAB Online

If I've understood you, then here is how I would structure the code. I'm assuming the data format that you originally posted, data(y,x,t). In this implementation, parfor slices your array correlations, rather than broadcasting them.

[m,n,p]=size(data);
dataFFT=real(fft(data,[],3));
%Correlation calculations
N=2*radius+1;
parfor i=1:N^2
    
    [yshift,xshift]=ind2sub([N,N],i);
    
    xyshift=[yshift,xshift]-(radius+1);
    
    correlations(i,:,:,:)=ifft(dataFFT.*imtranslate(dataFFT,xyshift)[],3,'symmetric');
    
end
correlations=reshape(correlations,[N,N,m,n,p]);
correlations=permute(correlations,[1,2,5,3,4]); %format - correlations(dy,dx,t,y,x)
%Gaussian fitting
[center_x, center_y]=deal(nan(p,m,n));
parfor i=1:m*n*p
  block=correlations(:,:,i);   
    
  fit_result = fit_2d_gaussian(block);
                 
  center_x(i) = fit_result(1); % x coordinate center of gaussian
  center_y(i) = fit_result(2); % y coordinate center of gaussian
end
%Linear fitting of centers 
A=[1:p;ones(1,p)].';
linfit_x=A\center_x(:,:);
linfit_y=A\center_y(:,:);
flow_x=reshape(linfit_x(1,:),m,n);
flow_y=reshape(linfit_y(1,:),m,n);

Matt J on 2 Sep 2019

Edited: Matt J on 2 Sep 2019

Open in MATLAB Online

which might result in a 5-10% performance improvement

I would expect much more improvement. The for-loop now runs for only N^2=400 iterations (everything else is vectorized) whereas previously you had 3600 iterations.

The problem with the code above is, however, that still the whole dataFFT matrix is send to each worker and not just a slice of it, which results in severe RAM problems.

Only if you use 65336 time samples, but you shouldn't do that, as I mentioned in my comment above. Also, even if you were to succeed in slicing the dataFFT matrix in some way, you would still have the bigger problem of the sheer size of the resulting correlation data. It is 400 times the size of your initial data array, so with 2^16 time samples would take around 1000 GB just to store.

the amount of samples influence the resolution of the fourier transform. The linear fits I get are greatly improved by increasing the sample size.

Maybe you need that resolution for the correlation calculation, but I doubt you need it for the Gaussian/linear fitting that follows. So, I think you need to discard the high resolution after the correlation calculations are complete. Below, I give an example modification of the code that does this.

As also illustrated in the code, an alternative to parfor that would solve the broadcast problems is to do the correlation calculations on the GPU (if you have a good graphics card). This would avoid data duplication and probably be much faster than parfor since all the required operations are very well GPU-optimized by the Parallel Computing Toolbox.

%Correlation calculations
dataFFT=real(fft(gpuArray(data),[],3));
N=2*radius+1;
for i=1:N^2
    
    [yshift,xshift]=ind2sub([N,N],i);
    
    xyshift=[yshift,xshift]-(radius+1);
    
     tmp=ifft(dataFFT.*imtranslate(dataFFT,xyshift),[],3,'symmetric');
    
    correlations(i,:,:,:)=tmp(:,:,1:100:end); %decimate by factor of 100
    
end

pigeon on 2 Sep 2019

Open in MATLAB Online

I would expect much more improvement. The for-loop now runs for only N^2=400 iterations (everything else is vectorized) whereas previously you had 3600 iterations.

When I assume that the ifft call is determining the time of the loop, then I can use the following code. (I think that this assumption is reasonable, as I suspect time required for matrix slicing in the old method to be similar to it for circshift in the new method).

width  = 108; % assuming square image, i.e. width == height
radius = 10;
ifft_calls_new = (2*radius+1)^2; % ifft call for each pixel of the correlation block
ifft_calls_old = (width-radius)^2; % ifft call for each pixel in the image
ifft_size_new = width-radius; % cropped image size
ifft_size_old = (2*radius+1); % block size
tic;
for i = 1:ifft_calls_new/10 % divide by 10 to reduce the calculation time
    a = ifft(ones(2^12, ifft_size_new, ifft_size_new));
end
time_new = toc;
tic;
for i = 1:ifft_calls_old/10
    a = ifft(ones(2^12, ifft_size_old, ifft_size_old));
end
time_old = toc;
fprintf('Gain is %.2f%% \n', (time_new - time_old) / time_old * 100);

Then I that the new method is 1% faster on my machine. The reason that it is not much faster is because the size of matrices for the ifft calls in much larger.

an alternative to parfor that would solve the broadcast problems is to do the correlation calculations on the GPU

This would indeed be an option. Using a GPU (GTX 1060, 6GB) it was indeed faster than a parfor loop with broadcasting, but I thought that the latter may be faster if the broadcasting could be avoided.

Only if you use 65336 time samples, but you shouldn't do that, as I mentioned in my comment above

Agreed, thats why I actually do not save the correlations of all 65536 samples.

In my comment above I have the following line:

out = c(2:nbins+1, :, :) .* scale - 1;

Here, I only save the timepoints 2 to 5. That's why my result matrix is fairly small.

Sign in to comment.

Inevitable broadcast variable in parfor?

2 Comments
Show NoneHide None

Accepted Answer

7 Comments
Show 5 older commentsHide 5 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Inevitable broadcast variable in parfor?

2 Comments Show NoneHide None

Accepted Answer

7 Comments Show 5 older commentsHide 5 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

2 Comments
Show NoneHide None

7 Comments
Show 5 older commentsHide 5 older comments