Inevitable broadcast variable in parfor?

2 views (last 30 days)
Hi,
I want to perform some calculations in a parfor loop using the neighbouring data in a 3D matrix. Currently I am using the following code, which has the problem that data is a 'broadcast variable'. This slows down the calculations and causes memory problems. (This is a small scale example where 18.5 MB is send to the workers.The actual data matrix will be around 100x100x65336 (i.e. 2.6 GB vs 3.6 MB in this example).)
rng('default');
height = 60;
width = 60;
ntimes = 2^8;
radius = 10;
block_size = 2*radius+1;
data = randn(height, width, ntimes, 'single'); % create some random data
output = zeros(height, width, ntimes, 'single');
result = zeros(height-2*radius, width-2*radius, block_size, block_size, ntimes, 'single');
ticBytes(gcp); tic;
parfor y = radius+1:height-radius
% create temporary array for parfor loop
temp_r = zeros(width-2*radius, block_size, block_size, ntimes, 'single');
for x = radius+1:width-radius
% here I calculate the fft of a block surrounding the pixel of interest.
% Since it violates rules of indexing, data is a broadcast variable
a = fft(data(y-radius:y+radius, x-radius:x+radius, :), [], 3);
% For simplicity sake I store the fft (In reality I do more calculations).
temp_r(x-radius, :, :, :) = real(a);
end
result(y-radius, :, :, :, :) = temp_r;
end
toc; tocBytes(gcp);
To overcome this problem I considered using linear indexing. However, I then cannot figure out how to overcome the broadcasting. Due to the large size of the actual data, I think that separating the data in blocks (2*radius+1) for every single pixel is not feasible.
rng('default');
height = 60;
width = 60;
ntimes = 2^8;
radius = 10;
data = randn(height, width, ntimes, 'single'); % create some random data
data = reshape(data, [height*width, ntimes]);
% determination of linear indices
[cols, rows] = meshgrid(radius+1:height-radius, radius+1:width-radius);
linear_ids = sub2ind([height, width], rows, cols);
% create an array with linear ids of neighbouring pixels
[x,y] = meshgrid(radius*(-height-1):height:radius*height-radius, 0:radius*2);
mask = x + y;
% the neighbour matrix has 2*radius+1 columns (the neighbours)
% and the rows are the different pixels under study
neighbour_matrix = bsxfun(@plus, linear_ids(:), mask(:)');
output = zeros(length(cols) * length(rows),(2*radius+1)^2, ntimes);
ticBytes(gcp); tic;
parfor i = 1:length(cols) * length(rows)
neighbours = neighbour_matrix(i,:);
% calculate fft using the neighbours here, but data is still a broadcast variable
% but how do I make data a sliced variable
a = fft(data(neighbours, :), [], 2);
output(i, :, :) = a;
end
toc; tocBytes(gcp);
I would be very grateful if someone could point me in the right direction. Thank you in advance!
tl;dr
Is it possible to avoid having a broadcast variable when a vector of indices is being used?
  2 Comments
Matt J
Matt J on 30 Aug 2019
Edited: Matt J on 30 Aug 2019
The actual data matrix will be around 100x100x65336 (i.e. 2.6 GB vs 3.6 MB in this example).)
I think you should probably decimate your time samples from 65336 to maybe a few hundred. It shouldn't take that many samples just to fit a line ('poly1') unless you have seriously bad measurement noise.
pigeon
pigeon on 2 Sep 2019
Hi Matt, the amount of samples influence the resolution of the fourier transform. The linear fits I get are greatly improved by increasing the sample size. But the suggestion is interesting as I might try zero padding the signal in the loop to reduce the amount of data sent to each worker. I will soon look into the effectiveness into this.
For clarification I don't use all 2^16 samples for the linear fit, usually only the first 4 points, as afterwards the signal to noise ratio becomes low.

Sign in to comment.

Accepted Answer

Matt J
Matt J on 30 Aug 2019
Edited: Matt J on 30 Aug 2019
I doubt you need any for-loops at all. Because your fft is along the 3rd dimension in all the sub-blocks, you could just all the FFT effort with,
dataFFT= real( fft(data,[],3) );
The rest of the operations involved in deriving "result" consist of extracting sliding 2D windows from dataFFT and copying them into a larger matrix. You should probably not be doing that as it entails a lot of duplicate storage. What are you planning to do with the sub-blocks one you have them?
  7 Comments
Matt J
Matt J on 2 Sep 2019
Edited: Matt J on 2 Sep 2019
which might result in a 5-10% performance improvement
I would expect much more improvement. The for-loop now runs for only N^2=400 iterations (everything else is vectorized) whereas previously you had 3600 iterations.
The problem with the code above is, however, that still the whole dataFFT matrix is send to each worker and not just a slice of it, which results in severe RAM problems.
Only if you use 65336 time samples, but you shouldn't do that, as I mentioned in my comment above. Also, even if you were to succeed in slicing the dataFFT matrix in some way, you would still have the bigger problem of the sheer size of the resulting correlation data. It is 400 times the size of your initial data array, so with 2^16 time samples would take around 1000 GB just to store.
the amount of samples influence the resolution of the fourier transform. The linear fits I get are greatly improved by increasing the sample size.
Maybe you need that resolution for the correlation calculation, but I doubt you need it for the Gaussian/linear fitting that follows. So, I think you need to discard the high resolution after the correlation calculations are complete. Below, I give an example modification of the code that does this.
As also illustrated in the code, an alternative to parfor that would solve the broadcast problems is to do the correlation calculations on the GPU (if you have a good graphics card). This would avoid data duplication and probably be much faster than parfor since all the required operations are very well GPU-optimized by the Parallel Computing Toolbox.
%Correlation calculations
dataFFT=real(fft(gpuArray(data),[],3));
N=2*radius+1;
for i=1:N^2
[yshift,xshift]=ind2sub([N,N],i);
xyshift=[yshift,xshift]-(radius+1);
tmp=ifft(dataFFT.*imtranslate(dataFFT,xyshift),[],3,'symmetric');
correlations(i,:,:,:)=tmp(:,:,1:100:end); %decimate by factor of 100
end
pigeon
pigeon on 2 Sep 2019
I would expect much more improvement. The for-loop now runs for only N^2=400 iterations (everything else is vectorized) whereas previously you had 3600 iterations.
When I assume that the ifft call is determining the time of the loop, then I can use the following code. (I think that this assumption is reasonable, as I suspect time required for matrix slicing in the old method to be similar to it for circshift in the new method).
width = 108; % assuming square image, i.e. width == height
radius = 10;
ifft_calls_new = (2*radius+1)^2; % ifft call for each pixel of the correlation block
ifft_calls_old = (width-radius)^2; % ifft call for each pixel in the image
ifft_size_new = width-radius; % cropped image size
ifft_size_old = (2*radius+1); % block size
tic;
for i = 1:ifft_calls_new/10 % divide by 10 to reduce the calculation time
a = ifft(ones(2^12, ifft_size_new, ifft_size_new));
end
time_new = toc;
tic;
for i = 1:ifft_calls_old/10
a = ifft(ones(2^12, ifft_size_old, ifft_size_old));
end
time_old = toc;
fprintf('Gain is %.2f%% \n', (time_new - time_old) / time_old * 100);
Then I that the new method is 1% faster on my machine. The reason that it is not much faster is because the size of matrices for the ifft calls in much larger.
an alternative to parfor that would solve the broadcast problems is to do the correlation calculations on the GPU
This would indeed be an option. Using a GPU (GTX 1060, 6GB) it was indeed faster than a parfor loop with broadcasting, but I thought that the latter may be faster if the broadcasting could be avoided.
Only if you use 65336 time samples, but you shouldn't do that, as I mentioned in my comment above
Agreed, thats why I actually do not save the correlations of all 65536 samples.
In my comment above I have the following line:
out = c(2:nbins+1, :, :) .* scale - 1;
Here, I only save the timepoints 2 to 5. That's why my result matrix is fairly small.

Sign in to comment.

More Answers (0)

Products


Release

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!