Inevitable broadcast variable in parfor?
2 views (last 30 days)
Show older comments
Hi,
I want to perform some calculations in a parfor loop using the neighbouring data in a 3D matrix. Currently I am using the following code, which has the problem that data is a 'broadcast variable'. This slows down the calculations and causes memory problems. (This is a small scale example where 18.5 MB is send to the workers.The actual data matrix will be around 100x100x65336 (i.e. 2.6 GB vs 3.6 MB in this example).)
rng('default');
height = 60;
width = 60;
ntimes = 2^8;
radius = 10;
block_size = 2*radius+1;
data = randn(height, width, ntimes, 'single'); % create some random data
output = zeros(height, width, ntimes, 'single');
result = zeros(height-2*radius, width-2*radius, block_size, block_size, ntimes, 'single');
ticBytes(gcp); tic;
parfor y = radius+1:height-radius
% create temporary array for parfor loop
temp_r = zeros(width-2*radius, block_size, block_size, ntimes, 'single');
for x = radius+1:width-radius
% here I calculate the fft of a block surrounding the pixel of interest.
% Since it violates rules of indexing, data is a broadcast variable
a = fft(data(y-radius:y+radius, x-radius:x+radius, :), [], 3);
% For simplicity sake I store the fft (In reality I do more calculations).
temp_r(x-radius, :, :, :) = real(a);
end
result(y-radius, :, :, :, :) = temp_r;
end
toc; tocBytes(gcp);
To overcome this problem I considered using linear indexing. However, I then cannot figure out how to overcome the broadcasting. Due to the large size of the actual data, I think that separating the data in blocks (2*radius+1) for every single pixel is not feasible.
rng('default');
height = 60;
width = 60;
ntimes = 2^8;
radius = 10;
data = randn(height, width, ntimes, 'single'); % create some random data
data = reshape(data, [height*width, ntimes]);
% determination of linear indices
[cols, rows] = meshgrid(radius+1:height-radius, radius+1:width-radius);
linear_ids = sub2ind([height, width], rows, cols);
% create an array with linear ids of neighbouring pixels
[x,y] = meshgrid(radius*(-height-1):height:radius*height-radius, 0:radius*2);
mask = x + y;
% the neighbour matrix has 2*radius+1 columns (the neighbours)
% and the rows are the different pixels under study
neighbour_matrix = bsxfun(@plus, linear_ids(:), mask(:)');
output = zeros(length(cols) * length(rows),(2*radius+1)^2, ntimes);
ticBytes(gcp); tic;
parfor i = 1:length(cols) * length(rows)
neighbours = neighbour_matrix(i,:);
% calculate fft using the neighbours here, but data is still a broadcast variable
% but how do I make data a sliced variable
a = fft(data(neighbours, :), [], 2);
output(i, :, :) = a;
end
toc; tocBytes(gcp);
I would be very grateful if someone could point me in the right direction. Thank you in advance!
tl;dr
Is it possible to avoid having a broadcast variable when a vector of indices is being used?
2 Comments
Matt J
on 30 Aug 2019
Edited: Matt J
on 30 Aug 2019
The actual data matrix will be around 100x100x65336 (i.e. 2.6 GB vs 3.6 MB in this example).)
I think you should probably decimate your time samples from 65336 to maybe a few hundred. It shouldn't take that many samples just to fit a line ('poly1') unless you have seriously bad measurement noise.
Accepted Answer
Matt J
on 30 Aug 2019
Edited: Matt J
on 30 Aug 2019
I doubt you need any for-loops at all. Because your fft is along the 3rd dimension in all the sub-blocks, you could just all the FFT effort with,
dataFFT= real( fft(data,[],3) );
The rest of the operations involved in deriving "result" consist of extracting sliding 2D windows from dataFFT and copying them into a larger matrix. You should probably not be doing that as it entails a lot of duplicate storage. What are you planning to do with the sub-blocks one you have them?
7 Comments
Matt J
on 2 Sep 2019
Edited: Matt J
on 2 Sep 2019
which might result in a 5-10% performance improvement
I would expect much more improvement. The for-loop now runs for only N^2=400 iterations (everything else is vectorized) whereas previously you had 3600 iterations.
The problem with the code above is, however, that still the whole dataFFT matrix is send to each worker and not just a slice of it, which results in severe RAM problems.
Only if you use 65336 time samples, but you shouldn't do that, as I mentioned in my comment above. Also, even if you were to succeed in slicing the dataFFT matrix in some way, you would still have the bigger problem of the sheer size of the resulting correlation data. It is 400 times the size of your initial data array, so with 2^16 time samples would take around 1000 GB just to store.
the amount of samples influence the resolution of the fourier transform. The linear fits I get are greatly improved by increasing the sample size.
Maybe you need that resolution for the correlation calculation, but I doubt you need it for the Gaussian/linear fitting that follows. So, I think you need to discard the high resolution after the correlation calculations are complete. Below, I give an example modification of the code that does this.
As also illustrated in the code, an alternative to parfor that would solve the broadcast problems is to do the correlation calculations on the GPU (if you have a good graphics card). This would avoid data duplication and probably be much faster than parfor since all the required operations are very well GPU-optimized by the Parallel Computing Toolbox.
%Correlation calculations
dataFFT=real(fft(gpuArray(data),[],3));
N=2*radius+1;
for i=1:N^2
[yshift,xshift]=ind2sub([N,N],i);
xyshift=[yshift,xshift]-(radius+1);
tmp=ifft(dataFFT.*imtranslate(dataFFT,xyshift),[],3,'symmetric');
correlations(i,:,:,:)=tmp(:,:,1:100:end); %decimate by factor of 100
end
More Answers (0)
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!