Can codistributed arrays be randomly sampled by workers in an SPMD block?
Show older comments
Hello, I am pretty new to using SPMD, but think it is what I need for a big data application. I have a dataset that is too large open on one core (~20 GB), so I am loading it in chunks on multiple cores and creating a codistributed array. That part is no problem. However, what I would like to do is be able to take random samples from the entire dataset. I can do this in a slow way (Part 3 of sample code) on the client by using the gather function. However, I am going to be doing this for quite a few iterations with a much larger dataset than shown below, so I want to know if there is a way that I can use the workers that are holding the dataset open to take samples of the entire dataset in an spmd block. This seems like it should be possible however when I try to implement it I get communication errors between the workers (Part 2 of sample code).
Here I have a prepared a dummy code to hopefully explain my issues. I created four dummy datasets (a,b,c,d) that I am loading on four workers and creating a codistributed array (Part 1). This part works well. The next SPMD block (Part 2) is my attempt to have each worker take a random sample of 20 values from the codistributed array. However when I do this I get communication errors. My overall goal would be to have each worker take and store multiple random samples from the entire dataset to deliver to the client at the end of the spmd block. Part 3 is how I am currently doing things. I hold the dataset open via a codistributed array and then I run my sampling or bootstrapping on the client with the use of the gather function. Now this works for small numbers of iterations, but will likely be really inefficient with my real dataset and the amount of loops I need to run (~26,000). I feel I am missing an opportunity to speed things up with spmd. Any help would be appreciated. I am using MATLAB 2017b. Thanks!
% These are the variables I will be loading for this example just for
% reference. My files will be much bigger rainfall data.
% a = rand(100,1);
% b = rand(100,1);
% c = rand(100,1);
% d = rand(100,1);
files = dir(); % Grab file names from current directory
parpool(4) % Opend parallel pool with 4 workers
tic
% Part 1: This spmd loop loads the data on each of the workers
spmd
datstruct = load(files(labindex+2).name); % The load function loads in the file as a structure
nam = fieldnames(datstruct); % Grabs field name in the structure
data = datstruct.(nam{1,1}); % Pulls the data out of the structure into the double format
datstruct = []; % Clears structure from memory
% Create a codistributed array across workers
codistr = codistributor1d(2,[1 1 1 1],[100 4]); % Creates 1D distribution scheme
data = codistributed.build(data,codistr); % Creates codistributed array across the 4 workers
data = data(:); % Want data in a vector for sampling
end
toc
% Part 2: My attempt to do sampling within the spmd block. It gives a
% communication error
spmd
samples = datasample(data,20);
end
% Part 3: This is the way I am currently analyzing the data on the client. It is
% not ideal since I am not taking advantage of the cores that are holding
% the dataset open.
% This loop would be for random sampling
tic
perc = zeros(50,5);
for i = 1:100
samdat = gather(datasample(data,1000));
perc(i,:) = prctile(samdat,[50,75,95,99,99.9]);
end
toc
% This loop would be for bootstrapping at each grid point
tic
boot = zeros(2,5,100);
for i = 1:100
bootdat = gather(data(data<0.5));
boot(:,:,i) = bootci(1000,@(x)prctile(x,[50,75,95,99,99.9]),bootdat);
end
toc
Answers (0)
Categories
Find more on Distributed Arrays in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!