Move from parfor to parfeval?
20 views (last 30 days)
Show older comments
I have a large simulation that uses a LSF cluster that supports Parallel Toolbox. Right now, the meat of the effort is in a parfor loop:
% Loop over cells of storm
celllist = find(~gotResult);
stormtmp = storm(celllist); % storm is array of handle-type classes
parfor i = 1:length(celllist)
icell = celllist(i);
fprintf('Cell %d...\n', icell);
matfileObj = pooldata.Value; %#ok<PFBNS>
ofac = zeros(sizevec,'single');
ofac = stormtmp(i).obsMatrix(grid,ofac,geom,rparms);
matfileObj.gotResult(1,icell) = true;
matfileObj.testOut(1,icell) = {ofac};
poolData is a parallel.Pool.Constant with a matfile object that is specific to each worker. At the end of the loop, I consolidate the results, clear the temporary files, and go to the next iteration of the model. This gives me robustness against cluster crashes which when the total job takes a month can be distressingly common (hence the checking for a gotResult at the start; I can have a partial result prior to a crash). The primary annoyance with parfor is that the allocation of units to workers happens rigidly at the start. With 60+ workers on 5-10 computers in a shared facility, there is no guarantee that they take a similar amount of time to finish. I find the last 25% or more of the execution time of each iteration is spent waiting for a dwindling number of my workers to finish their assignments. I've read the documentation for parfeval, and it seems to give me a way of more carefully managing each execution, but it's too convoluted to see how I get there. Any tips? It would seem that I would start with a find() to get the first N cells (N = # of workers) that need completing, and then enter a while loop using afterEach() where I could check for a valid result, and then get the next one on the list and assign it? Maybe I can do it with one main while loop and just remove entries from celllist() as they complete? Head is spinning...
1 Comment
Jeff Miller
on 25 Jan 2022
So the problem is that lots of workers are sitting idle while waiting for the slow ones to finish their assignments in this parfor loop? Instead of waiting here, you'd like to have them start on their assignments for the next parfor loop for the next "iteration of the model".
If that's right, maybe a simpler approach is to change this parfor loop so that it runs through all the cell/model combinations, something like this:
parfor i = 1:length(celllist)*NofModelIterations
For each i you'd need a little logic to work out which model and which cell you wanted, plus invoke the right model, but that might not be hard.
Accepted Answer
Edric Ellis
on 26 Jan 2022
To run using parfeval, you basically need to pull out the body of your parfor loop into a function, something like this:
celllist = find(~gotResult);
stormtmp = storm(celllist); % storm is array of handle-type classes
futures = [];
for i = 1:length(celllist)
% Schedule computation for each index in celllist. Each individual
% function evaluation is executed separately on the workers.
futures(i) = parfeval(@oneComputation, 0, i, stormtmp(i), poolData)
% You could simply wait for completion like this:
function oneComputation(icell, stormEl, poolData)
matfileObj = pooldata.Value;
ofac = zeros(sizevec,'single');
ofac = stormEl.obsMatrix(grid,ofac,geom,rparms);
matfileObj.gotResult(1,icell) = true;
matfileObj.testOut(1,icell) = {ofac};
Note I simply added a call to wait(futures) after scheduling the work - as I understand it, the worker results are all stored in the mafileObj.
It might be worth taking this approach a step further. If each worker computation takes a "long" time, then you might be better off using batch jobs. This will share resources on your cluster better because you don't need to keep a parallel pool running, with possibly-idle workers. The API to batch is similar to the API to parfeval, and specifies a single function evaluation on a worker. One difference is that it doesn't support parallel.pool.Constant, so you'd need to build the matfileObj directly on the worker.
More Answers (0)
See Also
Find more on Parallel for-Loops (parfor) in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!