Batch scheduler or pick up the run after wall time expires or checkpoints
2 views (last 30 days)
Show older comments
Hi,
I have a matlab program which I submitted as sbatch file. However, the walltime for matlab license on supercomputer to which I submitted expires after 24hrs. My data is saved at each step so I am not worried of losing the data at any point. I am trying to find a way to start running the file after it left off on the for loop. For example, considering I am running a time program like, if my code ends at j = 45000 and j = 30000 after 24hrs, how should I restart again at i = 45001 and j = 30001. Any hints ?
for i= 1:100000
for j = 1:60000
code....
save(data)
end
end
0 Comments
Answers (1)
Jan
on 26 Jan 2021
Edited: Jan
on 26 Jan 2021
DataFile = 'C:\Temp\YourData.mat';
if isfile(DataFile)
Data = load(DataFile);
else
Data.i0 = 1;
Data.j0 = 1;
end
for i = Data.i0:100000
Data.i0 = i;
for j = Data.j0:60000
Data.j0 = j;
Data.value = code...
save(DataFile, 'Data');
end
end
Saving the file 6'000'000'000 times will waste a lot of time. Then this will loose the last inner loop, but is most likely much faster:
DataFile = 'C:\Temp\YourData.mat';
if isfile(DataFile)
Data = load(DataFile);
else
Data.i0 = 1;
end
for i = Data.i0:100000
Data.i0 = i;
for j = 1:60000
Data.value = code...
end
save(DataFile, 'Data');
end
A smart option would to store the file every minute only using clock and etime. Then you loose less then a minute per day, but need only the time for 1440 save commands.
2 Comments
Jan
on 26 Jan 2021
Removing the data by the clear commands is usually a waste of time in Matlab.
Adding a folder to the path only to access the local files is a bad design. Only Matlab's functions should be part of the path, but there is no reason to expand the path for accessing data files. Use absolute pathnames instead.
I cannot run your code, but it looks like the part "mask.ocean_mask(lat_band(lat),lon)" might consume a remarkable part of the processing time. Do you have good reasons to access the MAT file extremely often? Is it far too large to match into the RAM?
"There is no pre-allocation of memory like data.mat." - I've showed you a method how the loop indices can be stored such, that restarting the script starts the loops at where they have been stopped bevor. This is not a pre-allocation. Nevertheless, this method cannot work reliably in your case, because you append data to open files. If the schedular of the cloud system cancels your process brutally, there is no guaratee, that the open files are left in an accurate state. There is no way to modify your code to let it be stoppable.
As far as I can see, you need a completely different approach. The idea of opening 366 file simultaneously and the time consuming access of the MAT file might be the main problem. If "code" does not hide any huge computations, the code should run fluently in far less then 24 hours. So instead of changing the code to accept a hard stop after 24 hours computing time, it is much better to improve the code to run efficiently. Currently it seems to spend the time with stressing the disk only.
See Also
Categories
Find more on File Operations in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!