Handling memory when working with very huge data (.mat) files.

6 views (last 30 days)
I am working with two 5D arrays (A5D and B5D) saved in a big_mat_file.mat file. The size of these arrays is specified in the code below. The total size of big_mat_file.mat file is around 20GB. I want to perform three simple operations on these matrices, as shown in the code. I have access to my university's computing cluster. When I run the following code with 120 workers and 400GB of memory, I receive the following error
In distcomp/remoteparfor/handleIntervalErrorResult (line 245) In distcomp/remoteparfor/getCompleteIntervals (line 395) In parallel_function>distributed_execution (line 746) In parallel_function (line 578)
Can someone please help me understanding what is causing this error. Is it because of low memory? It there anyother way to do the following operattions?
clear; clc;
load("big_mat_file.mat");
% it has two very huge 5D arrays "A5D" and "B5D", and two small arrays "as" and "bs"
% size of both A5D and B5D is [41 16 8 80 82]
% size of "as" is [1 80] and size of "bs" is [1 82]
xs = -12:0.1:12;
NX = length(xs);
ys = 0:0.4:12;
NY = length(ys);
total_iterations = NX * NY;
results = zeros(total_iterations , 41 , 16, 8);
XXs = zeros(total_iterations, 1);
YYs = zeros(total_iterations, 1);
parfor idx = 1:total_iterations
[ix, iy] = ind2sub([NX, NY], idx);
x = xs(ix);
y = ys(iy);
term1 = 1./(exp(1/y*(A5D-x)) + 10); %operation 1
to_integrate = B5D.*term1; %operation 2
XXs(idx) = x;
YYs(idx) = y;
results(idx, :, :, :) = trapz(as,trapz(bs,to_integrate,5),4); %operation 3
end
XXs = reshape(XXs, [NX, NY]);
YYs = reshape(YYs, [NX, NY]);
results = reshape(results, [NX, NY, 41, 16, 8]);
clear A5D B5D
save('saved_data.mat','-v7.3');

Accepted Answer

Saurabh
Saurabh on 30 Aug 2024
Edited: Saurabh on 30 Aug 2024
It seems like when you are performing some operation on Big Data which is 5D array and size 20GB accessing the university’s computing cluster, you encounter an error.
A heterogenous environment would be a cause of this issue.
The above link is a system requirement of Parallel Server, not “Parallel Computing Toolbox”, but it says an important point;
"Parallel processing constructs that work on the infrastructure enabled by parpool—parfor, parfeval spmd, distributed arrays, and message passing functions—cannot be used on a heterogeneous cluster configuration. The underlying MPI infrastructure requires that all cluster computers have matching word sizes and processor endianness."
The same Information can be found here:
If this is not the case then try changing the "worker" machine to a larger memory per core (in your case each worker will be allocated roughly 3-3.5GB), if this solves the issue, then the "workers" must have had insufficient memory.
If this is the case you can refer to below link, for troubleshooting steps:
I hope this helps.
  1 Comment
Luqman Saleem
Luqman Saleem on 31 Aug 2024
Thank you very much. It was the memory problem. Using the less number of workers worked.

Sign in to comment.

More Answers (1)

Sam Marshalik
Sam Marshalik on 30 Aug 2024
You are likely running out of memory on the workers. You are not using sliced input variables (Sliced Variables - MATLAB & Simulink (mathworks.com) to access the 5D matrices and are sending the entire copy to each worker. They are likely big enough that you are running out of memory on those machines. I would suggest to run less workers (to give them access to more memory per worker), try using sliced input variables and pass only part of the matrix to the workers, or run on machines with more memory.
To test this theory, you can run your work and monitor memory usage on those machines - if this is the issue, you should see it max out.

Categories

Find more on Parallel for-Loops (parfor) in Help Center and File Exchange

Products


Release

R2024a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!