How to parallelize when loading lots of data

62 views (last 30 days)
David K
David K on 15 Apr 2021
Edited: David K on 20 Apr 2021
I have been struggling to speed up my MATLAB code which loads a number of large data files into custom class objects. My two classes are a DataGroup and a DataObj, where every DataObj loads a single independent .mat data file, and DataGroup contains an array of DataObj to be able to access information distributed throughout the multiple files. We have it distributed in this way because every file contains a structure with a completely different set of fields, so there's not a great way to compile them together. Since .mat files are a compressed format, I believe a large portion of the load time is actually being used by the decompression, not the HDD reading.
Below is a basic version of the code that I am trying to speed up. My problem is that the parfor doesn't help and actually makes things worse, because of the large overhead needed to broadcast the large rawData back to the client. I know the overhead is the cause because I accidentally had my saveobj overload setting obj.rawData=[]; and that significantly improved the speed. The name and ID properties were being set properly, so I know the data was being loaded.
The saveObj overload is important, because the way MATLAB broadcasts variables between workers (and presumably back to the client) is to "save" them, and then "load" them in the destination. Since I have a custom class, if I overload saveobj it will be called before being sent back to the client. I also tried overloading loadobj to re-load the data a second time. This preserved the information, but at a slightly worse speed than serialized.
Here are my timing results on a small dataset. The different parfor methods indicate whether the saveobj and loadobj overloads were enabled, or commented out.
% MODE | TIME
% for loop | 16 s
% parfor with saveobj and loadobj (data loaded twice) | 18 s
% parfor with saveobj (data lost) | 7 s
% parfor with no overloads | 160 s
Is there a better way to parallelize this to make things run quicker? I've considered loading a couple key DataObjs first and then loading the rest in the background, updating the DataGroup.dataSet property in the background as well, but not sure how to do that.
UPDATE:
I tried parfeval, but that didn't offer any speed benefits since I fetch the results immediately after the loop. Is there a way to give parfeval a class method to update an existing object? I was thinking I could put filler values in my DataObj and then use parfeval to call an object method to load the data and update the values in the background. However when I tried that the parfeval call wouldn't actually update the object values. I'm not sure how (or if it's possible) to get it to communicate back to the client to update the original object in the background. See my comment below this post for the code I tried for this method.
run_script
% Data files are all .mat containing a single structure with many different fields
% The fieldnames are different for every data file
% Data files can range from a couple kB to over 1GB
filenames = {'file01.mat','file02.mat',...,'file100.mat'};
tic; allData = DataGroup(filenames); toc
DataGroup.m
classdef DataGroup < handle
properties
dataSet
end
methods
function obj = DataGroup(filenames)
dataSet = repmat(DataObj.empty(),1,length(filenames));
parfor iFile = 1:length(filenames)
dataSet(iFile) = DataObj(filenames{iFile});
end
obj.dataSet = dataSet;
end
end
DataObj.m
classdef DataGroup < handle
properties
rawData
name
ID
srcFile
end
methods
function obj = DataObj(filename)
obj.rawData = load(filename);
names = fieldnames(obj.rawData)
obj.name = names{1};
obj.ID = parse(obj.name); %some basic character parsing
obj.srcFile = filename;
end
function b = saveobj(obj)
b = obj;
b.rawData = [];
end
end
methods (Static)
function obj = loadobj(b)
obj = DataObj(b.srcFile);
end
end

Answers (1)

Gaurav Garg
Gaurav Garg on 20 Apr 2021
Hi David,
As has been noticed by you already, communication overhead among the threads is the probable cause for slowdown of the multi-threaded application.
You can try to speed up the program by using parfeval (documentation) as you do not need synchronous nature of threads.
You can also try compiling the code with C/C++ codegen, generating C++ code and then check the time taken by MEX file. MEX files are generally faster, but in some special cases, they might not be fast enough or some functions might not be supported by C/C++ codegen. For example, in your case, load function is codegen supported, but parse isn't.
  1 Comment
David K
David K on 20 Apr 2021
I tried parfeval, but that didn't offer any speed benefits since I fetch the results immediately after the loop. Is there a way to give parfeval a class method to update an existing object? I was thinking I could put filler values in my DataObj and then use parfeval to call an object method to load the data and update the values in the background. However when I tried that, the parfeval call wouldn't actually update the object values. I'm not sure how (or if it's possible) to get it to communicate back to the client to update the original object in the background. Here's a basic version of what I tried:
function obj = DataObj(filename)
obj.rawData = [];
obj.info1 = 'TBD';
obj.info2 = 'TBD';
parfeval(@loadData,1,obj,filename);
end
function loadData(obj,filename)
obj.rawData = load(filename);
obj.info1 = parse(obj.rawData.info1);
obj.info2 = parse(obj.rawData.info2);
end

Sign in to comment.

Categories

Find more on Parallel for-Loops (parfor) in Help Center and File Exchange

Products


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!