What is the correct way to save a large MATLAB structure?

79 views (last 30 days)
Owen
Owen on 20 Nov 2024 at 8:38
Commented: Steven Lord on 21 Nov 2024 at 16:40
I have a MATLAB structure which is just over 21GB in memory (from whos) and when I save this to a MAT file with the "-v7.3" and "-nocompression" flags it takes well over an hour (on a high performance workstation with an NVMe SSD) and I get a file which is 77GB on disk. I understand that there is some overhead in saving to a MAT file and that "-nocompression" will result in larger files that with compression (but I gave up after about 3 hours waiting for a compressed version to save), but how can 56GB of "overhead" be considered acceptable?
I only need to save this one structure and I won't be adding any other data or modifying the MAT file, so all of the additional features of the v7.3 format are of no use to me, I just need support for >2Gb variables. I attempted to use the undocumented getByteStreamFromArray function to get a byte array I can just dump to a file but this just returned "Error during serialization".
Am I somehow missing a "correct" way to do this efficiently? Or are my only options to either split my data in to a bunch of <2Gb variables to save in a v7 format or write my own serializer? I appreciate that doing either of these isn't exactly a massive job, I'm just very surprised there isn't better native support for large files!

Answers (2)

Matt J
Matt J on 20 Nov 2024 at 16:22
Edited: Matt J on 20 Nov 2024 at 16:44
but how can 56GB of "overhead" be considered acceptable?
It depends on what your struct contains. Field data containing handle objects, for example, will give a deceptively small memory total according to whos() because only the handle, and not what it is pointed to, is counted. However, when you save to a .mat file, the entirety of the data pointed to by the handle will be cloned, resulting in a much larger memory footprint. Example:
s.h=gcf;
whos('s').bytes
ans =
176
save file1 s
dir('file.mat').bytes
ans =
1836
Or are my only options to either split my data in to a bunch of <2Gb variables to save in a v7 format
It is hard for me to imagine why one would ever want 21GB of data in a single file. It would block off a huge chunk of contiguous disk space and it would take forever to load.
  2 Comments
Owen
Owen on 21 Nov 2024 at 15:34
Good point about handles, but that's not my issue as the fields are all conventional arrays (mostly numerical with the occasional string).
The data itself is fairly unstructured so there's no obvious way to partition it off in to seperate files and then only load the files as needed. I also have a similar sized text file (19Gb) that I work with regularly and that loads up with readlines in around 2 minutes, whereas the MAT takes 30 minutes.
The stucture I'm using behaves very well in memory allowing me to perform the necessary operations on it extremely efficiently. So if I were to split it up in to multiple files it would just need to be recombined again on loading to achieve the same functionality. I appreciate this is fairly trivial to do, I just find it strange that I have to do it myself and there isn't some sort of "dump this data to a binary file" function built in!
Steven Lord
Steven Lord on 21 Nov 2024 at 16:40
What's the general layout of the struct? Is it a scalar struct with a few large fields, is it a scalar struct with many small-to-medium fields, is it a non-scalar struct, etc.? This could impact how much overhead there is.
As an analogy, consider an egg carton. You could have an egg carton that wraps each egg in a small cardboard box of its own and then ties all those small boxes together (a non-scalar struct with one element per egg and one field named egg in each element) or you could have one that stores each egg in a cup of its own, but does not completely enclose the egg like the first example (a regular array.) Both store the same eggs, but one uses more material (overhead) and takes longer to access the egg/data.
Your suggestion of "dump this data to a binary file" is interesting, but would you expect to be able to dump the data (which I assume you mean something like the raw contents of memory) in one release of MATLAB and read in that dumped data in a different release of MATLAB? [And even if you don't, do you think other users would expect that to be a requirement for that type of feature and be annoyed/angry if it didn't support that workflow?] If so, we would need to very carefully consider how any internal change to how structs (or more generally, any arrays) are organized in memory would affect this dumping/reading process.

Sign in to comment.


Rahul
Rahul on 21 Nov 2024 at 8:32
Hi Owen,
The issue you're encountering stems from the design of MATLAB's MAT-file formats and the inherent inefficiencies of the -v7.3 format for your specific use case.
  • With the -v7.3 flag you can store variables with size greater than 2GB, with compression.
  • Without the -v7.3 flag (e.g. if the default version is set to -v7 or lower) there is no (or less) compression, but we cannot store large arrays.
MAT-file structure: The -v7.3 format uses HDF5 as its backend, which is highly versatile but not optimized for cases where you have a single, large variable. HDF5 is designed for general-purpose storage, including metadata and other overheads that can lead to excessive file sizes.
Serialization limitations: Large and complex data structures like struct can incur significant overhead because every field and subfield is treated as a separate dataset in HDF5.
getByteStreamFromArray is limited to serializing objects that MATLAB's internal serializer can handle. Structures or arrays with greater than 4 GB of data often hit limitations in MATLAB's serialization mechanism.
Some of the possible solutions that could resolve this issue are as follows:
Split into multiple variables and use -v7 format
  • If feasible, divide your large structure into several smaller variables, each <2GB.
  • Save these in the older -v7 format, which is more space-efficient for such cases.
fields = fieldnames(myStruct);
for i = 1:numel(fields)
save(['part_' fields{i} '.mat'], 'myStruct', '-v7');
end
Writing custom serializer
  • If the structure doesn't contain complex objects, you can recursively serialize it into a binary file with custom MATLAB code.
fid = fopen('large_struct.bin', 'w'); fwrite(fid, myStruct, 'uint8');
fclose(fid);
Use low-level HDF5 tools
  • If you need to stick to -v7.3, you can consider using MATLAB's low-level HDF5 functions to write the structure directly without unnecessary overhead.
h5create('large_struct.h5', '/myStruct', size(myStruct));
h5write('large_struct.h5', '/myStruct', myStruct);
If performance is your priority and you don't need the -v7.3 features, you can split the data into smaller parts and use the -v7 format.
Moreover, if your workstation supports parallel computing, you can consider using MATLAB's Parallel Computing Toolbox to parallelize the saving process. This might help speed up the process, especially if your structure can be split into independent parts.
To know more about the usage of HDF5 functions used in the above code, refer to the documentation link mentioned below:
Best!
  1 Comment
Owen
Owen on 21 Nov 2024 at 15:41
Thanks for the detailed examples, I'm going to go the route of splitting up the structure in to smaller parts. Unfortunately there's no logic way to partition the data (as it's very unstructured) so I'll need to recombine it back in to the large structure on loading. I will write some custom load and save functions to achive this for me.
I am still a bit surprised there isn't better large file support directly in MATLAB itself. Once I've got that 21Gb in memory (along with numerous other pieces of data some of which are also many Gb!), MATLAB makes working with them very easy and extremly efficient. So not having an efficient way to save/load it feels like an oversight, but maybe that's just me!

Sign in to comment.

Products


Release

R2024a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!