Is it possible to create a sparse binary (.bin) file on disk?

3 views (last 30 days)
I have a project where I would like to save my results to a binary (.bin) file that is stored on disk. Results need to be saved as they are generated (so that memory can be cleared), but the order in which these results are added to the binary file is not necessarily sequential (e.g., first I write to bytes 1-100, then 1001-1100, then 301-400, etc.).
In order to write non-sequentially to a binary file, I believe that file needs to be pre-allocated on the disk in some form or another. Is it possible to create a "sparse" binary file that has an area on disk set aside but which does not require writing zeros to every bit in the .bin file? I know how many bytes the file will take up when I am done saving to it, so this isnt a problem. Alternately, is there a way for me to write non-sequentially to a binary file without pre-allocating it first?
Thanks.

Accepted Answer

Anthony Barone
Anthony Barone on 25 May 2018
Edited: Anthony Barone on 25 May 2018
In case anyone comes across this question looking for the same thing...at some point in the last year I figured out a much better way to do this. Make a system call to
fallocate (Linux/UNIX - create or extend file)
fsutil file createnew (Windows - create file)
fsutil file seteof (Windows - extend file)
mkfile -n (MacOS - create file)
I haven't figured out extending a file on MacOS, but since this is a very unusual use case for me I have it setup to either zero-write to the end of the file or to read the data, delete, allocate a larger file, and re-write the data when a file of MacOS needs to be sparse-extended.
This is effectively instant, since it is true write-less allocation. For example, as a test I just allocated a 4 GB file in 0.05 seconds.
That said, writing non-sequentially to a file like this can be very slow, so you might be better off adding in zeros and writing data to the end of the file on the fly as needed, but write less allocation is possible to implement from within MATLAB.

More Answers (2)

Jan
Jan on 13 Mar 2017
You can use this to expand (or shrink) a file efficiently: FEX: FileResize. It is twice as fast as appending zeros with fwrite.
function InsertData(File, Data, Format, Pos)
fid = fopen(File, 'r+');
if fid == -1
error('*** %s: Cannot open file: %s', mfilename, File);
end
fseek(fid, 0, 1); % Spool to end
Len = ftell(fid);
if Pos > Len
FileResize(File, Pos);
end
fwrite(fid, Data, Format);
fclose(fid);
end
If multiple worker write to the same file... Hm. I'm not sure what happens, when two works access the same file and one writes into the section which is expanded by the other currently.
What about inventing your own "sparse" file format?
function InsertData(File, Data, Format, Pos)
fid = fopen(File, 'a');
if fid == -1
error('*** %s: Cannot open file: %s', mfilename, File);
end
Header = [ndims(data), size(data)];
fwrite(fid, Header, 'uint64');
fwrite(fid, Data, Format);
fclose(fid);
end
A method for reading or creating full files in a post-processing will be equivalently easy. The file is read or spooled in blocks afterwards, but this will not be dramatically slower.
  1 Comment
Anthony Barone
Anthony Barone on 13 Mar 2017
Thanks for suggesting FileResize. I will have to experiment if it works correctly with multiple workers.
As far as making my own type of sparse file - there is a specific format of for the file I am writing to that will allow it to be used in other applications (in paritcular .segy, which stores binary data along with a pre-defined list of header information). Making my own format would just require me to re-format it into the desiredformat when the code finishes, and as such wouldnt save me any time or trouble.
That said, even if I didnt have a target format I'm not sure this would be a good idea. The data is being written in such a way that sequential blocks of information are likely to be loaded with each other when you are loading part of the data (they represent data from locations that are physically close to each other). Introducing this type of sparse format would help initially, but seems like it would create significantly more work for accessing data once a significant amount of data has been added to the file, since it would have to jump around the file instead of reading sequentially.

Sign in to comment.


Walter Roberson
Walter Roberson on 10 Mar 2017
Unfortunately, No.
The POSIX standard operation that allows for sparse files is to fseek() to a location past end of file and write data there; the file system is then permitted to leave "holes" in the parts where nothing has been written.
Unfortunately, in MATLAB, if you fseek() beyond the end of file, the location "sticks" at the end of file.
Therefore, in MATLAB, if you want to write to a scattered location, the general write procedure is:
  1. fopen() without the 't' (text) attribute (important!), with 'a' access (not 'w' or 'w+' or 'a+' for this purpose)
  2. fseek() to end of file
  3. ftell() to determine the position of the end of file, in bytes
  4. if the current end of file is before the place you need to be, fwrite() 0's to the place you need to be; otherwise fseek() to the place you need to be
  5. fwrite() the data you want
The general read procedure is:
  1. fopen() without the 't' (text) attribute (important!), with 'r' or 'a' or 'a+' access (not 'w' or 'w+') -- it is fine to keep the file open with 'a' access for reading and writing
  2. fseek() to the position you need to be
  3. ftell() to determine the position you ended up in, in bytes
  4. if the current position is before the place you need to be, the data has not been written yet, so act appropriately
  5. otherwise fread() the data, keeping in mind that you might encounter end of file if you were not consistent about the blocksize -- or even if the end of file happened to be exactly at the place you want to start reading
You can modify this procedure to test that the entire block of data is available before you read it.
  3 Comments
Anthony Barone
Anthony Barone on 13 Mar 2017
To clarify, when I referred to "accessing the files" the only access that is required is a single access to write the data. After the data is written I wont need to access the written data again until after the code has finished running and all results from the code have been written to disk.
This makes me think that using memmapfile would just result in unnecessairy additions to the vitrual memory addresses, and wouldnt actually give any benefit since I dont need to access the data again after it is written. Am I correct in thinking this, or do I misunderstand something?

Sign in to comment.

Categories

Find more on Performance and Memory in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!