Slowdown of Reading Large Binary Files

12 views (last 30 days)
Matthew Bloom
Matthew Bloom on 13 Nov 2018
Commented: Matthew Bloom on 14 Nov 2018
I am attempting to read from a large binary file (315 GB). The file is a .op4 file written out by Nastran. The file contains a single matrix that has the size of NxM. I am able to determine ahead of time which columns (M) I need to read from the file, so I do not need to read the entire file into memory. The file has the format where it has a file header followed by all the data. The data is stored with a fileheader of 5 unit32 variables, and that header then tells you how many "words" to read and where within the NxM matrix that data is located. This is then repeated M times. Below I have put a sample code showing what I am doing.
The file I am running has about 2,300,000 columns and this script runs well for the first ~300,000 columns but then it suddenly starts to expotentially slowing down.
Running the timing script it is clear that over 90% of the time is being spent on the header=fread(fid,5,'uint32') line. I have tried finding ways of only reading the header lines ahead of time in one read, by using the 'skip' option in fread, but that bogs down as well after about 20% of the total file.
One additional note, the test case I am running is only saving about 20 columns of the 2,300,000 so there is not an issue of using too much memory in the workspace
%Where ind is a logical specifying which columns need retained in memory
fid = fopen([Path fname],'r');
if fid > 0
fseek(fid,0,-1); %Ensure you are at beginning of file
header=fread(fid,5,'uint32');
NCOL = header(2);
NROW = header(3);
NF = header(4);
NTYPE = header(5);
NAME = strtrim(fread(fid,[1,8],'*char')); % Reads ascii name of matrix if required.
data = zeros(NROW,sum(ind));
icol2 = 1;
tic
for col = 1:NCOL
if ~feof(fid)
temp_header=fread(fid,5,'uint32');
icol=temp_header(3); % Current column info
irow=temp_header(4); % Start reading at row...
NW=temp_header(5); % Number of records in current column
if ind(icol)
data(irow:irow+NW/2-1,icol2) = fread(fid,NW/2,'float64');
icol2 = icol2 + 1;
elseif ~ind(icol)
fseek(fid,NW/2*8,0);
end
end
end
fclose(fid);
end

Answers (1)

Image Analyst
Image Analyst on 13 Nov 2018
Maybe try memmapfile(). I've never used it myself so I can't offer anything beyond a suggestion to look into it.
  3 Comments
Image Analyst
Image Analyst on 13 Nov 2018
It seems like fseek() should tell it to skip a number of bytes. Is fseek() not working for you?
Matthew Bloom
Matthew Bloom on 14 Nov 2018
I am using fseek to move forward through the data I do not need. The slow down is having to read the header line every time. I need to read the header line because it tells you where the data is located within the matrix, and the number of bytes to read or skip for that line.
Referring back to the code in the original post, the line that is causing the slowdown is:
temp_header=fread(fid,5,'uint32');
In the end this line gets called 2,300,000 times. It reads very fast for the first 20% of the calls, but then starts to slow down a lot.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!