How to read an array of set number of characters from a binary file while skipping bytes in between.

9 views (last 30 days)
I'm trying to read a binary file, and I'm wondering if there is a better way to read the characters. A portion of the binary file repeats the same sequence, x number of times
32 characters, integer, integer, integer, integer
If I had just 32 character repeated x times, I can use:
names = convertCharsToStrings(fread(fileID, [ 32 x], '*char'));
I would like to do something like this instead since there are bytes in between:
names = convertCharsToStrings(fread(fileID, [ 32 x], '*char'), 32 + 8*4);
That it not working the way that I hoped. I assume it is reading one byte, then skipping 32 + 8*4 before reading the next byte.
I have a workaround where I read the first character of each of the x sequences, then read the 2nd character, and so on.
names( 1:x, 1:32) = ' ';
for i = 1:32
names(:, i) = convertCharsToStrings(fread(fileID, [ 1 x ], '*char', 31 + 8 *4));
fseek(fileID, -8*(x*4)-32*x + 1, 0);
end
This accomplishes what I need, but is there an easier, better, or faster way to do this?
  1 Comment
Walter Roberson
Walter Roberson on 9 Dec 2024
You can reduce the load a bit if you read in a uint64 and typecast it to uint8 and char() that. You would only need to loop 4 times instead of 32 and the I/O would be more efficient.
Note that you might need to fopen() with 'ieee-be' to get the right byte order when you do the above.

Sign in to comment.

Accepted Answer

Arjun
Arjun on 10 Dec 2024
Hi @Andre,
I see that you are wondering if there is an efficient way to pull out strings from a binary file which are mixed with integers in a certain pattern.
In this case you have pattern such that there is a string which is 32 bytes followed by 4 integers. You can open the file in binary read mode and calculate the size of each sequence (32 bytes for the string and 16 bytes for the integers). By iterating over the number of sequences, you can read each complete block of data at once using ‘fread’. The first 32 bytes of each block can be extracted and converted into a character string, which you can store in a pre-allocated string array. After processing all sequences, you can close the file and display the extracted names.
Using the above approach you can reduce the number of input/output operations by reading entire sequences at once, which is more efficient than reading each part separately. This approach takes advantage of MATLAB's ability to handle arrays efficiently, improving performance. By pre-allocating the ‘names’ array, you can avoid the overhead of dynamic resizing during the loop.
Kindly refer to the code below for better understanding:
% Open the file for reading
fileID = fopen('dummyfile.bin', 'rb');
% Determine the size of each complete sequence, assuming 4 bytes for integer
sequenceSize = 32 + 4 * 4;
% Preallocate the array for names
names = strings(x, 1);
for i = 1:x
% Read one sequence (32 characters + 4 integers)
data = fread(fileID, sequenceSize, '*uint8');
% Extract the 32 characters
nameChars = char(data(1:32))';
% Convert the characters to a string
names(i) = convertCharsToStrings(nameChars);
end
% Close the file
fclose(fileID);
% Display the names
disp(names);
I hope this helps!
  2 Comments
Andre Aroyan
Andre Aroyan on 10 Dec 2024
This definitely helped give the following idea for the I/O process. Now I'm stuck figuring out the best way to get arrays for the integers.
Here is my code now.
data = fread(fileID, [ ( 32 + 8*4 ) x ], '*uint8')';
names = strtrim(char(data(:,1:32)));
integerArray1 = zeros(x,1);
integerArray2 = zeros(x,1);
integerArray3 = zeros(x,1);
integerArray4 = zeros(x,1);
for i = 1:x
integerArray1(i) = typecast(data(i,33:40),'int64');
integerArray2(i) = typecast(data(i,41:48),'int64');
integerArray3(i) = typecast(data(i,49:56),'int64');
integerArray4(i) = typecast(data(i,57:64),'int64');
end
I could basically get a (x,8) matrix of unit8 for each of the integer array by using data(:,33:40), but I can't figure out how to get that matrix to an integer array of (x,1) without a for loop.
Any ideas or is this as efficient, in terms of speed, as I can get it? I'm fixating on efficiency because x is often in the millions for this application.
Thanks for the help so far!
Walter Roberson
Walter Roberson on 10 Dec 2024
Because typecast does not accept arrays of data, you need to loop it one way or another (possibly using arrayfun()). The alternative is that you could calculate the values:
d64 = uint64(data);
integerArray1 = d64(:,33) * 2^56 + d64(:,34) * 2^48 + d64(:,35) * 2^40 + etc
integerArray1 = typecast(integerArray1, 'int64');

Sign in to comment.

More Answers (0)

Tags

Products


Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!