Speedup processing of larger binary files

Dear all,
I have to process thousands of binary files (each of 16MB) by reading pairs of them and creating a bit-level data structure (usually a 1x134217728 array) in order to process them on bit level.
Currently I am doing this the following way:
conv = @(c) uint8(bitget(c,1:32));
measurement = NaN(1,(sizeOfMeasurements*8)) %(1,134217728)
fid = fopen(fileName, 'rb');
byteContent = fread(fid,'uint32');
fclose(fid);
bitRepresentation1 = arrayfun(conv, byteContent, 'UniformOutput', false);
measurement=[bitRepresentation1{:}];
end
However, reading a single file takes minutes and makes evaluation of the entire data set a very time-consuming task.
UPDATE: I replaced fopen successfully by memmapfile using the code below:
m=memmapfile(fileName,'Format',{'uint32', [4194304 1], 'byteContent'});
byteContent=m.data.byteContent;
byteContent = double(byteContent);
I printed timing information (using tic/toc) for the individual instructions and it turns out that the bottleneck is:
bitRepresentation1 = arrayfun(conv, byteContent, 'UniformOutput', false); % see first line of code for conv
Are there more efficient was of transforming byteContent into an array that stores a bit per index?
UPDATE2: I received suggestion from another source, that there are superfluous loops introduced by the conv function. The new code looks like this:
fid = fopen(fileName, 'rb');
bitContent = fread(fid,'*ubit64');
fclose(fid);
conv = @(ii) uint8(bitget(bitContent, ii));
bitRepresentation = arrayfun(conv, 1:64, 'UniformOutput', false);
measurement = reshape(cat(2, bitRepresentation{:})', 1, []);
This brings execution time of code line bitRepresentation = arrayfun[...] down from 39s to 0.5s. However, now the bottleneck is the very last code line with 5s.

5 Comments

m = memmapfile(file,'Format','double') ;
Try this...any error?
What is the prupose of your line:
bitRepresentation1 = arrayfun(conv, byteContent, 'UniformOutput', false);
Guillaume
Guillaume on 29 Nov 2016
Edited: Guillaume on 29 Nov 2016
What I don't understand is why each single bit has to be stored as individual numbers, wasting memory and processing time.
Computers already have a very efficient way of storing and processing arrays of bits. It's called uint8, uint16, etc.
Here is a novel idea: use a bit to store a bit rather than a byte to store a bit. Leave your numbers as is. Use 8 times less memory.
@Guillaume: Storing a bit in a bit is very efficient for the storing. But the processing is much harder, e.g. when for logical indexing. I'm using a C-mex script for logical indexing with bit fields, which is remarkably faster than indexing with LOGICAL vectors. But the main effect is not the compact storage of the bits, but I guess that Matlab does not pre-allocate efficiently. For an LOGICAL version see: FEX: CopyMask . I'm still astonished.
Did you try timing dec2bin() or de2bi() compared to bitget() ?

Sign in to comment.

Answers (1)

Jan
Jan on 29 Nov 2016
Edited: Jan on 29 Nov 2016
Omit this line:
measurement = NaN(1,(sizeOfMeasurements*8)) %(1,134217728)
A pre-allocation is a waste of time, if the result is overwritten later.
If you want to access the data bitwise, use an integer type:
byteContent = fread(fid, '*uint32'); % Instead of storing it in a DOUBLE
Creating a large cell is not efficient. I assume that these lines can be replaced:
bitRepresentation1 = arrayfun(conv, byteContent, 'UniformOutput', false);
measurement=[bitRepresentation1{:}];
If you explain the wanted result, a suggestion for a replacement is possible and I will expand my answer.
[EDITED]
fid = fopen(FileName, 'r');
if fid == -1
error('Cannot open file: %s', FileName);
end
Data = fread(fid, [8, inf], 'ubit1=>uint8');
fclose(fid);
Now each bit is stored as an UINT8 element of the value 1 or 0.
Perhaps this is faster (at least it is in R2009a: 0.25 sec on a virtual machine for a 16MB file):
Data = fread(fid, inf, '*uint8');
Result = [bitget(Data, 1), bitget(Data, 2), bitget(Data, 3), ...
bitget(Data, 4), bitget(Data, 5), bitget(Data, 6), ...
bitget(Data, 7), bitget(Data, 8)];
What a pitty that bitget(X, 1:8) is not valid in Matlab, when X is not a scalar.

2 Comments

Dear Jan,
i.e., given an 16MB binary file, the wanted result shall be an array A, of dimensions 1x134217728, where every index of the array stores the respective bit (either 0 or 1).
To give an example that is more illustrative. If the binary file only consists of one byte 0x55, the array A shall be of size 1x8 with values: 01010101.
See [EDITED]

Sign in to comment.

Categories

Asked:

on 29 Nov 2016

Commented:

on 30 Nov 2016

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!