Import data from text file with lines of unequal length as fast as possible.

6 views (last 30 days)
Hi all,
I wish to import data from a text file which has 6,000 lines. The data in the .txt file is structured in lines of unequal lengths, separated by commas, and the first two samples on each line do not need to be imported (see attached sample).
What I have accomplished so far is below:
clear ; clc
filename = 'sample.txt' ;
nL = 13 ; % Number of lines
% Initialize
fileID = fopen(filename) ;
tline = fgetl(fileID) ;
Nline = cell(nL, 1) ;
% Extract Lines in Text File
tic
i = 0 ;
while ischar(tline)
i = i + 1 ;
% Format & Save first line
nline = str2num(tline) ;
nline(1:2) = [] ; % Remove first two numberss
Nline(i) = {nline} ; % Save imported data in cell array
% Extract next line
tline = fgetl(fileID) ;
end
toc
Elapsed time is 0.060131 seconds.
However, as the real file is 6000 lines, and there are hundreds of those to process, I would like to ask you if you can think of a faster way to accomplish the above.
I have spotted it down to a couple of things that might help
  1. Preallocating (I have done this and helps a lot)
  2. The str2num function is incredibly slow, however I haven't found an equivalent method to accomplish the same task
  3. I don't know if saving everything in a cell array instead of a vector slows things down. Unfortunately due to the unequal length of each line my options are limited there, however I am open to any other opinions.
Thanks for the help in advance.

Accepted Answer

Voss
Voss on 6 Jun 2022
Edited: Voss on 6 Jun 2022
I don't know if it's "as fast as possible", but this may be faster than fgetl/str2num:
fid = fopen('sample.txt');
A = {};
tic
while ~feof(fid)
A{end+1,1} = fscanf(fid,'%f,').';
A{end}([1 2]) = []; % Remove first two numberss
end
toc
Elapsed time is 0.006596 seconds.
fclose(fid);
disp(A);
{[-184 -184 -186 -184 -184 -186 -186 -184 -184 -184 -186 -186 -186 -186 -184 -184 -184 -188 -188 -190 -190 -188 -188 -190 -190 -188 -184 -184 -188 -188 -186 -184 -184 -188 -188 -184 -186 … ]} {[-184 -182 -188 -190 -192 -188 -184 -182 -184 -184 -186 -188 -184 -182 -184 -184 -184 -186 -184 -188 -192 -194 -194 -190 -188 -186 -188 -188 -190 -192 -192 -192 -190 -190 -190 -186 -186 … ]} {[-190 -186 -184 -188 -188 -182 -184 -186 -186 -186 -186 -188 -188 -188 -188 -188 -188 -188 -186 -190 -190 -186 -184 -188 -192 -190 -184 -184 -188 -192 -190 -190 -190 -184 -186 -188 -186 … ]} {[-190 -194 -188 -184 -186 -186 -188 -190 -192 -190 -190 -188 -186 -186 -184 -184 -184 -184 -184 -182 -186 -188 -192 -190 -188 -188 -188 -188 -186 -184 -186 -190 -192 -190 -186 -184 -188 … ]} {[-186 -188 -188 -188 -190 -194 -192 -188 -188 -186 -186 -188 -184 -184 -186 -186 -188 -188 -186 -186 -186 -186 -184 -186 -186 -188 -192 -190 -188 -186 -188 -192 -190 -188 -188 -184 -182 … ]} {[-184 -182 -180 -182 -186 -188 -188 -190 -190 -188 -188 -186 -184 -182 -182 -186 -186 -188 -186 -186 -188 -188 -190 -192 -190 -188 -186 -184 -184 -186 -188 -188 -188 -190 -184 -182 -184 … ]} {[-192 -192 -188 -186 -186 -186 -186 -186 -184 -184 -182 -184 -186 -186 -186 -186 -186 -186 -190 -190 -188 -186 -188 -190 -194 -192 -186 -186 -186 -186 -186 -188 -188 -192 -190 -186 -186 … ]} {[-184 -186 -184 -182 -180 -182 -184 -186 -186 -186 -186 -188 -186 -182 -186 -190 -188 -186 -188 -190 -186 -186 -186 -186 -184 -186 -188 -192 -192 -190 -188 -186 -190 -188 -186 -188 -188 … ]} {[-190 -188 -188 -190 -188 -188 -188 -190 -190 -188 -186 -186 -184 -186 -186 -184 -184 -186 -190 -190 -188 -186 -184 -186 -188 -188 -186 -186 -186 -184 -186 -190 -192 -190 -192 -190 -184 … ]} {[-190 -190 -186 -180 -182 -186 -186 -188 -186 -186 -186 -188 -190 -190 -184 -184 -186 -188 -186 -184 -186 -190 -188 -188 -188 -188 -186 -184 -186 -188 -186 -180 -180 -184 -186 -188 -190 … ]} {[-194 -190 -182 -184 -188 -192 -190 -190 -192 -190 -188 -188 -186 -186 -186 -186 -188 -190 -190 -190 -188 -184 -186 -190 -192 -188 -184 -182 -184 -184 -184 -186 -186 -182 -184 -186 -186 … ]} {[-184 -186 -190 -194 -196 -194 -194 -196 -194 -190 -188 -188 -186 -186 -186 -186 -190 -192 -192 -192 -190 -192 -196 -192 -188 -188 -186 -186 -186 -186 -186 -186 -186 -188 -188 -186 -186 … ]} {[-188 -190 -190 -186 -184 -184 -188 -186 -182 -182 -186 -188 -186 -184 -182 -184 -186 -188 -190 -192 -192 -190 -186 -184 -186 -190 -190 -184 -184 -186 -186 -188 -188 -186 -188 -188 -188 … ]}
  6 Comments
KostasK
KostasK on 7 Jun 2022
Thanks to all for the effort @Jan + @Voss. The code on your last comment seems to work very well for me.
Jan
Jan on 7 Jun 2022
Edited: Jan on 7 Jun 2022
Thanks for sharing the timings on your machine. This means, that my older R2018b version has a severe drawback in the loop.
Reading the data in one command is very effiicient: good idea! cellfun is usually slower than a loop:
fid = fopen(filename);
data = fread(fid,'*char').';
fclose(fid);
C = strsplit(data, newline);
if isempty(C{end})
C(end) = [];
end
A = cell(numel(C), 1);
for k = 1:numel(C) % Or better: PARFOR!
v = sscanf(C{k}, '%f,');
A{k} = v(3:end).';
end
Now this can be combined with my evil parser if the OP is really bold and has an insane need for speed.
This is the code I've used for measuring the timings:
function testFileImport
filename = 'sample.txt';
% Original version: ----------------------------
tic;
nL = 6000 ;
fileID = fopen(filename) ;
tline = fgetl(fileID) ;
Nline = cell(nL, 1) ;
i = 0 ;
while ischar(tline)
i = i + 1 ;
nline = str2num(tline);
nline(1:2) = [];
Nline{i} = nline; % Slightly modified
tline = fgetl(fileID);
end
fclose(fileID);
toc
% Voss 1: ------------------------------------
tic
fid = fopen(filename);
A = cell(6000, 1);
k = 0;
while ~feof(fid)
v = fscanf(fid,'%f,').';
if ~isempty(v)
k = k + 1;
A{k} = v(3:end); % Remove first two numbers
end
end
fclose(fileID);
toc
% Voss 2: -----------------------------------
tic
fid = fopen(filename);
data = fread(fid,'*char').';
fclose(fid);
A = strsplit(data,newline());
A = cellfun(@(x)sscanf(x,'%f,').',A,'Uniform',false);
if isempty(A{end})
A(end) = [];
end
A = cellfun(@(x)x(3:end),A,'Uniform',false).';
toc
% Looped Voss 2: ----------------------------
tic
fid = fopen(filename);
data = fread(fid,'*char').';
fclose(fid);
C = strsplit(data, newline);
if isempty(C{end})
C(end) = [];
end
A = cell(numel(C), 1);
for k = 1:numel(C)
v = sscanf(C{k}, '%f,');
A{k} = v(3:end).';
end
toc
% Parallelized loop: --------------------------
if isempty(gcp)
parpool;
end
tic
fid = fopen(filename);
data = fread(fid,'*char').';
fclose(fid);
C = strsplit(data, newline);
if isempty(C{end})
C(end) = [];
end
A = cell(numel(C), 1);
parfor k = 1:numel(C)
v = sscanf(C{k}, '%f,');
A{k} = v(3:end).';
end
toc
% Parallelized and evil parser: ----------------
tic
fid = fopen(filename);
data = fread(fid,'*char').';
fclose(fid);
C = strsplit(data, newline);
if isempty(C{end})
C(end) = [];
end
A = cell(numel(C), 1);
parfor k = 1:numel(C)
v = Line2Integer(C{k});
A{k} = v(3:end);
end
toc
end
% **********************************
function x = Line2Integer(s)
% How ugly! But sscanf(s, '%d,') is slower
% The number must be integers, sign is caught, no trailing comma
% Don't blame me: I will claim not to be the author.
sgn = 1;
a = 0;
ix = 0;
x = zeros(1, 2000);
for k = 1:numel(s)
switch s(k)
case '-'
sgn = -1;
case ','
ix = ix + 1;
x(ix) = sgn * a;
sgn = 1;
a = 0;
otherwise
a = 10 * a + s(k) - 48;
end
end
ix = ix + 1; % Flush last value
x(ix) = sgn * a;
x = x(1:ix); % Crop unused elements
end
And the result on my i7/Win/R2018b:
Elapsed time is 2.894834 seconds. % Original
Elapsed time is 15.670646 seconds. % fscanf loop
Elapsed time is 1.801146 seconds. % Block read/strsplit/cellfun
Elapsed time is 1.699081 seconds. % Block read/strsplit/loop
Elapsed time is 0.946646 seconds. % Block read/strsplit/PARFOR loop
Elapsed time is 0.682529 seconds. % Block read/strsplit/PARFOR loop/evil parser
Good work, Voss. 24% of the original run time.

Sign in to comment.

More Answers (1)

Jan
Jan on 6 Jun 2022
Edited: Jan on 7 Jun 2022
Sorry, I'm not proud of this code, but it 23% faster on my machine for a test file with 6000 lines:
filename = 'sample.txt' ;
fileID = fopen(filename) ;
tic
nL = 6000; % Number of lines
Nline = cell(nL, 1);
i = 0;
tline = fgetl(fileID);
while ischar(tline)
i = i + 1;
nline = Line2Integer(tline);
Nline{i} = nline(3:end);
tline = fgetl(fileID);
end
toc
fclose(fileID);
end
% **********************************
function x = Line2Integer(s)
% How ugly! But sscanf(s, '%d,') is slower
% The number must be integers, sign is caught, no trailing comma
% Don't blame me: I will claim not to be the author.
sgn = 1;
a = 0;
ix = 0;
x = zeros(1, 2000);
for k = 1:numel(s)
switch s(k)
case '-'
sgn = -1;
case ','
ix = ix + 1;
x(ix) = sgn * a;
sgn = 1;
a = 0;
otherwise
a = 10 * a + s(k) - 48;
end
end
ix = ix + 1; % Flush last value
x(ix) = sgn * a;
x = x(1:ix); % Crop unused elements
end
This is evil. sscanf is much smarter and more robust, but slower in consequence.
For the testing, I've read the same file repeatedly from an SSD. The OS might store it in a RAM cache. For reading a lot of files in the real application fgetl might be the bottleneck, such that the small speedup of 23% might melt away.

Categories

Find more on Large Files and Big Data in Help Center and File Exchange

Products


Release

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!