Import data from text file with lines of unequal length as fast as possible.

Question

KostasK on 6 Jun 2022

1
Link

Direct link to this question

https://in.mathworks.com/matlabcentral/answers/1735060-import-data-from-text-file-with-lines-of-unequal-length-as-fast-as-possible

Edited: Jan on 7 Jun 2022

Accepted Answer: Voss

sample.txt

Open in MATLAB Online

Hi all,

I wish to import data from a text file which has 6,000 lines. The data in the .txt file is structured in lines of unequal lengths, separated by commas, and the first two samples on each line do not need to be imported (see attached sample).

What I have accomplished so far is below:

clear ; clc
filename = 'sample.txt' ;  
nL = 13 ; % Number of lines
% Initialize
fileID = fopen(filename) ;
tline = fgetl(fileID) ;
Nline = cell(nL, 1) ;
% Extract Lines in Text File
    tic
    i = 0 ;
while ischar(tline)
    i = i + 1 ;
    
    % Format & Save first line
    nline = str2num(tline) ;
    nline(1:2) = [] ; % Remove first two numberss
    Nline(i) =  {nline} ; % Save imported data in cell array
    % Extract next line
    tline = fgetl(fileID) ;
end
    toc
Elapsed time is 0.060131 seconds.
    

However, as the real file is 6000 lines, and there are hundreds of those to process, I would like to ask you if you can think of a faster way to accomplish the above.

I have spotted it down to a couple of things that might help

Preallocating (I have done this and helps a lot)
The str2num function is incredibly slow, however I haven't found an equivalent method to accomplish the same task
I don't know if saving everything in a cell array instead of a vector slows things down. Unfortunately due to the unequal length of each line my options are limited there, however I am open to any other opinions.

Thanks for the help in advance.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Voss on 6 Jun 2022

1
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/1735060-import-data-from-text-file-with-lines-of-unequal-length-as-fast-as-possible#answer_980280

Edited: Voss on 6 Jun 2022

Open in MATLAB Online

sample.txt

I don't know if it's "as fast as possible", but this may be faster than fgetl/str2num:

fid = fopen('sample.txt');
A = {};
tic
while ~feof(fid)
    A{end+1,1} = fscanf(fid,'%f,').';
    A{end}([1 2]) = []; % Remove first two numberss
end
toc
Elapsed time is 0.006596 seconds.
fclose(fid);
disp(A);
    {[-184 -184 -186 -184 -184 -186 -186 -184 -184 -184 -186 -186 -186 -186 -184 -184 -184 -188 -188 -190 -190 -188 -188 -190 -190 -188 -184 -184 -188 -188 -186 -184 -184 -188 -188 -184 -186 … ]}
    {[-184 -182 -188 -190 -192 -188 -184 -182 -184 -184 -186 -188 -184 -182 -184 -184 -184 -186 -184 -188 -192 -194 -194 -190 -188 -186 -188 -188 -190 -192 -192 -192 -190 -190 -190 -186 -186 … ]}
    {[-190 -186 -184 -188 -188 -182 -184 -186 -186 -186 -186 -188 -188 -188 -188 -188 -188 -188 -186 -190 -190 -186 -184 -188 -192 -190 -184 -184 -188 -192 -190 -190 -190 -184 -186 -188 -186 … ]}
    {[-190 -194 -188 -184 -186 -186 -188 -190 -192 -190 -190 -188 -186 -186 -184 -184 -184 -184 -184 -182 -186 -188 -192 -190 -188 -188 -188 -188 -186 -184 -186 -190 -192 -190 -186 -184 -188 … ]}
    {[-186 -188 -188 -188 -190 -194 -192 -188 -188 -186 -186 -188 -184 -184 -186 -186 -188 -188 -186 -186 -186 -186 -184 -186 -186 -188 -192 -190 -188 -186 -188 -192 -190 -188 -188 -184 -182 … ]}
    {[-184 -182 -180 -182 -186 -188 -188 -190 -190 -188 -188 -186 -184 -182 -182 -186 -186 -188 -186 -186 -188 -188 -190 -192 -190 -188 -186 -184 -184 -186 -188 -188 -188 -190 -184 -182 -184 … ]}
    {[-192 -192 -188 -186 -186 -186 -186 -186 -184 -184 -182 -184 -186 -186 -186 -186 -186 -186 -190 -190 -188 -186 -188 -190 -194 -192 -186 -186 -186 -186 -186 -188 -188 -192 -190 -186 -186 … ]}
    {[-184 -186 -184 -182 -180 -182 -184 -186 -186 -186 -186 -188 -186 -182 -186 -190 -188 -186 -188 -190 -186 -186 -186 -186 -184 -186 -188 -192 -192 -190 -188 -186 -190 -188 -186 -188 -188 … ]}
    {[-190 -188 -188 -190 -188 -188 -188 -190 -190 -188 -186 -186 -184 -186 -186 -184 -184 -186 -190 -190 -188 -186 -184 -186 -188 -188 -186 -186 -186 -184 -186 -190 -192 -190 -192 -190 -184 … ]}
    {[-190 -190 -186 -180 -182 -186 -186 -188 -186 -186 -186 -188 -190 -190 -184 -184 -186 -188 -186 -184 -186 -190 -188 -188 -188 -188 -186 -184 -186 -188 -186 -180 -180 -184 -186 -188 -190 … ]}
    {[-194 -190 -182 -184 -188 -192 -190 -190 -192 -190 -188 -188 -186 -186 -186 -186 -188 -190 -190 -190 -188 -184 -186 -190 -192 -188 -184 -182 -184 -184 -184 -186 -186 -182 -184 -186 -186 … ]}
    {[-184 -186 -190 -194 -196 -194 -194 -196 -194 -190 -188 -188 -186 -186 -186 -186 -190 -192 -192 -192 -190 -192 -196 -192 -188 -188 -186 -186 -186 -186 -186 -186 -186 -188 -188 -186 -186 … ]}
    {[-188 -190 -190 -186 -184 -184 -188 -186 -182 -182 -186 -188 -186 -184 -182 -184 -186 -188 -190 -192 -192 -190 -186 -184 -186 -190 -190 -184 -184 -186 -186 -188 -188 -186 -188 -188 -188 … ]}

6 Comments
Show 4 older commentsHide 4 older comments

KostasK on 7 Jun 2022

Thanks to all for the effort @Jan + @Voss. The code on your last comment seems to work very well for me.

Jan on 7 Jun 2022

Edited: Jan on 7 Jun 2022

Open in MATLAB Online

Thanks for sharing the timings on your machine. This means, that my older R2018b version has a severe drawback in the loop.

Reading the data in one command is very effiicient: good idea! cellfun is usually slower than a loop:

fid  = fopen(filename);
data = fread(fid,'*char').';
fclose(fid);
C    = strsplit(data, newline);
if isempty(C{end})
   C(end) = [];
end
A = cell(numel(C), 1);
for k = 1:numel(C)   % Or better: PARFOR!
   v    = sscanf(C{k}, '%f,');
   A{k} = v(3:end).';
end

Now this can be combined with my evil parser if the OP is really bold and has an insane need for speed.

This is the code I've used for measuring the timings:

function testFileImport
filename = 'sample.txt';
% Original version: ----------------------------
tic;
nL = 6000 ;
fileID = fopen(filename) ;
tline = fgetl(fileID) ;
Nline = cell(nL, 1) ;
i = 0 ;
while ischar(tline)
   i = i + 1 ;
   nline      = str2num(tline);
   nline(1:2) = [];
   Nline{i}   = nline;   % Slightly modified
   tline      = fgetl(fileID);
end
fclose(fileID);
toc
% Voss 1: ------------------------------------
tic
fid = fopen(filename);
A = cell(6000, 1);
k = 0;
while ~feof(fid)
    v = fscanf(fid,'%f,').';
    if ~isempty(v)
       k    = k + 1;
       A{k} = v(3:end); % Remove first two numbers
    end
end
fclose(fileID);
toc
% Voss 2: -----------------------------------
tic
fid  = fopen(filename);
data = fread(fid,'*char').';
fclose(fid);
A = strsplit(data,newline());
A = cellfun(@(x)sscanf(x,'%f,').',A,'Uniform',false);
if isempty(A{end})
    A(end) = [];
end
A = cellfun(@(x)x(3:end),A,'Uniform',false).';
toc
% Looped Voss 2: ----------------------------
tic
fid  = fopen(filename);
data = fread(fid,'*char').';
fclose(fid);
C    = strsplit(data, newline);
if isempty(C{end})
   C(end) = [];
end
A = cell(numel(C), 1);
for k = 1:numel(C)
   v = sscanf(C{k}, '%f,');
   A{k} = v(3:end).';
end
toc
% Parallelized loop: --------------------------
if isempty(gcp)
   parpool;
end
tic
fid  = fopen(filename);
data = fread(fid,'*char').';
fclose(fid);
C    = strsplit(data, newline);
if isempty(C{end})
   C(end) = [];
end
A = cell(numel(C), 1);
parfor k = 1:numel(C)
   v    = sscanf(C{k}, '%f,');
   A{k} = v(3:end).';
end
toc
% Parallelized and evil parser: ----------------
tic
fid  = fopen(filename);
data = fread(fid,'*char').';
fclose(fid);
C    = strsplit(data, newline);
if isempty(C{end})
   C(end) = [];
end
A = cell(numel(C), 1);
parfor k = 1:numel(C)
   v    = Line2Integer(C{k});
   A{k} = v(3:end);
end
toc
end
% **********************************
function x = Line2Integer(s)
% How ugly! But sscanf(s, '%d,') is slower
% The number must be integers, sign is caught, no trailing comma
% Don't blame me: I will claim not to be the author.
sgn = 1;
a   = 0;
ix  = 0;
x   = zeros(1, 2000);
for k = 1:numel(s)
   switch s(k)
      case '-'
         sgn = -1;
      case ','
         ix    = ix + 1;
         x(ix) = sgn * a;
         sgn   = 1;
         a     = 0;
      otherwise
         a = 10 * a + s(k) - 48;
   end
end
ix    = ix + 1;   % Flush last value
x(ix) = sgn * a; 
x     = x(1:ix);  % Crop unused elements
end

And the result on my i7/Win/R2018b:

Elapsed time is 2.894834 seconds.   % Original
Elapsed time is 15.670646 seconds.  % fscanf loop
Elapsed time is 1.801146 seconds.   % Block read/strsplit/cellfun
Elapsed time is 1.699081 seconds.   % Block read/strsplit/loop
Elapsed time is 0.946646 seconds.   % Block read/strsplit/PARFOR loop
Elapsed time is 0.682529 seconds.   % Block read/strsplit/PARFOR loop/evil parser

Good work, Voss. 24% of the original run time.

Sign in to comment.

Answer 2

Jan on 6 Jun 2022

1
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/1735060-import-data-from-text-file-with-lines-of-unequal-length-as-fast-as-possible#answer_980295

Edited: Jan on 7 Jun 2022

Open in MATLAB Online

Sorry, I'm not proud of this code, but it 23% faster on my machine for a test file with 6000 lines:

filename = 'sample.txt' ;
fileID   = fopen(filename) ;
tic
nL    = 6000; % Number of lines
Nline = cell(nL, 1);
i     = 0;
tline = fgetl(fileID);
while ischar(tline)
   i = i + 1;  
   nline    = Line2Integer(tline);
   Nline{i} = nline(3:end);
   tline    = fgetl(fileID);
end
toc
fclose(fileID);
end
% **********************************
function x = Line2Integer(s)
% How ugly! But sscanf(s, '%d,') is slower
% The number must be integers, sign is caught, no trailing comma
% Don't blame me: I will claim not to be the author.
sgn = 1;
a   = 0;
ix  = 0;
x   = zeros(1, 2000);
for k = 1:numel(s)
   switch s(k)
      case '-'
         sgn = -1;
      case ','
         ix    = ix + 1;
         x(ix) = sgn * a;
         sgn   = 1;
         a     = 0;
      otherwise
         a = 10 * a + s(k) - 48;
   end
end
ix    = ix + 1;   % Flush last value
x(ix) = sgn * a;
x     = x(1:ix);  % Crop unused elements
end

This is evil. sscanf is much smarter and more robust, but slower in consequence.

For the testing, I've read the same file repeatedly from an SSD. The OS might store it in a RAM cache. For reading a lot of files in the real application fgetl might be the bottleneck, such that the small speedup of 23% might melt away.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Import data from text file with lines of unequal length as fast as possible.

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

6 Comments
Show 4 older commentsHide 4 older comments

More Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Import data from text file with lines of unequal length as fast as possible.

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

6 Comments Show 4 older commentsHide 4 older comments

More Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

6 Comments
Show 4 older commentsHide 4 older comments

0 Comments
Show -2 older commentsHide -2 older comments