How to read large text data into matlab

116 views (last 30 days)
Hi every one I have a text file up to 10 GB which has to be read into matlab. The part of the data is listed below
ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
4323
ITEM: BOX BOUNDS pp pp ff
3.6821000000000000e-02 3.6996820000000000e+01
8.5320999999999994e-02 3.4761423000000001e+01
9.0000000000000002e-06 6.8636712000000003e+01
ITEM: ATOMS id c_water_force[1] c_water_force[2] c_water_force[3] c_water_force[4] c_water_force[5] c_water_force[6]
2241 51.4573 -48.0145 -55.5854 0.00121546 -0.00693737 -0.00454935
2242 -25.5898 -24.3081 -29.3729 0.00671099 0.00205397 -0.0108453
2243 9.2867 27.1493 -37.9274 -0.00115821 0.00912371 -0.00178601
2244 3.89714 -48.5019 70.5903 0.0041159 -0.00255481 -0.0029498
2245 49.8803 -40.1819 -5.30361 -0.0106695 0.0224494 0.00918698
2246 0.22115 -19.9758 -2.30173 0.0190817 0.0262146 -0.0153229
2247 -53.6289 50.5517 -23.5032 0.00388499 -0.00559089 0.000787281
.
.
.
.
.
.
.
.
ITEM: TIMESTEP
10
ITEM: NUMBER OF ATOMS
4323
ITEM: BOX BOUNDS pp pp ff
3.6821000000000000e-02 3.6996820000000000e+01
8.5320999999999994e-02 3.4761423000000001e+01
9.0000000000000002e-06 6.8636712000000003e+01
ITEM: ATOMS id c_water_force[1] c_water_force[2] c_water_force[3] c_water_force[4] c_water_force[5] c_water_force[6]
2241 -50.0606 -93.6118 -70.4534 0.000504085 -0.00684199 -0.00394166
2242 -14.4928 20.0993 3.55963 0.00244236 0.00203074 -0.0162865
2243 -2.64823 8.26566 23.6457 -0.000503352 0.0140246 -0.00909782
2244 -153.189 40.6383 -12.0141 0.00192712 -0.00177534 -0.00194966
2245 35.0712 -14.4107 6.31868 0.00668828 0.012556 0.00468532
2246 22.0675 -14.7867 61.4774 0.0182799 0.0194239 -0.00942033
2247 -3.80959 -88.6786 1.61222 0.00459477 -0.00577238 0.000324204
2248 -18.4777 -9.35017 -1.12766 0.0146401 0.00924069 -0.00730373
2249 16.2354 -7.34658 -25.1694 -0.0169203 0.0249397 0.0085598
2250 110.508 19.9749 -4.95758 -0.00500049 0.000961677 0.00667405
2251 -7.46059 3.35324 -41.665 0.0175383 -0.00791068 -0.00702065
Basically, it has many parts which start with the "ITEM: TIMESTEP". I have to skip the first 9 lines for each part and then read the other lines.
I tried the textscan function (May be I misused it), but it is very slow. Is there a faster way to do it in Matlab?
  1 Comment
Cedric
Cedric on 6 Jan 2018
Edited: Cedric on 6 Jan 2018
How much RAM do you have available?

Sign in to comment.

Accepted Answer

Cedric
Cedric on 6 Jan 2018
Edited: Cedric on 6 Jan 2018
If you have enough RAM for this, the following could run a little faster. It is way less versatile than Per's solution though, and exploits specific characters present in the header. You may have to adapt it a bit if there are e.g. other types of header content:
content = fileread( 'data.txt' ) ;
blockEnds = strfind( content, 'ITEM: T' ) - 1 ;
blockEnds = [blockEnds(2:end), numel( content )] ;
blockStarts = strfind( content, '6]' ) + 3 ;
nBlocks = numel( blockStarts ) ;
data = cell( nBlocks, 1 ) ;
fprintf( '%d blocks found.\n', nBlocks ) ;
for bId = 1 : nBlocks
data{bId} = reshape( sscanf( content(blockStarts(bId):blockEnds(bId)), '%f' ), 7, [] ).' ;
end
PS: this takes < 20s for a 1GB data file on a small laptop (with 32GB RAM though).
  2 Comments
Fan Li
Fan Li on 8 Jan 2018
Thanks. This way is faster.
Abdullahi Samantar
Abdullahi Samantar on 12 Dec 2018
Hi Fan Li,
How did you manipulate Cedric code to get your large txt file (lammps) run.
I couldnt figure it out.
Thank you

Sign in to comment.

More Answers (2)

per isakson
per isakson on 5 Jan 2018
Edited: per isakson on 6 Jan 2018
Given:
  • All headers consist of 9 lines
  • All data blocks consist of 7 columns of numerical data
  • The blocks of numerical data should be converted to double. (Added later.)
  • The columns of the data are separated by space, char(32)
  • There is RAM enough to store the parsed data. Nearly 10GB is needed to store in double. Single would introduce a rounding error.
Try:
>> cac = cssm( 'cssm.txt' );
>> whos cac
Name Size Bytes Class Attributes
cac 1x6 3696 cell
>> cac
cac =
[7x7 double] [11x7 double] [7x7 double] [11x7 double] [7x7 double] [11x7 double]
>>
>> cac{1}
ans =
1.0e+03 *
2.2410 0.0515 -0.0480 -0.0556 0.0000 -0.0000 -0.0000
2.2420 -0.0256 -0.0243 -0.0294 0.0000 0.0000 -0.0000
2.2430 0.0093 0.0271 -0.0379 -0.0000 0.0000 -0.0000
2.2440 0.0039 -0.0485 0.0706 0.0000 -0.0000 -0.0000
2.2450 0.0499 -0.0402 -0.0053 -0.0000 0.0000 0.0000
2.2460 0.0002 -0.0200 -0.0023 0.0000 0.0000 -0.0000
2.2470 -0.0536 0.0506 -0.0235 0.0000 -0.0000 0.0000
>>
where
function cac = cssm( ffs )
fid = fopen( ffs );
cac = cell(1,0);
while not( feof( fid ) )
cac(1,end+1) = textscan( fid, '%f%f%f%f%f%f%f' ...
, 'Headerlines',9, 'CollectOutput',true );
end
fclose( fid );
end
and cssm.txt contains three copies of the data of the question
"textscan [...] is very slow. Is there a faster way [...] Matlab?" AFAIK: No, not significantly faster. However, I don't agree that it's very slow.
  2 Comments
Fan Li
Fan Li on 6 Jan 2018
Hi per isakson
I am using fgetl function which is faster.
per isakson
per isakson on 6 Jan 2018
Edited: per isakson on 6 Jan 2018
That's comparing apples to oranges. I assumed without stating it that the "numerical blocks" should be parsed.

Sign in to comment.


Steven Lord
Steven Lord on 8 Jan 2018
If your data is too large to fit in memory all at once, consider using a datastore. Since you have data in the headers that I assume you want to access, using a TabularTextDatastore probably won't suit your needs. You may need to use a general FileDatastore or develop your own custom datastore using your knowledge of the way your data is formatted.
Once you have a datastore you could use it to create a tall array.
  1 Comment
Fan Li
Fan Li on 16 Jan 2018
Edited: Fan Li on 16 Jan 2018
Hi Steven Lord
I have read the part about tall array and datastore. It is useful for me. But I do not know how to skip the headers which I do not need . The function I am using now and other method provided here is not for datastore. There is limited source for skipping the headers with datastore. So, can you tell me how to skip the headers with the datastore? The format of the data is provided above.
Thanks

Sign in to comment.

Categories

Find more on Large Files and Big Data in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!