Read big file with mixed data types with datastore
4 views (last 30 days)
Show older comments
I've got a file which is 300 GB big. A piece of it can be found in the attached file. I've read that the best way to handle this kind of files is to read them into a datastore.
As you can see, the first two lines are characters, while the following lines are a combination of floats and integers. Is it possible to read them predefined? I know from fscanf that you can specify the data type, but when I do datastore it interprets every line as a string.
0 Comments
Answers (1)
Stephen23
on 25 Nov 2024
ds = datastore('./*.txt', 'Type','tabulartext', 'NumHeaderLines',2, 'TextscanFormats',repmat("%f",1,5));
T = preview(ds)
7 Comments
Walter Roberson
on 19 Dec 2024 at 0:22
I do not know what documentation you are referring to?
The documentation for fopen() says "If you do not specify an encoding scheme when opening a file for reading, fopen uses auto character-set detection to determine the encoding." . Details about the auto detection are left unspecified, so hypothetically it might have to scan through the entire file (just in case somewhere in the file there are some utf8 sequences.) But no auto-detection is done if you specify a text encoding.
datastore is good for processing lots of line-oriented data, as datastore can automatically break line-oriented files up into pieces for processing chunks. But the processing would have to be such that it made sense to do the task in chunks -- for example if the processing required calculating the standard deviation of the first column of data, then all of the data would have to be read in first.
See Also
Categories
Find more on Large Files and Big Data in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!