Loading part of a text file (i.e., fileread the first X bytes)

3 views (last 30 days)
I'm using fileread to load data. The problem I have is that the files are large (several MB) and I actually only need to load/process in the first fraction (say 100 kb) of the file. There are over 1M files so wasted computation time from loading all of this "data fat" at the end of the files adds up to several days.
Does anyone know of a way to use fileread (or something similar) where you can specify to only load part of the file into MATLAB's memory buffer? With this many files even saving a fraction of a second will make a big difference.

Accepted Answer

Walter Roberson
Walter Roberson on 2 Oct 2019
Edited: Walter Roberson on 2 Oct 2019
You would use fopen(), fread() with a size, then fclose() . You would want to use a "precision" specifier such as '*c' .
However, if there is a possibility that your files are UTF encoded or are multibyte character set, then you need to define more clearly what the size is intended to indicate. Is it (say) 100000 bytes that then potentially have to be decoded, or would you be wanting to read 100000 decoded characters ?
Also, remember to take into account line terminators in your counting. Does your file use carriage returns as well as linefeeds ?
  2 Comments
Scott
Scott on 2 Oct 2019
Thanks Walter. I'll see how the run time compares. I likely could also save time by not passing the full block of text to the various parsing functions as well.
This is helpful, thanks!
Walter Roberson
Walter Roberson on 2 Oct 2019
Extracting the beginning of a character vector is not always more efficient if the parsing code is able to handle extra characters beyond what you need. But if you are using regexp you would want to be sure to use the ? quantifier on .* for example, so using the .*? operator, or make sure you use 'dotexceptnewline' with .* because .* implicitly skips the pointer to the end of the entire stretch of characters and then work backwards to find matches, instead of finding the first match from the current position.
Extracting the beginning of a character vector usually does not cost much and can save you from having to carefully code .?* but when talking about "fractions of a second" then it costs a little that might not strictly need to be used. Extracting the beginning before parsing is cleaner programing in most cases, but not always the utmost optimization.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!