textscan does not read all rows

3 views (last 30 days)
Ricardo Lopez A.
Ricardo Lopez A. on 19 Jan 2020
Edited: Jeremy Hughes on 20 Jan 2020
Hi,
I am dealing with very large .txt files and trying to use textscan to open them. I have a a smaller .txt file with the same format that I was able to open with readtable. The resulting table has 47 variables and 1389712 rows.
Here is readtable code:
data=readtable('Building.txt');
Here is the textscan code:
formatSpec='%s%f%s%s%s%s%f%f%f%f%f%s%s%f%f%f%f%s%f%f%f%f%f%f%f%f%f%s%f%s%s%s%s%s%s%s%f%f%s%s%s%f%s%f%s%f%f';
fid = fopen('Building.txt','r');
data1 = textscan(fid,formatSpec,'Delimiter','|');
fclose(fid);
data1 has 47 variables, but only 36299 rows instead of 1389712 rows. I would use readtable, but it is way too slow for the large txt.files.
Please note that the formatSpec is obtained from the resulting readtable data by using summary(data) I could see the format of each variable.
This is an example of the format of the text files I am trying to use (lots of missing data I know):
EE760424-42D5-E511-80C1-3863BB43AC67|0||RESIDENTIAL STRUCTURE||RR000|||1||||||||| |0|0||0||.00||.00||C|0|||||||||||||||99748186| |38001|7017
EF760424-42D5-E511-80C1-3863BB43AC67|0||RESIDENTIAL STRUCTURE||RR000|||1||||||||| |0|0||0||.00||.00||C|0|||||||||||||||99748257| |38001|7017
Thanks a lot!
  1 Comment
dpb
dpb on 19 Jan 2020
You sure there aren't missing values in the readtable table? It's much more forgiving of a bad format or missing data than is textscan
Not much think anybody can do here without a sample file to work on...it should zip up pretty compactly.

Sign in to comment.

Answers (1)

Jeremy Hughes
Jeremy Hughes on 20 Jan 2020
Edited: Jeremy Hughes on 20 Jan 2020
If you pass in 'ReturnOnError',false with the textscan call, there will be an error message where the format cannot read your file. That's likely due to the missing data.
readtable tries to read using a detected format, and when that fails updates to re-read with a new format. It may be slow because it's reading multiple times trying to get the format correct. You could pass that same formatSpec into readtable, but it will likely error in the same way as textscan (just not silently)
If you try detectImportOptions with the file, then readtable, you might have faster/better results.
opts = detectImportOptions(file,'Delimiter','|','ExpectedNumVariables',47)
%% Check if this looks right
tp = preview(file,opts)
%% If the variable types look correct in tp, you don't need this step.
formatSpec='%s%f%s%s%s%s%f%f%f%f%f%s%s%f%f%f%f%s%f%f%f%f%f%f%f%f%f%s%f%s%s%s%s%s%s%s%f%f%s%s%s%f%s%f%s%f%f';
fmt = split(formatSpec(2:end),'%');
opts = setvartype(opts,strcmp(fmt,'f'),'double');
opts = setvartype(opts,strcmp(fmt,'s'),'char');
%% Read the whole file.
T = readtable(file,opts);
I can't really test this without your file, but it should work (maybe with some tweaking)

Tags

Products


Release

R2017a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!