textscan doesn't stop at blank space in txt file
3 views (last 30 days)
Show older comments
Hi, I'm trying to import data from a txt file using the textscan function. While I thought it was suppose to stop at the first blank space it sees, it seems to be grabbing data beyond the blank space. My Group1 should stop at the first blank space before "Events", but it includes "Events", "100", and "Subject".
I'm using the following code thus far..
[file_list, path_n] = uigetfile('.txt','Select the Files to Process','Multiselect','on');
fidi = fopen(file_list);
Group1 = textscan(fidi, '%s %s %s %f %s %s','HeaderLines',3, 'Delimiter','\t');
Attached is the txt file data:
4 Comments
dpb
on 21 Oct 2022
Edited: dpb
on 21 Oct 2022
"...the blank space doesn't match the "%s" specifier (or so I believe),"
Well, that isn't correct assumption, either, a blank is a valid character as is any other. However, unless told different with the optional 'whitespace' named parameter, blanks are considered whitespace and ignored or treated as delimiters except for quoted strings in which they are significant.
Again the textscan doc Algorithms section states--
"When matching data to a text conversion specifier, textscan reads until it finds a delimiter or an end-of-line character."
But, the format spec was '%s %s %s %f %s %s' which gets reapplied over and over until it either fails or reaches the end of file. In this case it found the %s and a numeric it could convert, but then the following records fail.
Another alternative to parsing w/ textscan when such is known to be in the file is to just accept the error; and resynch the file pointer to the next expected record and then carry on with the next section format string. This can be tricky if the file doesn't have fixed-length records as the example; fgetl will get to the next EOL record, but depending upon file content, that may not include all of the next record to be scanned and trying to back up to the previous end of record isn't easily supported in stream files. In the particular file, however, with the failure in the header line, that would work and you could subsequently get the second group in the same open with textscan as
fidi = fopen(file_list);
fmt=[repmat('%s',1,3) '%f' repmat('%s',1,2)];
G1=textscan(fidi,fmt,'HeaderLines',3,'Delimiter','\t','collectoutput',1);
fmt=[repmat('%s',1,3) '%f' repmat('%s',1,1)];
fgetl(fidi); % resynch to BOL next header group
G2=textscan(fidi,fmt,'Delimiter','\t','collectoutput',1);
Personally, I'd still opt for higher level parsing tools instead of having to then put the above into something useful...
Walter Roberson
on 21 Oct 2022
All textscan formats other than %c and %[] skip leading whitespace as defined by the Whitespace option (or default list of whitespace characters if no option was passed.) And %c is perfectly happy to read a space.
If you need a space to be rejected then you have two possibilities:
- pass Whitespace option that does not include space; or
- use %[^ ] taking into account that would be happy to gobble a number returning it as a character vector
Answers (1)
dpb
on 20 Oct 2022
opt=detectImportOptions(websave('walking_01.txt','https://www.mathworks.com/matlabcentral/answers/uploaded_files/1163318/walking_01.txt'), ...
'numheaderlines',2, ...
'readvariablenames',1, ...
'delimiter','\t', ...
'expectednumvariables',6, ...
'missingrule','fill');
opt.VariableTypes(1)={'char'};
tG=readtable(websave('walking_01.txt','https://www.mathworks.com/matlabcentral/answers/uploaded_files/1163318/walking_01.txt'),opt);
ix=find(contains(tG.Subject,'Events'));
tG=tG(1:ix-1,:);
[head(tG);tail(tG)]
Got to thinking -- each of the first two sections would make a great table -- and can import each in part directly. Unfortunately, readtable isn't set up to be able to read from memory...but thought it worthy of showing an import object and what could do.
"In anger" (as my old Scottish power plant testing engineer friend use to say) I'd still probably first read the file in in toto and use that to find the sections and then parse them.
The first two sections are pretty easy; not so sure about the "Devices" section -- the "Moment" section also looks ok although appears empty in this dataset.
1 Comment
dpb
on 20 Oct 2022
Edited: dpb
on 22 Oct 2022
SECTIONS={'Gait Cycle','Events','Devices'};
F=readlines(websave('walking_01.txt','https://www.mathworks.com/matlabcentral/answers/uploaded_files/1163318/walking_01.txt'));
ix=find(startsWith(F,SECTIONS))
Gives the section starting locations for internal parsing -- or use those to limit the ranges read using readtable from the file itself.
See Also
Categories
Find more on Text Files in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!