Extract value from .txt. Weird lay out.
Show older comments
Hi, so here is my problem: i'd like to extract values (double) from a file where sometimes different columns are not separated by any kind of delimiter.
The text file can look like this :
if true
% code
1994010103
12.05 54.60 38.00 0.28
12.10 54.60 43.00 0.30
13.10 54.60 99.00 0.33
13.15 54.60100.00 0.34
13.20 54.60 0.00 0.00
13.25 54.60128.00 0.16
end
and i'm interested in the values in the third column. The first row is a date/time and i should get rid of it.
My solution to this problem is :
if true
% code
fid = fopen('file');
T = textscan(fid,'%s','delimiter',{'\n'});
fclose(fid);
ngx=39;
ngy=34;
n=ngx*ngy;
t=5839;
for i =1:t
T{1}((i-1)*n+1)=[]; %get rid of the date/time which occurs every nth row
end
interest = zeros(length(T{1}),1);
for i =1:length(T{1})
interest(i) = str2double(T{1}{i}(12:18)); %extract the interesting characters from every row and convert them into a double
end
end
This code works, but i'm dealing with millions of rows and the loop makes the computation time really long..
If you have any idea of how to reduce the computation time, that'd be great !
Thanks
Accepted Answer
More Answers (1)
Ken Atwell
on 7 Mar 2014
Try replacing your second loop with something along the lines of:
T = strjoin(T', '\n');
interest = textscan(T, '%*11c %6f %*[^\n]');
interest = interest{1};
strjoin is a newer function to convert your cell array of strings to a single long string, which is what textscan will expect. If strjoin is not available in your version of MATLAB, http://www.mathworks.com/matlabcentral/fileexchange/31862-strjoin may help.
The textscan formatter string has three parts:
- Ignore the first 11 characters (%*11c)
- In interpret the next six character as a floating point number (the data you are interested in)
- Ignore the remainder of the line
1 Comment
Iff'en you're going to do that, may as well just write --
c=cell2mat(textscan(fid,'%*11c %6f %*[^\n]','delimiter',''))
which does as you note correctly skip the right number of columns. For OP's problem, he could then loop over the above also including
'headerlines',1
and the numeric count for the number of lines per subsection in the file.
Solves the OP's specific problem since only wants the one column, but still there's the gaping hole in Matlab functionality of the general case of parsing the whole file correctly w/o machinations.
I've posted examples like this during this discussion before but I don't recall you being one of the conversants so the following clearly demonstrates what's simply broke in C--
>> cc=(textscan(fid,['%5s' repmat('%6s',1,3)],'delimiter',''))
>> [cc{1} cc{2} cc{3} cc{4}]
ans =
'12.05' '54.60 ' '38.00 ' '0.28'
'12.10' '54.60 ' '43.00 ' '0.30'
'13.10' '54.60 ' '99.00 ' '0.33'
'13.15' '54.601' '00.00 ' '0.34'
'13.20' '54.60 ' '0.00 ' '0.00'
'13.25' '54.601' '28.00 ' '0.16'
>>
NB the second and subsequent columns--they all begin with a nonwhite character instead of the blank or character that is the actual content in the initial field column if one counts position based on the format string field widths. That is, while consistent with the definition of what the field width means in C, simply a practically wrong-headed definition. Consequently the 2nd has the string '54.60_' or '54.601' NOT the expected/needed/desired '_54.60' where I used the underscore to emphasize the blank. And, it ends up with the last column not even being full width.
C simply cannot keep its hands off the trailing location despite being explicitly told to do so. In kindergarten you get sent to the corner for timeout if you keep taking your neighbor's crayon... :)
ADDENDUM:
BTW, the above also depends upon the fact that there's always a whitespace character AFTER the 3rd column--observe what happens if make the case a littler tougher:
>> type test.dat
13.10 54.60 99.00 0.33
13.15 54.60100.00200.34
13.20 54.60 0.00300.00
13.25 54.60128.00-40.16
Now I've filled in the full 6-column field in the 4th column in some lines so the whitespace isn't there. Now the results are really screwed up and you have to go back to the actual column-counting parsing. I'd kinda' forgotten about the problem one runs into with real files concentrating too much on the specific solution to OP's particular problem/request.
>> cc=cell2mat(textscan(fid,['%5f' repmat('%6f',1,3)],'delimiter',''))
cc =
13.1000 54.6000 99.0000 0.3300
13.1500 54.6010 0.0020 0.3400
13.2000 54.6000 0.0030 0
13.2500 54.6010 28.0000 -40.1600
The correct array is
13.1000 54.6000 99.0000 0.33
13.1500 54.6000 100.0000 200.34
13.2000 54.6000 0.0000 300.00
13.2500 54.6000 128.0000 -40.16
Note also the last anomaly in behavior--owing to the '-', the parser manages to still get it right. I've worked that out before on just exactly how the rules say so, but it's convoluted enough I don't recall just otomh exactly how it does it but it has to do with what is done with whitespace.
Categories
Find more on Characters and Strings in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!