Extract value from .txt. Weird lay out.

Hi, so here is my problem: i'd like to extract values (double) from a file where sometimes different columns are not separated by any kind of delimiter.
The text file can look like this :
if true
% code
1994010103
12.05 54.60 38.00 0.28
12.10 54.60 43.00 0.30
13.10 54.60 99.00 0.33
13.15 54.60100.00 0.34
13.20 54.60 0.00 0.00
13.25 54.60128.00 0.16
end
and i'm interested in the values in the third column. The first row is a date/time and i should get rid of it.
My solution to this problem is :
if true
% code
fid = fopen('file');
T = textscan(fid,'%s','delimiter',{'\n'});
fclose(fid);
ngx=39;
ngy=34;
n=ngx*ngy;
t=5839;
for i =1:t
T{1}((i-1)*n+1)=[]; %get rid of the date/time which occurs every nth row
end
interest = zeros(length(T{1}),1);
for i =1:length(T{1})
interest(i) = str2double(T{1}{i}(12:18)); %extract the interesting characters from every row and convert them into a double
end
end
This code works, but i'm dealing with millions of rows and the loop makes the computation time really long..
If you have any idea of how to reduce the computation time, that'd be great !
Thanks

 Accepted Answer

dpb
dpb on 7 Mar 2014
Edited: dpb on 7 Mar 2014
This has been a point of contention of mine "since forever" -- there's no automatic way to read fixed-width column data in C (and hence Matlab). TMW definitely needs to add a feature so I recommend to add your voice to the feature enhancement list for it. Maybe in 30 more years...
But, with tools as are--
c=textscan(fid,'%s','delimiter','\n'); c=char(c{:}); % read, to char array
c(1:nSkip:end,:)=[]; % delete the date rows
data=str2num(c(:,12:18)); % convert the desired columns
If need more columns, either duplicate above or use the known field widths and insert a delimiter in the desired locations then use textscan on the array.
ADDENDUM:
The best thing for Matlab if you can is to change the form in which the files are generated going forward to use a delimiter or at least increase the field width. But, of course, sometimes it isn't feasible to do so..

8 Comments

Can you clarify what MATLAB is missing? MATLAB has long supported reading fixed-width data. See "Field Width" in the textscan doc:
>> type test.dat
12.05 54.60 38.00 0.28
12.10 54.60 43.00 0.30
13.10 54.60 99.00 0.33
13.15 54.60100.00 0.34
13.20 54.60 0.00 0.00
13.25 54.60128.00 0.16
>> [a,b,c,d]=textread('test.dat',['%5f' repmat('%6f',1,3)]);
>> [a b c d]
ans =
12.0500 54.6000 38.0000 0.2800
12.1000 54.6000 43.0000 0.3000
13.1000 54.6000 99.0000 0.3300
13.1500 54.6010 0 0.3400
13.2000 54.6000 0 0
13.2500 54.6010 28.0000 0.1600
>> cc=cell2mat(textscan(fid,['%5f' repmat('%6f',1,3)]))
cc =
12.0500 54.6000 38.0000 0.2800
12.1000 54.6000 43.0000 0.3000
13.1000 54.6000 99.0000 0.3300
13.1500 54.6010 0 0.3400
13.2000 54.6000 0 0
13.2500 54.6010 28.0000 0.1600
>>
Note that despite telling both textread and textscan that the field width is fixed, the C-compatible scanning doesn't terminate the field at the count and begin at the next character but "eats" the character after and thus the two values of 100 and 128 are interpreted as 0 a 28 instead. Adding 'delimiter','' to textscan doesn't help.
I've never found any way around this behavior and in previous discussions it's been said it is C-Standard compatible and expected behavior.
Can you write a working parsing for the above file?
My suggestion has been that Matlab should also have a Fortran FORMAT-like option that will cleanly deal with fixed-width fields w/o delimiters and also have the added benefit that one can then write repeat fields cleanly as well as recursion, etc., etc., etc., ...
In the past I've written mex utility that does pass a limited subset to Fortran but in the move back to the farm I appear to have lost the source and it was involved enough I've never had the inclination to reinvent it.
Gotcha, it looks like I got lucky when I used %c.
dpb
dpb on 8 Mar 2014
Edited: dpb on 9 Mar 2014
The problem is the definition of field in C --
"...textscan reads the number of characters or digits specified by the field width or precision, or up to the first delimiter, whichever comes first."
The key there is the phrase characters or digits -- it doesn't count delimiters as characters but consumes them uncounted excepting for %c. It's just not a useful definition for fixed-width formatted files if the full column width is used.
%c does count characters correctly but it's not a very useful way to read data one wants to parse; in your case where you're throwing those away it's fine. But, %s is fraught with pain--it eats whitespace until a delimiter irregardless of count iirc altho I always have to go back and test because while it is consistent it's so confusing what you're going to get I find it almost impossible to always get it right from just looking at a particular input and the "what seems logical" format string for a given case.
Anyway, I think this is the first time a TMW person has said anything other than the equivalent of "tough, that's just the way it is". Now that you see the problem can I plead with you to add an internal advocate voice to provide a solution?
As said above, I think the ideal solution would be to provide Fortran-like FORMAT-compatible i/o. Given that Matlab started with FORTRAN, it always seemed a shame to me that they ever got away from the clearly far better system and syntax than that used in C altho I understand since went to C it's simpler to just mimic the development language io.
Submit a Service Request for it. Quote this thread (URL) in your request.
Well, I relented and tried but that requires a login that I don't remember. TMW has surely made it a pita any more. :(
You want to?
Star Strider
Star Strider on 9 Mar 2014
Edited: Star Strider on 9 Mar 2014
I did, and I quoted this thread as my ‘justification’. (I also suggested an I/O format descriptor for engineering notation that would behave like the E/e descriptor, since that issue arose recently.) I figure the more ‘votes’ this issue gets in the form of Service Requests, the more likely it is to appear sooner rather than later.
If you’re logged in here, you should also be logged into everything else. (I like right-clicking because it’s easier to keep track of things. I just close the extra tabs when I’m finished.)
See if this works:
  • at the top of this page, right click on MathWorks.com
  • right click on Support
  • near the end of the page, right click on My Service Requests and create a new service request
I suggest you copy the URL for this thread first, then paste in in your Service Request. You’ve pretty much discussed everything of significance here, so there’s no need to retype it there.
Also, although you can’t vote for your own answer (I added my vote) you can vote for the question. (I did.)
dpb
dpb on 9 Mar 2014
Edited: dpb on 9 Mar 2014
OK, that path did work; following the direct link for some reason didn't recognize the login info and I get tired of the barriers very quickly in my dotage. :(
So, I did it again -- if it's as much longer before TMW does anything since my first submittals, I'll be about 100...I guess it will be a_good_thing (tm) if I am still able to use Matlab at all at that point to see the results. :)
I can't even count the number of times this has come up just in the <2 yr since I started following the forum a little as promised for the complementary updated license TMW generously provided after retirement but there have been quite a number that asked the specific question OP did and several others that have the problem as the underlying reason for the query even though the question wasn't direct owing to the poster being bogged down in the processing so the question asked was fairly far removed from root cause.
For some reason beyond my ken such a fundamental lack apparently has just never seemed important to anybody inside TMW with the clout to actually get anything done about it.
Having once done the mex interface to FORMAT, it has some difficulties if try to implement a fully-functional version that handles every possible feature, but a workable subset that handles probably 90-95% of real world cases isn't too bad and I'd think TMW should be able to do it in at most a couple of months or so if just would dedicate some resources to it. I thought at the time my version was probably about 80% of the way to being releasable back then but even with that as a starter wasn't able to generate any interest.

Sign in to comment.

More Answers (1)

Try replacing your second loop with something along the lines of:
T = strjoin(T', '\n');
interest = textscan(T, '%*11c %6f %*[^\n]');
interest = interest{1};
strjoin is a newer function to convert your cell array of strings to a single long string, which is what textscan will expect. If strjoin is not available in your version of MATLAB, http://www.mathworks.com/matlabcentral/fileexchange/31862-strjoin may help.
The textscan formatter string has three parts:
  1. Ignore the first 11 characters (%*11c)
  2. In interpret the next six character as a floating point number (the data you are interested in)
  3. Ignore the remainder of the line

1 Comment

dpb
dpb on 8 Mar 2014
Edited: dpb on 10 Mar 2014
Iff'en you're going to do that, may as well just write --
c=cell2mat(textscan(fid,'%*11c %6f %*[^\n]','delimiter',''))
which does as you note correctly skip the right number of columns. For OP's problem, he could then loop over the above also including
'headerlines',1
and the numeric count for the number of lines per subsection in the file.
Solves the OP's specific problem since only wants the one column, but still there's the gaping hole in Matlab functionality of the general case of parsing the whole file correctly w/o machinations.
I've posted examples like this during this discussion before but I don't recall you being one of the conversants so the following clearly demonstrates what's simply broke in C--
>> cc=(textscan(fid,['%5s' repmat('%6s',1,3)],'delimiter',''))
>> [cc{1} cc{2} cc{3} cc{4}]
ans =
'12.05' '54.60 ' '38.00 ' '0.28'
'12.10' '54.60 ' '43.00 ' '0.30'
'13.10' '54.60 ' '99.00 ' '0.33'
'13.15' '54.601' '00.00 ' '0.34'
'13.20' '54.60 ' '0.00 ' '0.00'
'13.25' '54.601' '28.00 ' '0.16'
>>
NB the second and subsequent columns--they all begin with a nonwhite character instead of the blank or character that is the actual content in the initial field column if one counts position based on the format string field widths. That is, while consistent with the definition of what the field width means in C, simply a practically wrong-headed definition. Consequently the 2nd has the string '54.60_' or '54.601' NOT the expected/needed/desired '_54.60' where I used the underscore to emphasize the blank. And, it ends up with the last column not even being full width.
C simply cannot keep its hands off the trailing location despite being explicitly told to do so. In kindergarten you get sent to the corner for timeout if you keep taking your neighbor's crayon... :)
ADDENDUM:
BTW, the above also depends upon the fact that there's always a whitespace character AFTER the 3rd column--observe what happens if make the case a littler tougher:
>> type test.dat
13.10 54.60 99.00 0.33
13.15 54.60100.00200.34
13.20 54.60 0.00300.00
13.25 54.60128.00-40.16
Now I've filled in the full 6-column field in the 4th column in some lines so the whitespace isn't there. Now the results are really screwed up and you have to go back to the actual column-counting parsing. I'd kinda' forgotten about the problem one runs into with real files concentrating too much on the specific solution to OP's particular problem/request.
>> cc=cell2mat(textscan(fid,['%5f' repmat('%6f',1,3)],'delimiter',''))
cc =
13.1000 54.6000 99.0000 0.3300
13.1500 54.6010 0.0020 0.3400
13.2000 54.6000 0.0030 0
13.2500 54.6010 28.0000 -40.1600
The correct array is
13.1000 54.6000 99.0000 0.33
13.1500 54.6000 100.0000 200.34
13.2000 54.6000 0.0000 300.00
13.2500 54.6000 128.0000 -40.16
Note also the last anomaly in behavior--owing to the '-', the parser manages to still get it right. I've worked that out before on just exactly how the rules say so, but it's convoluted enough I don't recall just otomh exactly how it does it but it has to do with what is done with whitespace.

Sign in to comment.

Categories

Products

Asked:

on 7 Mar 2014

Edited:

dpb
on 10 Mar 2014

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!