Extract value from .txt. Weird lay out.

Question

2 votes

Hi, so here is my problem: i'd like to extract values (double) from a file where sometimes different columns are not separated by any kind of delimiter.

The text file can look like this :

if true
  % code
 1994010103
12.05 54.60 38.00  0.28
12.10 54.60 43.00  0.30
13.10 54.60 99.00  0.33
13.15 54.60100.00  0.34
13.20 54.60  0.00  0.00
13.25 54.60128.00  0.16
end

and i'm interested in the values in the third column. The first row is a date/time and i should get rid of it.

My solution to this problem is :

if true
  % code
fid = fopen('file');
T = textscan(fid,'%s','delimiter',{'\n'});
fclose(fid);
ngx=39;
ngy=34;
n=ngx*ngy;
t=5839;
for i =1:t
  T{1}((i-1)*n+1)=[]; %get rid of the date/time which occurs every nth row
end
interest = zeros(length(T{1}),1);
for i =1:length(T{1})
  interest(i) = str2double(T{1}{i}(12:18)); %extract the interesting characters from every row and convert them into a double
end
end

This code works, but i'm dealing with millions of rows and the loop makes the computation time really long..

If you have any idea of how to reduce the computation time, that'd be great !

Thanks

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

dpb on 7 Mar 2014

Edited: dpb on 7 Mar 2014

Open in MATLAB Online

4 votes

This has been a point of contention of mine "since forever" -- there's no automatic way to read fixed-width column data in C (and hence Matlab). TMW definitely needs to add a feature so I recommend to add your voice to the feature enhancement list for it. Maybe in 30 more years...

But, with tools as are--

c=textscan(fid,'%s','delimiter','\n'); c=char(c{:});   % read, to char array
c(1:nSkip:end,:)=[];                                   % delete the date rows
data=str2num(c(:,12:18));                              % convert the desired columns

If need more columns, either duplicate above or use the known field widths and insert a delimiter in the desired locations then use textscan on the array.

ADDENDUM:

The best thing for Matlab if you can is to change the form in which the files are generated going forward to use a delimiter or at least increase the field width. But, of course, sometimes it isn't feasible to do so..

8 Comments
Show 6 older comments Hide 6 older comments

dpb on 7 Mar 2014

Edited: dpb on 7 Mar 2014

Open in MATLAB Online

>> type test.dat
05 54.60 38.00  0.28
10 54.60 43.00  0.30
10 54.60 99.00  0.33
15 54.60100.00  0.34
20 54.60  0.00  0.00
25 54.60128.00  0.16
>> [a,b,c,d]=textread('test.dat',['%5f' repmat('%6f',1,3)]);
>> [a b c d]
ans =
0500   54.6000   38.0000    0.2800
1000   54.6000   43.0000    0.3000
1000   54.6000   99.0000    0.3300
1500   54.6010         0    0.3400
2000   54.6000         0         0
2500   54.6010   28.0000    0.1600
>> cc=cell2mat(textscan(fid,['%5f' repmat('%6f',1,3)]))
cc =
0500   54.6000   38.0000    0.2800
1000   54.6000   43.0000    0.3000
1000   54.6000   99.0000    0.3300
1500   54.6010         0    0.3400
2000   54.6000         0         0
2500   54.6010   28.0000    0.1600
>>

Note that despite telling both textread and textscan that the field width is fixed, the C-compatible scanning doesn't terminate the field at the count and begin at the next character but "eats" the character after and thus the two values of 100 and 128 are interpreted as 0 a 28 instead. Adding 'delimiter','' to textscan doesn't help.

I've never found any way around this behavior and in previous discussions it's been said it is C-Standard compatible and expected behavior.

Can you write a working parsing for the above file?

My suggestion has been that Matlab should also have a Fortran FORMAT-like option that will cleanly deal with fixed-width fields w/o delimiters and also have the added benefit that one can then write repeat fields cleanly as well as recursion, etc., etc., etc., ...

In the past I've written mex utility that does pass a limited subset to Fortran but in the move back to the farm I appear to have lost the source and it was involved enough I've never had the inclination to reinvent it.

dpb on 8 Mar 2014

Edited: dpb on 9 Mar 2014

The problem is the definition of field in C --

"...textscan reads the number of characters or digits specified by the field width or precision, or up to the first delimiter, whichever comes first."

The key there is the phrase characters or digits -- it doesn't count delimiters as characters but consumes them uncounted excepting for %c. It's just not a useful definition for fixed-width formatted files if the full column width is used.

%c does count characters correctly but it's not a very useful way to read data one wants to parse; in your case where you're throwing those away it's fine. But, %s is fraught with pain--it eats whitespace until a delimiter irregardless of count iirc altho I always have to go back and test because while it is consistent it's so confusing what you're going to get I find it almost impossible to always get it right from just looking at a particular input and the "what seems logical" format string for a given case.

Anyway, I think this is the first time a TMW person has said anything other than the equivalent of "tough, that's just the way it is". Now that you see the problem can I plead with you to add an internal advocate voice to provide a solution?

As said above, I think the ideal solution would be to provide Fortran-like FORMAT-compatible i/o. Given that Matlab started with FORTRAN, it always seemed a shame to me that they ever got away from the clearly far better system and syntax than that used in C altho I understand since went to C it's simpler to just mimic the development language io.

Star Strider on 9 Mar 2014

Edited: Star Strider on 9 Mar 2014

I did, and I quoted this thread as my ‘justification’. (I also suggested an I/O format descriptor for engineering notation that would behave like the E/e descriptor, since that issue arose recently.) I figure the more ‘votes’ this issue gets in the form of Service Requests, the more likely it is to appear sooner rather than later.

If you’re logged in here, you should also be logged into everything else. (I like right-clicking because it’s easier to keep track of things. I just close the extra tabs when I’m finished.)

See if this works:

at the top of this page, right click on MathWorks.com
right click on Support
near the end of the page, right click on My Service Requests and create a new service request

I suggest you copy the URL for this thread first, then paste in in your Service Request. You’ve pretty much discussed everything of significance here, so there’s no need to retype it there.

Also, although you can’t vote for your own answer (I added my vote) you can vote for the question. (I did.)

dpb on 9 Mar 2014

Edited: dpb on 9 Mar 2014

OK, that path did work; following the direct link for some reason didn't recognize the login info and I get tired of the barriers very quickly in my dotage. :(

So, I did it again -- if it's as much longer before TMW does anything since my first submittals, I'll be about 100...I guess it will be a_good_thing (tm) if I am still able to use Matlab at all at that point to see the results. :)

I can't even count the number of times this has come up just in the <2 yr since I started following the forum a little as promised for the complementary updated license TMW generously provided after retirement but there have been quite a number that asked the specific question OP did and several others that have the problem as the underlying reason for the query even though the question wasn't direct owing to the poster being bogged down in the processing so the question asked was fairly far removed from root cause.

For some reason beyond my ken such a fundamental lack apparently has just never seemed important to anybody inside TMW with the clout to actually get anything done about it.

Having once done the mex interface to FORMAT, it has some difficulties if try to implement a fully-functional version that handles every possible feature, but a workable subset that handles probably 90-95% of real world cases isn't too bad and I'd think TMW should be able to do it in at most a couple of months or so if just would dedicate some resources to it. I thought at the time my version was probably about 80% of the way to being releasable back then but even with that as a starter wasn't able to generate any interest.

Sign in to comment.

Answer 2

Ken Atwell on 7 Mar 2014

Open in MATLAB Online

0 votes

Try replacing your second loop with something along the lines of:

 T = strjoin(T', '\n');
 interest = textscan(T, '%*11c %6f  %*[^\n]');
 interest = interest{1};

strjoin is a newer function to convert your cell array of strings to a single long string, which is what textscan will expect. If strjoin is not available in your version of MATLAB, http://www.mathworks.com/matlabcentral/fileexchange/31862-strjoin may help.

The textscan formatter string has three parts:

Ignore the first 11 characters (%*11c)
In interpret the next six character as a floating point number (the data you are interested in)
Ignore the remainder of the line

1 Comment
Show -1 older comments Hide -1 older comments

dpb on 8 Mar 2014

Edited: dpb on 10 Mar 2014

Open in MATLAB Online

Iff'en you're going to do that, may as well just write --

c=cell2mat(textscan(fid,'%*11c %6f %*[^\n]','delimiter',''))

which does as you note correctly skip the right number of columns. For OP's problem, he could then loop over the above also including

'headerlines',1

and the numeric count for the number of lines per subsection in the file.

Solves the OP's specific problem since only wants the one column, but still there's the gaping hole in Matlab functionality of the general case of parsing the whole file correctly w/o machinations.

I've posted examples like this during this discussion before but I don't recall you being one of the conversants so the following clearly demonstrates what's simply broke in C--

>> cc=(textscan(fid,['%5s' repmat('%6s',1,3)],'delimiter',''))
>> [cc{1} cc{2} cc{3} cc{4}]
ans = 
  '12.05'    '54.60 '    '38.00 '    '0.28'
  '12.10'    '54.60 '    '43.00 '    '0.30'
  '13.10'    '54.60 '    '99.00 '    '0.33'
  '13.15'    '54.601'    '00.00 '    '0.34'
  '13.20'    '54.60 '    '0.00  '    '0.00'
  '13.25'    '54.601'    '28.00 '    '0.16'
>>

NB the second and subsequent columns--they all begin with a nonwhite character instead of the blank or character that is the actual content in the initial field column if one counts position based on the format string field widths. That is, while consistent with the definition of what the field width means in C, simply a practically wrong-headed definition. Consequently the 2nd has the string '54.60_' or '54.601' NOT the expected/needed/desired '_54.60' where I used the underscore to emphasize the blank. And, it ends up with the last column not even being full width.

C simply cannot keep its hands off the trailing location despite being explicitly told to do so. In kindergarten you get sent to the corner for timeout if you keep taking your neighbor's crayon... :)

ADDENDUM:

BTW, the above also depends upon the fact that there's always a whitespace character AFTER the 3rd column--observe what happens if make the case a littler tougher:

>> type test.dat
10 54.60 99.00  0.33
15 54.60100.00200.34
20 54.60  0.00300.00
25 54.60128.00-40.16

Now I've filled in the full 6-column field in the 4th column in some lines so the whitespace isn't there. Now the results are really screwed up and you have to go back to the actual column-counting parsing. I'd kinda' forgotten about the problem one runs into with real files concentrating too much on the specific solution to OP's particular problem/request.

>> cc=cell2mat(textscan(fid,['%5f' repmat('%6f',1,3)],'delimiter',''))
cc =
 13.1000   54.6000   99.0000    0.3300
 13.1500   54.6010    0.0020    0.3400
 13.2000   54.6000    0.0030         0
 13.2500   54.6010   28.0000  -40.1600

The correct array is

1000  54.6000   99.0000    0.33
1500  54.6000  100.0000  200.34
2000  54.6000    0.0000  300.00
2500  54.6000  128.0000  -40.16

Note also the last anomaly in behavior--owing to the '-', the parser manages to still get it right. I've worked that out before on just exactly how the rules say so, but it's convoluted enough I don't recall just otomh exactly how it does it but it has to do with what is done with whitespace.

Sign in to comment.

Extract value from .txt. Weird lay out.

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

8 Comments
Show 6 older comments Hide 6 older comments

More Answers (1)

1 Comment
Show -1 older comments Hide -1 older comments

Categories

Products

Tags

Community Treasure Hunt

Extract value from .txt. Weird lay out.

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

8 Comments Show 6 older comments Hide 6 older comments

More Answers (1)

1 Comment Show -1 older comments Hide -1 older comments

Categories

Products

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

8 Comments
Show 6 older comments Hide 6 older comments

1 Comment
Show -1 older comments Hide -1 older comments