Reading in ascii files with white space as delimiter.
    29 views (last 30 days)
  
       Show older comments
    
I am trying to read in a very simple ascii file that looks like the following:
   PRES   HGHT   TEMP   DWPT   RELH   MIXR   DRCT   SKNT   THTA   THTE   THTV
    hPa     m      C      C      %    g/kg    deg   knot     K      K      K 
-----------------------------------------------------------------------------
  994.0    270    7.0    6.0     93   5.93     40     10  280.6  297.1  281.6
  989.0    312    6.2    5.2     93   5.64     42     12  280.2  295.9  281.2
  972.0    455    4.8    4.0     95   5.27     48     18  280.2  294.9  281.1
  ...
There seem to be a dozen functions that I can read this in with but I'm struggling with all of them.
The simplest seems to be dlmread. I'm currently using the command:
M = dlmread('radiosonde.ascii',' ',3,1)
However this seems to register a single space as the delimiter instead of all the white space. If I use:
M = dlmread('radiosonde.ascii')
It registers the white space as the delimiter but I cannot specify to ignore the headers. Is there some way to specify white space as the delimitter while also ignore the headers?
Is there a better way to do this? Why hasn't Mathworks streamlined reading text files to be one universal function?
Answers (2)
  Kevin Claytor
      
 on 9 Nov 2015
        Import data seems to work pretty well (but doesn't directly get you the headers):
importdata('radiosonde.ascii', ' ', 3)
If you know the exact format, textscan is used by the auto-generated code by: right click > import data:
startRow = 4;
formatSpec = '%7s%7s%7s%7s%7s%7s%7s%7s%7s%7s%s%[^\n\r]';
dataArray = textscan(fileID, formatSpec, 'Delimiter', '', 'WhiteSpace', '', 'HeaderLines' ,startRow-1, 'ReturnOnError', false);
0 Comments
  dpb
      
      
 on 9 Nov 2015
        
      Edited: dpb
      
      
 on 13 Nov 2015
  
      A"The better way..."
x=textread('radiosonde.ascii','','headerlines',3);
I hadn't noted before the symptom of repeated delimiters with dlmread; agreed that's a pit[proverbial]a[ppendage].
IMO, it's unfortunate TMW has chosen to deprecate the use of textread in favor of textscan; it has the advantage of
- returning a "regular" double array instead of only a cell array,
- doesn't need the extra fopen/fclose step again where a single file read suffices and,
- as shown below, it "counts" the record length and returns correct shape automagically whereas textscan has to be told or one has to reshape the returned array.
The above equivalent in textscan would be
x=cell2mat(textscan(fid,repmat('%f',1,11), ...
                        'delimiter',' ', ...
                        'headerlines',3, ...
                        'multipledelimsasone',1));
textscan is the one, general function, but there are so many possibilities (as in infinite) to cover that making something that is general but also flexible is difficult; hence the specialized functions for specific cases. It does seem as though the multiple delimiters option would be a worthwhile enhancement for them; as noted, I hadn't actually noted that behavior previously as I tend to use the textread route for the above reasons. There are things it can't do that textscan can (being able to be called on the same file multiple times being a major one) but instead of deprecating it, it should be brought up to the level of textscan instead imo (or, alternatively, the option I've asked for since it was introduced, have an optional ability in textscan to return the double array directly and understand a file name as well as file handle).
(+) ADDENDUM/ERRATUM
Actually, on reading the source for dlmread I observed something hadn't noticed before (and I don't think it's documented; at least not well) -- if one submits an empty string for the formatting string, then textscan will do something else internally and in a regular numeric array come up with the number of fields per input record and reflect that. That is a super result that should be shouted from the rooftops by TMW but seems to be a closely held secret--
    >> cell2mat(textscan(fid,'','collectoutput',1,'headerlines',3))
  ans =
    Columns 1 through 8
    994.0000  270.0000    7.0000    6.0000   93.0000    5.9300   40.0000   10.0000
    989.0000  312.0000    6.2000    5.2000   93.0000    5.6400   42.0000   12.0000
    972.0000  455.0000    4.8000    4.0000   95.0000    5.2700   48.0000   18.0000
    Columns 9 through 11
    280.6000  297.1000  281.6000
    280.2000  295.9000  281.2000
    280.2000  294.9000  281.1000
  >>
2 Comments
  dpb
      
      
 on 9 Nov 2015
				
      Edited: dpb
      
      
 on 10 Nov 2015
  
			>> help dlmread
 dlmread Read ASCII delimited file.
    RESULT = dlmread(FILENAME) reads numeric data from the ASCII
    delimited file FILENAME.  The delimiter is inferred from the formatting
    of the file.
      RESULT = dlmread(FILENAME,DELIMITER) reads numeric data from the ASCII
      delimited file FILENAME using the delimiter DELIMITER.  The result is
      returned in RESULT.  Use '\t' to specify a tab.
    When a delimiter is inferred from the formatting of the file,
    consecutive whitespaces are treated as a single delimiter.  By
    contrast, if a delimiter is specified by the DELIMITER input, any
    repeated delimiter character is treated as a separate delimiter.
    ...
I'd forgotten this detail; the behavior is documented. The problem is, there's no way with the interface as designed to specify the header rows and not the delimiter...it's a remnant of the original procedural interface design of the functions; quite often they weren't written to be as general as could/should have been.
dlmread is an m-file; it wouldn't be too hard to extend it to handle the case--
The preprocessing section looks like the following:
...
% Get Delimiter
if nargin==1 % Guess default delimiter
  [fid, theMessage] = fopen(filename);
  if fid < 0
    error(message('MATLAB:dlmread:FileNotOpened', filename, theMessage));
  end
  str = fread(fid, 4096,'*char')';
  frewind(fid);
  delimiter = guessdelim(str);
  if isspace(delimiter); delimiter = ''; end 
else
  delimiter = sprintf(delimiter); % Interpret \t (if necessary)
end
...
If one were to use [] placeholder for the delimiter but also provided the R,C offsets, nargin still returns the place counter in the list so it would be pretty easy to also test for the second argument being empty as well as the first case of only one argument and have it do the search. Then only if the delimiter were explicitly specified would the multiple vs single come into play.
Would take a little more effort to handle that case as well, but certainly doable (and probably should have been).
ADDENDUM
Modified the above if to
if nargin==1 | (nargin>1 & isempty(delimiter)) % Guess default delimiter
and voila! using [] as a placeholder for the DELIMITER argument lets one specify the offset row,column arguments and still get the behavior of the multiple delimiters as one and automagic determination of same.
  dpb
      
      
 on 12 Nov 2015
				BTW, the above working for the example file is sorta happenstance; the documentation also includes the caveat
All data in the input file must be numeric. dlmread does not operate 
on files containing nonnumeric data, even if the specified rows and
columns for the read contain numeric data only.
The example file is an anomaly that does, in fact, work correctly when skip the headers; not all will. I've not pursued this part in depth, undoubtedly it has to do with the fact the delimiter search reads an arbitrary 4096 characters and searches within it to determine the delimiter if requested and makes assumptions based on that which may turn out to be incorrect for a general line.
See Also
Categories
				Find more on Text Files in Help Center and File Exchange
			
	Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!


