Extract data from text file

Question

0 votes

sample data.txt

I have this 'sample data.txt' text file with the data not in the right form. I need to read this text file and extract the data and tabulate it in the order as shown in figure below. I am not sure how can I do it.

Really appreciate it if someone can help to guide me. Thank you.

2 Comments
Show None Hide None

Guillaume on 29 Apr 2019

The format of your text file is dreadful! Has it been altered in any way from its original format? It would be much easier to parse if the column data was separated by a tab or comma character instead of spaces and if the table header wasn't split onto two lines within one of the column header.

The screenshot that you show doesn't match the text file you've attached and therefore leave some questions unanswered:

It would appear that the Delayed Gadolinium Enhancement column can have multiword entries (e.g. Full thickness). Can any other column also have multi word entries? If so, how can we identify which column a word belongs to?
The formatting of the text is not even consistent across the table. Sometimes you have < 50 (with a space), sometimes <50 (without a space) for that last column. Do you want the text as is, or normalised in the output? Even better in my opinion would be to convert to numbers, in that case should Full thickness be converted to 100?

Unfortunately, because of that awful formatting, you're going to have to write a parser for the file and make plenty of assumptions that may be invalid and cause the parsing to fail on future files. If you can get the same data in a more sensible format that would be better.

matlab noob on 29 Apr 2019

Really appreciate for your reply regarding this question. According to the question you've asked

It would appear that the Delayed Gadolinium Enhancement column can have multiword entries (e.g. Full thickness). Can any other column also have multi word entries? If so, how can we identify which column a word belongs to? The other column can also have multi word entris
The formatting of the text is not even consistent across the table. Sometimes you have < 50 (with a space), sometimes <50 (without a space) for that last column. Do you want the text as is, or normalised in the output? I'll need the original text as it is, no conversion is encourage in my case.

Meanwhile I'm searching something that can read specific string inbetween those data that I'll like to extract out. Is it possible for this idea to apply for this case?

Thank you.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Stephen23 on 29 Apr 2019

Edited: Stephen23 on 29 Apr 2019

Open in MATLAB Online

2 votes

sample data.txt

That is a very badly formatted file. For example, the field delimiters are space characters and space characters also occur within the fields (without any text delimiters to group the fields together). There is no robust general solution for parsing such a poorly formatted file, altough in some limited cases (such as with prior knowledge of the field contents) you might be able to parse it but parsing such files will always be fragile. On that basis I assumed that the fields contain only the text in the number and types that you have shown, i.e. each line contains exactly:

1 or 2 words (starts with 'Basal' or 'Mid' or 'Apical', or constitutes 'Apex')
1 number
1 word
('Nil' or 'Present')
('Nil' or 'Present')
('Nil' or 'Full thickness' or a percentage)

This matches all of the seventeen rows in your example data file:

str = fileread('sample data.txt');
rgx = ['(Apex|(Basal|Mid|Apical)\s+[A-Z][a-z]+)\s+(\d+)\s+([A-Z][a-z]+)',...
	'\s+(Nil|Present)\s+(Nil|Present)\s+(Nil|Full thickness|([<>]\s?)?\d+\%)'];
tkn = regexpi(str,rgx,'tokens');
tkn = vertcat(tkn{:})

Giving:

tkn = 
    'Basal Anterior'         '1'     'Hypokinetic'    'Nil'        'Nil'        '50%'           
    'Basal Anteroseptal'     '2'     'Dyskinetic'     'Present'    'Present'    'Full thickness'
    'Basal Inferoseptal'     '3'     'Hypokinetic'    'Present'    'Present'    '50%'           
    'Basal Inferior'         '4'     'Hypokinetic'    'Nil'        'Present'    '50%'           
    'Basal Inferolateral'    '5'     'Normal'         'Nil'        'Nil'        'Nil'           
    'Basal Anterolateral'    '6'     'Normal'         'Nil'        'Nil'        'Nil'           
    'Mid Anterior'           '7'     'Hypokinetic'    'Nil'        'Nil'        '<50%'          
    'Mid Anteroseptal'       '8'     'Dyskinetic'     'Present'    'Present'    'Full thickness'
    'Mid Inferoseptal'       '9'     'Akinetic'       'Present'    'Present'    'Full thickness'
    'Mid Inferior'           '10'    'Hypokinetic'    'Nil'        'Present'    '<50%'          
    'Mid Inferolateral'      '11'    'Normal'         'Nil'        'Nil'        'Nil'           
    'Mid Anterolateral'      '12'    'Normal'         'Nil'        'Nil'        '<50%'          
    'Apical Anterior'        '13'    'Akinetic'       'Nil'        'Nil'        '50%'           
    'Apical Septal'          '14'    'Akinetic'       'Nil'        'Nil'        '< 50%'         
    'Apical Inferior'        '15'    'Akinetic'       'Nil'        'Nil'        '> 50%'         
    'Apical Lateral'         '16'    'Hypokinetic'    'Nil'        'Nil'        'Full thickness'
    'Apex'                   '17'    'Akinetic'       'Nil'        'Nil'        'Full thickness'
>> size(tkn)
ans =
    17     6
>>     

Clearly you can put that into a table if you really want to:

>> hdr = {'LeftVentricularSegments','No','WallMotion','PerfusionAtRest','PerfusionAtStress','DelayedGadoliniumEnhancement'};
>> T = cell2table(tkn,'VariableNames',hdr)
T = 
    LeftVentricularSegments     No      WallMotion      PerfusionAtRest    PerfusionAtStress    DelayedGadoliniumEnhancement
    _______________________    ____    _____________    _______________    _________________    ____________________________
    'Basal Anterior'           '1'     'Hypokinetic'    'Nil'              'Nil'                '50%'                       
    'Basal Anteroseptal'       '2'     'Dyskinetic'     'Present'          'Present'            'Full thickness'            
    'Basal Inferoseptal'       '3'     'Hypokinetic'    'Present'          'Present'            '50%'                       
    'Basal Inferior'           '4'     'Hypokinetic'    'Nil'              'Present'            '50%'                       
    'Basal Inferolateral'      '5'     'Normal'         'Nil'              'Nil'                'Nil'                       
    'Basal Anterolateral'      '6'     'Normal'         'Nil'              'Nil'                'Nil'                       
    'Mid Anterior'             '7'     'Hypokinetic'    'Nil'              'Nil'                '<50%'                      
    'Mid Anteroseptal'         '8'     'Dyskinetic'     'Present'          'Present'            'Full thickness'            
    'Mid Inferoseptal'         '9'     'Akinetic'       'Present'          'Present'            'Full thickness'            
    'Mid Inferior'             '10'    'Hypokinetic'    'Nil'              'Present'            '<50%'                      
    'Mid Inferolateral'        '11'    'Normal'         'Nil'              'Nil'                'Nil'                       
    'Mid Anterolateral'        '12'    'Normal'         'Nil'              'Nil'                '<50%'                      
    'Apical Anterior'          '13'    'Akinetic'       'Nil'              'Nil'                '50%'                       
    'Apical Septal'            '14'    'Akinetic'       'Nil'              'Nil'                '< 50%'                     
    'Apical Inferior'          '15'    'Akinetic'       'Nil'              'Nil'                '> 50%'                     
    'Apical Lateral'           '16'    'Hypokinetic'    'Nil'              'Nil'                'Full thickness'            
    'Apex'                     '17'    'Akinetic'       'Nil'              'Nil'                'Full thickness' 

12 Comments
Show 10 older comments Hide 10 older comments

matlab noob on 1 May 2019

Edited: matlab noob on 1 May 2019

Open in MATLAB Online

% capture next line 
nl = '[\r\n]+';
% read the text file
file = fileread(a);
expression = ['(Apex|(Basal|Mid|Apical)\s+[A-Z][a-z]+)',... % extract all string begin with 'Apex', 'Basal', 'Mid', 'Apical'
              nl,'(\d+)',... % number after the LVsegments
              nl, '([A-Z][a-z]+\s?[a-z]+)',...% Wall motion 
              nl,'(([<>]\s?)?\d+\%|(\d+\%)|([<>]\s?)?\d+\s?\%|[A-Z]+\s?[-][A-Z]+|[A-Z][a-z]+\s?[a-z]+)' % Delayed Gadolinium Enhancement
              ]; 
str = regexpi(file, expression, 'tokens');
str = vertcat(str{:});
% Insert header for each data extracted
header = {'Left_Ventricular_Segments','No','Wall_Motion','Delayed_Gadolinium_Enhancement'};
% Data tabulation
table = cell2table(str,'VariableNames', header)

This is the code that I've done to read all my text file (100+), but I face some problem.

Recalling the problem of the text file, it does not have a consistent arrangement of data.

Some text file (mostly) consist of

"Left_Ventricular_Segments" "No" "Wall_Motion" "Delayed_Gadolinium_Enhancement"

which apply to most of the cases.

However, some of the text file (only a few) consist of one extra column

"Left_Ventricular_Segments" "No" "Wall_Motion" "Perfusion Defect At Stress" "Delayed_Gadolinium_Enhancement"

I've read that there is this (?(cond)expr) & (?(cond)expr1|expr2) is it applicable in my situation? Meanwhile still struggling on how to use this...

Or is there any smarter way in including this condition into the code? Esle I will go for a dumb way by adding another line for this purposes. Thank you.

% capture next line 
nl = '[\r\n]+';
expression = ['(Apex|(Basal|Mid|Apical)\s+[A-Z][a-z]+)',... % extract all string begin with 'Apex', 'Basal', 'Mid', 'Apical'
              nl,'(\d+)',... % number after the LVsegments
              nl, '([A-Z][a-z]+\s?[a-z]+)',...% Wall motion 
              nl, '([A-Z][a-z]+\s?[a-z]+)',...% Perfusion Defect At Stress
              nl,'(([<>]\s?)?\d+\%|(\d+\%)|([<>]\s?)?\d+\s?\%|[A-Z]+\s?[-][A-Z]+|[A-Z][a-z]+\s?[a-z]+)' % Delayed Gadolinium Enhancement
              ]; 
table = 
    Left_Ventricular_Segments     No      Wall_Motion     Perfusion_Defect_At_Stress    Delayed_Gadolinium_Enhancement
    'Basal Anterior'             '1'     'Hypokinetic'    'Nil'                         'Nil'                         
   ...

Stephen23 on 1 May 2019

Edited: Stephen23 on 1 May 2019

Open in MATLAB Online

This code reads your three later files, where each field is on its own line.

The code relies on one main assumption: that the header name "No" appears by itself on one line, which is used to anchor and identify the block of data that you are looking for. The other lines are simply contiguous with that header name. It also uses the "No" field values to identify the number of fields: this requires that only the "No" fields constitute numeric values.

R = '([^\n]+\n)*No(\n[^\n]+)+'; % regular expression, contiguous around "No".
S = dir('sample*.txt');
N = numel(S);
C = cell(1,N);
for k = 1:N
	str = fileread(S(k).name);
	str = regexprep(str,'\r\n','\n'); % replace Windows newlines.
	M = regexp(str,R,'match','once'); % match lines of text file.
	P = regexp(M,'\n','split');       % split lines into cell array.
	V = str2double(P);                % convert lines into numbers.
	D = mean(diff(find(~isnan(V))));  % identify non-NaN (i.e. "No" lines").
	H = regexprep(P(1:D),'\s+','_');  % get heater lines.
	X = strcmpi(P{D+1},'Enhancement');   % identify superfluous header.
	A = reshape(P(1+X+D:end),D,[]).';    % get data lines.
	T = cell2table(A,'variableNames',H); % convert data + header into table.
	C{k} = T;
end

Giving:

>> C{:}
ans = 
    Left_Ventricular_Segments     No     Perfusion_defect_at_rest    Perfusion_defect_at_stress     Wall_Motion     Delayed_Gadolinium
    _________________________    ____    ________________________    __________________________    _____________    __________________
    'Basal Anterior'             '1'     'Nil'                       'Nil'                         'Normal'         'Mid wall'        
    'Basal Anteroseptal'         '2'     'Nil'                       'Nil'                         'Normal'         'Mid wall'        
    'Basal Inferoseptal'         '3'     'Nil'                       'Nil'                         'Normal'         'Mid wall'        
    'Basal Inferior'             '4'     'Nil'                       'Nil'                         'Normal'         'Nil'             
    'Basal Inferolateral'        '5'     'Nil'                       'Nil'                         'Normal'         'Nil'             
    'Basal Anterolateral'        '6'     'Nil'                       'Nil'                         'Normal'         'Nil'             
    'Mid Anterior'               '7'     'Present'                   'Present'                     'Akinetic'       'Full thickness'  
    'Mid Anteroseptal'           '8'     'Present'                   'Present'                     'Akinetic'       'Full thickness'  
    'Mid Inferoseptal'           '9'     'Present'                   'Present'                     'Hypokinetic'    'Full thickness'  
    'Mid Inferior'               '10'    'Nil'                       'Nil'                         'Normal'         'Nil'             
    'Mid Inferolateral'          '11'    'Nil'                       'Nil'                         'Normal'         'Nil'             
    'Mid Anterolateral'          '12'    'Nil'                       'Nil'                         'Normal'         'Nil'             
    'Apical Anterior'            '13'    'Present'                   'Present'                     'Akinetic'       'Full thickness'  
    'Apical Septal'              '14'    'Present'                   'Present'                     'Akinetic'       'Full thickness'  
    'Apical Inferior'            '15'    'Present'                   'Present'                     'Akinetic'       'Full thickness'  
    'Apical Lateral'             '16'    'Nil'                       'Nil'                         'Normal'         '<50%'            
    'Apex'                       '17'    'Nil'                       'Nil'                         'Dystkinetic'    '<50%'            
ans = 
    Left_Ventricular_Segments     No     Wall_Motion    Perfusion_At_Rest    Perfusion_At_Stress    Delayed_Gadolinium
    _________________________    ____    ___________    _________________    ___________________    __________________
    'Basal Anterior'             '1'     'Normal'       'Nil'                'Nil'                  'Nil'             
    'Basal Anteroseptal'         '2'     'Normal'       'Nil'                'Nil'                  'Nil'             
    'Basal Inferoseptal'         '3'     'Normal'       'Nil'                'Nil'                  'Nil'             
    'Basal Inferior'             '4'     'Normal'       'Nil'                'Nil'                  '50% (mid wall)'  
    'Basal Inferolateral'        '5'     'Normal'       'Nil'                'Nil'                  'Nil'             
    'Basal Anterolateral'        '6'     'Normal'       'Nil'                'Nil'                  'Nil'             
    'Mid Anterior'               '7'     'Normal'       'Nil'                'Nil'                  'Nil'             
    'Mid Anteroseptal'           '8'     'Normal'       'Nil'                'Nil'                  'Nil'             
    'Mid Inferoseptal'           '9'     'Normal'       'Nil'                'Nil'                  'Nil'             
    'Mid Inferior'               '10'    'Normal'       'Nil'                'Nil'                  'Nil'             
    'Mid Inferolateral'          '11'    'Normal'       'Nil'                'Nil'                  'Nil'             
    'Mid Anterolateral'          '12'    'Normal'       'Nil'                'Nil'                  'Nil'             
    'Apical Anterior'            '13'    'Normal'       'Nil'                'Nil'                  'Nil'             
    'Apical Septal'              '14'    'Normal'       'Nil'                'Nil'                  'Nil'             
    'Apical Inferior'            '15'    'Normal'       'Nil'                'Nil'                  'Nil'             
    'Apical Lateral'             '16'    'Normal'       'Nil'                'Nil'                  'Nil'             
    'Apex'                       '17'    'Normal'       'Nil'                'Nil'                  'Nil'             
ans = 
    Left_Ventricular_Segments     No      Wall_Motion     Perfusion_Defect_At_Stress    Delayed_Gadolinium
    _________________________    ____    _____________    __________________________    __________________
    'Basal Anterior'             '1'     'Hypokinetic'    'Nil'                         'Nil'             
    'Basal Anteroseptal'         '2'     'Hypokinetic'    'Nil'                         'Mid wall'        
    'Basal Inferoseptal'         '3'     'Hypokinetic'    'Nil'                         'Mid wall'        
    'Basal Inferior'             '4'     'Hypokinetic'    'Nil'                         'Nil'             
    'Basal Inferolateral'        '5'     'Hypokinetic'    'Nil'                         'Nil'             
    'Basal Anterolateral'        '6'     'Hypokinetic'    'Nil'                         'Nil'             
    'Mid Anterior'               '7'     'Hypokinetic'    'Nil'                         '50%'             
    'Mid Anteroseptal'           '8'     'Hypokinetic'    'Nil'                         'Mid wall'        
    'Mid Inferoseptal'           '9'     'Hypokinetic'    'Nil'                         'Mid wall'        
    'Mid Inferior'               '10'    'Hypokinetic'    'Possibly'                    'Nil'             
    'Mid Inferolateral'          '11'    'Hypokinetic'    'Nil'                         'Nil'             
    'Mid Anterolateral'          '12'    'Hypokinetic'    'Nil'                         'Nil'             
    'Apical Anterior'            '13'    'Akinetic'       'Nil'                         '50%'             
    'Apical Septal'              '14'    'Hypokinetic'    'Nil'                         '50%'             
    'Apical Inferior'            '15'    'Hypokinetic'    'Nil'                         '50%'             
    'Apical Lateral'             '16'    'Hypokinetic'    'Nil'                         '< 50%'           
    'Apex'                       '17'    'Dyskinetic'     'Nil'                         '50%'             
>> 

matlab noob on 1 May 2019

Thank you so much for your help and explaination. I think I'm able to understand your concept. However, I'll need some time to understand the code. Once agian thank you!

Sign in to comment.

Answer 2

KSSV on 29 Apr 2019

Open in MATLAB Online

0 votes

T = readtable(myfile)

2 Comments
Show None Hide None

Guillaume on 29 Apr 2019

There is no way that readtable can cope with the sample file supplied by the OP.

matlab noob on 29 Apr 2019

Appreciate for the reply. Thanks!

Sign in to comment.

Extract data from text file

2 Comments
Show None Hide None

Accepted Answer

12 Comments
Show 10 older comments Hide 10 older comments

More Answers (1)

2 Comments
Show None Hide None

Categories

Tags

Community Treasure Hunt

Extract data from text file

2 Comments Show None Hide None

Accepted Answer

12 Comments Show 10 older comments Hide 10 older comments

More Answers (1)

2 Comments Show None Hide None

Categories

Tags

See Also

Community Treasure Hunt

2 Comments
Show None Hide None

12 Comments
Show 10 older comments Hide 10 older comments

2 Comments
Show None Hide None