Extract data from text file

I have this 'sample data.txt' text file with the data not in the right form. I need to read this text file and extract the data and tabulate it in the order as shown in figure below. I am not sure how can I do it.
Really appreciate it if someone can help to guide me. Thank you.

2 Comments

The format of your text file is dreadful! Has it been altered in any way from its original format? It would be much easier to parse if the column data was separated by a tab or comma character instead of spaces and if the table header wasn't split onto two lines within one of the column header.
The screenshot that you show doesn't match the text file you've attached and therefore leave some questions unanswered:
  • It would appear that the Delayed Gadolinium Enhancement column can have multiword entries (e.g. Full thickness). Can any other column also have multi word entries? If so, how can we identify which column a word belongs to?
  • The formatting of the text is not even consistent across the table. Sometimes you have < 50 (with a space), sometimes <50 (without a space) for that last column. Do you want the text as is, or normalised in the output? Even better in my opinion would be to convert to numbers, in that case should Full thickness be converted to 100?
Unfortunately, because of that awful formatting, you're going to have to write a parser for the file and make plenty of assumptions that may be invalid and cause the parsing to fail on future files. If you can get the same data in a more sensible format that would be better.
Really appreciate for your reply regarding this question. According to the question you've asked
  • It would appear that the Delayed Gadolinium Enhancement column can have multiword entries (e.g. Full thickness). Can any other column also have multi word entries? If so, how can we identify which column a word belongs to? The other column can also have multi word entris
  • The formatting of the text is not even consistent across the table. Sometimes you have < 50 (with a space), sometimes <50 (without a space) for that last column. Do you want the text as is, or normalised in the output? I'll need the original text as it is, no conversion is encourage in my case.
Meanwhile I'm searching something that can read specific string inbetween those data that I'll like to extract out. Is it possible for this idea to apply for this case?
Thank you.

Sign in to comment.

 Accepted Answer

Stephen23
Stephen23 on 29 Apr 2019
Edited: Stephen23 on 29 Apr 2019
That is a very badly formatted file. For example, the field delimiters are space characters and space characters also occur within the fields (without any text delimiters to group the fields together). There is no robust general solution for parsing such a poorly formatted file, altough in some limited cases (such as with prior knowledge of the field contents) you might be able to parse it but parsing such files will always be fragile. On that basis I assumed that the fields contain only the text in the number and types that you have shown, i.e. each line contains exactly:
  1. 1 or 2 words (starts with 'Basal' or 'Mid' or 'Apical', or constitutes 'Apex')
  2. 1 number
  3. 1 word
  4. ('Nil' or 'Present')
  5. ('Nil' or 'Present')
  6. ('Nil' or 'Full thickness' or a percentage)
This matches all of the seventeen rows in your example data file:
str = fileread('sample data.txt');
rgx = ['(Apex|(Basal|Mid|Apical)\s+[A-Z][a-z]+)\s+(\d+)\s+([A-Z][a-z]+)',...
'\s+(Nil|Present)\s+(Nil|Present)\s+(Nil|Full thickness|([<>]\s?)?\d+\%)'];
tkn = regexpi(str,rgx,'tokens');
tkn = vertcat(tkn{:})
Giving:
tkn =
'Basal Anterior' '1' 'Hypokinetic' 'Nil' 'Nil' '50%'
'Basal Anteroseptal' '2' 'Dyskinetic' 'Present' 'Present' 'Full thickness'
'Basal Inferoseptal' '3' 'Hypokinetic' 'Present' 'Present' '50%'
'Basal Inferior' '4' 'Hypokinetic' 'Nil' 'Present' '50%'
'Basal Inferolateral' '5' 'Normal' 'Nil' 'Nil' 'Nil'
'Basal Anterolateral' '6' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anterior' '7' 'Hypokinetic' 'Nil' 'Nil' '<50%'
'Mid Anteroseptal' '8' 'Dyskinetic' 'Present' 'Present' 'Full thickness'
'Mid Inferoseptal' '9' 'Akinetic' 'Present' 'Present' 'Full thickness'
'Mid Inferior' '10' 'Hypokinetic' 'Nil' 'Present' '<50%'
'Mid Inferolateral' '11' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anterolateral' '12' 'Normal' 'Nil' 'Nil' '<50%'
'Apical Anterior' '13' 'Akinetic' 'Nil' 'Nil' '50%'
'Apical Septal' '14' 'Akinetic' 'Nil' 'Nil' '< 50%'
'Apical Inferior' '15' 'Akinetic' 'Nil' 'Nil' '> 50%'
'Apical Lateral' '16' 'Hypokinetic' 'Nil' 'Nil' 'Full thickness'
'Apex' '17' 'Akinetic' 'Nil' 'Nil' 'Full thickness'
>> size(tkn)
ans =
17 6
>>
Clearly you can put that into a table if you really want to:
>> hdr = {'LeftVentricularSegments','No','WallMotion','PerfusionAtRest','PerfusionAtStress','DelayedGadoliniumEnhancement'};
>> T = cell2table(tkn,'VariableNames',hdr)
T =
LeftVentricularSegments No WallMotion PerfusionAtRest PerfusionAtStress DelayedGadoliniumEnhancement
_______________________ ____ _____________ _______________ _________________ ____________________________
'Basal Anterior' '1' 'Hypokinetic' 'Nil' 'Nil' '50%'
'Basal Anteroseptal' '2' 'Dyskinetic' 'Present' 'Present' 'Full thickness'
'Basal Inferoseptal' '3' 'Hypokinetic' 'Present' 'Present' '50%'
'Basal Inferior' '4' 'Hypokinetic' 'Nil' 'Present' '50%'
'Basal Inferolateral' '5' 'Normal' 'Nil' 'Nil' 'Nil'
'Basal Anterolateral' '6' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anterior' '7' 'Hypokinetic' 'Nil' 'Nil' '<50%'
'Mid Anteroseptal' '8' 'Dyskinetic' 'Present' 'Present' 'Full thickness'
'Mid Inferoseptal' '9' 'Akinetic' 'Present' 'Present' 'Full thickness'
'Mid Inferior' '10' 'Hypokinetic' 'Nil' 'Present' '<50%'
'Mid Inferolateral' '11' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anterolateral' '12' 'Normal' 'Nil' 'Nil' '<50%'
'Apical Anterior' '13' 'Akinetic' 'Nil' 'Nil' '50%'
'Apical Septal' '14' 'Akinetic' 'Nil' 'Nil' '< 50%'
'Apical Inferior' '15' 'Akinetic' 'Nil' 'Nil' '> 50%'
'Apical Lateral' '16' 'Hypokinetic' 'Nil' 'Nil' 'Full thickness'
'Apex' '17' 'Akinetic' 'Nil' 'Nil' 'Full thickness'

12 Comments

Really appreciate for your help and yes the text file is badly formatted. I can upload another sample file that shows different arragement.
I agree about the point that you've mention, each of the text file contain 'Basal', 'Mid', 'Apical', 'Apex'. However, I'm afraid that it won't work for other text file. I'll try out your method with other text file.
Meanwhile, I'm also looking into "Data extraction from text file start and end with specific string". I'm not sure if my concept regarding this is in the right path...
However, I'll proceed with your method first. I'll be back in case I need some assistance.
Thank you very much for the early response.
Stephen23
Stephen23 on 29 Apr 2019
Edited: Stephen23 on 29 Apr 2019
@matlab noob: the second file format would be much easier to work with, as it places each field on its own line. Should be quite simple to parse with a variety of methods (including textscan, regular expressions, etc.).
However if you are looking for one parser that can handle both those file formats AND also fulfill that each field can have multiple words then you will be searching in vain.
"I'll be back in case I need some assistance."
Of course, anytime.
"Thank you very much for the early response."
My pleasure. Remember to vote for my answer!
can you explain about this
rgx = ['(Apex|(Basal|Mid|Apical)\s+[A-Z][a-z]+)\s+(\d+)\s+([A-Z][a-z]+)',...
'\s+(Nil|Present)\s+(Nil|Present)\s+(Nil|Full thickness|([<>]\s?)?\d+\%)'];
or any site recommended as a guideline to understand what is the use of '[A-Z]' 's+' '[<>]\s?'...
As you've mention that sample data 2 will be easier to work with, I'll need your help to guide me in contracting the code.
Becuase I am quite new to matlab, hope that you won't mind. Thank you.
You are in luck, Stephen has even written a tool to help understand Regular Expressions! See https://www.mathworks.com/matlabcentral/fileexchange/48930-interactive-regular-expression-tool
Briefly,
square brackets denote list of characters, any one of which must be matched at that location. Ranges are permitted inside the [], so [A-Z] is the same as [ABCDEFGHIJKLMNOPQRSTUVWXYZ] and so matches any one upper case Latin alphabet character
Likewise [a-z] is an abbreviation for the range of lower-case letters and so matches any one lower case Latin alphabet character.
The + right after [a-z] means that one or more of what was just before that must be matched.
Put together, [A-Z][a-z]+ means that you must have an upper case character, followed immediately by one or more lower-case letters.
The \s is any one whitespace character. The + after it is "one or more", so \s+ means one or more whitespace characters.
@Walter Roberson Thank you very much in sharing the link and also the details explanation regarding the question I've asked previously.
Meanwhile, I've succeed in building the code to read the other text file.
Thank you very much for the help from all of you!
Stephen23
Stephen23 on 30 Apr 2019
Edited: Stephen23 on 30 Apr 2019
"...any site recommended as a guideline to understand..."
Regular expressions are a special, very powerful language for matching combinations of characters by type, number, position, etc.. Learning how to use regular expressions requires reading the documentation a lot. By which I really mean a lot: there is no shortcut.
I still find myself reading the regular expression documentation to re-check points like the exact syntax for negative look-behind assertions.
% capture next line
nl = '[\r\n]+';
% read the text file
file = fileread(a);
expression = ['(Apex|(Basal|Mid|Apical)\s+[A-Z][a-z]+)',... % extract all string begin with 'Apex', 'Basal', 'Mid', 'Apical'
nl,'(\d+)',... % number after the LVsegments
nl, '([A-Z][a-z]+\s?[a-z]+)',...% Wall motion
nl,'(([<>]\s?)?\d+\%|(\d+\%)|([<>]\s?)?\d+\s?\%|[A-Z]+\s?[-][A-Z]+|[A-Z][a-z]+\s?[a-z]+)' % Delayed Gadolinium Enhancement
];
str = regexpi(file, expression, 'tokens');
str = vertcat(str{:});
% Insert header for each data extracted
header = {'Left_Ventricular_Segments','No','Wall_Motion','Delayed_Gadolinium_Enhancement'};
% Data tabulation
table = cell2table(str,'VariableNames', header)
This is the code that I've done to read all my text file (100+), but I face some problem.
Recalling the problem of the text file, it does not have a consistent arrangement of data.
Some text file (mostly) consist of
"Left_Ventricular_Segments" "No" "Wall_Motion" "Delayed_Gadolinium_Enhancement"
which apply to most of the cases.
However, some of the text file (only a few) consist of one extra column
"Left_Ventricular_Segments" "No" "Wall_Motion" "Perfusion Defect At Stress" "Delayed_Gadolinium_Enhancement"
I've read that there is this (?(cond)expr) & (?(cond)expr1|expr2) is it applicable in my situation? Meanwhile still struggling on how to use this...
Or is there any smarter way in including this condition into the code? Esle I will go for a dumb way by adding another line for this purposes. Thank you.
% capture next line
nl = '[\r\n]+';
expression = ['(Apex|(Basal|Mid|Apical)\s+[A-Z][a-z]+)',... % extract all string begin with 'Apex', 'Basal', 'Mid', 'Apical'
nl,'(\d+)',... % number after the LVsegments
nl, '([A-Z][a-z]+\s?[a-z]+)',...% Wall motion
nl, '([A-Z][a-z]+\s?[a-z]+)',...% Perfusion Defect At Stress
nl,'(([<>]\s?)?\d+\%|(\d+\%)|([<>]\s?)?\d+\s?\%|[A-Z]+\s?[-][A-Z]+|[A-Z][a-z]+\s?[a-z]+)' % Delayed Gadolinium Enhancement
];
table =
Left_Ventricular_Segments No Wall_Motion Perfusion_Defect_At_Stress Delayed_Gadolinium_Enhancement
'Basal Anterior' '1' 'Hypokinetic' 'Nil' 'Nil'
...
Stephen23
Stephen23 on 1 May 2019
Edited: Stephen23 on 1 May 2019
@matlab noob: The reason I used a rather complex regular expression in my answer is because in your original file (uploaded as part of your question) each line had multiple fields. But now your use of newlines indicates that the file format is different, and each field is on its own line (matching the file you uploaded here). As I wrote earlier, this second file format hugely simplifies your task as you only need to match the pattern (e.g. those fixed words in the first field) and the lines themselves. Forget about the rest of it: while you might get it to work, trying to adapt my answer when each field is already split onto its own line is too complex.
Also keep in mind that there is no guarantee that any robust parsing is possible with such badly formatted files.
If you want help with creating a simpler regular expression (or some other way to parse the second file format) then please upload a few sample files, including some with the extra column.
Sample data 3.txt consist of both "Perfusion Defect At Stress" and "Perfusion at Rest"
Sample data 4.txt consist of only 1 extra "Perfusion Defect At Stress"
Thank you.
This code reads your three later files, where each field is on its own line.
The code relies on one main assumption: that the header name "No" appears by itself on one line, which is used to anchor and identify the block of data that you are looking for. The other lines are simply contiguous with that header name. It also uses the "No" field values to identify the number of fields: this requires that only the "No" fields constitute numeric values.
R = '([^\n]+\n)*No(\n[^\n]+)+'; % regular expression, contiguous around "No".
S = dir('sample*.txt');
N = numel(S);
C = cell(1,N);
for k = 1:N
str = fileread(S(k).name);
str = regexprep(str,'\r\n','\n'); % replace Windows newlines.
M = regexp(str,R,'match','once'); % match lines of text file.
P = regexp(M,'\n','split'); % split lines into cell array.
V = str2double(P); % convert lines into numbers.
D = mean(diff(find(~isnan(V)))); % identify non-NaN (i.e. "No" lines").
H = regexprep(P(1:D),'\s+','_'); % get heater lines.
X = strcmpi(P{D+1},'Enhancement'); % identify superfluous header.
A = reshape(P(1+X+D:end),D,[]).'; % get data lines.
T = cell2table(A,'variableNames',H); % convert data + header into table.
C{k} = T;
end
Giving:
>> C{:}
ans =
Left_Ventricular_Segments No Perfusion_defect_at_rest Perfusion_defect_at_stress Wall_Motion Delayed_Gadolinium
_________________________ ____ ________________________ __________________________ _____________ __________________
'Basal Anterior' '1' 'Nil' 'Nil' 'Normal' 'Mid wall'
'Basal Anteroseptal' '2' 'Nil' 'Nil' 'Normal' 'Mid wall'
'Basal Inferoseptal' '3' 'Nil' 'Nil' 'Normal' 'Mid wall'
'Basal Inferior' '4' 'Nil' 'Nil' 'Normal' 'Nil'
'Basal Inferolateral' '5' 'Nil' 'Nil' 'Normal' 'Nil'
'Basal Anterolateral' '6' 'Nil' 'Nil' 'Normal' 'Nil'
'Mid Anterior' '7' 'Present' 'Present' 'Akinetic' 'Full thickness'
'Mid Anteroseptal' '8' 'Present' 'Present' 'Akinetic' 'Full thickness'
'Mid Inferoseptal' '9' 'Present' 'Present' 'Hypokinetic' 'Full thickness'
'Mid Inferior' '10' 'Nil' 'Nil' 'Normal' 'Nil'
'Mid Inferolateral' '11' 'Nil' 'Nil' 'Normal' 'Nil'
'Mid Anterolateral' '12' 'Nil' 'Nil' 'Normal' 'Nil'
'Apical Anterior' '13' 'Present' 'Present' 'Akinetic' 'Full thickness'
'Apical Septal' '14' 'Present' 'Present' 'Akinetic' 'Full thickness'
'Apical Inferior' '15' 'Present' 'Present' 'Akinetic' 'Full thickness'
'Apical Lateral' '16' 'Nil' 'Nil' 'Normal' '<50%'
'Apex' '17' 'Nil' 'Nil' 'Dystkinetic' '<50%'
ans =
Left_Ventricular_Segments No Wall_Motion Perfusion_At_Rest Perfusion_At_Stress Delayed_Gadolinium
_________________________ ____ ___________ _________________ ___________________ __________________
'Basal Anterior' '1' 'Normal' 'Nil' 'Nil' 'Nil'
'Basal Anteroseptal' '2' 'Normal' 'Nil' 'Nil' 'Nil'
'Basal Inferoseptal' '3' 'Normal' 'Nil' 'Nil' 'Nil'
'Basal Inferior' '4' 'Normal' 'Nil' 'Nil' '50% (mid wall)'
'Basal Inferolateral' '5' 'Normal' 'Nil' 'Nil' 'Nil'
'Basal Anterolateral' '6' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anterior' '7' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anteroseptal' '8' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Inferoseptal' '9' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Inferior' '10' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Inferolateral' '11' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anterolateral' '12' 'Normal' 'Nil' 'Nil' 'Nil'
'Apical Anterior' '13' 'Normal' 'Nil' 'Nil' 'Nil'
'Apical Septal' '14' 'Normal' 'Nil' 'Nil' 'Nil'
'Apical Inferior' '15' 'Normal' 'Nil' 'Nil' 'Nil'
'Apical Lateral' '16' 'Normal' 'Nil' 'Nil' 'Nil'
'Apex' '17' 'Normal' 'Nil' 'Nil' 'Nil'
ans =
Left_Ventricular_Segments No Wall_Motion Perfusion_Defect_At_Stress Delayed_Gadolinium
_________________________ ____ _____________ __________________________ __________________
'Basal Anterior' '1' 'Hypokinetic' 'Nil' 'Nil'
'Basal Anteroseptal' '2' 'Hypokinetic' 'Nil' 'Mid wall'
'Basal Inferoseptal' '3' 'Hypokinetic' 'Nil' 'Mid wall'
'Basal Inferior' '4' 'Hypokinetic' 'Nil' 'Nil'
'Basal Inferolateral' '5' 'Hypokinetic' 'Nil' 'Nil'
'Basal Anterolateral' '6' 'Hypokinetic' 'Nil' 'Nil'
'Mid Anterior' '7' 'Hypokinetic' 'Nil' '50%'
'Mid Anteroseptal' '8' 'Hypokinetic' 'Nil' 'Mid wall'
'Mid Inferoseptal' '9' 'Hypokinetic' 'Nil' 'Mid wall'
'Mid Inferior' '10' 'Hypokinetic' 'Possibly' 'Nil'
'Mid Inferolateral' '11' 'Hypokinetic' 'Nil' 'Nil'
'Mid Anterolateral' '12' 'Hypokinetic' 'Nil' 'Nil'
'Apical Anterior' '13' 'Akinetic' 'Nil' '50%'
'Apical Septal' '14' 'Hypokinetic' 'Nil' '50%'
'Apical Inferior' '15' 'Hypokinetic' 'Nil' '50%'
'Apical Lateral' '16' 'Hypokinetic' 'Nil' '< 50%'
'Apex' '17' 'Dyskinetic' 'Nil' '50%'
>>
Thank you so much for your help and explaination. I think I'm able to understand your concept. However, I'll need some time to understand the code. Once agian thank you!

Sign in to comment.

More Answers (1)

T = readtable(myfile)

2 Comments

There is no way that readtable can cope with the sample file supplied by the OP.
Appreciate for the reply. Thanks!

Sign in to comment.

Categories

Asked:

on 29 Apr 2019

Edited:

on 1 May 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!