MATLAB Answers

How to read in multiple text files, each containing multiple lines/formats?

2 views (last 30 days)
Ankit Pasi
Ankit Pasi on 16 May 2021
Answered: Mathieu NOE on 16 May 2021
Hi
Thanks for reading and any support in advance. I am trying to read multiple text files in a folder for which I have the following code. The source of the data is this kaggle dataset - https://www.kaggle.com/kmader/pulmonary-chest-xray-abnormalities
files = dir(fullfile('archive.1/ChinaSet_AllFiles/','ChinaSet_AllFiles','ClinicalReadings','*.txt'));
N = length(files)
data = []
for i = 1:N
t = files(i).name;
formatspec = '%s %s%*[^\r\n]%*[\r\n]+%s';
file = fopen(fullfile(files(i).folder,t),'r');
A = textscan(file , formatspec, 'delimiter','\n');
data = [data; A];
fclose(file)
end
It loops through the files fine but the files themselves have some data inconsistencies such as the following:
Usual Files:
femal 32yrs
normal
Other files:
male 40yrs
PTB in the right upper field
I need three columns for each file such as - male, 40yrs, "PTB in the right upper field"
Can someone please support?

Answers (2)

dpb
dpb on 16 May 2021
Very difficult without example files to see the nuances, but the two records above I'd handle more like--
d=dir(fullfile('archive.1/ChinaSet_AllFiles/','ChinaSet_AllFiles','ClinicalReadings','*.txt'));
tData=[]; % empty table placeholder
for i = 1:numel(d) % iterate over dir struct
fid=fopen(fullfile(d(i).folder,d(i).name;),'r'); % open file in turn
data=textscan(fid,'%s,'delimiter','\n','whitespace',''); % read as cellstr() array by record
tmp=split(data(1)); % split the first record to sex, age fields
tData=[tData;table(tmp(1),tmp(2),data(2),'VariableNames',{'Gender','Age','Diagnosis'})]; % insert into table
fclose(fid)
end
The above assumes these are the only two record types and that they all follow the pattern of two fields on the first and one long record on second.

Mathieu NOE
Mathieu NOE on 16 May 2021
hello
I have to admit that I am not a super pro of textscan , so someone else will probably make a better code than me , but this is what I tried and tested as a workaround
files = dir(fullfile('archive.1/ChinaSet_AllFiles/','ChinaSet_AllFiles','ClinicalReadings','*.txt'));
N = length(files)
data = []
% for i = 1:N
% t = files(i).name;
% formatspec = '%s %s%*[^\r\n]%*[\r\n]+%s';
% file = fopen(fullfile(files(i).folder,t),'r');
% A = textscan(file , formatspec, 'delimiter',' ');
% data = [data; A];
% fclose(file)
% end
for i = 1:N
t = files(i).name;
rr = readlines(fullfile(files(i).folder,t));
temp = split(rr{1});
% remove empty cells
empty = cellfun('isempty',temp)
temp(empty) = [];
% finally...
A = [temp' rr{2}];
data = [data; A];
end

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!