How would I 'trim-the-fat' off of individual text files that are part of a loop?

1 view (last 30 days)
Hello,
I'm working on a script that's going to read single column, no header .txt files. Each file in a perfect world would be an exact multiple of 36,000,000 lines of data, however data get's stored with an additional 1 to 5,000,000 million lines. I do not need this data.
What I'm currently using is a file splitter on Linux command line, that splits data into 36,000,000 line chunks, and removes anything that is less than that. Here's what that looks like
clear
echo Hello Human. Please enter the date of the data to be analyzed [mmddyyyy]
echo
read DataAnalysis
echo
echo Would you like to analyze DEF, LFM, or SUM?
echo
read DataType
echo
echo Thank you Human, please wait.......
echo
cd $DataAnalysis
split -d -l 36000000 *Live*$DataType* x0
split -d -l 36000000 *Dead*$DataType* x1
#Below, this removes anything with a length less than the bin time. This removes excess data
find . -name 'x*' | xargs -i bash -c 'if [ $(wc -l {}|cut -d" " -f1) -lt 36000000 ] ; then rm -f {}; fi'
mkdir Chopped
mv -S .txt x0* Chopped
mv -S .txt x1* Chopped
#Below, this turns all files into .txt files by adding the .txt suffix
find . -name 'x*' -print0 | xargs -0 -i% mv % %.txt
echo
echo
echo *****Data Chop Complete Human*****
echo
echo
Now this script is dependant on there being a single "LIVE" file, and a single "DEAD" file, which isn't going to always be the case. I'm going to have multiple files with arbitrary names, that need to be analyzed and concatonated in a specific order. What I currently have for file selection in MatLab is the following
%% Populate filenames for LINUX command line operation
clear
close all
clc
[FileNames PathNames]=uigetfile('Y:\Data\*.txt', 'Choose files to load:','MultiSelect','on'); %It opens the window for file selection
prompt = 'Enter save-name according to: file_mmddyyyy_signal ';
Filenamesave = input(prompt,'s');
Filenamesave = strcat(PathNames,Filenamesave,'.mat');
PathNames=strrep(PathNames,'L:','LabData');
PathNames=strrep(PathNames,'\','/');
PathNamesSave=strcat('/',PathNames);
save(Filenamesave,'FileNames','PathNames','PathNamesSave');
When I load the file produced by this script, how would I write a script to scan every file and ignore excess data points that don't equal 36,000,000?
  1 Comment
dpb
dpb on 27 Sep 2019
Edited: dpb on 27 Sep 2019
Have pointed this out numerous times but will try yet again...use fullfile() to build file names from the pieces-parts instead of string catenation operations and you won't have to mess with what the file separator character is--ML will take care of it automagically at runtime.
A corollary of the above is to not store a system-specific character in the default base names but to build them at runtime also from the name strings only using fullfile so they'll also match the OS you're running on.

Sign in to comment.

Accepted Answer

dpb
dpb on 27 Sep 2019
Edited: dpb on 28 Sep 2019
" I'm going to have multiple files with arbitrary names, that need to be analyzed and concat[e]nated "
Presuming this is related to the previous topic, I'd (yet again) suggest it's probably not necessary (or even desireable) to generate all the arbitrary intermediate files...
N=yourbignumber;
fid=fopen(yourreallyreallyreaalybigfile.txt','r');
while ~feof(fid)
[data,nread]=fscanf(fid,'%f',N);
if nread<N
% whatever to do with the full section results goes here
else
% anything want to do with the short section results goes here
end
end
fid=fclose(fid);
Inside that full section clause can be the other loop we just went through that uses the second magic number of 400K records to process.
  2 Comments
EL
EL on 29 Sep 2019
This makes sense. So, each file which will ahve a tail end of data I don't need will simply be ignored, if I'm understanding this correctly?
The files being loaded aren't split files. These are the raw data files that are seperate because data acquisition had to stop due to a mandatory change in conditions. I always have live and dead data, and sometimes I have to stop data acquisition to adjust the instrument or conditions. Each time I stop data acquisition, new files are generated. It's just how our software works.
dpb
dpb on 30 Sep 2019
"each file which will ahve a tail end of data I don't need will simply be ignored"
Depends. The above will read up to N records -- there could be fewer records in the file, there could be an error inside the file or there could be N or more records but had an out-of-memory problem reading the full N.
In the above, you'll read however many sets there are in the file before the loop quits but you'll know how many records were read each time and can take action accordingly.
If you have only a fixed number of total records that are wanted (some multiple of N), then would need to use a counter to keep track of how many sets you've read and break when that's done.
In the other thread, it is presumed the N is the total number of records wanted and there's no need in that case for the while loop. This would be how to read the N=400K blocks if don't read the whole set wanted in one go.
How you do this is up to you in the end; I was just trying to get you past the original postings of some time ago that were to break up the big file into a zillion little ones.

Sign in to comment.

More Answers (1)

Guillaume
Guillaume on 26 Sep 2019
Edited: Guillaume on 26 Sep 2019
If I understood correctly:
opt = detectImportOptions(yourtextfile);
opt.DataLines = [1 36e6]; %only read the first 36000000 lines if there are more
data = readtable(yourtextfile, opt); %R2019a or later, use readmatrix instead if you want a plain matrix
If the files are guaranteed to have at least 36,000,000 lines then this would work as well:
data = csvread(yourtextfile, 0, 0, 36e6, 0);
but will error if there are less than 36,000,000 lines, unlike the 1st option which will read whatever there is.
  1 Comment
dpb
dpb on 27 Sep 2019
Edited: dpb on 27 Sep 2019
One can always put the read in a try...catch block to handle the short file section case.
N=yourbignumber;
fid=fopen(yourtextfile,'r');
try
data=fscanf(fid,'%f',N);
catch ME
% anything want to do with the short section results goes here
end
fid=fclose(fid);
The above also will not error no matter the file size (well, it might, but you've anticipated it and have way to handle it gracefully).
The other thing of this way is you have a direct 1D double array; the readtable option above will return the data in a MATLAB table object which, for just one variable, doesn't have much benefit.

Sign in to comment.

Products


Release

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!