Streamline for loops and indexing of large files

4 views (last 30 days)
I am wondering if anyone has any advice on how I could streamline a script I am working on.
The purpose of a script is to subset a large file (e.g. 7000*72000) according to time values which occur in the first column of the large file and the first column of a second file of interest. So, I have two datasheets where some of the rows match in time, and I want to pull out the rows which match. I have 10 of these very large files and only one or two of them will contain times that match, so the first task is to find out which of the very large files I have, corresponds to the file of interest, before loading it in.
folder=('Y:\AIS data\CPA_FILT\CPA_ACOU\'); %folder containing files of interest
CPA_ACOU_files=dir(fullfile(folder,'*.csv')); %list each file in folder
count=1; %index for output
for j=1:length(CPA_ACOU_files) %for each file
%read in 'interesting' files
CPA=readtable(fullfile(folder,CPA_ACOU_files(j).name));
CPA_t=datetime(CPA.Var1,'ConvertFrom','datenum'); %convert time to datetime
CPA_times=timeofday(CPA_t(:,1)); %remove times
CPA_datesonly=CPA_t(:,1)-CPA_times; %get dates of time vector
%read in times of large datasheets (PSD) and find any that match interesting files
tvec_folder=('Y:\AIS data\CPA_FILT\TVEC\');
tvec_files=dir(fullfile(tvec_folder,'*.csv'));
%for each tvec, do ismember, and if any times match interesting times, store name of tvec file
%(shortcut to load PSD so we don't have to do it manually)
for a=1:length(tvec_files)
tvec{a}=readtable(fullfile(tvec_folder,tvec_files(a).name)); %read in tvec
tvec{a}=datetime(tvec{a}.Var1,'ConvertFrom','datenum'); %covnert times to datetime
tvec_times=timeofday(tvec{a});
tvec_datesonly=tvec{a}-tvec_times;
[matches]=ismember(CPA_datesonly,tvec_datesonly);
B=any(matches);
if (B==1)
matchingfiles(a,:)=string(tvec_files(a).name);
else
%do nothing
end
end
matchingfiles=rmmissing(matchingfiles);
%tvec files with times which match interesting file times
for c=1:length(matchingfiles) %read in corresponding PSD file
%remove TVEC from start of filename to get filename which matches PSD file
filename_parts=strsplit(matchingfiles(c),'_');
filename_parts(:,1)=[];
filename=strjoin(filename_parts,'_');
%read in PSD file
PSDfolder=('Y:\SoundTrap\Boats\Hydrophone\PSD Output\Duty cycle data\');
PSDfiles{c}=readtable(fullfile(PSDfolder,filename));
end
%for each PSD, subset by rows which match interesting files
%(if they occur over multiple PSD files, they are concatenated into one output file)
for i=1:length(PSDfiles)
PSD_t=datetime(PSDfiles{1,i}.Var1,'ConvertFrom','datenum');
PSD_t(:,2)=PSD_t+minutes(2); %time window for filtering PSD files
[rows,cols]=size(CPA);
for k=1:rows %for each hydrophone recording with CPA
acou_CPA_file=CPA.Var1(k) %get time of file
acou_CPA_file=datetime(acou_CPA_file,'ConvertFrom','datenum')
for m=1:length(PSD_t) %check during which PSD file it occurs
starttime=PSD_t(m,1); %two minute period to check between
endtime=PSD_t(m,2);
if (acou_CPA_file>=starttime) && (acou_CPA_file<=endtime)
CPA_PSD(count,:)=PSDfiles{1,i}{m,:}; %store PSD data in output
count=count+1;
else
%do nothing
end
end
end
CPA_PSD_t=datetime(CPA_PSD(:,1),'ConvertFrom','datenum'); %check times
%outfolder=('Y:\AIS data\CPA_FILT\CPA_PSD\');
%writetable(CPA_PSD,strcat(outfolder,'acouCPA_',PSDfile(i).name));
end
end
  2 Comments
Mathieu NOE
Mathieu NOE on 12 Oct 2020
hello
this is my 2 cents suggestion ; instead of having to load very large files into matlab, I'd rather try to find a pre processing method to copy the time vector only from each data file into a new data_time file , and run the matlab code on those much smaller files. Once you know the files and time index of interest, you can do the post processing of the data in matlab.
if you have to do that not very often, you can try a powerfull text editor like Textpad (https://www.textpad.com/home)
otherwise maybe a batch exe file could do the trick
Louise Wilson
Louise Wilson on 12 Oct 2020
Hi Mathieu, thank you! I have actually already saved and stored the time vector from each .csv into a separate data_time file for each of the large files. But, how would I then access each row of interest without having to load in the entire file?

Sign in to comment.

Accepted Answer

Seth Furman
Seth Furman on 15 Oct 2020
1) Consider using innerjoin or outerjoin to find the rows with matching row times between two tables.
2) Consider using tall tables for data that are too big to fit into memory.
For example,
>> tt1 = table2timetable(tall(tabularTextDatastore('tt1.csv')))
tt1 =
M×1 tall timetable
Time Data
___________ ____
15-Oct-2020 1
16-Oct-2020 2
17-Oct-2020 3
18-Oct-2020 4
19-Oct-2020 5
20-Oct-2020 6
21-Oct-2020 7
22-Oct-2020 8
: :
: :
>> tt2 = table2timetable(tall(tabularTextDatastore('tt2.csv')))
tt2 =
4×1 tall timetable
Time Data
___________ ____
16-Oct-2020 11
20-Oct-2020 12
22-Oct-2020 13
25-Oct-2020 14
>> tJoin = innerjoin(tt1,tt2,'Keys','Time')
tJoin =
M×2 tall timetable
Time Data_tt1 Data_tt2
____ ________ ________
? ? ?
? ? ?
? ? ?
: : :
: : :
Preview deferred. Learn more.
>> tJoin = gather(tJoin)
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 7: Completed in 0.31 sec
- Pass 2 of 7: Completed in 0.17 sec
- Pass 3 of 7: Completed in 0.39 sec
- Pass 4 of 7: Completed in 0.47 sec
- Pass 5 of 7: Completed in 0.32 sec
- Pass 6 of 7: Completed in 0.49 sec
- Pass 7 of 7: Completed in 0.71 sec
Evaluation completed in 3.4 sec
tJoin =
3×2 timetable
Time Data_tt1 Data_tt2
___________ ________ ________
16-Oct-2020 2 11
20-Oct-2020 6 12
22-Oct-2020 8 13

More Answers (0)

Products


Release

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!