Streamline for loops and indexing of large files

Question

Louise Wilson on 11 Oct 2020

0
Link

Direct link to this question

https://in.mathworks.com/matlabcentral/answers/610861-streamline-for-loops-and-indexing-of-large-files

Answered: Seth Furman on 15 Oct 2020

I am wondering if anyone has any advice on how I could streamline a script I am working on.

The purpose of a script is to subset a large file (e.g. 7000*72000) according to time values which occur in the first column of the large file and the first column of a second file of interest. So, I have two datasheets where some of the rows match in time, and I want to pull out the rows which match. I have 10 of these very large files and only one or two of them will contain times that match, so the first task is to find out which of the very large files I have, corresponds to the file of interest, before loading it in.

folder=('Y:\AIS data\CPA_FILT\CPA_ACOU\'); %folder containing files of interest
CPA_ACOU_files=dir(fullfile(folder,'*.csv')); %list each file in folder
count=1; %index for output
for j=1:length(CPA_ACOU_files) %for each file
    %read in 'interesting' files
    CPA=readtable(fullfile(folder,CPA_ACOU_files(j).name));
    CPA_t=datetime(CPA.Var1,'ConvertFrom','datenum'); %convert time to datetime
    CPA_times=timeofday(CPA_t(:,1)); %remove times
    CPA_datesonly=CPA_t(:,1)-CPA_times; %get dates of time vector
    %read in times of large datasheets (PSD) and find any that match interesting files
    tvec_folder=('Y:\AIS data\CPA_FILT\TVEC\'); 
    tvec_files=dir(fullfile(tvec_folder,'*.csv'));
    
    %for each tvec, do ismember, and if any times match interesting times, store name of tvec file
    %(shortcut to load PSD so we don't have to do it manually)
    for a=1:length(tvec_files) 
        tvec{a}=readtable(fullfile(tvec_folder,tvec_files(a).name)); %read in tvec
        tvec{a}=datetime(tvec{a}.Var1,'ConvertFrom','datenum'); %covnert times to datetime
        tvec_times=timeofday(tvec{a});
        tvec_datesonly=tvec{a}-tvec_times;
        [matches]=ismember(CPA_datesonly,tvec_datesonly);
        B=any(matches);
        
        if (B==1)
            
           matchingfiles(a,:)=string(tvec_files(a).name);
            
        else
            %do nothing
        end
        
    end
    matchingfiles=rmmissing(matchingfiles); 
    %tvec files with times which match interesting file times
        
    for c=1:length(matchingfiles) %read in corresponding PSD file
    
        %remove TVEC from start of filename to get filename which matches PSD file
        filename_parts=strsplit(matchingfiles(c),'_'); 
        filename_parts(:,1)=[];
        filename=strjoin(filename_parts,'_');
        %read in PSD file
        PSDfolder=('Y:\SoundTrap\Boats\Hydrophone\PSD Output\Duty cycle data\');
        PSDfiles{c}=readtable(fullfile(PSDfolder,filename));
    end
    %for each PSD, subset by rows which match interesting files
    %(if they occur over multiple PSD files, they are concatenated into one output file)
    
    for i=1:length(PSDfiles)
        PSD_t=datetime(PSDfiles{1,i}.Var1,'ConvertFrom','datenum');
        PSD_t(:,2)=PSD_t+minutes(2); %time window for filtering PSD files 
        [rows,cols]=size(CPA);
        for k=1:rows %for each hydrophone recording with CPA
            acou_CPA_file=CPA.Var1(k) %get time of file
            acou_CPA_file=datetime(acou_CPA_file,'ConvertFrom','datenum')
            for m=1:length(PSD_t) %check during which PSD file it occurs
                starttime=PSD_t(m,1); %two minute period to check between
                endtime=PSD_t(m,2);
                if (acou_CPA_file>=starttime) && (acou_CPA_file<=endtime)
                    CPA_PSD(count,:)=PSDfiles{1,i}{m,:}; %store PSD data in output
                    count=count+1;
                else
                    %do nothing
                end
            end
        end
    CPA_PSD_t=datetime(CPA_PSD(:,1),'ConvertFrom','datenum'); %check times
    %outfolder=('Y:\AIS data\CPA_FILT\CPA_PSD\');
    %writetable(CPA_PSD,strcat(outfolder,'acouCPA_',PSDfile(i).name));
    end
end

2 Comments
Show NoneHide None

Mathieu NOE on 12 Oct 2020

hello

this is my 2 cents suggestion ; instead of having to load very large files into matlab, I'd rather try to find a pre processing method to copy the time vector only from each data file into a new data_time file , and run the matlab code on those much smaller files. Once you know the files and time index of interest, you can do the post processing of the data in matlab.

if you have to do that not very often, you can try a powerfull text editor like Textpad (https://www.textpad.com/home)

otherwise maybe a batch exe file could do the trick

Louise Wilson on 12 Oct 2020

Hi Mathieu, thank you! I have actually already saved and stored the time vector from each .csv into a separate data_time file for each of the large files. But, how would I then access each row of interest without having to load in the entire file?

Sign in to comment.

Sign in to answer this question.

Answer 1

Seth Furman on 15 Oct 2020

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/610861-streamline-for-loops-and-indexing-of-large-files#answer_515213

Open in MATLAB Online

1) Consider using innerjoin or outerjoin to find the rows with matching row times between two tables.

https://www.mathworks.com/help/matlab/ref/innerjoin.html

https://www.mathworks.com/help/matlab/ref/outerjoin.html

2) Consider using tall tables for data that are too big to fit into memory.

https://www.mathworks.com/help/matlab/import_export/tall-arrays.html#bvayjn0

For example,

>> tt1 = table2timetable(tall(tabularTextDatastore('tt1.csv')))
tt1 =
  M×1 tall timetable
       Time        Data
    ___________    ____
    15-Oct-2020     1  
    16-Oct-2020     2  
    17-Oct-2020     3  
    18-Oct-2020     4  
    19-Oct-2020     5  
    20-Oct-2020     6  
    21-Oct-2020     7  
    22-Oct-2020     8  
         :          :
         :          :
>> tt2 = table2timetable(tall(tabularTextDatastore('tt2.csv')))
tt2 =
  4×1 tall timetable
       Time        Data
    ___________    ____
    16-Oct-2020     11 
    20-Oct-2020     12 
    22-Oct-2020     13 
    25-Oct-2020     14 
>> tJoin = innerjoin(tt1,tt2,'Keys','Time')
tJoin =
  M×2 tall timetable
    Time    Data_tt1    Data_tt2
    ____    ________    ________
     ?         ?           ?    
     ?         ?           ?    
     ?         ?           ?    
     :         :           :
     :         :           :
Preview deferred. Learn more.
>> tJoin = gather(tJoin)
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 7: Completed in 0.31 sec
- Pass 2 of 7: Completed in 0.17 sec
- Pass 3 of 7: Completed in 0.39 sec
- Pass 4 of 7: Completed in 0.47 sec
- Pass 5 of 7: Completed in 0.32 sec
- Pass 6 of 7: Completed in 0.49 sec
- Pass 7 of 7: Completed in 0.71 sec
Evaluation completed in 3.4 sec
tJoin =
  3×2 timetable
       Time        Data_tt1    Data_tt2
    ___________    ________    ________
    16-Oct-2020       2           11   
    20-Oct-2020       6           12   
    22-Oct-2020       8           13

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Streamline for loops and indexing of large files

2 Comments
Show NoneHide None

Accepted Answer

0 Comments
Show -2 older commentsHide -2 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Streamline for loops and indexing of large files

2 Comments Show NoneHide None

Accepted Answer

0 Comments Show -2 older commentsHide -2 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

2 Comments
Show NoneHide None

0 Comments
Show -2 older commentsHide -2 older comments