What exactly is going on with this code? I'm having difficulty understanding it

Question

0 votes

Hello,

I'm struggling with understanding the code below. I was hoping to give my best interpretation of what's going on, and hope that I'm right.

for FilesToRead = 1 : length(FileNames)
    y = load(fullfile(Directory, FileNames{FilesToRead}));
    y = {y};
    n = 400000;
    for i=1:length(y)
        x{i}=(1:length(y{i}))';
    end
    for ii=1:length(y)
        for k=1:length(y{ii})/n
            coeffs{ii}(k,:)=polyfit(x{ii}(n*(k-1)+1 : k*n),y{ii}(n*(k-1)+1 : k*n),1);
            yfit{ii}(k,:)=coeffs{ii}(k,2)+coeffs{ii}(k,1)*x{ii}(n*(k-1)+1 : k*n);
        end
    end
    for ii=1:length(y)
        dd{ii}= subsref(yfit{ii}.', substruct('()', {':'})).';
        dd2{ii}=y{ii}(1:length(dd{ii}))-dd{ii}';
    end
    for i=1:length(dd2)
        t{i}=zeros(1,length(dd2{i}));
    end
    t{1}=1:length(dd2{1});
    for i=2:length(dd2)
        t{i}= t{i-1}(end)+[1:length(dd2{i})];
    end
    clear coeffs dd i ii k prompt ws yfit
    
    yRaw=y; xRaw=x;
    x=t; y=dd2;
    clear dd2 t
    % Window for variance - 5 seconds = 100,000
    n = 100000;
    for i=1:length(y)
        for k=1:length(y{i})/n
            a(k)=var(y{i}(n*(k-1)+1:k*n));
        end
        v{i}=a;
        a=[];
    end
    t{1}=1:length(v{1});
    for i=2:length(y)
        t{i}=t{i-1}(end)+[1:length(v{i})];
    end
    
    for i=1:length(t)
        tm{i} = t{i}.*((n/20000)/60); % time in minutes
    end
end

So essentially, every 400,000 data points, a 1st degree polynomial is fitted to the 400,000 points. Then, the data is zeroed, or flattened, where the slope of the polynomial was turned to 0, and every data point had an equal number removed from from the area on the slope the data point was located.

This is done to the entire data set. Those new values are then stored, and variance data is gathered from every 100,000 data points.

Is this the general gist?

9 Comments
Show 7 older comments Hide 7 older comments

EL on 23 Sep 2019

Edited: EL on 23 Sep 2019

Open in MATLAB Online

I have three 400,000 length files ready if you'd like to test those?

Here's new code I'm working with to test out

clc
clear 
%%This sections is to load the correct folder to gather data from
    % ***Below is the same code for Linux!
%rootfolder = '/Data/';
    % ***Don't forget to swith the slashes!
rootfolder = 'Z:Data\'; %%MAJOR ISSUE AREA. THE DIRECTORY HERE ALWAYS CHANGES! WATCH OUT!!!!
% Chopped = '/Chopped';
Chopped = '\Chopped\';
prompt = 'Enter date of experiment to be analyzed: ';
DataDate = input(prompt, 's');
Directory = strcat(rootfolder,DataDate,Chopped);
Directory2=strcat(rootfolder,DataDate,'\');
%% This is to select the save file name
prompt = 'Enter the name for the processed files according to mmddyyyy_bug_media_oC      ';
Filenamesave = input(prompt, 's');
%% This is to select the signal type. This should have already been done with datachop =]
prompt = 'Enter the Signal Type DEF, LFM or SUM    ';
signal = input(prompt, 's');
%% Below are some code alternatives to that were previously tried. Maybe I should delete these......
% FileNames=dir(Directory);
%FileNames=dir(fullfile(Directory, '*.txt'))
% FileNames=dir(Directory);
% FileNames(1:2) = [];
% FileNames = {FileNames.name};
 
%% This selects all .txt files in the folder. If the linux script DataChop is used, then you're good!
FileNames=dir(Directory);
FileNames = {FileNames.name};
FileNames = FileNames(contains(FileNames, '.txt'));
disp(' ');
disp(' ');
disp('File structure generated');
disp(' ');
disp(' ');
%% This is the run inputs
% signal = 'DEF';
% epflplot = 'N';
% rawplot = 'N';
% fftplot = 'N';
% ft = 20000;
% bt = 30;
%% This is the loop to process the data. First, a linear fit of a 1st \
% degree polynomial is fitted to the data, with the data points being 
% zeroes according to that slope. Once zeroed, variance is gathered. 
for FilesToRead = 1 : length(FileNames)
    y = load(fullfile(Directory, FileNames{FilesToRead}));
    y = {y};
    n = 400000;
    for i=1:length(y)
    x{i}=(1:length(y{i}))';
    end
     for ii=1:length(y)
         for k=1:length(y{ii})/n
            coeffs{ii}(k,:)=polyfit(x{ii}(n*(k-1)+1 : k*n),y{ii}(n*(k-1)+1 : k*n),1);
            yfit{ii}(k,:)=coeffs{ii}(k,2)+coeffs{ii}(k,1)*x{ii}(n*(k-1)+1 : k*n);
         end
     end
    for ii=1:length(y)
        dd{ii}= subsref(yfit{ii}.', substruct('()', {':'})).';
        dd2{ii}=y{ii}(1:length(dd{ii}))-dd{ii}';
    end
    for i=1:length(dd2)
        t{i}=zeros(1,length(dd2{i}));
    end
    t{1}=1:length(dd2{1});
    for i=2:length(dd2)
        t{i}= t{i-1}(end)+[1:length(dd2{i})];
    end
    clear coeffs dd i ii k prompt ws yfit
    yRaw=y; xRaw=x;
    x=t; y=dd2;
    clear dd2 t
    % Window for variance - 5 seconds = 100,000
    n = 100000;
    for i=1:length(y)
            for k=1:length(y{i})/n
            a(k)=var(y{i}(n*(k-1)+1:k*n));
            end
            v{i}=a;
            a=[];
    end
    t{1}=1:length(v{1});  
    for i=2:length(y)
           t{i}=t{i-1}(end)+[1:length(v{i})];
    end
  
    for i=1:length(t)
    tm{i} = t{i}.*((n/20000)/60); % time in minutes
    end
%% Saving Data
numstr=num2str(FilesToRead);
PathSave=strcat(Directory, Filenamesave,'_',signal,'_', numstr, '.mat');
save(strcat(PathSave), 't', 'tm', 'v','xRaw', 'x', 'yRaw','y','n','FileNames');
%% This is a counter, giving the display an output of the number of files processed.
denom=num2str(length(FileNames));
numer=num2str(FilesToRead);
fraction=strcat(numer,'/',denom);
counter=strcat(fraction, ' Files Processed');
disp(counter);
%% Writing variance data to ASCII File
% dlmwrite(filename, v, delimiter, row, col)
% Data is appended each time. Data is added by row. 
% 
 VarianceASCIIFilename=strcat(Directory2, 'VarianceData', '_', DataDate, '_', signal, '.txt');
 dlmwrite(VarianceASCIIFilename, v{1,1}, 'delimiter', '\t', '-append');
 
 %% Appending Data to .mat file
 
 
%  VarianceMATFilename=strcat(Directory2, 'VarianceData', '_', DataDate, '_', signal, '_');
%  save(VarianceMATFilename, 'v', '-append');
%  
%% Hopefully the clear/clc doesn't mess anything up!!
% clear v t tm xRaw x yRaw y n
end
% disp(' ');
% disp(' ');
% disp('All Files Processed'); 
% disp(' ');
% disp(' ');
% disp('Graph Building Starting'); 
%% Making Graphs
%Starting with variance. Using data from ascii file. Each row is a new
%'interval' of data. In our case, 30 minutes if 'DataChop' functions
%correctly
%% Box and Whisker Plots, 30 min bin time
%Load individual variables with the following format
% load('FilePath/Name', 'variable');
% For plotting the data, do....
% boxplot('variable'{1,1});
% For loading your .txt file; below is an example 09232019
% 
%% Various Variables from previous script
% Variables that were not saved with this update. Might be needed when
 % making graphs
 
 
%  -v7.3
 % Names
%PathNames
%ft

This makes teh same out puts.

If you set the directory to the three fiels linked, you'll process them in about 30 seconds.

dpb on 24 Sep 2019

Open in MATLAB Online

...
%rootfolder = '/Data/';
    % ***Don't forget to swith the slashes!
rootfolder = 'Z:Data\'; %%MAJOR ISSUE AREA. THE DIRECTORY HERE ALWAYS CHANGES! WATCH OUT!!!!
% Chopped = '/Chopped';
Chopped = '\Chopped\';
prompt = 'Enter date of experiment to be analyzed: ';
DataDate = input(prompt, 's');
Directory = strcat(rootfolder,DataDate,Chopped);
Directory2=strcat(rootfolder,DataDate,'\');
...

We talked about this issue previously I think, too...???

Use fullfile() to build qualified filenames without specifying the system-specific delimiter character and it will automagically recognize which system are running on and do the right thing for you.

%% This selects all .txt files in the folder.
FileNames=dir(Directory);
FileNames = {FileNames.name};
FileNames = FileNames(contains(FileNames, '.txt'));
...
for FilesToRead = 1 : length(FileNames)
  y = load(fullfile(Directory, FileNames{FilesToRead}));
  ...

is just

rootfolder = 'Z:Data';
Chopped = 'Chopped';
prompt = 'Enter date of experiment to be analyzed: ';
DataDate = input(prompt, 's');
Directory=fullfile(rootfolder,DataDate,Chopped);
FileNames=dir(fullfile(Directory,'*.txt');
...
for FilesToRead=1:length(FileNames)
  y=load(fullfile(Directory,FileNames(FilesToRead).name));
  ...

and will include only those with the .txt extension and have the system-dependency handled automatically.

I don't think there's any point in then turning y into a cell array instead of just operating on the resulting vector as native double; that alone should speed things up noticeably as well as reduce memory footprint.

Then, as G has noted (and I did in our previous conversations as well) the loop and initial x is completely superfluous and waste of time and memory.

Also as I noted before, one could then compute the statistics on N elements of the vector piecewise by reshape() to NxM columns, use the inbuilt MATLAB facility to compute by default over columns for builtin routines like mean and var or iterate over the M columns for those which haven't been vectorized such as polyfit. At that point you may find it convient to store the coefficients in cell array since there are two per column, but it isn't clear there's any need to actually save them at all but to just process each column (section) and move on to the next. Doing it this way whether save the intermediates or not would eliminate almost all the convoluted indexing that makes so much of the code so difficult to parse.

Once have done that, then just working through the rest of the machinations should be relatively straightforward with the same idea although I still contend would be far simpler to just use a much smaller dataset initially with N=400 or somesuch so don't get so overwhelmed by just the size and can see results quickly and on screen instead of being innundated by the sheer mass of numbers that obfuscates the underlying problem. Once do that, then you can go back and verify your revised code produces same results for a real data set over that time length.

Again, it would also be wise to recode so that these magic numbers aren't buried in the bowels of the code but are easily set and modifiable so the code could be used generically if conditions do change in future.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Guillaume on 24 Sep 2019

Edited: Guillaume on 24 Sep 2019

Open in MATLAB Online

1 vote

Removing all the unnecessary cruft, this is probably what the code reduces to. I've gone with the fact that your files are 400,000 elements long to remove the badly coded split into 400,000 elements.

I haven't tested the code so maybe got it wrong. i've kept the meaningless variable names. If I had written it, I would have used much better variable names (seriously, a as a variable name?)

%inputs:
y = load(fullfile(Directory, FileNames{FilesToRead}));  %since the file is text, it would be better to use csvread or similar instead of load
x = (1:numel(y))';
coeffs = polyfit(x, y(:), 1);
yfit = polyval(coeffs, x);  %much simpler than coding the polynomial yourself
dd2 = y - yfit;
n = 100000;
a = var(reshape(dd2, n, []));
tm = x * n / 20000 / 60;

I've not kept the t variable that is recreated again and agian as it doesn't do anything useful. If you still needed, it's the same as x.

It would be simple to add the split into 400,000 elements if it's really needed.

edit: as discussed below, I missed that dd2 gets renamed to y, replacing the original y. Seriously, who does that? What was wrong with the original dd2 (other than both being meaningless)?

18 Comments
Show 16 older comments Hide 16 older comments

EL on 24 Sep 2019

Open in MATLAB Online

I copied and pasted the loop (window for flattening and variance data) to fit the outputs of a linux script. I didn't change anything except what gets inputted. Here's the original below

%% Populate filenames for LINUX command line operation
clear
close all
clc
[FileNames PathNames]=uigetfile('Y:\........\*.txt', 'Choose files to load:','MultiSelect','on'); %It opens the window for file selection  
prompt = 'Enter save-name according to: file_mmddyyyy_signal  ';
Filenamesave = input(prompt,'s');
Filenamesave = strcat(PathNames,Filenamesave,'.mat');
PathNames=strrep(PathNames,'L:','Data');
PathNames=strrep(PathNames,'\','/');
PathNamesSave=strcat('/',PathNames);
save(Filenamesave,'FileNames','PathNames','PathNamesSave');
clear;
clc;
close all;
%% Retrieve files
tstart=tic
prompt = 'Enter the name of a .mat file to run  (e.g. file_mmddyyyy_signal.mat) ';
files = input(prompt,'s');
load(files);
%% Terminate or proceed
if isequal(FileNames{1},0)
   disp('User selected Cancel')   %display value of variable
else
  for i=1:numel(FileNames)
    clear ii
    c=class(FileNames{i});
    if c=='char'
      FileNames{i}=cellstr(FileNames{i});
    end
    for ii=1:numel(FileNames{i})
      disp(['User selected ', fullfile(PathNames{i}, FileNames{i}{ii})])
    end
  end
  %% User query to run/save Figure output
  %Code for commandline inputs
  prompt = 'Enter file name for figures according to: mmddyyyy_bug_media_oC  ';
  Filenamesave = cellstr(input(prompt,'s'));
  prompt = 'Enter signal type for files';
  signal = input(prompt,'s');
  if isempty(signal)
    signal = 'DEF';
  end
prompt = 'Make EPFL plots? Y/N [Y]: ';
epflplot = input(prompt,'s');
if isempty(epflplot)
    epflplot = 'Y';
end
prompt = 'Make raw signal and variance plots? Y/N [Y]: ';
rawplot = input(prompt,'s');
if isempty(rawplot)
    rawplot = 'Y';
end
prompt = 'Make FFT plots? Y/N [Y]: ';
fftplot = input(prompt,'s');
if isempty(fftplot)
    fftplot = 'Y';
end
prompt = 'Enter the data collection rate (Hz)[20000]: ';
ft = input(prompt);
if isempty(ft)
    ft = 20000;
end
% Binning inputs
prompt = 'Would you like to bin the data? [Y]/N ';
bo = input(prompt);
if isempty(bo)
    bo = 'Y';
end
if bo=='Y' 
% user inputs the bin time, if no input, script continues
prompt = 'Enter the bin time in minutes  ';
bt = input(prompt);
end
%% Import and concatonate data segments
for i=1:length(FileNames)  
  y{i}=[];
  for ii=1:length(FileNames{i})
    Path{i}{ii}=fullfile(PathNames{i}, FileNames{i}{ii});  
    fileID{ii} = fopen(Path{i}{ii});
    c=[];
    while ~feof(fileID{ii})
      C = textscan(fileID{ii},'\t%f',10000);
      c=[c,C'];
    end
    fclose(fileID{ii});
    c = c(~cellfun(@isempty, c));
    gg=cell2mat(c');
    yp{ii}=gg;
    test=regexp(FileNames{i}{1}, 'txt','ONCE');
    if isempty(test)
      Names{i}=FileNames{i}{1};
    else 
      [~,Names{i}]=fileparts(FileNames{i}{1});    
    end    
    %waitbar (i/length(FileNames),h);
  end
  for ii= 1:length(FileNames{i})
    y{i}=[y{i};yp{ii}]; % concatonates all y data for each batch
  end
end
Names=cellstr(Names);   
Names2=strrep(Names,'_',' ');
%close(h);
disp('Files are loaded successfully!')
    
clear ans c C fileID gg yp h i ii Path question test prompt
%    clear ans c C fileID gg h i Path question test prompt
%% Bin the data
% this reallocates y{i}, if not used, script still runs 
if isempty(bt)
    return
else
    
bd=bt*60*ft; % number of data points in the bin    
for i=1:length(y)
        f(i)=floor(length(y{i})/bd);
%         yb{1}=y{1}(1:bd);
%         N{1}=Names{i}; %replicate name to each bin
%         N2{1}=Names2{i}; %replicate name to each bin
%    if f>1 % multiple bins allocated or data cut to single bin size
    for ii=1:f(i)
    bds=(bd*ii)-(bd-1); % bin data start
    bde=bds+(bd-1); % bin data end
    yb{(sum(f(1:i-1))+ii)}=y{i}(bds:bde); % bin the data
    N{(sum(f(1:i-1))+ii)}=Names{i}; %replicate name to each bin
    N2{(sum(f(1:i-1))+ii)}=Names2{i}; %replicate name to each bin
    end
%    end
end
y=yb; % reallocate to y vector
Names=N; % update to reflect new bins
Names2=N2; % update to reflect new bins
clear yb N N2
end
%% Process the data  
    for i=1:length(y)
    x{i}=(1:length(y{i}))';
    end
    % Window for flattening
    n = 400000;
    for ii=1:length(y)
        for k=1:length(y{ii})/n
            coeffs{ii}(k,:)=polyfit(x{ii}(n*(k-1)+1 : k*n),y{ii}(n*(k-1)+1 : k*n),1);
            yfit{ii}(k,:)=coeffs{ii}(k,2)+coeffs{ii}(k,1)*x{ii}(n*(k-1)+1 : k*n);
        end
    end
    for ii=1:length(y)
    dd{ii}= subsref(yfit{ii}.', substruct('()', {':'})).';
    dd2{ii}=y{ii}(1:length(dd{ii}))-dd{ii}';
    end
    for i=1:length(dd2)
    t{i}=zeros(1,length(dd2{i}));
    end
    t{1}=1:length(dd2{1});
    for i=2:length(dd2)
    t{i}= t{i-1}(end)+[1:length(dd2{i})];
    end
    clear coeffs dd i ii k prompt ws yfit
    yRaw=y; xRaw=x;
    x=t; y=dd2;
    clear dd2 t
    % Window for variance
    n = 100000;
    for i=1:length(y)
            for k=1:length(y{i})/n
            a(k)=var(y{i}(n*(k-1)+1:k*n));
            end
            v{i}=a;
            a=[];
    end
    t{1}=1:length(v{1});  
    for i=2:length(y)
           t{i}=t{i-1}(end)+[1:length(v{i})];
    end
    
   
    
%%%%%In between here is 500 lines of graph making. I'm omitting this for now. Afterwards, there's...
%% Rename variables 
for i=1:length(t)
tm{i} = t{i}.*((n/ft)/60); % time in minutes
end
%Save variable output.
%v = variance
save(strcat(PathNamesSave{1},Filenamesave{1},'_',signal), 't', 'tm', 'v','xRaw', 'x', 'yRaw', 'y','n', 'ft','Filenamesave','FileNames','PathNames','PathNamesSave','Names','Names2','-v7.3');
disp('Done')

Just to be clear, the variance calculations are not using the data that has been zeroed or flattened? I don't see this. We're saying the variance is gathered from the raw data?

EL on 26 Sep 2019

I was told snippets of what is exactly happening, however no one went into detail because I was told to use what was given to me, write a script using this as a function.

However, that ended up not working, so instead of beeting a dead horse which I've doing for the last month, I'm trying to rewrite it. Before I could do that though, I want to make sure there isn't something in the script that I'm missing, because I'm pretty stupid when it comes to programming and I'm betting that I'm missing something. I cannot rewrite code that I do not understand the intention of, adn I don't feel like wasting my time writing a code when I missed something critical. Now I feel I understand the intention, and yes, it makes sense given how I understand the machine to function.

I did take a 16hour R course, and it seemed much more intuitive, so i was going to translate the overall goal of what I thought this script was doing into R. I felt like it would be much easier. MatLab has been nothing but a nightmare so far, and I'm tired of getting stressed out.

For some reason, everytime I make progress in matlab, the next step is a road block, and I have to back track again to find a right answer. The script that I had modified was working, except I couldn't get it to write data in any format. I couldn't figure out how to keep the append option from overwriting old variables, couldn't figure out how to rename the new variables with a counter I wrote in (hopefully keeping all variables nice and organized, and making them easier to call up in an organized, numerical format). I can write them in ascii format, however I assumed that was not the 'optimal' way.

Anyway, I'm going to integrate this script with a loop I previously made so files are read in 400,000k segments. I'll probably end up posting that when I innevitably get stuck again.

dpb on 26 Sep 2019

Edited: dpb on 27 Sep 2019

Open in MATLAB Online

"couldn't figure out how to rename the new variables with a counter "

That is absolutely the wrong approach -- that way will inevitably lead to nightmares and other unmaintainable code.

We would need to see what it was you were trying to accomplish by doing that in order to provide the best option, but that is where perhaps cell array could be useful.

Then again, if it has to do with this idea of breaking up big files into chunks, I'll reiterate yet again that that is an unneeded and wasteful step instead of processing the original file either in pieces of N size or, alternatively, as G has suggested using tall arrays that are able to address data larger than what will fit in memory as if it were in memory from a programming standpoint.

The first alternative could be as simple as the above code outlined including

N=YourBigNumberRecords
tend=0;
fidi=fopen(testinputfile,'r')
fido=fopen(outputfile,'w')
while ~feof(fidi)
  y=fscanf(fidi,'%f',N)
  y=detrend(y)
  v=var(y)
  tm=tend+[0:numel(y)-1]*dt
  tend=tm(end)+dt
  fwrite(fido,y)  
  fwrite(fido,v)  
end
fclose(all)

which would interleave the y,v sections; you could choose some alternative schemes as well of additional files or using structures or whatever, but the above would serve just fine as is.

It's not clear where the plotting is in the scheme--it can either be for each subsection by including in the loop or at the end by reading what it needs from the file--positions can be computed by counting records and fseek could easily retrieve any time position desired as every element is a single double eight bytes.

In my humble ( :) ) opinion you need to step back from the code and decide what is the end objective then implement that as G has said -- continuing to tie the organization to the original abomination is never going to lead to anything good--things will only get worse as continue to try to graft on new features or the like.

ADDENDUM

The Q? we discussed some time back then of setting up a process to do a whole series of files (experiments) is then reduced to simply calling the above function with the list of those files. There's a FEX submission (I forget whose just now) filefun that is the same idea for files as are the builtin arrayfun or cellfun that operate on arrays or cell arrays--give it the list of files and it calls the given function for them in a one-line syntax for you.

EL on 28 Sep 2019

Open in MATLAB Online

This is were I currently am

%% New Script. Faster. Better. Stronger.  Started 25SEPT2019
 clear;
 clc;
 
 
 %% For Testing
 
%  load TestingNew;
 
 
 %% Retrieve Files
 tstart=tic
rootDirectory = 'Z:\ColdTech\NMDIII\Data\'; 
prompt = 'Enter Date of Experiment to be Processed (mmddyyyy)           ';
DataDate = input(prompt, 's');
prompt = 'Enter the signal to be processed (DEF, LFM or SUM)            ';
signal = input(prompt, 's');
prompt = 'Enter the name for the Output File (mmddyyyy_bug_media_oC)    ';
outputfilename = input(prompt, 's');
Directory = strcat(rootDirectory, DataDate, '\');
FileNames = dir(Directory);
FileNames = {FileNames.name};
FileNames = FileNames(contains(FileNames, signal)); % Files will be in the Directory, and they ust be the signal type and a .txt file. This is enough to discriminate
outputfile = strcat(Directory, outputfilename, '.mat');
%% Data Analysis
% Flattening the Data and Gathering Variance
bt = 36000000;
N =400000;            % Window of Flattening, 20 seconds of data
n = 100000;            % Window of Variance, 5 seconds of data
tend = 0;
for FilesToRead = 1:length(FileNames);
    FileDirectory = strcat(Directory, FileNames);
    fidi=fopen(FileDirectory{FilesToRead},'r');   %This should be what varies
    fido=fopen(outputfile,'w');             %This should be the same
save_struct = struct('y', [], 'v', [], 'yflat', [], 'filename', []);
save_struct = repmat(save_struct,1,3);
save_struct(FilesToRead).filename = FileDirectory{FilesToRead};
while ~feof(fidi);
  ybt=fscanf(fidi,'%f\n',bt);
%   if ybt<length(bt);               % There is always 'fat' at the end of the data. This removes it. Should happen at the end of every file. Should NOT end the loop
%       return     % Is this formatted correctly? not sure if return or break is better.....
%   end
    for y=1:length(ybt)/N;
%   for y=fscanf(ybt, '%f\n', N) % Maybe need to add a 1:length(ybt)/N
      y=fscanf(ybt, '%f\n', N)
      yflat=detrend(y);
      v=var(reshape(yflat, n, []));
      tm=tend+[0:numel(y)-1]*dt;
      tend=tm(end)+dt;
      save_struct(FilesToRead).v = v;
      save_struct(FilesToRead).y = y;
      save_struct(FilesToRead).yflat = yflat;
  end
end
end
save('my_filename', 'save_struct');
save('my_filename', 'save_struct');
%% Data Analysis using fwrite
Directory = strcat(rootDirectory, DataDate, '\');
FileNames = dir(Directory);
FileNames = {FileNames.name};
FileNames = FileNames(contains(FileNames, signal)); % Files will be in the Directory, and they ust be the signal type and a .txt file. This is enough to discriminate
outputfile = strcat(Directory, outputfilename, '.mat');

My files, in an ideal world, would be an exact multiple of 36,000,000. However, I always have excess data at the end, between 1-3 million lines. I tried to add an 'if' loop, so that if a bin soesn't meet that requirement, that loop would return and continue to the next file. That isn't working, and I cannot save data currently. I think it's because fscanf is only supposed to read files, rather than variables.

Rather than continueing on before I get stuck again, do you think I'm on the right track?

dpb on 28 Sep 2019

Open in MATLAB Online

Looks much cleaner, indeed! :) I don't have time this instant to really read it in detail, but I'd say you are on the right route, indeed.

Just so we're on the same track when I do get a chance (hopefully later on this evening or in the morning) a couple questions...

Q1: Is this file you're processing now the full experiment file without the partitioning of it into multiple pieces as before? I THINK that's what I'm reading, just to make sure am on same page in thinking through the process.

Q2: Why the variable names bt and ybt? What does "bt" stand for and why would you submit that's a good variable name for a length constant? To me, at least, it makes reading the code difficult because it doesn't seem to have any meaning related to what it is. Making the code human-readable as much as a novel plot as possible is always a goal.

while ~feof(fidi);
  ybt=fscanf(fidi,'%f\n',bt);
%   if ybt<length(bt);               % There is always 'fat' at the end of the data. This removes it. Should happen at the end of every file. Should NOT end the loop
%       return     % Is this formatted correctly? not sure if return or break is better.....
%   end
    for y=1:length(ybt)/N;
%   for y=fscanf(ybt, '%f\n', N) % Maybe need to add a 1:length(ybt)/N
      y=fscanf(ybt, '%f\n', N)

Q3: Can you hold ybt in memory ok on your system and other systems you expect to be able to run this on? You did not check the optional return value of the number of elements read so you don't know whether the read returned the whole wanted number or not. I would expect so, 3.6E6/8/(1024*1024) --> <30 MB on machines with GB of memory these days, but your code needs to verify and handle the case if it fails--if nothing else, inform the user and abort; better have backup logic to process the file in smaller chunks.

Q4: What do you think is the purpose of the second loop using fscanf? You've already got the double array ybt in memory; just operate on it. Oh--I get it...this appears to be the attempt to implement the suggestion to just loop over the files back when were reading 400K individual files. OK, now that you are processing the whole file, really instead of actually reading the full 3.6E6 array at once, if the processing is really always on these chunks of 400K elements at a time (or some other arbitrary set as may evolve in future so should definitely be a variable and not a buried "magic number"), it really makes sense to just read that number each time from the file, operate on it and then go on to the next. Then you keep a secondary running count variable of the total number of elements you've read/processed and quit when that reaches the other, bigger magic number (3.6E6 here, or N1/N2 --> 3.6E6/400K = 9 iterations of the loop).

OK, I do have to run right now, but you definitely are getting towards something that is much, much better and will get you to an end goal pretty soon now, I think.

Oh, "just one more thing!"

Q5: Where is the plotting over what sequence of data in the process? The individual smaller sections or the whole thing or...?? Would help knowing that so ensure get the data organized for it, too.

dpb on 28 Sep 2019

Edited: dpb on 28 Sep 2019

Open in MATLAB Online

ADDENDUM:

OBTW, if you can hold the 3.6E6 in memory, it gets even easier because detrend() is also vectorized to operate by columns as many MATLAB-supplied functions are. In that case, you can do the thing as

ybt=fscanf(fidi,'%f\n',bt);         % read the 3.6E6 (bt) entries
y=detrend(reshape(ybt,N,[]);        % detrend in N-sized blocks by column
y=y():                              % put back into column vector

if you don't need the raw data, then you can do it all in place and dispense with the second variable.

Also, if you read the full bt elements at once, you don't need the while loop at all--from your description once you have read that many observations, you don't care about the rest of the file content so there's no need to do anything at all with it. Just close that file and go on to the next (after completing analysis of the current one, of course).

ADDENDUM SECOND:

" once you have read that many observations, you don't care about the rest of the file content so there's no need to do anything at all with it."

If you really do want to trim the files all to that magic length, then at this point you could just rewrite the file with the ybt vector before closing and that objective would then be done, too. Of course if you do that, that other data will be gone irretrievably so just in case somebody at some future date might want to retrieve more than that, probably better to just go on to the next file and leave the existing one alone. Or, of course, at the expense of extra disk space you create the fixed-length file in parallel with the original with some related name rather than overwrite the original.

dpb on 29 Sep 2019

Open in MATLAB Online

OK, here's my first pass through about how I'd structure what I understand. Need a pattern for the file names to make sure the name construction for the wildcard pattern for dir() is right but this removes the dependence on the system file separator yours has.

% Retrieve Files
Drive='Z:';
Root=fullfile(Drive,'ColdTech','NMDIII','Data'); 
prompt = 'Enter Date of Experiment to be Processed (mmddyyyy)           ';
DataDate = input(prompt, 's');
prompt = 'Enter the signal to be processed (DEF, LFM or SUM)            ';
signal = input(prompt, 's');
prompt = 'Enter the name for the Output File (mmddyyyy_bug_media_oC)    ';
outputfilename = input(prompt, 's');
Directory = fullfile(Root, DataDate);
 % Files will be in the Directory, and they must be the signal type and a .txt file.
d=dir([Directory '*' signal '*.txt']);
outputfile = fullfile(Drive,Root,Directory, [outputfilename '.mat']);
%% Data Analysis
% Flattening the Data and Gathering Variance
bt = 36000000;
N =400000;             % Window of Flattening, 20 seconds of data
n = 100000;            % Window of Variance, 5 seconds of data
tend = 0;
for i=1:numel(d)
  fname=fullfile(d(i).folder,d(i).name);
  fidi=fopen(fname,'r');   %This should be what varies
  fido=fopen(outputfile,'w');                         %This should be the same
  save_struct = struct('y', [], 'v', [], 'yflat', [], 'filename', []);
  save_struct = repmat(save_struct,1,3);
  save_struct(i).filename=fname;
  y=fscanf(fidi,'%f',bt);
  yflat=detrend(reshape(y,N,[]));     % detrended signal over N samples
  yflat=yflat(:);                     % reshape back to column vector
  v=var(reshape(yflat, n, []));
  v=v(:);
  tm=tend+[0:numel(y)-1]*dt;
  tend=tm(end)+dt;
  save_struct(i).v = v;
  save_struct(i).y = y;
  save_struct(i).yflat = yflat;
end
save('my_filename', 'save_struct');
save('my_filename', 'save_struct');

Needs the output file names updated to be used but basics I think are there.

I shortened some variable names to shorten the code -- long loop indexing variables are more clutter than help -- "i" is the traditional index and since there's no imaginary arithmetic in sight here, there's no issues here. You'll want to turn this into a function anyway for the longer term at which it will also then become local and not matter for the base workspace.

I also just returned the dir() struct in d and pull the file name and folder from it.

AFAICT, there is only one file of the type for the experiment so there's no need for that loop I think altho I left it in there in case I don't quite follow.

EL on 29 Sep 2019

Edited: EL on 29 Sep 2019

Open in MATLAB Online

I just saw the 3rd option for detrend too and realized it would remove the need for a secondary loop. It seem sthat it requries an array to define where the cutting would happen. In our case, it would be every 400,000, so I was thinking something along the lines of

split=[(((i-1)*N)+1):i*N]
yflat=detrend(reshape(y,N,[split]));

EDIT 1: Now that I look at, this would require a loop for the split variable to owrk correctly

EDIT 3: This works better.

split = linspace(1, bt, N)
yflat=detrend(reshape(y,N,[split]));

Also, what's the purpose of tend?

Also, i wasn't hoping to trim the data, I was hoping to have it be ignored. I figured the following would maek it go back to the top of the loop, starting with the next iteration

if i~=bt
    return
end

This is just to remove data I do not need, and not to have data from one condition falling into the bin time of another condtion.

EDIT 2: Also, how does the reshape function help us in this situation? the documentation for detrend doesn't make it look like it's necessary.

dpb on 29 Sep 2019

Edited: dpb on 29 Sep 2019

"... 3rd option for detrend too and realized it would remove the need for a secondary loop"

No. Well, that would be one way. But then must build a secondary vector of the breakpoint indices and pass it. Since your breakpoints are at 400K, and detrend operates by column anyway, that's what is done via reshape() to let it operate on columns by default (just like var).

No secondary loop is needed. I didn't try timing to compare, but I'd expect that version to outperform the one with specific breakpoints as well as not needing the array for memory savings plus there's no optimization better than that of not computing something that isn't needed.

I STRONGLY recommend you build a small test dataset and experiment at the command line so you can see how these things work visually so you get a better feel for what's actually happening. The sheer magnitude of a 400K segment of a 36M array precludes that, but the logic is identically the same for 10 lines out of 40, say, and you can inspect that for correctness visually plus understand what the code is actually doing.

"I figured the following would ... go to top of loop..."

That's totally unnecessary; the full 36M elements have been read and the loop goes on its merry way after computing stuff with them (via the reshape() option outlined above). Only if you can't hold that much in memory and process each 400K in individual reads will you need a loop within the same file. NB: the while ~feof() loop is gone; unneeded, too.

The rest was just hypothetical of what you could do; what the intent was wasn't made totally explicit.

On the last, that's the same question back over again -- yes, you could do it w/o reshape() but it's easier coding using it and since reshape is simply an internal adjustment of indexing of the same data, it's quite low in overhead so is efficient computationally besides. You can do some timing testing if you care to, but I'm confident enough the performance will be at least as good plus you don't have to compute the breakpoints such that I'm not going that route at all. That option is provided in case there were variable breakpoint spacings to be accounted for; not what you have.

You haven't really directly answered any of the Q? I posed nor provided the filenaming pattern sufficiently to be able to check out the modifications there for system independence so still have a few details to either iron out or debug.

In particular, the Q? about whether there really is ever more than one file returned for the given date/type analysis at the beginning is the key one. If this is going to be used interactively, I'd suggest possibly using uigetfile dialog instead of manual prompts requiring typing in as present, but that can wait.

tend also actually isn't needed any longer if one builds the time vector for the whole enchilada--that would again be needed if one looped over the 400K chunks to create the continuous time for each section picking up from the last. There's any number of ways to build that corollary vector and there's a lot to be said if it is always the specific sampling rate to not even bother to do anything except compute it on the fly and not save it at all...other than if one ever does just read sections of a file starting somewhere in the middle. But even then, if it's a constant and never changes, it's just dt*N for whatever N is chosen or N==T/dt to compute the offset given a desired time.

EL on 29 Sep 2019

Q1: Is this file you're processing now the full experiment file without the partitioning of it into multiple pieces as before? I THINK that's what I'm reading, just to make sure am on same page in thinking through the process.

It is the base number of files produces by the machine. There will be a minimum of two, maybe more. It depends on how often I have to stop data acquisition and change conditions.

Q2: Why the variable names bt and ybt? What does "bt" stand for and why would you submit that's a good variable name for a length constant? To me, at least, it makes reading the code difficult because it doesn't seem to have any meaning related to what it is. Making the code human-readable as much as a novel plot as possible is always a goal.

bt was the bin time tha the box plots would be made in, 36,000,000, or 30 minutes of data. Ybt was a variable I threw in to connect bt and y between different lines. Not really needed anymore though

Q3: Can you hold ybt in memory ok on your system and other systems you expect to be able to run this on? You did not check the optional return value of the number of elements read so you don't know whether the read returned the whole wanted number or not. I would expect so, 3.6E6/8/(1024*1024) --> <30 MB on machines with GB of memory these days, but your code needs to verify and handle the case if it fails--if nothing else, inform the user and abort; better have backup logic to process the file in smaller chunks.

Not using ybt anymore, so we don’t need it.

Q4: What do you think is the purpose of the second loop using fscanf? You've already got the double array ybt in memory; just operate on it. Oh--I get it...this appears to be the attempt to implement the suggestion to just loop over the files back when were reading 400K individual files. OK, now that you are processing the whole file, really instead of actually reading the full 3.6E6 array at once, if the processing is really always on these chunks of 400K elements at a time (or some other arbitrary set as may evolve in future so should definitely be a variable and not a buried "magic number"), it really makes sense to just read that number each time from the file, operate on it and then go on to the next. Then you keep a secondary running count variable of the total number of elements you've read/processed and quit when that reaches the other, bigger magic number (3.6E6 here, or N1/N2 --> 3.6E6/400K = 9 iterations of the loop).

I don’t recall off the top of my head. It was something I thought of that may have let the script work.

Q5: Where is the plotting over what sequence of data in the process? The individual smaller sections or the whole thing or...?? Would help knowing that so ensure get the data organized for it, too.

All the data analyzed would be plotted together. I intend on having box plots made with 30 minutes of data (36,000,000 divided by 100,000 for variance result sin n=360 per box plot).

Then the raw y data over time, so the drift of the machine can be measured.

Then the flattened y data over time, so the average amplitude can be viewed without noise (the drift)

Then the flattened y data would undergo FFT, so the most common amplitudes can be viewed as a control.

Additionally, violin graphs over 30 minute bins would measure which amplitudes were most common in their bin time.

Regarding uigetfile, this script will be run remotely through command line without an interface. The data date and the signal type is enough to differentiate between data. I also intend on having a list populate of the files selected and what order they’ll be processed, so the script can be terminated if the wrong files were loaded.

dpb on 29 Sep 2019

"Not using ybt anymore, so we don’t need it."

That doesn't answer the Q? underlying -- can you hold the full vector in memory on the systems you're running on? What you use for the variable name is immaterial. If so, then the full file can be operated on at once and there's no need to loop through the file at all. If not, then you need the loop to read segments and process them...but you don't need to (and shouldn't) separate out into individual files to do that.

"data date and the signal type is enough to differentiate between data"

Yeah, you've said that but you're going at it in what appears to be very awkward ways...if you would illustrate the naming pattern used, I'm certain a wildcard pattern could be built from the inputs you've asked for from the user which will have to come from somewhere, no matter how you actually end up running the script (which should undoubtedly be turned into a function for data integrity and scoping) so could have dir return only those of interest directly.

Anyway, details of interface and some minor efficiencies aside, looks like you've got a handle on how to proceed from here; I'll reiterate on the idea of playing at the command line with small dataset so you can understand what the reshape stuff is doing and how.

Think I'll bow out now...good luck.

Sign in to comment.

Answer 2

dpb on 25 Sep 2019

Edited: dpb on 25 Sep 2019

Open in MATLAB Online

0 votes

"If you set the directory to the three fiels linked, you'll process them in about 30 seconds."

I did set breakpoint in debugger and ran the first of the three to the point of computing dd2

As we already knew, the result of

y=load('x000.txt');
ypr=detrend(y);            % use the Signal Processing TB detrend function

resulted in

K>> dd2{1}(1:10)
ans =
   -0.0011
    0.0005
    0.0041
   -0.0033
    0.0033
   -0.0001
    0.0023
    0.0001
   -0.0003
    0.0022
K>> ypr(1:10)
ans =
   -0.0011
    0.0005
    0.0041
   -0.0033
    0.0033
   -0.0001
    0.0023
    0.0001
   -0.0003
    0.0022
K>> 

which are the same result in one line of code instead of the 50 or so in your function and runs in a fraction of a second. And, in fact,

K>> max(abs(dd2{1}-ypr))
ans =
   2.8449e-16
K>> 

shows the difference in the result by different computation method is within machine precision of a double precision variable so we've verified that's what the gosh-awful mess actually does.

So, all you now need is to compute the variance of the detrended time series over the time length of choice -- as G shows and I recommended, the easiest way to do this is to reshape() the vector to Nx[] array and let Matlab vector operation by column on the array compute the values for each column in one call.

It appears the only thing left if the computation of the tm array and then plotting the results.

All the calculations could be in a loop of 20 lines at the absolute outside; more like 10 or even fewer it would seem probable.

Wrap that in an outer loop that opens the big file and reads the desired number of records (400K or whatever were chosen) and computes each section.

4 Comments
Show 2 older comments Hide 2 older comments

Guillaume on 25 Sep 2019

@EL, were you the one asking the same question on discord?

If so, then start a new question (here on answers) for just that, explaining in a lot more details what you want to save, to what kind of file (mat, text, excel, binary, whatever), and why it is incremental.

Note that if your input data files are huge, then using datastore and tall arrays would probably be a good idea. Again, this requires discarding the current code.

EL on 25 Sep 2019

Yeah, that was me on discord.

I'll start a new question regarding that.

Sign in to comment.

What exactly is going on with this code? I'm having difficulty understanding it

9 Comments
Show 7 older comments Hide 7 older comments

Accepted Answer

18 Comments
Show 16 older comments Hide 16 older comments

More Answers (1)

4 Comments
Show 2 older comments Hide 2 older comments

Categories

Tags

Community Treasure Hunt

What exactly is going on with this code? I'm having difficulty understanding it

9 Comments Show 7 older comments Hide 7 older comments

Accepted Answer

18 Comments Show 16 older comments Hide 16 older comments

More Answers (1)

4 Comments Show 2 older comments Hide 2 older comments

Categories

Tags

See Also

Community Treasure Hunt

9 Comments
Show 7 older comments Hide 7 older comments

18 Comments
Show 16 older comments Hide 16 older comments

4 Comments
Show 2 older comments Hide 2 older comments