What exactly is going on with this code? I'm having difficulty understanding it

Hello,
I'm struggling with understanding the code below. I was hoping to give my best interpretation of what's going on, and hope that I'm right.
for FilesToRead = 1 : length(FileNames)
y = load(fullfile(Directory, FileNames{FilesToRead}));
y = {y};
n = 400000;
for i=1:length(y)
x{i}=(1:length(y{i}))';
end
for ii=1:length(y)
for k=1:length(y{ii})/n
coeffs{ii}(k,:)=polyfit(x{ii}(n*(k-1)+1 : k*n),y{ii}(n*(k-1)+1 : k*n),1);
yfit{ii}(k,:)=coeffs{ii}(k,2)+coeffs{ii}(k,1)*x{ii}(n*(k-1)+1 : k*n);
end
end
for ii=1:length(y)
dd{ii}= subsref(yfit{ii}.', substruct('()', {':'})).';
dd2{ii}=y{ii}(1:length(dd{ii}))-dd{ii}';
end
for i=1:length(dd2)
t{i}=zeros(1,length(dd2{i}));
end
t{1}=1:length(dd2{1});
for i=2:length(dd2)
t{i}= t{i-1}(end)+[1:length(dd2{i})];
end
clear coeffs dd i ii k prompt ws yfit
yRaw=y; xRaw=x;
x=t; y=dd2;
clear dd2 t
% Window for variance - 5 seconds = 100,000
n = 100000;
for i=1:length(y)
for k=1:length(y{i})/n
a(k)=var(y{i}(n*(k-1)+1:k*n));
end
v{i}=a;
a=[];
end
t{1}=1:length(v{1});
for i=2:length(y)
t{i}=t{i-1}(end)+[1:length(v{i})];
end
for i=1:length(t)
tm{i} = t{i}.*((n/20000)/60); % time in minutes
end
end
So essentially, every 400,000 data points, a 1st degree polynomial is fitted to the 400,000 points. Then, the data is zeroed, or flattened, where the slope of the polynomial was turned to 0, and every data point had an equal number removed from from the area on the slope the data point was located.
This is done to the entire data set. Those new values are then stored, and variance data is gathered from every 100,000 data points.
Is this the general gist?

9 Comments

I added a missing end as the last line and fixed the indentation.
It's tough analyse the code without running it.
I thought I went through this at some length a week or two ago???
It wasn't clear why it was written as it was--seemed very roundabout way to do it in all likelihood.
I didn't dig through the entire morass to be absolutely certain regarding the minute details as there wasn't any data supplied to check it out with easily. As Per says, it would be easier given the convoluted way it was written to actually test what it's doing with some known data...the size is immaterial in figuring out the algorithm--400 points are just as good as 400,000.
I'm struggling with understanding the code below
Best is to go back to the original author, complain loudly that there is no comment in the code and ask them to explain it.
I gave up after reading this:
y = {y};
%...
for i = 1:length(y)
y is purposefully created as a 1 element cell array, hence length(y) is guaranteed to be 1 and the for is pointless. So not only is the code not documented but it's most likely wrong.
Ugh. Nasty miserable code, written by someone who forgot (more likely, never learned) that comments are legal. In fact, comment lines are free!
Personally, I'd decide that whatever it does, you would be better off writing code from scratch. Certainly safer, as there is no assurance that you could ever debug this mess in reasonable time if there was a problem. And how would you know if there was a problem anyway?
@G -- From the previous time I looked into the whole mess for OP I'm pretty sure I determined the above is a do-nothing loop that has no bearing on anything else later on.
@J -- OP did share there was one comment at the beginning of the code -- that it, basically was thrown together but not one that actually described the computations. :)
I'm with John, this code should be discarded and new, documented code should be written. There are too many issues with it. e.g.:
  • it assumes that the input length is a multiple of 400,000. If it's not, it will error.
  • It uses some very strange and pointless constructs such as
dd{ii}= subsref(yfit{ii}.', substruct('()', {':'})).';
which is simply:
dd{ii} = yfit{ii}(:);
  • it dedicates a whole loop to just preallocating arrays but on the next loop entirely discards the prellocated arrays and replaces them by new ones.
  • all of the loop are a complete waste of time since all of them iterate over just one value.
  • probably more...
I cannot rewrite the code without understanding what, overall, is trying to be done.
I have three 400,000 length files ready if you'd like to test those?
Here's new code I'm working with to test out
clc
clear
%%This sections is to load the correct folder to gather data from
% ***Below is the same code for Linux!
%rootfolder = '/Data/';
% ***Don't forget to swith the slashes!
rootfolder = 'Z:Data\'; %%MAJOR ISSUE AREA. THE DIRECTORY HERE ALWAYS CHANGES! WATCH OUT!!!!
% Chopped = '/Chopped';
Chopped = '\Chopped\';
prompt = 'Enter date of experiment to be analyzed: ';
DataDate = input(prompt, 's');
Directory = strcat(rootfolder,DataDate,Chopped);
Directory2=strcat(rootfolder,DataDate,'\');
%% This is to select the save file name
prompt = 'Enter the name for the processed files according to mmddyyyy_bug_media_oC ';
Filenamesave = input(prompt, 's');
%% This is to select the signal type. This should have already been done with datachop =]
prompt = 'Enter the Signal Type DEF, LFM or SUM ';
signal = input(prompt, 's');
%% Below are some code alternatives to that were previously tried. Maybe I should delete these......
% FileNames=dir(Directory);
%FileNames=dir(fullfile(Directory, '*.txt'))
% FileNames=dir(Directory);
% FileNames(1:2) = [];
% FileNames = {FileNames.name};
%% This selects all .txt files in the folder. If the linux script DataChop is used, then you're good!
FileNames=dir(Directory);
FileNames = {FileNames.name};
FileNames = FileNames(contains(FileNames, '.txt'));
disp(' ');
disp(' ');
disp('File structure generated');
disp(' ');
disp(' ');
%% This is the run inputs
% signal = 'DEF';
% epflplot = 'N';
% rawplot = 'N';
% fftplot = 'N';
% ft = 20000;
% bt = 30;
%% This is the loop to process the data. First, a linear fit of a 1st \
% degree polynomial is fitted to the data, with the data points being
% zeroes according to that slope. Once zeroed, variance is gathered.
for FilesToRead = 1 : length(FileNames)
y = load(fullfile(Directory, FileNames{FilesToRead}));
y = {y};
n = 400000;
for i=1:length(y)
x{i}=(1:length(y{i}))';
end
for ii=1:length(y)
for k=1:length(y{ii})/n
coeffs{ii}(k,:)=polyfit(x{ii}(n*(k-1)+1 : k*n),y{ii}(n*(k-1)+1 : k*n),1);
yfit{ii}(k,:)=coeffs{ii}(k,2)+coeffs{ii}(k,1)*x{ii}(n*(k-1)+1 : k*n);
end
end
for ii=1:length(y)
dd{ii}= subsref(yfit{ii}.', substruct('()', {':'})).';
dd2{ii}=y{ii}(1:length(dd{ii}))-dd{ii}';
end
for i=1:length(dd2)
t{i}=zeros(1,length(dd2{i}));
end
t{1}=1:length(dd2{1});
for i=2:length(dd2)
t{i}= t{i-1}(end)+[1:length(dd2{i})];
end
clear coeffs dd i ii k prompt ws yfit
yRaw=y; xRaw=x;
x=t; y=dd2;
clear dd2 t
% Window for variance - 5 seconds = 100,000
n = 100000;
for i=1:length(y)
for k=1:length(y{i})/n
a(k)=var(y{i}(n*(k-1)+1:k*n));
end
v{i}=a;
a=[];
end
t{1}=1:length(v{1});
for i=2:length(y)
t{i}=t{i-1}(end)+[1:length(v{i})];
end
for i=1:length(t)
tm{i} = t{i}.*((n/20000)/60); % time in minutes
end
%% Saving Data
numstr=num2str(FilesToRead);
PathSave=strcat(Directory, Filenamesave,'_',signal,'_', numstr, '.mat');
save(strcat(PathSave), 't', 'tm', 'v','xRaw', 'x', 'yRaw','y','n','FileNames');
%% This is a counter, giving the display an output of the number of files processed.
denom=num2str(length(FileNames));
numer=num2str(FilesToRead);
fraction=strcat(numer,'/',denom);
counter=strcat(fraction, ' Files Processed');
disp(counter);
%% Writing variance data to ASCII File
% dlmwrite(filename, v, delimiter, row, col)
% Data is appended each time. Data is added by row.
%
VarianceASCIIFilename=strcat(Directory2, 'VarianceData', '_', DataDate, '_', signal, '.txt');
dlmwrite(VarianceASCIIFilename, v{1,1}, 'delimiter', '\t', '-append');
%% Appending Data to .mat file
% VarianceMATFilename=strcat(Directory2, 'VarianceData', '_', DataDate, '_', signal, '_');
% save(VarianceMATFilename, 'v', '-append');
%
%% Hopefully the clear/clc doesn't mess anything up!!
% clear v t tm xRaw x yRaw y n
end
% disp(' ');
% disp(' ');
% disp('All Files Processed');
% disp(' ');
% disp(' ');
% disp('Graph Building Starting');
%% Making Graphs
%Starting with variance. Using data from ascii file. Each row is a new
%'interval' of data. In our case, 30 minutes if 'DataChop' functions
%correctly
%% Box and Whisker Plots, 30 min bin time
%Load individual variables with the following format
% load('FilePath/Name', 'variable');
% For plotting the data, do....
% boxplot('variable'{1,1});
% For loading your .txt file; below is an example 09232019
%
%% Various Variables from previous script
% Variables that were not saved with this update. Might be needed when
% making graphs
% -v7.3
% Names
%PathNames
%ft
This makes teh same out puts.
If you set the directory to the three fiels linked, you'll process them in about 30 seconds.
...
%rootfolder = '/Data/';
% ***Don't forget to swith the slashes!
rootfolder = 'Z:Data\'; %%MAJOR ISSUE AREA. THE DIRECTORY HERE ALWAYS CHANGES! WATCH OUT!!!!
% Chopped = '/Chopped';
Chopped = '\Chopped\';
prompt = 'Enter date of experiment to be analyzed: ';
DataDate = input(prompt, 's');
Directory = strcat(rootfolder,DataDate,Chopped);
Directory2=strcat(rootfolder,DataDate,'\');
...
We talked about this issue previously I think, too...???
Use fullfile() to build qualified filenames without specifying the system-specific delimiter character and it will automagically recognize which system are running on and do the right thing for you.
%% This selects all .txt files in the folder.
FileNames=dir(Directory);
FileNames = {FileNames.name};
FileNames = FileNames(contains(FileNames, '.txt'));
...
for FilesToRead = 1 : length(FileNames)
y = load(fullfile(Directory, FileNames{FilesToRead}));
...
is just
rootfolder = 'Z:Data';
Chopped = 'Chopped';
prompt = 'Enter date of experiment to be analyzed: ';
DataDate = input(prompt, 's');
Directory=fullfile(rootfolder,DataDate,Chopped);
FileNames=dir(fullfile(Directory,'*.txt');
...
for FilesToRead=1:length(FileNames)
y=load(fullfile(Directory,FileNames(FilesToRead).name));
...
and will include only those with the .txt extension and have the system-dependency handled automatically.
I don't think there's any point in then turning y into a cell array instead of just operating on the resulting vector as native double; that alone should speed things up noticeably as well as reduce memory footprint.
Then, as G has noted (and I did in our previous conversations as well) the loop and initial x is completely superfluous and waste of time and memory.
Also as I noted before, one could then compute the statistics on N elements of the vector piecewise by reshape() to NxM columns, use the inbuilt MATLAB facility to compute by default over columns for builtin routines like mean and var or iterate over the M columns for those which haven't been vectorized such as polyfit. At that point you may find it convient to store the coefficients in cell array since there are two per column, but it isn't clear there's any need to actually save them at all but to just process each column (section) and move on to the next. Doing it this way whether save the intermediates or not would eliminate almost all the convoluted indexing that makes so much of the code so difficult to parse.
Once have done that, then just working through the rest of the machinations should be relatively straightforward with the same idea although I still contend would be far simpler to just use a much smaller dataset initially with N=400 or somesuch so don't get so overwhelmed by just the size and can see results quickly and on screen instead of being innundated by the sheer mass of numbers that obfuscates the underlying problem. Once do that, then you can go back and verify your revised code produces same results for a real data set over that time length.
Again, it would also be wise to recode so that these magic numbers aren't buried in the bowels of the code but are easily set and modifiable so the code could be used generically if conditions do change in future.

Sign in to comment.

 Accepted Answer

Removing all the unnecessary cruft, this is probably what the code reduces to. I've gone with the fact that your files are 400,000 elements long to remove the badly coded split into 400,000 elements.
I haven't tested the code so maybe got it wrong. i've kept the meaningless variable names. If I had written it, I would have used much better variable names (seriously, a as a variable name?)
%inputs:
y = load(fullfile(Directory, FileNames{FilesToRead})); %since the file is text, it would be better to use csvread or similar instead of load
x = (1:numel(y))';
coeffs = polyfit(x, y(:), 1);
yfit = polyval(coeffs, x); %much simpler than coding the polynomial yourself
dd2 = y - yfit;
n = 100000;
a = var(reshape(dd2, n, []));
tm = x * n / 20000 / 60;
I've not kept the t variable that is recreated again and agian as it doesn't do anything useful. If you still needed, it's the same as x.
It would be simple to add the split into 400,000 elements if it's really needed.
edit: as discussed below, I missed that dd2 gets renamed to y, replacing the original y. Seriously, who does that? What was wrong with the original dd2 (other than both being meaningless)?

18 Comments

". I've gone with the fact that your files are 400,000 elements long to remove the badly coded split into 400,000 elements."
Ah! Now I remember...there was a previous thread or part of the other on recoding that asked about splitting a humongous file into 400K record chunks. Apparently, they're actually collecting data at a high sample rate over very long time periods and using the arbitrary 400K segments to detrend.
Seemed to me that the detrended data were saved before but don't seem to be in the above. Eric should double-check on that...otherwise seems no point at all in doing so.
There is, of course, in the Signal Processing TB a detrend function altho it has a fair amount of overhead but could reduce the high-level coding here....
I copied and pasted the loop (window for flattening and variance data) to fit the outputs of a linux script. I didn't change anything except what gets inputted. Here's the original below
%% Populate filenames for LINUX command line operation
clear
close all
clc
[FileNames PathNames]=uigetfile('Y:\........\*.txt', 'Choose files to load:','MultiSelect','on'); %It opens the window for file selection
prompt = 'Enter save-name according to: file_mmddyyyy_signal ';
Filenamesave = input(prompt,'s');
Filenamesave = strcat(PathNames,Filenamesave,'.mat');
PathNames=strrep(PathNames,'L:','Data');
PathNames=strrep(PathNames,'\','/');
PathNamesSave=strcat('/',PathNames);
save(Filenamesave,'FileNames','PathNames','PathNamesSave');
clear;
clc;
close all;
%% Retrieve files
tstart=tic
prompt = 'Enter the name of a .mat file to run (e.g. file_mmddyyyy_signal.mat) ';
files = input(prompt,'s');
load(files);
%% Terminate or proceed
if isequal(FileNames{1},0)
disp('User selected Cancel') %display value of variable
else
for i=1:numel(FileNames)
clear ii
c=class(FileNames{i});
if c=='char'
FileNames{i}=cellstr(FileNames{i});
end
for ii=1:numel(FileNames{i})
disp(['User selected ', fullfile(PathNames{i}, FileNames{i}{ii})])
end
end
%% User query to run/save Figure output
%Code for commandline inputs
prompt = 'Enter file name for figures according to: mmddyyyy_bug_media_oC ';
Filenamesave = cellstr(input(prompt,'s'));
prompt = 'Enter signal type for files';
signal = input(prompt,'s');
if isempty(signal)
signal = 'DEF';
end
prompt = 'Make EPFL plots? Y/N [Y]: ';
epflplot = input(prompt,'s');
if isempty(epflplot)
epflplot = 'Y';
end
prompt = 'Make raw signal and variance plots? Y/N [Y]: ';
rawplot = input(prompt,'s');
if isempty(rawplot)
rawplot = 'Y';
end
prompt = 'Make FFT plots? Y/N [Y]: ';
fftplot = input(prompt,'s');
if isempty(fftplot)
fftplot = 'Y';
end
prompt = 'Enter the data collection rate (Hz)[20000]: ';
ft = input(prompt);
if isempty(ft)
ft = 20000;
end
% Binning inputs
prompt = 'Would you like to bin the data? [Y]/N ';
bo = input(prompt);
if isempty(bo)
bo = 'Y';
end
if bo=='Y'
% user inputs the bin time, if no input, script continues
prompt = 'Enter the bin time in minutes ';
bt = input(prompt);
end
%% Import and concatonate data segments
for i=1:length(FileNames)
y{i}=[];
for ii=1:length(FileNames{i})
Path{i}{ii}=fullfile(PathNames{i}, FileNames{i}{ii});
fileID{ii} = fopen(Path{i}{ii});
c=[];
while ~feof(fileID{ii})
C = textscan(fileID{ii},'\t%f',10000);
c=[c,C'];
end
fclose(fileID{ii});
c = c(~cellfun(@isempty, c));
gg=cell2mat(c');
yp{ii}=gg;
test=regexp(FileNames{i}{1}, 'txt','ONCE');
if isempty(test)
Names{i}=FileNames{i}{1};
else
[~,Names{i}]=fileparts(FileNames{i}{1});
end
%waitbar (i/length(FileNames),h);
end
for ii= 1:length(FileNames{i})
y{i}=[y{i};yp{ii}]; % concatonates all y data for each batch
end
end
Names=cellstr(Names);
Names2=strrep(Names,'_',' ');
%close(h);
disp('Files are loaded successfully!')
clear ans c C fileID gg yp h i ii Path question test prompt
% clear ans c C fileID gg h i Path question test prompt
%% Bin the data
% this reallocates y{i}, if not used, script still runs
if isempty(bt)
return
else
bd=bt*60*ft; % number of data points in the bin
for i=1:length(y)
f(i)=floor(length(y{i})/bd);
% yb{1}=y{1}(1:bd);
% N{1}=Names{i}; %replicate name to each bin
% N2{1}=Names2{i}; %replicate name to each bin
% if f>1 % multiple bins allocated or data cut to single bin size
for ii=1:f(i)
bds=(bd*ii)-(bd-1); % bin data start
bde=bds+(bd-1); % bin data end
yb{(sum(f(1:i-1))+ii)}=y{i}(bds:bde); % bin the data
N{(sum(f(1:i-1))+ii)}=Names{i}; %replicate name to each bin
N2{(sum(f(1:i-1))+ii)}=Names2{i}; %replicate name to each bin
end
% end
end
y=yb; % reallocate to y vector
Names=N; % update to reflect new bins
Names2=N2; % update to reflect new bins
clear yb N N2
end
%% Process the data
for i=1:length(y)
x{i}=(1:length(y{i}))';
end
% Window for flattening
n = 400000;
for ii=1:length(y)
for k=1:length(y{ii})/n
coeffs{ii}(k,:)=polyfit(x{ii}(n*(k-1)+1 : k*n),y{ii}(n*(k-1)+1 : k*n),1);
yfit{ii}(k,:)=coeffs{ii}(k,2)+coeffs{ii}(k,1)*x{ii}(n*(k-1)+1 : k*n);
end
end
for ii=1:length(y)
dd{ii}= subsref(yfit{ii}.', substruct('()', {':'})).';
dd2{ii}=y{ii}(1:length(dd{ii}))-dd{ii}';
end
for i=1:length(dd2)
t{i}=zeros(1,length(dd2{i}));
end
t{1}=1:length(dd2{1});
for i=2:length(dd2)
t{i}= t{i-1}(end)+[1:length(dd2{i})];
end
clear coeffs dd i ii k prompt ws yfit
yRaw=y; xRaw=x;
x=t; y=dd2;
clear dd2 t
% Window for variance
n = 100000;
for i=1:length(y)
for k=1:length(y{i})/n
a(k)=var(y{i}(n*(k-1)+1:k*n));
end
v{i}=a;
a=[];
end
t{1}=1:length(v{1});
for i=2:length(y)
t{i}=t{i-1}(end)+[1:length(v{i})];
end
%%%%%In between here is 500 lines of graph making. I'm omitting this for now. Afterwards, there's...
%% Rename variables
for i=1:length(t)
tm{i} = t{i}.*((n/ft)/60); % time in minutes
end
%Save variable output.
%v = variance
save(strcat(PathNamesSave{1},Filenamesave{1},'_',signal), 't', 'tm', 'v','xRaw', 'x', 'yRaw', 'y','n', 'ft','Filenamesave','FileNames','PathNames','PathNamesSave','Names','Names2','-v7.3');
disp('Done')
Just to be clear, the variance calculations are not using the data that has been zeroed or flattened? I don't see this. We're saying the variance is gathered from the raw data?
"Just to be clear, the variance calculations are not using the data that has been zeroed or flattened?"
Yes, it is. G (and I'd forgotten and also in this read-through) missed the almost buried line
y=dd2;
that overwrites the raw data. As he notes, the sorry variable naming convention along with the massive obfuscation makes parsing of what is probably a very simple straightforward computation quite painful.
But, with that, I'd guess odds are quite good G's recast will reproduce the intent of the original with the assumption he made that you did split up the original very long file into the multitude of shorter ones.
I still contend as we discussed in the previous thread on this same topic that that is also a waste of compute time and memory which is also a waste of your time to not process the actual files.
But, little steps first, maybe; see if in fact if you just do what it looks like needs being done above you don't get satisfactory results.
Yep, completely missed the renaming of dd2 to y. Bonkers! How can this code ended being used for some critical processing? Did nobody review it?
Honestly, as we said before, it would be much better to start from scratch. Specify what the code should do rather reverse engineering extremely bad code.
I want to rewrite the code, but I do not know exactly what it's intending to do. I started out with hardly a clue. I was thrown in the deep end and told to run this script as part of a function, which ended up not working.
As far as I can tell, a 1st degree polynomial is fitted to a series of data. The difference between this slope and the x axis is gathered, with the correlating x data substracted from each data point, 'leveling' the data, thereby removing any natural drift in the data.
Then the variance is gathered from this new data set.
I intend on rewriting this in R, because I'm failing at MatLab and tired of being frusrated every day. I was able to write soem code in linux very easily and R looks to be a similar format. But I need to reverse engineer what this code is doing so I don't miss anything critical.
Is there something I'm missing regarding the analysis of the data?
"write [some] code in linux very easily "
What does that mean? linux is the OS, not a language. There are command shells with batch commands that run under linux, granted, but there are those for Windows as well. I use the JPSoftware CMD replacement shell as it has a lot more features than does CMD, but it's certainly not mandatory as there's nothing here that can't be done pretty straightforward with ML.
R is different entirely; unless you already know something of it, I don't see learning yet another interpreted language system is the answer to the problem.
As G showed, you could implement what you describe above in about 10 lines of MATLAB code or less.
It's unfortunate your predecessor left such a sorry bag of stuff, but when you strip it back to the basics, there's really very little actually going on in the above code.
I want to rewrite the code, but I do not know exactly what it's intending to do
I find it amazing that the overall goal of some code that is actually in use is not known. Surely, you must know what the ultimate purpose of the whole code chain is. In that case, I contend you'd be better off starting from scratch and forgetting anything that this code does.
If the whole code is just a black box, stuff goes in, magic happens, stuff comes out. How can that be used with confidence that the magic is the right magic?
As dpb pointed out the linear fit + subtraction is a detrend. So here we are we many lines of very obfuscated codes just to do a simple one line operation. Then you calculate the variance in batches.
The first thing to ask is: is this what should be done with the data? Does it even make sense to do that? If not, then there's no point even looking at the code.
So, maybe you should just bin that code, write a completely fresh spec of what some new code should do and ask for help in doing that.
I was told snippets of what is exactly happening, however no one went into detail because I was told to use what was given to me, write a script using this as a function.
However, that ended up not working, so instead of beeting a dead horse which I've doing for the last month, I'm trying to rewrite it. Before I could do that though, I want to make sure there isn't something in the script that I'm missing, because I'm pretty stupid when it comes to programming and I'm betting that I'm missing something. I cannot rewrite code that I do not understand the intention of, adn I don't feel like wasting my time writing a code when I missed something critical. Now I feel I understand the intention, and yes, it makes sense given how I understand the machine to function.
I did take a 16hour R course, and it seemed much more intuitive, so i was going to translate the overall goal of what I thought this script was doing into R. I felt like it would be much easier. MatLab has been nothing but a nightmare so far, and I'm tired of getting stressed out.
For some reason, everytime I make progress in matlab, the next step is a road block, and I have to back track again to find a right answer. The script that I had modified was working, except I couldn't get it to write data in any format. I couldn't figure out how to keep the append option from overwriting old variables, couldn't figure out how to rename the new variables with a counter I wrote in (hopefully keeping all variables nice and organized, and making them easier to call up in an organized, numerical format). I can write them in ascii format, however I assumed that was not the 'optimal' way.
Anyway, I'm going to integrate this script with a loop I previously made so files are read in 400,000k segments. I'll probably end up posting that when I innevitably get stuck again.
I would submit R would only seem more intuitive because you've been given a piece of trash and have been trying to use it instead of just rewriting it from the start. The obfuscation inside this piece of code looks almost to be intentional -- maybe there was a case of an outside consultant trying to produce future work for himself or somesuch going on.
I don't see how it could be much more intuitive than
y=load(data)
y=detrend(y)
v=var(y)
tm=[0:numel(y)-1]*dt
which is what the whole thing boils down to ignoring minor bookkeeping details.
I'll grant the original code doesn't look like that superficially, but it definitely isn't MATLAB that caused that but either a truly incompetent coder or done deliberately.
It would seem to be a reasonable thing to make your employer aware of the two side by side and to have an internal design review to ensure the specifications for the program are formalized going forward.
"couldn't figure out how to rename the new variables with a counter "
That is absolutely the wrong approach -- that way will inevitably lead to nightmares and other unmaintainable code.
We would need to see what it was you were trying to accomplish by doing that in order to provide the best option, but that is where perhaps cell array could be useful.
Then again, if it has to do with this idea of breaking up big files into chunks, I'll reiterate yet again that that is an unneeded and wasteful step instead of processing the original file either in pieces of N size or, alternatively, as G has suggested using tall arrays that are able to address data larger than what will fit in memory as if it were in memory from a programming standpoint.
The first alternative could be as simple as the above code outlined including
N=YourBigNumberRecords
tend=0;
fidi=fopen(testinputfile,'r')
fido=fopen(outputfile,'w')
while ~feof(fidi)
y=fscanf(fidi,'%f',N)
y=detrend(y)
v=var(y)
tm=tend+[0:numel(y)-1]*dt
tend=tm(end)+dt
fwrite(fido,y)
fwrite(fido,v)
end
fclose(all)
which would interleave the y,v sections; you could choose some alternative schemes as well of additional files or using structures or whatever, but the above would serve just fine as is.
It's not clear where the plotting is in the scheme--it can either be for each subsection by including in the loop or at the end by reading what it needs from the file--positions can be computed by counting records and fseek could easily retrieve any time position desired as every element is a single double eight bytes.
In my humble ( :) ) opinion you need to step back from the code and decide what is the end objective then implement that as G has said -- continuing to tie the organization to the original abomination is never going to lead to anything good--things will only get worse as continue to try to graft on new features or the like.
ADDENDUM
The Q? we discussed some time back then of setting up a process to do a whole series of files (experiments) is then reduced to simply calling the above function with the list of those files. There's a FEX submission (I forget whose just now) filefun that is the same idea for files as are the builtin arrayfun or cellfun that operate on arrays or cell arrays--give it the list of files and it calls the given function for them in a one-line syntax for you.
This is were I currently am
%% New Script. Faster. Better. Stronger. Started 25SEPT2019
clear;
clc;
%% For Testing
% load TestingNew;
%% Retrieve Files
tstart=tic
rootDirectory = 'Z:\ColdTech\NMDIII\Data\';
prompt = 'Enter Date of Experiment to be Processed (mmddyyyy) ';
DataDate = input(prompt, 's');
prompt = 'Enter the signal to be processed (DEF, LFM or SUM) ';
signal = input(prompt, 's');
prompt = 'Enter the name for the Output File (mmddyyyy_bug_media_oC) ';
outputfilename = input(prompt, 's');
Directory = strcat(rootDirectory, DataDate, '\');
FileNames = dir(Directory);
FileNames = {FileNames.name};
FileNames = FileNames(contains(FileNames, signal)); % Files will be in the Directory, and they ust be the signal type and a .txt file. This is enough to discriminate
outputfile = strcat(Directory, outputfilename, '.mat');
%% Data Analysis
% Flattening the Data and Gathering Variance
bt = 36000000;
N =400000; % Window of Flattening, 20 seconds of data
n = 100000; % Window of Variance, 5 seconds of data
tend = 0;
for FilesToRead = 1:length(FileNames);
FileDirectory = strcat(Directory, FileNames);
fidi=fopen(FileDirectory{FilesToRead},'r'); %This should be what varies
fido=fopen(outputfile,'w'); %This should be the same
save_struct = struct('y', [], 'v', [], 'yflat', [], 'filename', []);
save_struct = repmat(save_struct,1,3);
save_struct(FilesToRead).filename = FileDirectory{FilesToRead};
while ~feof(fidi);
ybt=fscanf(fidi,'%f\n',bt);
% if ybt<length(bt); % There is always 'fat' at the end of the data. This removes it. Should happen at the end of every file. Should NOT end the loop
% return % Is this formatted correctly? not sure if return or break is better.....
% end
for y=1:length(ybt)/N;
% for y=fscanf(ybt, '%f\n', N) % Maybe need to add a 1:length(ybt)/N
y=fscanf(ybt, '%f\n', N)
yflat=detrend(y);
v=var(reshape(yflat, n, []));
tm=tend+[0:numel(y)-1]*dt;
tend=tm(end)+dt;
save_struct(FilesToRead).v = v;
save_struct(FilesToRead).y = y;
save_struct(FilesToRead).yflat = yflat;
end
end
end
save('my_filename', 'save_struct');
save('my_filename', 'save_struct');
%% Data Analysis using fwrite
Directory = strcat(rootDirectory, DataDate, '\');
FileNames = dir(Directory);
FileNames = {FileNames.name};
FileNames = FileNames(contains(FileNames, signal)); % Files will be in the Directory, and they ust be the signal type and a .txt file. This is enough to discriminate
outputfile = strcat(Directory, outputfilename, '.mat');
My files, in an ideal world, would be an exact multiple of 36,000,000. However, I always have excess data at the end, between 1-3 million lines. I tried to add an 'if' loop, so that if a bin soesn't meet that requirement, that loop would return and continue to the next file. That isn't working, and I cannot save data currently. I think it's because fscanf is only supposed to read files, rather than variables.
Rather than continueing on before I get stuck again, do you think I'm on the right track?
Looks much cleaner, indeed! :) I don't have time this instant to really read it in detail, but I'd say you are on the right route, indeed.
Just so we're on the same track when I do get a chance (hopefully later on this evening or in the morning) a couple questions...
Q1: Is this file you're processing now the full experiment file without the partitioning of it into multiple pieces as before? I THINK that's what I'm reading, just to make sure am on same page in thinking through the process.
Q2: Why the variable names bt and ybt? What does "bt" stand for and why would you submit that's a good variable name for a length constant? To me, at least, it makes reading the code difficult because it doesn't seem to have any meaning related to what it is. Making the code human-readable as much as a novel plot as possible is always a goal.
while ~feof(fidi);
ybt=fscanf(fidi,'%f\n',bt);
% if ybt<length(bt); % There is always 'fat' at the end of the data. This removes it. Should happen at the end of every file. Should NOT end the loop
% return % Is this formatted correctly? not sure if return or break is better.....
% end
for y=1:length(ybt)/N;
% for y=fscanf(ybt, '%f\n', N) % Maybe need to add a 1:length(ybt)/N
y=fscanf(ybt, '%f\n', N)
Q3: Can you hold ybt in memory ok on your system and other systems you expect to be able to run this on? You did not check the optional return value of the number of elements read so you don't know whether the read returned the whole wanted number or not. I would expect so, 3.6E6/8/(1024*1024) --> <30 MB on machines with GB of memory these days, but your code needs to verify and handle the case if it fails--if nothing else, inform the user and abort; better have backup logic to process the file in smaller chunks.
Q4: What do you think is the purpose of the second loop using fscanf? You've already got the double array ybt in memory; just operate on it. Oh--I get it...this appears to be the attempt to implement the suggestion to just loop over the files back when were reading 400K individual files. OK, now that you are processing the whole file, really instead of actually reading the full 3.6E6 array at once, if the processing is really always on these chunks of 400K elements at a time (or some other arbitrary set as may evolve in future so should definitely be a variable and not a buried "magic number"), it really makes sense to just read that number each time from the file, operate on it and then go on to the next. Then you keep a secondary running count variable of the total number of elements you've read/processed and quit when that reaches the other, bigger magic number (3.6E6 here, or N1/N2 --> 3.6E6/400K = 9 iterations of the loop).
OK, I do have to run right now, but you definitely are getting towards something that is much, much better and will get you to an end goal pretty soon now, I think.
Oh, "just one more thing!"
Q5: Where is the plotting over what sequence of data in the process? The individual smaller sections or the whole thing or...?? Would help knowing that so ensure get the data organized for it, too.
ADDENDUM:
OBTW, if you can hold the 3.6E6 in memory, it gets even easier because detrend() is also vectorized to operate by columns as many MATLAB-supplied functions are. In that case, you can do the thing as
ybt=fscanf(fidi,'%f\n',bt); % read the 3.6E6 (bt) entries
y=detrend(reshape(ybt,N,[]); % detrend in N-sized blocks by column
y=y(): % put back into column vector
if you don't need the raw data, then you can do it all in place and dispense with the second variable.
Also, if you read the full bt elements at once, you don't need the while loop at all--from your description once you have read that many observations, you don't care about the rest of the file content so there's no need to do anything at all with it. Just close that file and go on to the next (after completing analysis of the current one, of course).
ADDENDUM SECOND:
" once you have read that many observations, you don't care about the rest of the file content so there's no need to do anything at all with it."
If you really do want to trim the files all to that magic length, then at this point you could just rewrite the file with the ybt vector before closing and that objective would then be done, too. Of course if you do that, that other data will be gone irretrievably so just in case somebody at some future date might want to retrieve more than that, probably better to just go on to the next file and leave the existing one alone. Or, of course, at the expense of extra disk space you create the fixed-length file in parallel with the original with some related name rather than overwrite the original.
OK, here's my first pass through about how I'd structure what I understand. Need a pattern for the file names to make sure the name construction for the wildcard pattern for dir() is right but this removes the dependence on the system file separator yours has.
% Retrieve Files
Drive='Z:';
Root=fullfile(Drive,'ColdTech','NMDIII','Data');
prompt = 'Enter Date of Experiment to be Processed (mmddyyyy) ';
DataDate = input(prompt, 's');
prompt = 'Enter the signal to be processed (DEF, LFM or SUM) ';
signal = input(prompt, 's');
prompt = 'Enter the name for the Output File (mmddyyyy_bug_media_oC) ';
outputfilename = input(prompt, 's');
Directory = fullfile(Root, DataDate);
% Files will be in the Directory, and they must be the signal type and a .txt file.
d=dir([Directory '*' signal '*.txt']);
outputfile = fullfile(Drive,Root,Directory, [outputfilename '.mat']);
%% Data Analysis
% Flattening the Data and Gathering Variance
bt = 36000000;
N =400000; % Window of Flattening, 20 seconds of data
n = 100000; % Window of Variance, 5 seconds of data
tend = 0;
for i=1:numel(d)
fname=fullfile(d(i).folder,d(i).name);
fidi=fopen(fname,'r'); %This should be what varies
fido=fopen(outputfile,'w'); %This should be the same
save_struct = struct('y', [], 'v', [], 'yflat', [], 'filename', []);
save_struct = repmat(save_struct,1,3);
save_struct(i).filename=fname;
y=fscanf(fidi,'%f',bt);
yflat=detrend(reshape(y,N,[])); % detrended signal over N samples
yflat=yflat(:); % reshape back to column vector
v=var(reshape(yflat, n, []));
v=v(:);
tm=tend+[0:numel(y)-1]*dt;
tend=tm(end)+dt;
save_struct(i).v = v;
save_struct(i).y = y;
save_struct(i).yflat = yflat;
end
save('my_filename', 'save_struct');
save('my_filename', 'save_struct');
Needs the output file names updated to be used but basics I think are there.
I shortened some variable names to shorten the code -- long loop indexing variables are more clutter than help -- "i" is the traditional index and since there's no imaginary arithmetic in sight here, there's no issues here. You'll want to turn this into a function anyway for the longer term at which it will also then become local and not matter for the base workspace.
I also just returned the dir() struct in d and pull the file name and folder from it.
AFAICT, there is only one file of the type for the experiment so there's no need for that loop I think altho I left it in there in case I don't quite follow.
I just saw the 3rd option for detrend too and realized it would remove the need for a secondary loop. It seem sthat it requries an array to define where the cutting would happen. In our case, it would be every 400,000, so I was thinking something along the lines of
split=[(((i-1)*N)+1):i*N]
yflat=detrend(reshape(y,N,[split]));
EDIT 1: Now that I look at, this would require a loop for the split variable to owrk correctly
EDIT 3: This works better.
split = linspace(1, bt, N)
yflat=detrend(reshape(y,N,[split]));
Also, what's the purpose of tend?
Also, i wasn't hoping to trim the data, I was hoping to have it be ignored. I figured the following would maek it go back to the top of the loop, starting with the next iteration
if i~=bt
return
end
This is just to remove data I do not need, and not to have data from one condition falling into the bin time of another condtion.
EDIT 2: Also, how does the reshape function help us in this situation? the documentation for detrend doesn't make it look like it's necessary.
"... 3rd option for detrend too and realized it would remove the need for a secondary loop"
No. Well, that would be one way. But then must build a secondary vector of the breakpoint indices and pass it. Since your breakpoints are at 400K, and detrend operates by column anyway, that's what is done via reshape() to let it operate on columns by default (just like var).
No secondary loop is needed. I didn't try timing to compare, but I'd expect that version to outperform the one with specific breakpoints as well as not needing the array for memory savings plus there's no optimization better than that of not computing something that isn't needed.
I STRONGLY recommend you build a small test dataset and experiment at the command line so you can see how these things work visually so you get a better feel for what's actually happening. The sheer magnitude of a 400K segment of a 36M array precludes that, but the logic is identically the same for 10 lines out of 40, say, and you can inspect that for correctness visually plus understand what the code is actually doing.
"I figured the following would ... go to top of loop..."
That's totally unnecessary; the full 36M elements have been read and the loop goes on its merry way after computing stuff with them (via the reshape() option outlined above). Only if you can't hold that much in memory and process each 400K in individual reads will you need a loop within the same file. NB: the while ~feof() loop is gone; unneeded, too.
The rest was just hypothetical of what you could do; what the intent was wasn't made totally explicit.
On the last, that's the same question back over again -- yes, you could do it w/o reshape() but it's easier coding using it and since reshape is simply an internal adjustment of indexing of the same data, it's quite low in overhead so is efficient computationally besides. You can do some timing testing if you care to, but I'm confident enough the performance will be at least as good plus you don't have to compute the breakpoints such that I'm not going that route at all. That option is provided in case there were variable breakpoint spacings to be accounted for; not what you have.
You haven't really directly answered any of the Q? I posed nor provided the filenaming pattern sufficiently to be able to check out the modifications there for system independence so still have a few details to either iron out or debug.
In particular, the Q? about whether there really is ever more than one file returned for the given date/type analysis at the beginning is the key one. If this is going to be used interactively, I'd suggest possibly using uigetfile dialog instead of manual prompts requiring typing in as present, but that can wait.
tend also actually isn't needed any longer if one builds the time vector for the whole enchilada--that would again be needed if one looped over the 400K chunks to create the continuous time for each section picking up from the last. There's any number of ways to build that corollary vector and there's a lot to be said if it is always the specific sampling rate to not even bother to do anything except compute it on the fly and not save it at all...other than if one ever does just read sections of a file starting somewhere in the middle. But even then, if it's a constant and never changes, it's just dt*N for whatever N is chosen or N==T/dt to compute the offset given a desired time.
Q1: Is this file you're processing now the full experiment file without the partitioning of it into multiple pieces as before? I THINK that's what I'm reading, just to make sure am on same page in thinking through the process.
It is the base number of files produces by the machine. There will be a minimum of two, maybe more. It depends on how often I have to stop data acquisition and change conditions.
Q2: Why the variable names bt and ybt? What does "bt" stand for and why would you submit that's a good variable name for a length constant? To me, at least, it makes reading the code difficult because it doesn't seem to have any meaning related to what it is. Making the code human-readable as much as a novel plot as possible is always a goal.
bt was the bin time tha the box plots would be made in, 36,000,000, or 30 minutes of data. Ybt was a variable I threw in to connect bt and y between different lines. Not really needed anymore though
Q3: Can you hold ybt in memory ok on your system and other systems you expect to be able to run this on? You did not check the optional return value of the number of elements read so you don't know whether the read returned the whole wanted number or not. I would expect so, 3.6E6/8/(1024*1024) --> <30 MB on machines with GB of memory these days, but your code needs to verify and handle the case if it fails--if nothing else, inform the user and abort; better have backup logic to process the file in smaller chunks.
Not using ybt anymore, so we don’t need it.
Q4: What do you think is the purpose of the second loop using fscanf? You've already got the double array ybt in memory; just operate on it. Oh--I get it...this appears to be the attempt to implement the suggestion to just loop over the files back when were reading 400K individual files. OK, now that you are processing the whole file, really instead of actually reading the full 3.6E6 array at once, if the processing is really always on these chunks of 400K elements at a time (or some other arbitrary set as may evolve in future so should definitely be a variable and not a buried "magic number"), it really makes sense to just read that number each time from the file, operate on it and then go on to the next. Then you keep a secondary running count variable of the total number of elements you've read/processed and quit when that reaches the other, bigger magic number (3.6E6 here, or N1/N2 --> 3.6E6/400K = 9 iterations of the loop).
I don’t recall off the top of my head. It was something I thought of that may have let the script work.
Q5: Where is the plotting over what sequence of data in the process? The individual smaller sections or the whole thing or...?? Would help knowing that so ensure get the data organized for it, too.
All the data analyzed would be plotted together. I intend on having box plots made with 30 minutes of data (36,000,000 divided by 100,000 for variance result sin n=360 per box plot).
Then the raw y data over time, so the drift of the machine can be measured.
Then the flattened y data over time, so the average amplitude can be viewed without noise (the drift)
Then the flattened y data would undergo FFT, so the most common amplitudes can be viewed as a control.
Additionally, violin graphs over 30 minute bins would measure which amplitudes were most common in their bin time.
Regarding uigetfile, this script will be run remotely through command line without an interface. The data date and the signal type is enough to differentiate between data. I also intend on having a list populate of the files selected and what order they’ll be processed, so the script can be terminated if the wrong files were loaded.
"Not using ybt anymore, so we don’t need it."
That doesn't answer the Q? underlying -- can you hold the full vector in memory on the systems you're running on? What you use for the variable name is immaterial. If so, then the full file can be operated on at once and there's no need to loop through the file at all. If not, then you need the loop to read segments and process them...but you don't need to (and shouldn't) separate out into individual files to do that.
"data date and the signal type is enough to differentiate between data"
Yeah, you've said that but you're going at it in what appears to be very awkward ways...if you would illustrate the naming pattern used, I'm certain a wildcard pattern could be built from the inputs you've asked for from the user which will have to come from somewhere, no matter how you actually end up running the script (which should undoubtedly be turned into a function for data integrity and scoping) so could have dir return only those of interest directly.
Anyway, details of interface and some minor efficiencies aside, looks like you've got a handle on how to proceed from here; I'll reiterate on the idea of playing at the command line with small dataset so you can understand what the reshape stuff is doing and how.
Think I'll bow out now...good luck.

Sign in to comment.

More Answers (1)

"If you set the directory to the three fiels linked, you'll process them in about 30 seconds."
I did set breakpoint in debugger and ran the first of the three to the point of computing dd2
As we already knew, the result of
y=load('x000.txt');
ypr=detrend(y); % use the Signal Processing TB detrend function
resulted in
K>> dd2{1}(1:10)
ans =
-0.0011
0.0005
0.0041
-0.0033
0.0033
-0.0001
0.0023
0.0001
-0.0003
0.0022
K>> ypr(1:10)
ans =
-0.0011
0.0005
0.0041
-0.0033
0.0033
-0.0001
0.0023
0.0001
-0.0003
0.0022
K>>
which are the same result in one line of code instead of the 50 or so in your function and runs in a fraction of a second. And, in fact,
K>> max(abs(dd2{1}-ypr))
ans =
2.8449e-16
K>>
shows the difference in the result by different computation method is within machine precision of a double precision variable so we've verified that's what the gosh-awful mess actually does.
So, all you now need is to compute the variance of the detrended time series over the time length of choice -- as G shows and I recommended, the easiest way to do this is to reshape() the vector to Nx[] array and let Matlab vector operation by column on the array compute the values for each column in one call.
It appears the only thing left if the computation of the tm array and then plotting the results.
All the calculations could be in a loop of 20 lines at the absolute outside; more like 10 or even fewer it would seem probable.
Wrap that in an outer loop that opens the big file and reads the desired number of records (400K or whatever were chosen) and computes each section.

4 Comments

[EL Answer moved to comment...dpb]
How would I be able to append data to the same variable? This is an issie I had with saving data on the recent modifications I was trying to make, and the '-append' option under save overwrites existing variables, rather than truly appending them. I coukd save the files under csv formats, but I feel that .mat files would be best. Does it matter?
We can't answer what matters for your application--that's entirely up to your needs or other requirements.
What size of files are you actually needing to keep and for what purposes? Frankly, for large files if one intends to continue to add to them, stream files would be the most efficient by far in both speed and size.
@EL, were you the one asking the same question on discord?
If so, then start a new question (here on answers) for just that, explaining in a lot more details what you want to save, to what kind of file (mat, text, excel, binary, whatever), and why it is incremental.
Note that if your input data files are huge, then using datastore and tall arrays would probably be a good idea. Again, this requires discarding the current code.
Yeah, that was me on discord.
I'll start a new question regarding that.

Sign in to comment.

Tags

Asked:

EL
on 23 Sep 2019

Commented:

dpb
on 29 Sep 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!