What exactly is going on with this code? I'm having difficulty understanding it
Show older comments
Hello,
I'm struggling with understanding the code below. I was hoping to give my best interpretation of what's going on, and hope that I'm right.
for FilesToRead = 1 : length(FileNames)
y = load(fullfile(Directory, FileNames{FilesToRead}));
y = {y};
n = 400000;
for i=1:length(y)
x{i}=(1:length(y{i}))';
end
for ii=1:length(y)
for k=1:length(y{ii})/n
coeffs{ii}(k,:)=polyfit(x{ii}(n*(k-1)+1 : k*n),y{ii}(n*(k-1)+1 : k*n),1);
yfit{ii}(k,:)=coeffs{ii}(k,2)+coeffs{ii}(k,1)*x{ii}(n*(k-1)+1 : k*n);
end
end
for ii=1:length(y)
dd{ii}= subsref(yfit{ii}.', substruct('()', {':'})).';
dd2{ii}=y{ii}(1:length(dd{ii}))-dd{ii}';
end
for i=1:length(dd2)
t{i}=zeros(1,length(dd2{i}));
end
t{1}=1:length(dd2{1});
for i=2:length(dd2)
t{i}= t{i-1}(end)+[1:length(dd2{i})];
end
clear coeffs dd i ii k prompt ws yfit
yRaw=y; xRaw=x;
x=t; y=dd2;
clear dd2 t
% Window for variance - 5 seconds = 100,000
n = 100000;
for i=1:length(y)
for k=1:length(y{i})/n
a(k)=var(y{i}(n*(k-1)+1:k*n));
end
v{i}=a;
a=[];
end
t{1}=1:length(v{1});
for i=2:length(y)
t{i}=t{i-1}(end)+[1:length(v{i})];
end
for i=1:length(t)
tm{i} = t{i}.*((n/20000)/60); % time in minutes
end
end
So essentially, every 400,000 data points, a 1st degree polynomial is fitted to the 400,000 points. Then, the data is zeroed, or flattened, where the slope of the polynomial was turned to 0, and every data point had an equal number removed from from the area on the slope the data point was located.
This is done to the entire data set. Those new values are then stored, and variance data is gathered from every 100,000 data points.
Is this the general gist?
9 Comments
per isakson
on 23 Sep 2019
I added a missing end as the last line and fixed the indentation.
It's tough analyse the code without running it.
dpb
on 23 Sep 2019
I thought I went through this at some length a week or two ago???
It wasn't clear why it was written as it was--seemed very roundabout way to do it in all likelihood.
I didn't dig through the entire morass to be absolutely certain regarding the minute details as there wasn't any data supplied to check it out with easily. As Per says, it would be easier given the convoluted way it was written to actually test what it's doing with some known data...the size is immaterial in figuring out the algorithm--400 points are just as good as 400,000.
Guillaume
on 23 Sep 2019
I'm struggling with understanding the code below
Best is to go back to the original author, complain loudly that there is no comment in the code and ask them to explain it.
I gave up after reading this:
y = {y};
%...
for i = 1:length(y)
y is purposefully created as a 1 element cell array, hence length(y) is guaranteed to be 1 and the for is pointless. So not only is the code not documented but it's most likely wrong.
John D'Errico
on 23 Sep 2019
Ugh. Nasty miserable code, written by someone who forgot (more likely, never learned) that comments are legal. In fact, comment lines are free!
Personally, I'd decide that whatever it does, you would be better off writing code from scratch. Certainly safer, as there is no assurance that you could ever debug this mess in reasonable time if there was a problem. And how would you know if there was a problem anyway?
dpb
on 23 Sep 2019
@G -- From the previous time I looked into the whole mess for OP I'm pretty sure I determined the above is a do-nothing loop that has no bearing on anything else later on.
@J -- OP did share there was one comment at the beginning of the code -- that it, basically was thrown together but not one that actually described the computations. :)
Guillaume
on 23 Sep 2019
I'm with John, this code should be discarded and new, documented code should be written. There are too many issues with it. e.g.:
- it assumes that the input length is a multiple of 400,000. If it's not, it will error.
- It uses some very strange and pointless constructs such as
dd{ii}= subsref(yfit{ii}.', substruct('()', {':'})).';
which is simply:
dd{ii} = yfit{ii}(:);
- it dedicates a whole loop to just preallocating arrays but on the next loop entirely discards the prellocated arrays and replaces them by new ones.
- all of the loop are a complete waste of time since all of them iterate over just one value.
- probably more...
EL
on 23 Sep 2019
dpb
on 24 Sep 2019
...
%rootfolder = '/Data/';
% ***Don't forget to swith the slashes!
rootfolder = 'Z:Data\'; %%MAJOR ISSUE AREA. THE DIRECTORY HERE ALWAYS CHANGES! WATCH OUT!!!!
% Chopped = '/Chopped';
Chopped = '\Chopped\';
prompt = 'Enter date of experiment to be analyzed: ';
DataDate = input(prompt, 's');
Directory = strcat(rootfolder,DataDate,Chopped);
Directory2=strcat(rootfolder,DataDate,'\');
...
We talked about this issue previously I think, too...???
Use fullfile() to build qualified filenames without specifying the system-specific delimiter character and it will automagically recognize which system are running on and do the right thing for you.
%% This selects all .txt files in the folder.
FileNames=dir(Directory);
FileNames = {FileNames.name};
FileNames = FileNames(contains(FileNames, '.txt'));
...
for FilesToRead = 1 : length(FileNames)
y = load(fullfile(Directory, FileNames{FilesToRead}));
...
is just
rootfolder = 'Z:Data';
Chopped = 'Chopped';
prompt = 'Enter date of experiment to be analyzed: ';
DataDate = input(prompt, 's');
Directory=fullfile(rootfolder,DataDate,Chopped);
FileNames=dir(fullfile(Directory,'*.txt');
...
for FilesToRead=1:length(FileNames)
y=load(fullfile(Directory,FileNames(FilesToRead).name));
...
and will include only those with the .txt extension and have the system-dependency handled automatically.
I don't think there's any point in then turning y into a cell array instead of just operating on the resulting vector as native double; that alone should speed things up noticeably as well as reduce memory footprint.
Then, as G has noted (and I did in our previous conversations as well) the loop and initial x is completely superfluous and waste of time and memory.
Also as I noted before, one could then compute the statistics on N elements of the vector piecewise by reshape() to NxM columns, use the inbuilt MATLAB facility to compute by default over columns for builtin routines like mean and var or iterate over the M columns for those which haven't been vectorized such as polyfit. At that point you may find it convient to store the coefficients in cell array since there are two per column, but it isn't clear there's any need to actually save them at all but to just process each column (section) and move on to the next. Doing it this way whether save the intermediates or not would eliminate almost all the convoluted indexing that makes so much of the code so difficult to parse.
Once have done that, then just working through the rest of the machinations should be relatively straightforward with the same idea although I still contend would be far simpler to just use a much smaller dataset initially with N=400 or somesuch so don't get so overwhelmed by just the size and can see results quickly and on screen instead of being innundated by the sheer mass of numbers that obfuscates the underlying problem. Once do that, then you can go back and verify your revised code produces same results for a real data set over that time length.
Again, it would also be wise to recode so that these magic numbers aren't buried in the bowels of the code but are easily set and modifiable so the code could be used generically if conditions do change in future.
Accepted Answer
More Answers (1)
"If you set the directory to the three fiels linked, you'll process them in about 30 seconds."
I did set breakpoint in debugger and ran the first of the three to the point of computing dd2
As we already knew, the result of
y=load('x000.txt');
ypr=detrend(y); % use the Signal Processing TB detrend function
resulted in
K>> dd2{1}(1:10)
ans =
-0.0011
0.0005
0.0041
-0.0033
0.0033
-0.0001
0.0023
0.0001
-0.0003
0.0022
K>> ypr(1:10)
ans =
-0.0011
0.0005
0.0041
-0.0033
0.0033
-0.0001
0.0023
0.0001
-0.0003
0.0022
K>>
which are the same result in one line of code instead of the 50 or so in your function and runs in a fraction of a second. And, in fact,
K>> max(abs(dd2{1}-ypr))
ans =
2.8449e-16
K>>
shows the difference in the result by different computation method is within machine precision of a double precision variable so we've verified that's what the gosh-awful mess actually does.
So, all you now need is to compute the variance of the detrended time series over the time length of choice -- as G shows and I recommended, the easiest way to do this is to reshape() the vector to Nx[] array and let Matlab vector operation by column on the array compute the values for each column in one call.
It appears the only thing left if the computation of the tm array and then plotting the results.
All the calculations could be in a loop of 20 lines at the absolute outside; more like 10 or even fewer it would seem probable.
Wrap that in an outer loop that opens the big file and reads the desired number of records (400K or whatever were chosen) and computes each section.
4 Comments
dpb
on 25 Sep 2019
[EL Answer moved to comment...dpb]
How would I be able to append data to the same variable? This is an issie I had with saving data on the recent modifications I was trying to make, and the '-append' option under save overwrites existing variables, rather than truly appending them. I coukd save the files under csv formats, but I feel that .mat files would be best. Does it matter?
dpb
on 25 Sep 2019
We can't answer what matters for your application--that's entirely up to your needs or other requirements.
What size of files are you actually needing to keep and for what purposes? Frankly, for large files if one intends to continue to add to them, stream files would be the most efficient by far in both speed and size.
Guillaume
on 25 Sep 2019
@EL, were you the one asking the same question on discord?
If so, then start a new question (here on answers) for just that, explaining in a lot more details what you want to save, to what kind of file (mat, text, excel, binary, whatever), and why it is incremental.
Note that if your input data files are huge, then using datastore and tall arrays would probably be a good idea. Again, this requires discarding the current code.
EL
on 25 Sep 2019
Categories
Find more on Logical in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!