When is it better to use a multi-level-struct than a table?
Show older comments
I am processing data logged in ~4000 text files. I initally read the data into a multi-level structure because the heirarchical nature seemed to make more sense for how I collected the data. 5 different configurations, each tested at 20 different positions, with each position containing 40 angles, each angle being a seperate experiment with environmental parameters (time, temp, speed), and 42 data channels, each having a mean or RMS value, a tare value, and a standard deviation (I can calculate these as I read the files and then store only scalars in the struct).
I abandoned the struct because reading the data back out was too burdensome. For instance, if I want to plot data channel 5 mean against data channel 1 mean for a certain angle (say 10deg) at all locations of one config, I thought I would use something like:
% pseudo code just for illustration, haven't tried, wouldn't work
x = [data.config(3).pos(:).ang(10).chan(5).mean];
y = [data.config(3).pos(:).ang(10).chan(1).mean];
plot(x,y)
But what I learned is that you cannot address more than one level of a struct at a time, instead, you must run a series of nested loops, one each for every criteria you want to query by, and move it's contents into a temporary variable for the next loop to operate on.
With a table on the other hand, I can store everything in one large flat table where each row is a an angle (one row for each experiment) and just have a ton of columns. The downside to this in my mind is that the table now contains soooo much more repetative data. for instance: the struct could parent all of the sub structs back to one of the five configurations, but the table must have ~4000 extra cells so that each row knows what config it is a member of. The upside is that querying out data is much simpler. eg:
% also example code which I haven't tried, may not be correct
x = data.mean(config==3 & ang==10 & chan==5);
y = data.mean(config==3 & ang==10 & chan==1);
plot(x,y)
So I am guessing it is a matter of preference, but going through all of this is making me wonder when and why do you chose a mutli-level struct over a table, and are there other even better options?
6 Comments
"why do you chose a mutli-level struct over a table..."
Almost never.
"...and are there other even better options?"
Tables.
You are over-thinking this. One of the main goals of good data design includes making processing data easier: is your goal a) to process the data and get some results or b) to fight your data with lots and lots of nested loops? Once you decide on nested loops then you have painted yourself into that corner...
"The downside to this in my mind is that the table now contains soooo much more repetative data."
The downside seems to be only in your mind: did you run out of memory using a table? What exactly is preventing you from using a table?
Consider the actual concrete benefits: use tables and you will open up many MATLAB tools that simplify data importing, processing, and exporting your results. On top of that you make it much easier to explore your data (a very important part of any researcher's work), e.g. notice some trend then plot it, or use some other grouping criteria, or do whatever statistical magic on some other variables than you first considered during the planning phase... easy with a table, a real pain when you are fighting your data using nested loops.
The data is repetitive because it is repetitive: when lots of permutations of test parameters/conditions/whatever are used for tests then naturally each individual testcase includes parameter values that also occur in other testcases. That is normal and expected.
"But what I learned is that you cannot address more than one level of a struct at a time"
Structures do not have "levels": what you have are lots of separate, nested structures.
Even though you might think of those data in some kind of higher-dimensional space or as having some kind of abstract tree structure, that does not mean that a tree structure is the best way to store or process that data using a computer. A computer is not an abstract machine: it stores lots of numbers in long lists (of various flavors): in general the closer you get to lists, the better. That is rather the art of good data design (which in turn leads to much better code design): trying to map an abstract concept onto something concrete that a computer can efficiently process.
Tables are perfect for one test-case per line. That simple rule of thumb works for a lot of data design. Break it only when needed.
Walter Roberson
on 23 Oct 2023
But what I learned is that you cannot address more than one level of a struct at a time
You can use getfield
Not in the way that the user shows (note that the OP stated that field POS has "20 different positions"). GETFIED is not a general solution that magically flattens intermediate multiple elements of nested structures. In fact the GETFIELD code has always included a code-comment that specifically excludes it being used in that way "% Always return first element (even for comma separated list result)".
For every non-scalar nested structure GETFIELD by design only returns data from the first nested structure element. So its use would require as many loops as the user has non-scalar indices for any of the nested structures.
Walter Roberson
on 23 Oct 2023
If you have S(J).A(K).B(L) and you are doing sweeps over J K L, then you have your choice of implementations:
%version 1
for J = 1 : limitJ
SJ = S(J);
for K = 1 : limitK
SJAK = SJ.A(K);
for L = 1 : limitL
Value = SJAK.B(L);
do_something(Value);
end
end
end
%version 2
for J = 1 : limitJ
for K = 1 : limitK
for L = 1 : limitL
Value = S(J).A(K).B(L);
do_something(Value);
end
end
end
%version 3
arrayfun(@(J) arrayfun(@(K) arrayfun(@(L) do_something(getfield(S, J, 'A', K, 'B', L)), 1:limitL, 'un', 0), 1:limitK, 'un', 0), 1:limitJ, 'un', 0)
and there would be another arrayfun version that iterates over structure members that just isn't coming to mind at the moment but I am sure is possible.
So getfield() is one of the options that does not require creating temporary variables (other than internally)
"If you have S(J).A(K).B(L) and you are doing sweeps over J K L..."
What relevance does that have to the specific example give by the OP? Not much.
"there would be another arrayfun version that iterates over structure members that just isn't coming to mind at the moment but I am sure is possible"
It is possible if you pass scalar structures as the function inputs. But warning: tectonic plates move much faster.
(hint: that approach is the partner to version 1, just like version 3 is the partner to version 2)
"So getfield() is one of the options that does not require creating temporary variables (other than internally)"
And yet... it is not really an option. None of those "versions" actually deliver what the OP requires: the numeric vectors x and y (for plotting, as the OP clearly states).
Versions 1 & 2 are the nested loops the OP already knows about. Version 3 (very slowly) creates nested cell arrays inside nested cell arrays inside another cell array. Flattening multiply nested cell arrays (to get the numeric vectors x & y, which are what the OP needs) requires either multiple comma-separated lists (with associated temporary variables) or more nested loops or recursion... or some other even worse kind of horror. So you are right back to square one.
@cdlapoin: these examples should make it quite clear why you should be using tables.
cdlapoin
on 23 Oct 2023
Accepted Answer
More Answers (0)
Categories
Find more on Tables in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!