When is it better to use a multi-level-struct than a table?

I am processing data logged in ~4000 text files. I initally read the data into a multi-level structure because the heirarchical nature seemed to make more sense for how I collected the data. 5 different configurations, each tested at 20 different positions, with each position containing 40 angles, each angle being a seperate experiment with environmental parameters (time, temp, speed), and 42 data channels, each having a mean or RMS value, a tare value, and a standard deviation (I can calculate these as I read the files and then store only scalars in the struct).
I abandoned the struct because reading the data back out was too burdensome. For instance, if I want to plot data channel 5 mean against data channel 1 mean for a certain angle (say 10deg) at all locations of one config, I thought I would use something like:
% pseudo code just for illustration, haven't tried, wouldn't work
x = [data.config(3).pos(:).ang(10).chan(5).mean];
y = [data.config(3).pos(:).ang(10).chan(1).mean];
plot(x,y)
But what I learned is that you cannot address more than one level of a struct at a time, instead, you must run a series of nested loops, one each for every criteria you want to query by, and move it's contents into a temporary variable for the next loop to operate on.
With a table on the other hand, I can store everything in one large flat table where each row is a an angle (one row for each experiment) and just have a ton of columns. The downside to this in my mind is that the table now contains soooo much more repetative data. for instance: the struct could parent all of the sub structs back to one of the five configurations, but the table must have ~4000 extra cells so that each row knows what config it is a member of. The upside is that querying out data is much simpler. eg:
% also example code which I haven't tried, may not be correct
x = data.mean(config==3 & ang==10 & chan==5);
y = data.mean(config==3 & ang==10 & chan==1);
plot(x,y)
So I am guessing it is a matter of preference, but going through all of this is making me wonder when and why do you chose a mutli-level struct over a table, and are there other even better options?

6 Comments

"why do you chose a mutli-level struct over a table..."
Almost never.
"...and are there other even better options?"
Tables.
You are over-thinking this. One of the main goals of good data design includes making processing data easier: is your goal a) to process the data and get some results or b) to fight your data with lots and lots of nested loops? Once you decide on nested loops then you have painted yourself into that corner...
"The downside to this in my mind is that the table now contains soooo much more repetative data."
The downside seems to be only in your mind: did you run out of memory using a table? What exactly is preventing you from using a table?
Consider the actual concrete benefits: use tables and you will open up many MATLAB tools that simplify data importing, processing, and exporting your results. On top of that you make it much easier to explore your data (a very important part of any researcher's work), e.g. notice some trend then plot it, or use some other grouping criteria, or do whatever statistical magic on some other variables than you first considered during the planning phase... easy with a table, a real pain when you are fighting your data using nested loops.
The data is repetitive because it is repetitive: when lots of permutations of test parameters/conditions/whatever are used for tests then naturally each individual testcase includes parameter values that also occur in other testcases. That is normal and expected.
"But what I learned is that you cannot address more than one level of a struct at a time"
Structures do not have "levels": what you have are lots of separate, nested structures.
Even though you might think of those data in some kind of higher-dimensional space or as having some kind of abstract tree structure, that does not mean that a tree structure is the best way to store or process that data using a computer. A computer is not an abstract machine: it stores lots of numbers in long lists (of various flavors): in general the closer you get to lists, the better. That is rather the art of good data design (which in turn leads to much better code design): trying to map an abstract concept onto something concrete that a computer can efficiently process.
Tables are perfect for one test-case per line. That simple rule of thumb works for a lot of data design. Break it only when needed.
But what I learned is that you cannot address more than one level of a struct at a time
You can use getfield
"You can use getfield"
Not in the way that the user shows (note that the OP stated that field POS has "20 different positions"). GETFIED is not a general solution that magically flattens intermediate multiple elements of nested structures. In fact the GETFIELD code has always included a code-comment that specifically excludes it being used in that way "% Always return first element (even for comma separated list result)".
For every non-scalar nested structure GETFIELD by design only returns data from the first nested structure element. So its use would require as many loops as the user has non-scalar indices for any of the nested structures.
If you have S(J).A(K).B(L) and you are doing sweeps over J K L, then you have your choice of implementations:
%version 1
for J = 1 : limitJ
SJ = S(J);
for K = 1 : limitK
SJAK = SJ.A(K);
for L = 1 : limitL
Value = SJAK.B(L);
do_something(Value);
end
end
end
%version 2
for J = 1 : limitJ
for K = 1 : limitK
for L = 1 : limitL
Value = S(J).A(K).B(L);
do_something(Value);
end
end
end
%version 3
arrayfun(@(J) arrayfun(@(K) arrayfun(@(L) do_something(getfield(S, J, 'A', K, 'B', L)), 1:limitL, 'un', 0), 1:limitK, 'un', 0), 1:limitJ, 'un', 0)
and there would be another arrayfun version that iterates over structure members that just isn't coming to mind at the moment but I am sure is possible.
So getfield() is one of the options that does not require creating temporary variables (other than internally)
"If you have S(J).A(K).B(L) and you are doing sweeps over J K L..."
What relevance does that have to the specific example give by the OP? Not much.
"there would be another arrayfun version that iterates over structure members that just isn't coming to mind at the moment but I am sure is possible"
It is possible if you pass scalar structures as the function inputs. But warning: tectonic plates move much faster.
(hint: that approach is the partner to version 1, just like version 3 is the partner to version 2)
"So getfield() is one of the options that does not require creating temporary variables (other than internally)"
And yet... it is not really an option. None of those "versions" actually deliver what the OP requires: the numeric vectors x and y (for plotting, as the OP clearly states).
Versions 1 & 2 are the nested loops the OP already knows about. Version 3 (very slowly) creates nested cell arrays inside nested cell arrays inside another cell array. Flattening multiply nested cell arrays (to get the numeric vectors x & y, which are what the OP needs) requires either multiple comma-separated lists (with associated temporary variables) or more nested loops or recursion... or some other even worse kind of horror. So you are right back to square one.
@cdlapoin: these examples should make it quite clear why you should be using tables.
Stephen, I see your point and it's well taken. A couple thousand duplicate values is not really a problem if the performance is fine, and my datasets are not so large that a small performance hit would be much of a problem anyway.

Sign in to comment.

 Accepted Answer

We are discussing in https://www.mathworks.com/matlabcentral/answers/556024-what-frustrates-you-about-matlab-2#answer_1337061 why row-by-row access to a table can be much slower than some of the alternatives. A lot is going to depend on how you use the data after it has been put into the data structure.
If all of the data is numeric, using a numeric array will be typically be fastest... but again it depends on the data access patterns. Sometimes cell arrays are faster, as recently explored in https://www.mathworks.com/matlabcentral/answers/2035921-access-time-of-data-in-cell-array-vs-matrix#answer_1336881

2 Comments

I see, so I could just hold all the same data in a flat matrix, and keep track of my column names seperately and that might be faster in some cases, but in most cases the difference would be negligible. (that is my reading of the linked topic anyway).
I'm not really hearing that there is ever a time where the nested data structures would be the better option.
Just using tables from now on may be what I go with then. I like the readability of calling variables by name, and I like having the workspace kept clean by storing those variables within the table.
Use tables.
Most likely you will spend far more time writing, debugging, and maintaining your code than your code will spend running. Therefore making sure that your data and code is clear and correct is of the uttmost importance, and will save you time overall. Tables are a great way to achive that clarity.
"I'm not really hearing that there is ever a time where the nested data structures would be the better option."
Something like this would be difficult without nested structures or a similar data type:
It implements a https://en.wikipedia.org/wiki/Trie using actual MATLAB (not low-level) code.

Sign in to comment.

More Answers (0)

Categories

Products

Release

R2023b

Asked:

on 23 Oct 2023

Edited:

on 24 Oct 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!