- "Generate bootstrap samples at the subject level" — i.e., sample subjects with replacement.
- "Collect all observations for the selected subjects" to form the training set for each tree.
- "Train a single tree using 'fitctree'" or similar on that training set.
- "Repeat" the above to train as many trees as needed.
- "Aggregate the predictions manually" — e.g., by majority vote for classification.
Random Forest with paired observations: how to maintain subject separation
7 views (last 30 days)
Show older comments
When using classifiers like SVM, I keep observations from each subject together by using a custom cross-validation partition. Random forest uses bootstrap aggregation instead of cross-validation, so I need a way of telling it to keep each subject's observations together: i.e. a subject has to be either fully in or out of the bag, not some observations in and some out. How do I do this in Matlab?
I can write code to generate the bootstrapped data that TreeBagger could use for each tree, analogous to a custom CVPartition, but there seems to be no way of passing this to TreeBagger. How does one achieve this in Matlab?
(I do realise that one solution to keep subjects together is to use cross-validation on top of bagging, but that shouldn't be necessary and greatly slows the whole process down, e.g. 10-fold CV would be expected to take ten times as long. I could also manually roll the whole random forest process, but then I don't have a TreeBagger object that I can pass to other functions, etc.)
rf = TreeBagger(numTrees, X, Y, ...
'Method', 'classification', ...
'OOBPrediction', 'on', ...
'NumPredictorsToSample', mtry, ...
'MinLeafSize', 3)
0 Comments
Answers (1)
Sameer
on 8 Jul 2025
Hi @Leon
"TreeBagger" in MATLAB performs standard bootstrap aggregation at the observation level, and it doesn't natively support grouped sampling (e.g., sampling by subject rather than by individual observation).
Although "TreeBagger" doesn't expose an option to directly specify custom bootstrap indices (like how "cvpartition" works for cross-validation), one common workaround is to implement "subject-level bagging" manually, then train individual decision trees and aggregate them to mimic "TreeBagger".
Here's the general approach:
With this you can build a custom structure or class to manage the trained trees and mimic the interface for prediction and out-of-bag estimation.
Also, if your dataset isn't too large and you're okay with the computational cost, using cross-validation with grouped "cvpartition" over subjects is still the most robust option with standard MATLAB tools — though as you mentioned, it’s slower.
Hope this helps!
See Also
Categories
Find more on Classification Ensembles in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!