Random Forest with paired observations: how to maintain subject separation

7 views (last 30 days)
When using classifiers like SVM, I keep observations from each subject together by using a custom cross-validation partition. Random forest uses bootstrap aggregation instead of cross-validation, so I need a way of telling it to keep each subject's observations together: i.e. a subject has to be either fully in or out of the bag, not some observations in and some out. How do I do this in Matlab?
I can write code to generate the bootstrapped data that TreeBagger could use for each tree, analogous to a custom CVPartition, but there seems to be no way of passing this to TreeBagger. How does one achieve this in Matlab?
(I do realise that one solution to keep subjects together is to use cross-validation on top of bagging, but that shouldn't be necessary and greatly slows the whole process down, e.g. 10-fold CV would be expected to take ten times as long. I could also manually roll the whole random forest process, but then I don't have a TreeBagger object that I can pass to other functions, etc.)
rf = TreeBagger(numTrees, X, Y, ...
'Method', 'classification', ...
'OOBPrediction', 'on', ...
'NumPredictorsToSample', mtry, ...
'MinLeafSize', 3)

Answers (1)

Sameer
Sameer on 8 Jul 2025
"TreeBagger" in MATLAB performs standard bootstrap aggregation at the observation level, and it doesn't natively support grouped sampling (e.g., sampling by subject rather than by individual observation).
Although "TreeBagger" doesn't expose an option to directly specify custom bootstrap indices (like how "cvpartition" works for cross-validation), one common workaround is to implement "subject-level bagging" manually, then train individual decision trees and aggregate them to mimic "TreeBagger".
Here's the general approach:
  1. "Generate bootstrap samples at the subject level" — i.e., sample subjects with replacement.
  2. "Collect all observations for the selected subjects" to form the training set for each tree.
  3. "Train a single tree using 'fitctree'" or similar on that training set.
  4. "Repeat" the above to train as many trees as needed.
  5. "Aggregate the predictions manually" — e.g., by majority vote for classification.
With this you can build a custom structure or class to manage the trained trees and mimic the interface for prediction and out-of-bag estimation.
Also, if your dataset isn't too large and you're okay with the computational cost, using cross-validation with grouped "cvpartition" over subjects is still the most robust option with standard MATLAB tools — though as you mentioned, it’s slower.
Hope this helps!
  1 Comment
Leon
Leon on 8 Jul 2025
Thanks for your input. This is basically what I meant by "I could also manually roll the whole random forest process, but then I don't have a TreeBagger object that I can pass to other functions, etc." It's frustrating that MATLAB doesn't allow you to keep subjects together. The code I have experimented with for doing this manually is much slower than TreeBagger.

Sign in to comment.

Products


Release

R2024a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!