the relation between score and probability for ensemble classification
16 views (last 30 days)
Show older comments
I am dealing with a binary classification problem and I want to get the probability of prediction using RUSBoost algorithm. http://www.mathworks.com/help/stats/compactclassificationensemble.predict.html according to this doc, it says the score generated by each tree is the probability of this observation originating from this class computed as the fraction of observations of this class in a tree leaf. predict averages these scores over all trees in the ensemble. However, I cannot see that from the score I get. For the 2 classes, the sum of scores is not 1, and sometimes score is larger than 1. It seems that the classification is decided by the larger score for the 2 classes.
So what is the right way to understand the score? How can I infer the probability from the score? Is there any way to do it? Thanks!
Answers (1)
Shubham
on 10 Nov 2023
Hi Hui,
The RUSBoost algorithm is an implementation of the AdaBoost algorithm with random undersampling. In terms of interpreting the scores generated by each tree in the ensemble, it's important to note that these scores do not directly represent probabilities. Instead, they reflect the strength or confidence of the prediction for each class.
The scores generated by each tree in the ensemble are combined and averaged to obtain the final prediction. In the case of binary classification, the class with the higher score is typically assigned as the predicted class.
To infer probabilities from the scores, you can use a technique called Platt scaling or sigmoid calibration. Platt scaling involves fitting a logistic regression model to the scores generated by the ensemble, using the true class labels as the response variable. This calibrated model can then be used to estimate the probabilities for each class.
Here's a step-by-step approach to perform Platt scaling:
- Collect the scores generated by the RUSBoost ensemble for your dataset.
- Split your dataset into a training set and a validation set.
- Fit a logistic regression model using the scores as the predictor variable and the true class labels as the response variable, using the training set.
- Predict the probabilities using the fitted logistic regression model for the validation set.
- Evaluate the performance of the calibrated probabilities using appropriate metrics (e.g., log loss, Brier score) and adjust the model if necessary.
By applying Platt scaling, you can obtain calibrated probabilities that reflect the likelihood of an observation belonging to each class.
0 Comments
See Also
Categories
Find more on Classification Ensembles in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!