Credit Scoring Using Logistic Regression and Decision Trees

This example shows how to create and compare two credit scoring models, which includes:

Training both a logistic regression model (base model) and a decision tree model (challenger model) to predict PDs.
Validating the models by comparing the values of different validation metrics between the challenger model and the base model.

In this example, the base model is a logistic regression model, whereas the challenger model is a decision tree model. This example compares a credit scorecard logistic regression model (using creditscorecard) and decision tree scoring model (using fitctree), and presents a workflow to train the models, compute the PDs, and perform model validation using the risk.validation namespace. The models in this example are straightforward and trained using basic options for illustrative purposes, but you can use this workflow to compare more sophisticated champion and challenger models.

Compute Probabilities of Default Using Logistic Regression

First, create a scoring model using creditscorecard. Apply automatic binning using the autobinning function and use the default autobinning options. You can also use more advanced automatic and manual binning operations to improve the model by using the Binning Explorer. To train a logistic regression model, use fitmodel with the full model option to include all predictors in the model. Then, compute the PDs using probdefault. For a detailed description of this workflow, see Bin Data to Create Credit Scorecards Using Binning Explorer or Case Study for Credit Scorecard Analysis.

% Create a creditscorecard object, bin data, and fit a logistic regression model
load CreditCardData.mat
scl = creditscorecard(data,'IDVar','CustID');
scl = autobinning(scl);
scl = fitmodel(scl,'VariableSelection','fullmodel');

Generalized linear regression model:
    logit(status) ~ 1 + CustAge + TmAtAddress + ResStatus + EmpStatus + CustIncome + TmWBank + OtherCC + AMBalance + UtilRate
    Distribution = Binomial

Estimated Coefficients:
                   Estimate        SE         tStat        pValue  
                   _________    ________    _________    __________

    (Intercept)      0.70246    0.064039       10.969    5.3719e-28
    CustAge           0.6057     0.24934       2.4292      0.015131
    TmAtAddress       1.0381     0.94042       1.1039       0.26963
    ResStatus         1.3794      0.6526       2.1137      0.034538
    EmpStatus        0.89648     0.29339       3.0556     0.0022458
    CustIncome       0.70179     0.21866       3.2095     0.0013295
    TmWBank           1.1132     0.23346       4.7683    1.8579e-06
    OtherCC           1.0598     0.53005       1.9994      0.045568
    AMBalance         1.0572     0.36601       2.8884     0.0038718
    UtilRate       -0.047597     0.61133    -0.077858       0.93794


1200 observations, 1190 error degrees of freedom
Dispersion: 1
Chi^2-statistic vs. constant model: 91, p-value = 1.05e-15

% Compute the corresponding probabilities of default
pdL = probdefault(scl);

Compute Probabilities of Default Using Decision Trees

Next, create the challenger model. Use the Statistics and Machine Learning Toolbox™ method fitctree to fit a Decision Tree (DT) to the data. By default, the splitting criterion is Gini's diversity index. In this example, set a maximum number of splits to avoid overfitting and specify which predictors are categorical. For information on additional training options that can improve the model, see the Name-Value Arguments in fitctree.

% Create and view classification tree
CategoricalPreds = {'ResStatus','EmpStatus','OtherCC'};
dt = fitctree(data,'status~CustAge+TmAtAddress+ResStatus+EmpStatus+CustIncome+TmWBank+OtherCC+UtilRate',...
    'MaxNumSplits',30,'CategoricalPredictors',CategoricalPreds);
disp(dt)

  ClassificationTree
           PredictorNames: {'CustAge'  'TmAtAddress'  'ResStatus'  'EmpStatus'  'CustIncome'  'TmWBank'  'OtherCC'  'UtilRate'}
             ResponseName: 'status'
    CategoricalPredictors: [3 4 7]
               ClassNames: [0 1]
           ScoreTransform: 'none'
          NumObservations: 1200


  Properties, Methods

The decision tree is shown below. You can also use the view function with the name-value argument mode set to "graph" to visualize the tree as a graph.

view(dt)

Decision tree for classification
 1  if CustIncome<30500 then node 2 elseif CustIncome>=30500 then node 3 else 0
 2  if TmWBank<60 then node 4 elseif TmWBank>=60 then node 5 else 1
 3  if TmWBank<32.5 then node 6 elseif TmWBank>=32.5 then node 7 else 0
 4  if TmAtAddress<13.5 then node 8 elseif TmAtAddress>=13.5 then node 9 else 1
 5  if UtilRate<0.255 then node 10 elseif UtilRate>=0.255 then node 11 else 0
 6  if CustAge<60.5 then node 12 elseif CustAge>=60.5 then node 13 else 0
 7  if CustAge<46.5 then node 14 elseif CustAge>=46.5 then node 15 else 0
 8  if CustIncome<24500 then node 16 elseif CustIncome>=24500 then node 17 else 1
 9  if TmWBank<56.5 then node 18 elseif TmWBank>=56.5 then node 19 else 1
10  if CustAge<21.5 then node 20 elseif CustAge>=21.5 then node 21 else 0
11  class = 1
12  if EmpStatus=Employed then node 22 elseif EmpStatus=Unknown then node 23 else 0
13  if TmAtAddress<131 then node 24 elseif TmAtAddress>=131 then node 25 else 0
14  if TmAtAddress<97.5 then node 26 elseif TmAtAddress>=97.5 then node 27 else 0
15  class = 0
16  class = 0
17  if ResStatus in {Home Owner Tenant} then node 28 elseif ResStatus=Other then node 29 else 1
18  if TmWBank<52.5 then node 30 elseif TmWBank>=52.5 then node 31 else 0
19  class = 1
20  class = 1
21  class = 0
22  if UtilRate<0.375 then node 32 elseif UtilRate>=0.375 then node 33 else 0
23  if UtilRate<0.005 then node 34 elseif UtilRate>=0.005 then node 35 else 0
24  if CustIncome<39500 then node 36 elseif CustIncome>=39500 then node 37 else 0
25  class = 1
26  if UtilRate<0.595 then node 38 elseif UtilRate>=0.595 then node 39 else 0
27  class = 1
28  class = 1
29  class = 0
30  class = 1
31  class = 0
32  class = 0
33  if UtilRate<0.635 then node 40 elseif UtilRate>=0.635 then node 41 else 0
34  if CustAge<49 then node 42 elseif CustAge>=49 then node 43 else 1
35  if CustIncome<57000 then node 44 elseif CustIncome>=57000 then node 45 else 0
36  class = 1
37  class = 0
38  class = 0
39  if CustIncome<34500 then node 46 elseif CustIncome>=34500 then node 47 else 1
40  class = 1
41  class = 0
42  class = 1
43  class = 0
44  class = 0
45  class = 1
46  class = 0
47  class = 1

The decision tree has a predict function, where the first output predicts the class and the second output provides the probability of belonging to that class.

% Extract probabilities of default
[~,ObservationClassProb] = predict(dt,data);
pdDT = ObservationClassProb(:,2);

ObservationClassProb returns a NumObs-by-2 array with class probability at all observations. The order of the classes is the same as in dt.ClassName. In this example, the class names are [0 1], where 0 represents the good label based on the class with the highest count in the raw data. The first column corresponds to nondefaults, whereas the second column represents the actual PDs. You use the PDs in the scoring or validation sections of the workflow.

Predictor Importance for Credit Scorecard Model

Predictor importance is related to the concept of predictor weights, as the weight of a predictor determines its significance in calculating the final score and the PD. For credit scorecards, the weights are determined by dividing the range of points for each predictor by the total range of points for the entire credit scorecard model.

In this example, use the formatpoints function with the PointsOddsandPDO name-value argument for scaling. Set the following parameters:

Target points
Target odds
Number of points to double the odds (PDO)

The odds double with every increase of points-to-double-the-odds (PDO). The formatpoints function solves for the scaling parameters so that the scaled scores are consistent with the target points, the target odds, and the PDO.

% Choose target points, target odds, and PDO values
TargetPoints = 500;
TargetOdds = 2;
PDO = 50;

% Format points and compute points range
scl = formatpoints(scl,'PointsOddsAndPDO',[TargetPoints TargetOdds PDO]);
[PointsTable,MinPts,MaxPts] = displaypoints(scl);
PtsRange = MaxPts - MinPts;
disp(PointsTable(1:10,:))

      Predictors            Bin         Points
    _______________    _____________    ______

    {'CustAge'    }    {'[-Inf,33)'}    37.008
    {'CustAge'    }    {'[33,37)'  }    38.342
    {'CustAge'    }    {'[37,40)'  }    44.091
    {'CustAge'    }    {'[40,46)'  }    51.757
    {'CustAge'    }    {'[46,48)'  }    63.826
    {'CustAge'    }    {'[48,58)'  }     64.97
    {'CustAge'    }    {'[58,Inf]' }    82.826
    {'CustAge'    }    {'<missing>'}       NaN
    {'TmAtAddress'}    {'[-Inf,23)'}    49.058
    {'TmAtAddress'}    {'[23,83)'  }    57.325

fprintf('Minimum points: %g, Maximum points: %g\n',MinPts,MaxPts)

Minimum points: 348.705, Maximum points: 683.668

The weights are defined as the range of points, for any given predictor, divided by the range of points for the entire scorecard.

Predictor = unique(PointsTable.Predictors,'stable'); 
NumPred = length(Predictor);
Weight  = zeros(NumPred,1);

for ii = 1 : NumPred
   Ind = strcmpi(Predictor{ii},PointsTable.Predictors);
   MaxPtsPred = max(PointsTable.Points(Ind));
   MinPtsPred = min(PointsTable.Points(Ind));
   Weight(ii) = 100*(MaxPtsPred-MinPtsPred)/PtsRange;
end

PredictorWeights = table(Predictor,Weight);
PredictorWeights(end+1,:) = PredictorWeights(end,:);
PredictorWeights.Predictor{end} = 'Total';
PredictorWeights.Weight(end) = sum(Weight);
disp(PredictorWeights)

       Predictor       Weight 
    _______________    _______

    {'CustAge'    }     13.679
    {'TmAtAddress'}     5.1564
    {'ResStatus'  }     8.7945
    {'EmpStatus'  }      8.519
    {'CustIncome' }     19.259
    {'TmWBank'    }     24.557
    {'OtherCC'    }     7.3414
    {'AMBalance'  }     12.365
    {'UtilRate'   }    0.32919
    {'Total'      }        100

% Plot a histogram of the weights
figure
bar(PredictorWeights.Weight(1:end-1))
title('Predictor Importance Estimates for Credit Scorecard');
ylabel('Estimates (%)');
xlabel('Predictors');
xticklabels(PredictorWeights.Predictor(1:end-1));

Figure contains an axes object. The axes object with title Predictor Importance Estimates for Credit Scorecard, xlabel Predictors, ylabel Estimates (%) contains an object of type bar.

Using Decision Trees for Predictor Importance

When you use decision trees, you can investigate predictor importance using the predictorImportance function. On every predictor, the function sums and normalizes changes in the risks due to splits by using the number of branch nodes. A high value in the output array indicates a strong predictor.

imp = predictorImportance(dt);

figure;
bar(100*imp/sum(imp)); % to normalize on a 0-100% scale
title('Predictor Importance Estimates for Decision Tree');
ylabel('Estimates (%)');
xlabel('Predictors');
xticklabels(dt.PredictorNames);

Figure contains an axes object. The axes object with title Predictor Importance Estimates for Decision Tree, xlabel Predictors, ylabel Estimates (%) contains an object of type bar.

In this case, 'CustIncome' (parent node) is the most important predictor, followed by 'UtilRate', where the second split happens, and so on. The predictor importance step can help in predictor screening for data sets with a large number of predictors.

Normalize the predictor importance for decision trees using a percent from 0 through 100%, then compare the two models in a combined histogram.

Ind = ismember(Predictor,dt.PredictorNames);
w = zeros(size(Weight));
w(Ind) = 100*imp'/sum(imp);
figure
bar([Weight,w]);
title('Predictor Importance Estimates');
ylabel('Estimates (%)');
xlabel('Predictors');
h = gca;
xticklabels(Predictor)
legend({'logit','DT'})

Figure contains an axes object. The axes object with title Predictor Importance Estimates, xlabel Predictors, ylabel Estimates (%) contains 2 objects of type bar. These objects represent logit, DT.

Comparing the predictor importance of the two models, not only are the weights across models different, but the selected predictors in each model also diverge. The predictors 'AMBalance' and 'OtherCC' are missing from the decision tree model, and 'UtilRate' is missing from the logistic regression model.

Note that these results depend on the binning algorithm you choose for the creditscorecard object and the parameters used in fitctree to build the decision tree. Different parameter choices made during training may bring the importance of different predictors closer across models.

Model Validation

You can use various validation tools from the toolbox to access and compare these models. For instance, both creditscorecard and decision trees offer validation tools. crediscorecard supports a validatemodel function that supports discrimination metrics and visualizations. Decision trees have tools such as the predictorImportance function used earlier, tools for cross validation (Improving Classification Trees and Regression Trees), and tools for interpretability (Interpret Machine Learning Models). However, not all validation tools supported in the scorecard are supported by decision trees, and vice versa. The risk.validation namespace offers a convenient way to compute the same validation metrics for any model and compare the results.

This example performs in-sample validation, where you use the same data set for both training and validating the models. Alternatively, you can split the data into training and testing, retrain the models with the training set, and use the test data to perform out-of-sample validation.

Recall the PD from the scorecard is stored in the variable pdL, and the decision tree PD is stored in pdDT. For validation, you need the predicted PDs and the binary response. In this case, the response data is in data.status. In this section, you arrange the validation data (predictions by both models and response values) in a new table.

ValidationData = table(pdL,pdDT,data.status,VariableNames=["PDScorecard" "PDDecisionTree" "DefaultFlag"]);
head(ValidationData)

    PDScorecard    PDDecisionTree    DefaultFlag
    ___________    ______________    ___________

      0.24717          0.19672            0     
      0.20283         0.090909            0     
      0.31063          0.19672            0     
       0.1677          0.26977            0     
      0.18661         0.090909            0     
      0.14176          0.19672            0     
      0.51817          0.40323            1     
       0.2793          0.19672            0

Once you have trained and validated your models, you can use discrimination metrics to measure how well models can differentiate between different outcomes. The next section highlights a set of discrimination metrics you can use in your workflows.

Discrimination

Discrimination measures how well the model predictions rank the customers by risk. Higher risk customers should be assigned riskier scores such as higher PD values. In this section, compute the following discrimination metrics:

Accuracy ratio (AR) – risk.validation.accuracyRatio
Area under the curve (AUC) – risk.validation.areaUnderCurve
Kolmogorov-Smirnov (KS) – risk.validation.kolmogorovSmirnov
Brier score – risk.validation.brierScore

This example includes Brier score as a discrimination metric, although it can also be used as a calibration metric in other applications.

DiscriminationResults = table;
DiscriminationResults.AR = zeros(2,1); % Accuracy ratio
DiscriminationResults.AUC = zeros(2,1); % Area under the curve
DiscriminationResults.KS = zeros(2,1); % Kolmogorov-Smirnov
DiscriminationResults.Brier = zeros(2,1); % Brier score
DiscriminationResults.Properties.RowNames = ["Scorecard"; "Decision Tree"];

for ii=1:2 % for each model
   PDVar = ValidationData.Properties.VariableNames{ii};
   DiscriminationResults.AR(ii) = risk.validation.accuracyRatio(ValidationData.(PDVar),ValidationData.DefaultFlag);
   DiscriminationResults.AUC(ii) = risk.validation.areaUnderCurve(ValidationData.(PDVar),ValidationData.DefaultFlag);
   DiscriminationResults.KS(ii) = risk.validation.kolmogorovSmirnov(ValidationData.(PDVar),ValidationData.DefaultFlag);
   DiscriminationResults.Brier(ii) = risk.validation.brierScore(ValidationData.(PDVar),ValidationData.DefaultFlag);
end
disp(DiscriminationResults)

                       AR         AUC        KS        Brier 
                     _______    _______    _______    _______

    Scorecard        0.32515    0.66258    0.23204    0.20519
    Decision Tree    0.38903    0.69451    0.29666    0.19166

For AR, AUC and KS, higher values mean better discrimination, whereas for Brier score, lower values indicate better discrimination. These are simple models, included here for illustration purposes, and there should be no general conclusions about these classes of models based on these results. Both models can be improved with additional tuning during training.

A deciles report is commonly used when analyzing model discrimination. This report groups the portfolio data into deciles by PD, and then displays a range of metrics for each decile.

Select a model using the dropdown to generate the deciles report for the corresponding model. The risk.validation.thresholdMetrics function generates the information needed for the report.

Model = "Scorecard";
if Model=="Scorecard"
   PDVar = "PDScorecard";
else
   PDVar = "PDDecisionTree";
end

PDSelected = ValidationData.(PDVar);
DecileNumberByPD = risk.validation.groupNumberByQuantile(PDSelected,"deciles");
PDAvgByDecile = groupsummary(PDSelected,DecileNumberByPD,"mean");
PDMappedToDecilePD = PDAvgByDecile(DecileNumberByPD);

metricsByDecile = risk.validation.thresholdMetrics(PDMappedToDecilePD,ValidationData.DefaultFlag);
disp(metricsByDecile)

    Threshold    TruePositiveRate    FalsePositiveRate    RateOfPositivePredictions    TruePositives    FalsePositives    TrueNegatives    FalseNegatives
    _________    ________________    _________________    _________________________    _____________    ______________    _____________    ______________

     0.58115               0                    0                      0                      0                0               803              397      
     0.58115         0.17632             0.062267                    0.1                     70               50               753              327      
     0.46854         0.32242              0.13948                    0.2                    128              112               691              269      
     0.41592         0.44081              0.23039                    0.3                    175              185               618              222      
      0.3696          0.5466              0.32752                    0.4                    217              263               540              180      
     0.33316         0.64736              0.42715                    0.5                    257              343               460              140      
     0.29963         0.74055              0.53051                    0.6                    294              426               377              103      
     0.27108         0.82872              0.63636                    0.7                    329              511               292               68      
     0.23646         0.90176              0.74969                    0.8                    358              602               201               39      
     0.19635         0.96725              0.86675                    0.9                    384              696               107               13      
     0.13643               1                    1                      1                    397              803                 0                0

Reformat the table to reflect the meaning of these metrics in the context of credit risk. "Bads" in the column names refers to defaulters, whereas "Goods" refers to nondefaulters.

decileReport = renamevars(metricsByDecile,metricsByDecile.Properties.VariableNames,...
   ["PD" "Proportion of Bads" "Proportion of Goods" "Proportion of Borrowers" "Cumulative Bads" "Cumulative Goods" "Goods in Higher Deciles" "Bads in Higher Deciles"]);
decileReport(1,:) = []; % First row is mostly for plotting purposes
decileReport = addvars(decileReport,(1:height(decileReport))',Before="PD",NewVariableNames="Decile");
disp(decileReport)

    Decile      PD       Proportion of Bads    Proportion of Goods    Proportion of Borrowers    Cumulative Bads    Cumulative Goods    Goods in Higher Deciles    Bads in Higher Deciles
    ______    _______    __________________    ___________________    _______________________    _______________    ________________    _______________________    ______________________

       1      0.58115         0.17632               0.062267                    0.1                     70                 50                     753                       327          
       2      0.46854         0.32242                0.13948                    0.2                    128                112                     691                       269          
       3      0.41592         0.44081                0.23039                    0.3                    175                185                     618                       222          
       4       0.3696          0.5466                0.32752                    0.4                    217                263                     540                       180          
       5      0.33316         0.64736                0.42715                    0.5                    257                343                     460                       140          
       6      0.29963         0.74055                0.53051                    0.6                    294                426                     377                       103          
       7      0.27108         0.82872                0.63636                    0.7                    329                511                     292                        68          
       8      0.23646         0.90176                0.74969                    0.8                    358                602                     201                        39          
       9      0.19635         0.96725                0.86675                    0.9                    384                696                     107                        13          
      10      0.13643               1                      1                      1                    397                803                       0                         0

The Proportion of Borrowers column shows the proportion of the entire portfolio that is included up to that row in the table. For the scorecard model, this has increments of 0.1, as expected, since these are the deciles. The Proportion of Bads and Proportion of Goods columns show that for the 10% of the portfolio with the highest PD, about 17.6% of the defaulters have been identified, and only about 6.2% of the nondefaulters where assigned such high PD value. For 20% of the portfolio, the model already correctly labeled 32.2% of the defaulters, and only incorrectly labeled 13.9% of the nondefaulters as risky. An ideal model would show the Proportion of Bads increasing quickly while the Proportion of Goods stays low initially and only grows towards the bottom of the table where PD values are low and Cumulative Goods are high.

For the decision tree model in this example, there are only 6 rows and the proportion in each row does not match 10%. This is because the model tends to predict some PD values very often. Here is the frequency table of the predicted PD values for the decision tree model.

tabulate(ValidationData.PDDecisionTree)

      Value    Count   Percent
     0.0625       16      1.33%
  0.0909091       44      3.67%
   0.166667        6      0.50%
   0.196721      244     20.33%
        0.2        5      0.42%
       0.25       20      1.67%
   0.252688      186     15.50%
   0.269767      215     17.92%
   0.285714        7      0.58%
   0.333333        3      0.25%
   0.403226      248     20.67%
        0.5        8      0.67%
   0.513761      109      9.08%
   0.571429        7      0.58%
   0.645161       31      2.58%
   0.666667        9      0.75%
   0.727273       11      0.92%
          1       31      2.58%

The tabulated output shows the various percentages of the portfolio and the associated PD. For example, the first row indicates that 1.33% of the portfolio gets a PD of 6.25%. It is not possible to split the portfolio into bins with exactly 10% of observations when using the PD value as the split criterion. In this case, when determining the deciles, the data ends up grouped into 6 bins, where some bins contain over 20% of observations.

The Kolmogorov-Smirnov plot precisely shows the growth of the proportion of bads and the proportion of goods for a model, as a function of the PD value (or a score value). Typically, it is sorted from riskier scores on the left (such as high PD) to safer scores on the right (such as low PD). The KS metric is the maximum difference between these proportions.

figure;
plot(metricsByDecile.Threshold,metricsByDecile.TruePositiveRate)
hold on
plot(metricsByDecile.Threshold,metricsByDecile.FalsePositiveRate)
hold off
grid on
ax = gca;
ax.XDir = "reverse";
title(strcat("KS Plot by Deciles: ",Model))
xlabel("PD")
ylabel("Proportion")
legend("Proportion of Bads","Proportion of Goods",Location="best")

Figure contains an axes object. The axes object with title KS Plot by Deciles: Scorecard, xlabel PD, ylabel Proportion contains 2 objects of type line. These objects represent Proportion of Bads, Proportion of Goods.

The receiver operating characteristic (ROC) curve plots the Proportion of Bads versus the Proportion of Goods directly, without showing the PD values.

figure;
plot(flipud(metricsByDecile.FalsePositiveRate),flipud(metricsByDecile.TruePositiveRate))
grid on
title(strcat("ROC by Deciles: ",Model))
xlabel("Proportion of Goods")
ylabel("Proportion of Bads")

Figure contains an axes object. The axes object with title ROC by Deciles: Scorecard, xlabel Proportion of Goods, ylabel Proportion of Bads contains an object of type line.

In this case, a curve that quickly increases is preferred, since this shows that the defaulters are identified faster than the nondefaulters, while the score or PD changes from riskier to safer values.

You can generate either plot at the individual loan level as well, without aggregating by deciles. For example, you can make the KS plot at the loan level by using the following code.

[ksValue,ksOutput] = risk.validation.kolmogorovSmirnov(ValidationData.(PDVar),ValidationData.DefaultFlag);
figure;
plot(ksOutput.Metrics.Threshold,ksOutput.Metrics.TruePositiveRate)
hold on
plot(ksOutput.Metrics.Threshold,ksOutput.Metrics.FalsePositiveRate)
hold off
grid on
ax = gca;
ax.XDir = "reverse";
title(strcat("KS Plot, Loan-Level: ",Model))
xlabel("PD")
ylabel("Proportion")
legend("Proportion of Bads","Proportion of Goods",Location="best")

Figure contains an axes object. The axes object with title KS Plot, Loan-Level: Scorecard, xlabel PD, ylabel Proportion contains 2 objects of type line. These objects represent Proportion of Bads, Proportion of Goods.

Calibration

Model calibration measures how close the predicted PDs are from the actual default rates. It requires grouping individual loans so that a default rate for the group can be computed and compared to the average PD of the group. For individual loans, the response value is a 0 or 1 value, whereas the predicted PD for the loan is a continuous value. By grouping, you can measure the default rate within a group (the number of defaults divided by the number of loans), which is a value between 0 and 1, and compare it to the group's PD.

For credit scoring models, where each individual borrower may get its own PD value, it is common to use deciles as groups. For credit rating models, each rating is a natural group.

Some calibration metrics, such as the Hosmer-Lemeshow test and the root mean square error (RMSE), require all the groups in the portfolio to compute a single calibration value for the portfolio.

Hosmer-Lemeshow is a statistical test where the null hypothesis states that the expected number of defaults in the groups matches the observed defaults. A rejection of the null hypothesis means that expected and observed default counts are not close enough and the models should be reviewed. In this example, we report the rejection flag and the corresponding p-value (smaller values suggest mismatch between expected and observed defaults). For more information about this test, see risk.validation.hosmerLemeshowTest.

The RMSE metric rmse is the average value of the square errors between the predicted PD and observed default rate.

CalibrationResults = table;
CalibrationResults.HosmerLemeshowReject = zeros(2,1); % Hosmer-Lemeshow test rejection flag
CalibrationResults.HosmerLemeshowPValue = zeros(2,1); % Hosmer-Lemeshow test p-value
CalibrationResults.RMSE = zeros(2,1); % Root mean square error
CalibrationResults.Properties.RowNames = ["Scorecard"; "Decision Tree"];

for ii=1:2 % for each model
   PDVar = ValidationData.Properties.VariableNames{ii};
   PDSelected = ValidationData.(PDVar);

   DecileNumberByPD = risk.validation.groupNumberByQuantile(PDSelected,"deciles");
   PDAvgByDecile = groupsummary(PDSelected,DecileNumberByPD,"mean");
   [NumDefaultsByDecile,~,NumLoansByDecile] = groupsummary(ValidationData.DefaultFlag,DecileNumberByPD,"sum");

   [HLReject,HLOutput] = risk.validation.hosmerLemeshowTest(PDAvgByDecile,NumDefaultsByDecile,NumLoansByDecile);
   CalibrationResults.HosmerLemeshowReject(ii) = HLReject;
   CalibrationResults.HosmerLemeshowPValue(ii) = HLOutput.PValue;
   CalibrationResults.RMSE(ii) = rmse(PDAvgByDecile,NumDefaultsByDecile./NumLoansByDecile,Weight=NumLoansByDecile);
end
disp(CalibrationResults)

                     HosmerLemeshowReject    HosmerLemeshowPValue       RMSE   
                     ____________________    ____________________    __________

    Scorecard                 0                    0.97998             0.017103
    Decision Tree             0                          1           9.2096e-16

You can interpret the following results from the table:

The rejection flag is 0 signifying a good fit between expected and observed defaults.
A higher p-value indicates better performance.
A lower RMSE indicates better performance.

Other calibration tests such as the binomial test, measure the calibration for each individual group. For more information about this test, see risk.validation.binomialTest. For the binomial test, the null hypothesis states that the PD for each group matches the group's default rate. Rejection of the test means that the PD underestimates the default rate. As before, deciles are commonly used for grouping.

Model = "Scorecard";
if Model=="Scorecard"
   PDVar = "PDScorecard";
else
   PDVar = "PDDecisionTree";
end

PDSelected = ValidationData.(PDVar);
DecileNumberByPD = risk.validation.groupNumberByQuantile(PDSelected,"deciles");
PDAvgByDecile = groupsummary(PDSelected,DecileNumberByPD,"mean");
[NumDefaultsByDecile,~,NumLoansByDecile] = groupsummary(ValidationData.DefaultFlag,DecileNumberByPD,"sum");
[BinTestReject,BinTestOutput] = risk.validation.binomialTest(PDAvgByDecile,NumDefaultsByDecile,NumLoansByDecile);

You can see that each group has its own test result. The main output of the function shows the rejection flag for each group.

disp(BinTestReject)

The structure BinTestOutput contains a table with detailed information about the test results for each group. The first column of the table displays the rejection flag for the test.

disp(BinTestOutput.Results)

    RejectBinTest    PValue     NumEvents    CriticalValue    ConfidenceLevel    NumTrials    Probability    ObservedProbability
    _____________    _______    _________    _____________    _______________    _________    ___________    ___________________

          0           0.8493       13             24               0.95             120         0.13643            0.10833      
          0          0.32138       26             32               0.95             120         0.19635            0.21667      
          0          0.48175       29             37               0.95             120         0.23646            0.24167      
          0          0.33819       35             42               0.95             120         0.27108            0.29167      
          0           0.4516       37             45               0.95             120         0.29963            0.30833      
          0           0.5327       40             50               0.95             120         0.33316            0.33333      
          0           0.7031       42             54               0.95             120          0.3696               0.35      
          0          0.73511       47             60               0.95             120         0.41592            0.39167      
          0          0.40712       58             66               0.95             120         0.46854            0.48333      
          0          0.51955       70             80               0.95             120         0.58115            0.58333

The binomial test assumes the defaults are independent. To assess the effect of correlation between events, you can apply the correlated binomial test, which requires a correlation value as an additional input parameter. This example uses a 10% correlation for illustration purposes. For more information about the correlated binomial test, see risk.validation.correlatedBinomialTest.

CorrValue = 0.1;
[~,CorrBinTestOutput] = risk.validation.correlatedBinomialTest(PDAvgByDecile,NumDefaultsByDecile,NumLoansByDecile,CorrValue);
disp(CorrBinTestOutput.Results)

    RejectCorrBinTest    PValue     NumEvents    CriticalValue    ConfidenceLevel    NumTrials    Probability    ObservedProbability    Correlation    EventCorrelation
    _________________    _______    _________    _____________    _______________    _________    ___________    ___________________    ___________    ________________

            0            0.60583       13             35               0.95             120         0.13643            0.10833              0.1            0.043043    
            0            0.38439       26             46               0.95             120         0.19635            0.21667              0.1            0.050366    
            0             0.4511       29             53               0.95             120         0.23646            0.24167              0.1            0.054047    
            0             0.4075       35             58               0.95             120         0.27108            0.29167              0.1            0.056624    
            0            0.45226       37             62               0.95             120         0.29963            0.30833              0.1            0.058393    
            0            0.48512       40             67               0.95             120         0.33316            0.33333              0.1            0.060106    
            0            0.55054       42             72               0.95             120          0.3696               0.35              0.1            0.061564    
            0            0.56914       47             78               0.95             120         0.41592            0.39167              0.1            0.062864    
            0            0.46487       58             84               0.95             120         0.46854            0.48333              0.1            0.063643    
            0            0.51871       70             96               0.95             120         0.58115            0.58333              0.1            0.062926

The critical values are higher for the correlated binomial test, showing that a higher number of defaults is acceptable under the assumption that there is some level of correlation between defaults.

The following plot summarizes the calibration results from the uncorrelated and correlated binomial tests.

figure;
bar(BinTestOutput.Results.Probability.*BinTestOutput.Results.NumTrials)
hold on
plot(BinTestOutput.Results.NumEvents,'*')
plot(BinTestOutput.Results.CriticalValue,'x')
plot(CorrBinTestOutput.Results.CriticalValue,'+')
hold off
grid on
legend("Predicted","Observed","Binomial Test","Correlated Binomial Test",Location="best")
xlabel("Group Number")
ylabel("Defaults")
title(strcat("Model Calibration with Binomial Test Limits: ",Model))

Figure contains an axes object. The axes object with title Model Calibration with Binomial Test Limits: Scorecard, xlabel Group Number, ylabel Defaults contains 4 objects of type bar, line. One or more of the lines displays its values using only markers These objects represent Predicted, Observed, Binomial Test, Correlated Binomial Test.

You can improve the simple models that this example compares with additional parameter tuning during training. This example also demonstrates readily available tools that you can use to train and validate credit scoring models.

More About

Logistic regression links the score and the PD through the logistic regression function, and is the default fitting and scoring model when you work with creditscorecard objects. Decision trees have gained popularity in credit scoring and are now commonly used to fit data and predict default. The decision trees algorithms follow a top-down approach where the data set is split according to a chosen metric, including the Gini index, information value, or entropy. For more information, see Decision Trees. The Risk Model Validation offers a range of tools to validate and compare the models.