Main Content

risk.validation.kolmogorovSmirnov

Kolmogorov-Smirnov statistic

Since R2025a

    Description

    ksValue = risk.validation.kolmogorovSmirnov(Score,BinaryResponse) returns the two-sample Kolmogorov-Smirnov (KS) statistic, where Score contains numeric values that represent rankings or predictions from a binary classification model, such as probability of default (PD) estimates. BinaryResponse specifies the target state of each value in Score. This syntax is well-suited for binary classification models.

    example

    ksValue = risk.validation.kolmogorovSmirnov(Sample1,Sample2) calculates the two-sample KS statistic for the data in Sample1 and Sample2.

    ksValue = risk.validation.kolmogorovSmirnov(___,SortDirection=sortdir) specifies the sorting direction of the unique values in Score or in Sample1 and Sample2.

    [ksValue,Output] = risk.validation.kolmogorovSmirnov(___) also returns a structure Output, that contains the KS score and additional information about the test.

    Examples

    collapse all

    Compute the Kolmogorov-Smirnov (KS) statistic for credit scores by using the kolmogorovSmirnov function. In this example, you use the credit validation data set, which includes a table, ScorecardValidationData, that contains credit scores and their corresponding default status information.

    Load and display the credit validation data.

    load CreditValidationData.mat
    head(ScorecardValidationData)
        CreditScore      PD       Default
        ___________    _______    _______
    
          579.86       0.14182       0   
          563.65       0.17143       0   
          549.52       0.20106       0   
          546.25       0.20845       0   
          485.34       0.37991       0   
          482.07       0.39065       0   
          579.86       0.14182       1   
          451.73         0.494       0   
    

    Extract the variables CreditScore and Default from the table ScorecardValidationData. Use Default as the BinaryResponse input argument.

    Scores = ScorecardValidationData.CreditScore;
    BinaryResponse = ScorecardValidationData.Default;

    Compute the KS statistic by using the kolmogorovSmirnov function with the fully qualified namespace risk.validation. For credit models, you can sort the scores from lower scores to higher scores by setting the SortDirection name-value argument to "ascending". This setting ensures that the function sorts the scores from higher risk individuals to lower risk individuals.

    [ksValue,Output] = risk.validation.kolmogorovSmirnov(Scores,BinaryResponse,SortDirection="ascending")
    ksValue = 
    0.1770
    
    Output = struct with fields:
        KolmogorovSmirnovStatistic: 0.1770
            KolmogorovSmirnovScore: 476.4030
                           Metrics: [107×3 table]
    
    

    The output structure, Output, contains the KS statistic and the value in Score that attains this statistic. Display the metrics Threshold, TruePositiveRate, and FalsePositiveRate contained in the table Output.Metrics.

    head(Output.Metrics)
        Threshold    TruePositiveRate    FalsePositiveRate
        _________    ________________    _________________
    
         408.99                 0                   0     
         408.99          0.071429            0.012821     
         410.12          0.079365            0.017094     
         430.66          0.087302            0.017094     
         435.52          0.087302            0.025641     
         436.65           0.10317            0.029915     
         439.33           0.11905            0.029915     
         440.45           0.13492            0.029915     
    

    Calculate the Kolmogorov-Smirnov (KS) statistic for two samples containing risk-theoretical profit and loss (RTPL) data and hypothetical profit and loss (HPL) data, respectively. The vectors RTPL and HPL contain the RTPL and HPL data for 250 trading-days, or one year, of a simulated portfolio.

    load("PandLValues.mat")
    [ksValue,Output] = risk.validation.kolmogorovSmirnov(RTPL,HPL)
    ksValue = 
    0.0280
    
    Output = struct with fields:
        KolmogorovSmirnovStatistic: 0.0280
            KolmogorovSmirnovPoint: -1.0261e+03
                     Distributions: [501×3 table]
    
    

    The output indicates that the largest distance between the empirical cumulative distribution function (CDF) for RTPL and the empirical CDF for HTPL is 0.028.

    Display the evaluation points and values for the empirical CDFs.

    Output.Distributions
    ans=501×3 table
        -3.9596e+04         0         0
        -3.9596e+04    0.0040         0
        -3.0298e+04    0.0040    0.0040
        -2.2525e+04    0.0040    0.0080
        -2.1882e+04    0.0080    0.0080
        -2.0224e+04    0.0120    0.0080
        -2.0065e+04    0.0120    0.0120
        -1.9575e+04    0.0120    0.0160
        -1.8832e+04    0.0160    0.0160
        -1.7563e+04    0.0160    0.0200
        -1.7370e+04    0.0160    0.0240
        -1.7006e+04    0.0160    0.0280
        -1.6749e+04    0.0200    0.0280
        -1.6713e+04    0.0240    0.0280
          ⋮
    
    

    Input Arguments

    collapse all

    Score values, specified as a numeric vector, containing values that indicate quantities such as rankings or predictions, PD, or LGD estimates. For more information, see Algorithms.

    Data Types: single | double

    Binary response, specified as a numeric or logical vector, that contains values of 1 (true) or 0 (false). The binary response represents the target state for each value in Score.

    When you specify BinaryResponse, risk.validation.kolmogorovSmirnov creates two samples from the data in Score. The sample given by the 0 values in BinaryResponse corresponds to the Output argument's FalsePositiveRate field, and the sample given by the 1 values corresponds to the TruePositiveRate field. For more information, see Algorithms.

    Sample data, specified as two numeric vectors

    Example: normrnd(0,1,1,100),normrnd(5,2,1,100)

    Data Types: single | double

    Sorting direction of the distribution variable, specified as one of the following:

    • "descending" — Default value when you specify Score and BinaryResponse. Descending sorting direction is well suited for binary classifiers. Models that use probability of default data, for example, typically use a descending sorting direction because higher values correspond to higher risk. In this case, a descending sorting direction ensures that TruePositiveRate represents the proportion of defaulters.

    • "ascending" — Default value when you specify Sample1,Sample2. Ascending sorting orders are well suited for comparing samples. Models that use credit scores, for example, typically use an ascending sorting direction because low values correspond to higher risk.

    Example: SortDirection="descending"

    Output Arguments

    collapse all

    KS value for the values contained in Score, returned as a numeric scalar. You can use the KS value to quantify how well a model differentiates between lower risk and higher risk customers.

    Output metrics, returned as a structure containing the following fields:

    • KolmogorovSmirnovStatisticksValue

    • KolmogorovSmirnovScore — Value in Score that attains the KS statistic.

    When you specify Score and BinaryResponse, Output includes a Metrics field, which is a table with the following columns.

      • Thresholds — Unique score values sorted according to the value of sortdir.

      • TruePositiveRate — True positive rate values corresponding to the unique scores in the Thresholds column. For credit scoring models, this column represents the proportion of defaulters.

      • FalsePositiveRate — False positive rate values corresponding to the unique scores in the Threshold column. For credit scoring models, this column represents the proportion of nondefaulters.

    When you specify Sample1,Sample2, Output includes a field Distributions, which is a table with the following columns.

    • EvaluationPoint — Evaluation points for the CDFs

    • EmpiricalCDF1 — Values of the Sample1 CDF, evaluated at the points in EvaluationPoint

    • EmpiricalCDF2 — Values of the Sample2 CDF, evaluated at the points in EvaluationPoint

    Algorithms

    The risk.validation.kolmogorovSmirnov function calculates the KS statistic by taking the largest absolute difference between the empirical cumulative distribution functions (CDFs) for two samples.

    • When you specify Sample1,Sample2, the function calculates the empirical CDFs using the data in the samples.

    • When you specify Score and BinaryResponse, risk.validation.kolmogorovSmirnov uses BinaryResponse to create two samples from the data in Score and then calculates the empirical CDF using the data in the samples. The sample given by the 0 values in BinaryResponse corresponds to the Output argument's FalsePositiveRate field, and the sample given by the 1 values corresponds to the TruePositiveRate field.

    Alternative Functionality

    You can calculate the and visualize the KS statistic by using the risk.validation.kolmogorovSmirnovPlot plot. risk.validation.kolmogorovSmirnovPlot displays the KS statistic and empirical cumulative distribution function (CDFs) for the samples, and allows you to plot a difference profile for the empirical CDFs. You can also perform a two-sample KS test using kstest2.

    Version History

    Introduced in R2025a