Main Content

fillmissing

Replace missing values for credit scorecard predictors

Description

example

sc = fillmissing(sc,PredictorNames,Statistics) replaces missing values of the predictor PredictorNames with values defined by Statistics and returns an updated credit scorecard object (sc). Standard missing data is defined as follows:

  • NaN for numeric arrays

  • <undefined> for categorical arrays

Note

If you run fillmissing after binning a predictor, the existing cutpoints and bin edges are preserved and the "Good" and "Bad" counts from the <missing> bin are added to the corresponding bin.

example

sc = fillmissing(___,ConstantValue) uses arguments from the previous syntax and a value for a ConstantValue to replace missing values.

Examples

collapse all

This example shows how to use fillmissing to replace missing values in the CustAge and ResStatus predictors with user-defined values. For additional information on alternative approaches for "treating" missing data, see Credit Scorecard Modeling with Missing Values.

Load the credit scorecard data and use dataMissing for the training data.

load CreditCardData.mat
disp(head(dataMissing));
    CustID    CustAge    TmAtAddress     ResStatus     EmpStatus    CustIncome    TmWBank    OtherCC    AMBalance    UtilRate    status
    ______    _______    ___________    ___________    _________    __________    _______    _______    _________    ________    ______

      1          53          62         <undefined>    Unknown        50000         55         Yes       1055.9        0.22        0   
      2          61          22         Home Owner     Employed       52000         25         Yes       1161.6        0.24        0   
      3          47          30         Tenant         Employed       37000         61         No        877.23        0.29        0   
      4         NaN          75         Home Owner     Employed       53000         20         Yes       157.37        0.08        0   
      5          68          56         Home Owner     Employed       53000         14         Yes       561.84        0.11        0   
      6          65          13         Home Owner     Employed       48000         59         Yes       968.18        0.15        0   
      7          34          32         Home Owner     Unknown        32000         26         Yes       717.82        0.02        1   
      8          50          57         Other          Employed       51000         33         No        3041.2        0.13        0   

Create a creditscorecard object with 'BinMissingData' set to true.

sc = creditscorecard(dataMissing,'BinMissingData',true);
sc = autobinning(sc);

Use bininfo and plotbins to display the CustAge and ResStatus predictors with missing data.

bininfo(sc,'CustAge')
ans=10×6 table
         Bin         Good    Bad     Odds       WOE       InfoValue 
    _____________    ____    ___    ______    ________    __________

    {'[-Inf,33)'}     69      52    1.3269    -0.42156      0.018993
    {'[33,37)'  }     63      45       1.4    -0.36795      0.012839
    {'[37,40)'  }     72      47    1.5319     -0.2779     0.0079824
    {'[40,46)'  }    172      89    1.9326    -0.04556     0.0004549
    {'[46,48)'  }     59      25      2.36     0.15424     0.0016199
    {'[48,51)'  }     99      41    2.4146     0.17713     0.0035449
    {'[51,58)'  }    157      62    2.5323     0.22469     0.0088407
    {'[58,Inf]' }     93      25      3.72     0.60931      0.032198
    {'<missing>'}     19      11    1.7273    -0.15787    0.00063885
    {'Totals'   }    803     397    2.0227         NaN      0.087112

plotbins(sc,'CustAge');

bininfo(sc,'ResStatus')
ans=5×6 table
         Bin          Good    Bad     Odds        WOE       InfoValue 
    ______________    ____    ___    ______    _________    __________

    {'Tenant'    }    296     161    1.8385    -0.095463     0.0035249
    {'Home Owner'}    352     171    2.0585     0.017549    0.00013382
    {'Other'     }    128      52    2.4615      0.19637     0.0055808
    {'<missing>' }     27      13    2.0769     0.026469    2.3248e-05
    {'Totals'    }    803     397    2.0227          NaN     0.0092627

plotbins(sc,'ResStatus');

Use fillmissing to replace NaN values in CustAge with the median value and to replace the <missing> values in ResStatus with 'Tenant'. Use predictorinfo to verify the filled values.

sc = fillmissing(sc,{'CustAge'},'median');
sc = fillmissing(sc,{'ResStatus'},'constant','Tenant');
predictorinfo(sc,'CustAge')
ans=1×4 table
               PredictorType         LatestBinning          LatestFillMissingType    LatestFillMissingValue
               _____________    ________________________    _____________________    ______________________

    CustAge     {'Numeric'}     {'Automatic / Monotone'}         {'Median'}                  {[45]}        

predictorinfo(sc,'ResStatus')
ans=1×5 table
                  PredictorType     Ordinal         LatestBinning          LatestFillMissingType    LatestFillMissingValue
                 _______________    _______    ________________________    _____________________    ______________________

    ResStatus    {'Categorical'}     false     {'Automatic / Monotone'}        {'Constant'}               {'Tenant'}      

Use bininfo and plotbins to display the CustAge and ResStatus predictors to verify that the missing data has been replaced with the values defined by fillmissing.

bininfo(sc,'CustAge')
ans=9×6 table
         Bin         Good    Bad     Odds        WOE       InfoValue
    _____________    ____    ___    ______    _________    _________

    {'[-Inf,33)'}     69      52    1.3269     -0.42156     0.018993
    {'[33,37)'  }     63      45       1.4     -0.36795     0.012839
    {'[37,40)'  }     72      47    1.5319      -0.2779    0.0079824
    {'[40,46)'  }    191     100      1.91    -0.057315    0.0008042
    {'[46,48)'  }     59      25      2.36      0.15424    0.0016199
    {'[48,51)'  }     99      41    2.4146      0.17713    0.0035449
    {'[51,58)'  }    157      62    2.5323      0.22469    0.0088407
    {'[58,Inf]' }     93      25      3.72      0.60931     0.032198
    {'Totals'   }    803     397    2.0227          NaN     0.086822

plotbins(sc,'CustAge');

bininfo(sc,'ResStatus')
ans=4×6 table
         Bin          Good    Bad     Odds        WOE       InfoValue 
    ______________    ____    ___    ______    _________    __________

    {'Tenant'    }    323     174    1.8563    -0.085821     0.0030935
    {'Home Owner'}    352     171    2.0585     0.017549    0.00013382
    {'Other'     }    128      52    2.4615      0.19637     0.0055808
    {'Totals'    }    803     397    2.0227          NaN     0.0088081

plotbins(sc,'ResStatus');

Use fitmodel and then run formatpoints, displaypoints, and score.

sc = fitmodel(sc,'Display','off');
sc = formatpoints(sc,'WorstAndBest',[300 800]);
t = displaypoints(sc)
t=31×3 table
      Predictors             Bin           Points
    ______________    _________________    ______

    {'CustAge'   }    {'[-Inf,33)'    }    72.565
    {'CustAge'   }    {'[33,37)'      }    76.588
    {'CustAge'   }    {'[37,40)'      }    83.346
    {'CustAge'   }    {'[40,46)'      }    99.902
    {'CustAge'   }    {'[46,48)'      }    115.78
    {'CustAge'   }    {'[48,51)'      }     117.5
    {'CustAge'   }    {'[51,58)'      }    121.07
    {'CustAge'   }    {'[58,Inf]'     }    149.93
    {'CustAge'   }    {'<missing>'    }    99.902
    {'EmpStatus' }    {'Unknown'      }     79.64
    {'EmpStatus' }    {'Employed'     }    133.98
    {'EmpStatus' }    {'<missing>'    }       NaN
    {'CustIncome'}    {'[-Inf,29000)' }    21.926
    {'CustIncome'}    {'[29000,33000)'}    73.949
    {'CustIncome'}    {'[33000,35000)'}    97.117
    {'CustIncome'}    {'[35000,40000)'}    101.44
      ⋮

When a validation data set has missing values and you use fillmissing with the training dataset, the missing values in the validation data set are assigned the same points as the corresponding bins containing the filled values.

As the table shows, the '<missing>' bin for the CustAge predictor is assigned the same points as the '[40,46)' bin because the missing data is filled with the median value 45.

The points assigned to the '<missing>' bin for the EmpStatus predictor are NaN because fillmissing is not used for that predictor. The assigned points are decided by the default 'NoScore' for the 'Missing' name-value pair argument in formatpoints.

Create a test validation data set (tdata) and add missing values.

tdata = data(1:10,:);
tdata.CustAge(1) = NaN;
tdata.ResStatus(2) = '<undefined>';
[scr,pts] = score(sc,tdata)
scr = 10×1

  566.7335
  611.2547
  584.5130
  628.7876
  609.7148
  671.1048
  403.6413
  551.9461
  575.9874
  524.4789

pts=10×5 table
    CustAge    EmpStatus    CustIncome    TmWBank    AMBalance
    _______    _________    __________    _______    _________

    99.902       79.64        153.88      145.38      87.933  
    149.93      133.98        153.88      85.531      87.933  
    115.78      133.98        101.44      145.38      87.933  
     117.5      133.98        153.88      83.991      139.44  
    149.93      133.98        153.88      83.991      87.933  
    149.93      133.98        153.88      145.38      87.933  
    76.588       79.64        73.949      85.531      87.933  
     117.5      133.98        153.88      85.531       61.06  
     117.5       79.64        153.88      85.531      139.44  
     117.5       79.64        153.88      85.531      87.933  

This example shows different possibilities for handling missing data in validation data.

When scoring data from a validation data set, you have several options. If you choose to do nothing, the points assigned to the missing data are NaN, which comes from the default 'NoScore' for the 'Missing' name-value pair argument in formatpoints.

If you want to score missing values of all the predictors with one consistent metric, you can use the options 'ZeroWOE', 'MinPoints', or 'MaxPoints' for the 'Missing' name-value pair argument in formatpoints.

load CreditCardData.mat
sc = creditscorecard(data);
predictorinfo(sc,'CustAge')
ans=1×4 table
               PredictorType      LatestBinning      LatestFillMissingType    LatestFillMissingValue
               _____________    _________________    _____________________    ______________________

    CustAge     {'Numeric'}     {'Original Data'}        {'Original'}              {0x0 double}     

predictorinfo(sc,'ResStatus')
ans=1×5 table
                  PredictorType     Ordinal      LatestBinning      LatestFillMissingType    LatestFillMissingValue
                 _______________    _______    _________________    _____________________    ______________________

    ResStatus    {'Categorical'}     false     {'Original Data'}        {'Original'}              {0x0 double}     

sc = autobinning(sc);
sc = fitmodel(sc,'display','off');

displaypoints(sc)
ans=37×3 table
      Predictors            Bin            Points  
    ______________    ________________    _________

    {'CustAge'   }    {'[-Inf,33)'   }     -0.15894
    {'CustAge'   }    {'[33,37)'     }     -0.14036
    {'CustAge'   }    {'[37,40)'     }    -0.060323
    {'CustAge'   }    {'[40,46)'     }     0.046408
    {'CustAge'   }    {'[46,48)'     }      0.21445
    {'CustAge'   }    {'[48,58)'     }      0.23039
    {'CustAge'   }    {'[58,Inf]'    }        0.479
    {'CustAge'   }    {'<missing>'   }          NaN
    {'ResStatus' }    {'Tenant'      }    -0.031252
    {'ResStatus' }    {'Home Owner'  }      0.12696
    {'ResStatus' }    {'Other'       }      0.37641
    {'ResStatus' }    {'<missing>'   }          NaN
    {'EmpStatus' }    {'Unknown'     }    -0.076317
    {'EmpStatus' }    {'Employed'    }      0.31449
    {'EmpStatus' }    {'<missing>'   }          NaN
    {'CustIncome'}    {'[-Inf,29000)'}     -0.45716
      ⋮

sc = formatpoints(sc,'Missing','minpoints','WorstAndBestScores',[300 850]);
displaypoints(sc)
ans=37×3 table
      Predictors            Bin           Points
    ______________    ________________    ______

    {'CustAge'   }    {'[-Inf,33)'   }    46.396
    {'CustAge'   }    {'[33,37)'     }    48.727
    {'CustAge'   }    {'[37,40)'     }    58.772
    {'CustAge'   }    {'[40,46)'     }    72.167
    {'CustAge'   }    {'[46,48)'     }    93.256
    {'CustAge'   }    {'[48,58)'     }    95.256
    {'CustAge'   }    {'[58,Inf]'    }    126.46
    {'CustAge'   }    {'<missing>'   }    46.396
    {'ResStatus' }    {'Tenant'      }    62.421
    {'ResStatus' }    {'Home Owner'  }    82.276
    {'ResStatus' }    {'Other'       }    113.58
    {'ResStatus' }    {'<missing>'   }    62.421
    {'EmpStatus' }    {'Unknown'     }    56.765
    {'EmpStatus' }    {'Employed'    }    105.81
    {'EmpStatus' }    {'<missing>'   }    56.765
    {'CustIncome'}    {'[-Inf,29000)'}    8.9706
      ⋮

The value of -32.5389 for the <missing> bin of 'CustAge' comes from the 'minPoints' argument for formatpoints.

[scr,pts] = score(sc,dataMissing(1:5,:))
scr = 5×1

  602.0394
  648.1988
  560.5569
  613.5595
  646.8109

pts=5×7 table
    CustAge    ResStatus    EmpStatus    CustIncome    TmWBank    OtherCC    AMBalance
    _______    _________    _________    __________    _______    _______    _________

    95.256      62.421       56.765        121.18      116.05     86.224       64.15  
    126.46      82.276       105.81        121.18      62.107     86.224       64.15  
    93.256      62.421       105.81        76.585      116.05     42.287       64.15  
    46.396      82.276       105.81        121.18      60.719     86.224      110.96  
    126.46      82.276       105.81        121.18      60.719     86.224       64.15  

Alternatively, you can score missing data for each individual predictor with a different statistic based on that predictor's information. To do so, use fillmissing for a creditscorecard object.

load CreditCardData.mat
sc = creditscorecard(data);
sc = fillmissing(sc,'CustAge','constant',35);
predictorinfo(sc,'CustAge')
ans=1×4 table
               PredictorType      LatestBinning      LatestFillMissingType    LatestFillMissingValue
               _____________    _________________    _____________________    ______________________

    CustAge     {'Numeric'}     {'Original Data'}        {'Constant'}                 {[35]}        

sc = fillmissing(sc,'ResStatus','Mode');
predictorinfo(sc,'ResStatus')
ans=1×5 table
                  PredictorType     Ordinal      LatestBinning      LatestFillMissingType    LatestFillMissingValue
                 _______________    _______    _________________    _____________________    ______________________

    ResStatus    {'Categorical'}     false     {'Original Data'}          {'Mode'}               {'Home Owner'}    

sc = autobinning(sc);
sc = fitmodel(sc,'display','off');
sc = formatpoints(sc,'Missing','minpoints','WorstAndBestScores',[300 850]);
 
displaypoints(sc)
ans=37×3 table
      Predictors            Bin           Points
    ______________    ________________    ______

    {'CustAge'   }    {'[-Inf,33)'   }    46.396
    {'CustAge'   }    {'[33,37)'     }    48.727
    {'CustAge'   }    {'[37,40)'     }    58.772
    {'CustAge'   }    {'[40,46)'     }    72.167
    {'CustAge'   }    {'[46,48)'     }    93.256
    {'CustAge'   }    {'[48,58)'     }    95.256
    {'CustAge'   }    {'[58,Inf]'    }    126.46
    {'CustAge'   }    {'<missing>'   }    48.727
    {'ResStatus' }    {'Tenant'      }    62.421
    {'ResStatus' }    {'Home Owner'  }    82.276
    {'ResStatus' }    {'Other'       }    113.58
    {'ResStatus' }    {'<missing>'   }    82.276
    {'EmpStatus' }    {'Unknown'     }    56.765
    {'EmpStatus' }    {'Employed'    }    105.81
    {'EmpStatus' }    {'<missing>'   }    56.765
    {'CustIncome'}    {'[-Inf,29000)'}    8.9706
      ⋮

The value of <missing> for 'CustAge' comes from the fill value of 35 even though the training data has no missing values.

disp(dataMissing(1:5,:));
    CustID    CustAge    TmAtAddress     ResStatus     EmpStatus    CustIncome    TmWBank    OtherCC    AMBalance    UtilRate    status
    ______    _______    ___________    ___________    _________    __________    _______    _______    _________    ________    ______

      1          53          62         <undefined>    Unknown        50000         55         Yes       1055.9        0.22        0   
      2          61          22         Home Owner     Employed       52000         25         Yes       1161.6        0.24        0   
      3          47          30         Tenant         Employed       37000         61         No        877.23        0.29        0   
      4         NaN          75         Home Owner     Employed       53000         20         Yes       157.37        0.08        0   
      5          68          56         Home Owner     Employed       53000         14         Yes       561.84        0.11        0   
[scr,pts] = score(sc,dataMissing(1:5,:))
scr = 5×1

  621.8943
  648.1988
  560.5569
  615.8904
  646.8109

pts=5×7 table
    CustAge    ResStatus    EmpStatus    CustIncome    TmWBank    OtherCC    AMBalance
    _______    _________    _________    __________    _______    _______    _________

    95.256      82.276       56.765        121.18      116.05     86.224       64.15  
    126.46      82.276       105.81        121.18      62.107     86.224       64.15  
    93.256      62.421       105.81        76.585      116.05     42.287       64.15  
    48.727      82.276       105.81        121.18      60.719     86.224      110.96  
    126.46      82.276       105.81        121.18      60.719     86.224       64.15  

Input Arguments

collapse all

Credit scorecard model, specified as a creditscorecard object.

Name of creditscorecard predictor to fill missing data for, specified as a scalar character vector, scalar string, cell array of character vectors, or string array.

Data Types: char | string | cell

Statistic to use to fill missing data for the predictors, specified as a character vector or string with one of the following values.

  • 'mean' — Replace missing data with the average or mean value. The option is valid only for numeric data. The 'mean' calculates the weighted mean of the predictor by referring to the predictor column and the Weights column from the creditscorecard object. For more information, see Weighted Mean.

  • 'median' — Replace missing data with the median value. Valid for numeric and ordinal data. The 'median' calculates the weighted median of the predictor by referring to the predictor column and the Weights column from the creditscorecard object. For more information, see Weighted Median.

  • 'mode' — Replace missing data with the mode. Valid for numeric and both nominal and ordinal categorical data. The 'mode' calculates the weighted mode of the predictor by referring to the predictor column and the Weights column from the creditscorecard object. For more information, see Weighted Mode.

  • 'original' — Set the missing data for numeric and categorical predictors back to its original value: NaN if numeric, <undefined> or <missing> if categorical.

  • 'constant' — Set the missing data for numeric and categorical predictors to a constant value that you specify in the optional argument for ConstantValue.

Data Types: char | string

(Optional) Value to fill missing entries in predictors specified in PredictorNames, specified as a numeric value, character vector, string, or cell array of character vectors.

Note

You can use ConstantValue only if you set the Statistics argument to 'constant'.

Data Types: char | double | string | cell

Output Arguments

collapse all

Updated creditscorecard object, returned as an object.

More About

collapse all

Weighted Mean

The weighted mean is similar to an ordinary mean except that instead of each of the data points contributing equally to the final average, some data points contribute more than others.

The weighted mean for a nonempty finite multiset of data (x) with corresponding nonnegative weights (w) is

x¯=i=1nwixii=1nwi

Weighted Median

The weighted median is the 50% weighted percentile, where the percentage in the total weight is counted instead of the total number.

For n distinct ordered elements (x) positive weights (w) such that i=1nwi=1, the weighted median is the element xk:

i=1k1wi12 and i=k+1nwi12

In the case where the respective weights of both elements border the midpoint of the set of weights without encapsulating it, each element defines a partition equal to 1/2. These elements are referred to as the lower weighted median and upper weighted median. The weighted median is chosen based on which element keeps the partitions most equal. This median is always the weighted median with the lowest weight. In the event that the upper and lower weighted medians are equal, the lower weighted median is accepted.

Weighted Mode

The weighted mode of a set of weighted data values is the value that appears most often.

The mode of a sample is the element that occurs most often in the collection. For example, the mode of the sample [1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17] is 6.

References

[1] “Basel Committee on Banking Supervision: Studies on the Validation of Internal Rating Systems.” Working Paper No. 14, February 2005.

[2] Refaat, M. Credit Risk Scorecards: Development and Implementation Using SAS. lulu.com, 2011.

[3] Loeffler, G. and Posch, P. N. Credit Risk Modeling Using Excel and VBA. Wiley Finance, 2007.

Introduced in R2020a