fillmissing
Description
replaces missing values of the predictor sc
= fillmissing(sc
,PredictorNames
,Statistics
)PredictorNames
with values defined by Statistics
and returns an updated
credit scorecard object (sc
). Standard missing data is
defined as follows:
NaN
for numeric arrays<undefined>
for categorical arrays
Note
If you run fillmissing
after binning a predictor,
the existing cutpoints and bin edges are preserved and the "Good" and
"Bad" counts from the <missing>
bin are added to
the corresponding bin.
uses arguments from the previous syntax and a value for a
sc
= fillmissing(___,ConstantValue
)ConstantValue
to replace missing values.
Examples
Fill Missing Data in a creditscorecard
Object
This example shows how to use fillmissing
to replace missing values in the CustAge
and ResStatus
predictors with user-defined values. For additional information on alternative approaches for "treating" missing data, see Credit Scorecard Modeling with Missing Values.
Load the credit scorecard data and use dataMissing
for the training data.
load CreditCardData.mat
disp(head(dataMissing));
CustID CustAge TmAtAddress ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance UtilRate status ______ _______ ___________ ___________ _________ __________ _______ _______ _________ ________ ______ 1 53 62 <undefined> Unknown 50000 55 Yes 1055.9 0.22 0 2 61 22 Home Owner Employed 52000 25 Yes 1161.6 0.24 0 3 47 30 Tenant Employed 37000 61 No 877.23 0.29 0 4 NaN 75 Home Owner Employed 53000 20 Yes 157.37 0.08 0 5 68 56 Home Owner Employed 53000 14 Yes 561.84 0.11 0 6 65 13 Home Owner Employed 48000 59 Yes 968.18 0.15 0 7 34 32 Home Owner Unknown 32000 26 Yes 717.82 0.02 1 8 50 57 Other Employed 51000 33 No 3041.2 0.13 0
Create a creditscorecard
object with 'BinMissingData'
set to true
.
sc = creditscorecard(dataMissing,'BinMissingData',true);
sc = autobinning(sc);
Use bininfo
and plotbins
to display the CustAge
and ResStatus
predictors with missing data.
bininfo(sc,'CustAge')
ans=10×6 table
Bin Good Bad Odds WOE InfoValue
_____________ ____ ___ ______ ________ __________
{'[-Inf,33)'} 69 52 1.3269 -0.42156 0.018993
{'[33,37)' } 63 45 1.4 -0.36795 0.012839
{'[37,40)' } 72 47 1.5319 -0.2779 0.0079824
{'[40,46)' } 172 89 1.9326 -0.04556 0.0004549
{'[46,48)' } 59 25 2.36 0.15424 0.0016199
{'[48,51)' } 99 41 2.4146 0.17713 0.0035449
{'[51,58)' } 157 62 2.5323 0.22469 0.0088407
{'[58,Inf]' } 93 25 3.72 0.60931 0.032198
{'<missing>'} 19 11 1.7273 -0.15787 0.00063885
{'Totals' } 803 397 2.0227 NaN 0.087112
plotbins(sc,'CustAge');
bininfo(sc,'ResStatus')
ans=5×6 table
Bin Good Bad Odds WOE InfoValue
______________ ____ ___ ______ _________ __________
{'Tenant' } 296 161 1.8385 -0.095463 0.0035249
{'Home Owner'} 352 171 2.0585 0.017549 0.00013382
{'Other' } 128 52 2.4615 0.19637 0.0055808
{'<missing>' } 27 13 2.0769 0.026469 2.3248e-05
{'Totals' } 803 397 2.0227 NaN 0.0092627
plotbins(sc,'ResStatus');
Use fillmissing
to replace NaN
values in CustAge
with the median value and to replace the <missing>
values in ResStatus
with 'Tenant'
. Use predictorinfo
to verify the filled values.
sc = fillmissing(sc,{'CustAge'},'median'); sc = fillmissing(sc,{'ResStatus'},'constant','Tenant'); predictorinfo(sc,'CustAge')
ans=1×4 table
PredictorType LatestBinning LatestFillMissingType LatestFillMissingValue
_____________ ________________________ _____________________ ______________________
CustAge {'Numeric'} {'Automatic / Monotone'} {'Median'} {[45]}
predictorinfo(sc,'ResStatus')
ans=1×5 table
PredictorType Ordinal LatestBinning LatestFillMissingType LatestFillMissingValue
_______________ _______ ________________________ _____________________ ______________________
ResStatus {'Categorical'} false {'Automatic / Monotone'} {'Constant'} {'Tenant'}
Use bininfo
and plotbins
to display the CustAge
and ResStatus
predictors to verify that the missing data has been replaced with the values defined by fillmissing
.
bininfo(sc,'CustAge')
ans=9×6 table
Bin Good Bad Odds WOE InfoValue
_____________ ____ ___ ______ _________ _________
{'[-Inf,33)'} 69 52 1.3269 -0.42156 0.018993
{'[33,37)' } 63 45 1.4 -0.36795 0.012839
{'[37,40)' } 72 47 1.5319 -0.2779 0.0079824
{'[40,46)' } 191 100 1.91 -0.057315 0.0008042
{'[46,48)' } 59 25 2.36 0.15424 0.0016199
{'[48,51)' } 99 41 2.4146 0.17713 0.0035449
{'[51,58)' } 157 62 2.5323 0.22469 0.0088407
{'[58,Inf]' } 93 25 3.72 0.60931 0.032198
{'Totals' } 803 397 2.0227 NaN 0.086822
plotbins(sc,'CustAge');
bininfo(sc,'ResStatus')
ans=4×6 table
Bin Good Bad Odds WOE InfoValue
______________ ____ ___ ______ _________ __________
{'Tenant' } 323 174 1.8563 -0.085821 0.0030935
{'Home Owner'} 352 171 2.0585 0.017549 0.00013382
{'Other' } 128 52 2.4615 0.19637 0.0055808
{'Totals' } 803 397 2.0227 NaN 0.0088081
plotbins(sc,'ResStatus');
Use fitmodel
and then run formatpoints
, displaypoints
, and score
.
sc = fitmodel(sc,'Display','off'); sc = formatpoints(sc,'WorstAndBest',[300 800]); t = displaypoints(sc)
t=31×3 table
Predictors Bin Points
______________ _________________ ______
{'CustAge' } {'[-Inf,33)' } 72.565
{'CustAge' } {'[33,37)' } 76.588
{'CustAge' } {'[37,40)' } 83.346
{'CustAge' } {'[40,46)' } 99.902
{'CustAge' } {'[46,48)' } 115.78
{'CustAge' } {'[48,51)' } 117.5
{'CustAge' } {'[51,58)' } 121.07
{'CustAge' } {'[58,Inf]' } 149.93
{'CustAge' } {'<missing>' } 99.902
{'EmpStatus' } {'Unknown' } 79.64
{'EmpStatus' } {'Employed' } 133.98
{'EmpStatus' } {'<missing>' } NaN
{'CustIncome'} {'[-Inf,29000)' } 21.926
{'CustIncome'} {'[29000,33000)'} 73.949
{'CustIncome'} {'[33000,35000)'} 97.117
{'CustIncome'} {'[35000,40000)'} 101.44
⋮
When a validation data set has missing values and you use fillmissing
with the training dataset, the missing values in the validation data set are assigned the same points as the corresponding bins containing the filled values.
As the table shows, the '<missing>'
bin for the CustAge
predictor is assigned the same points as the '[40,46)'
bin because the missing data is filled with the median value 45
.
The points assigned to the '<missing>'
bin for the EmpStatus
predictor are NaN
because fillmissing
is not used for that predictor. The assigned points are decided by the default 'NoScore'
for the 'Missing'
name-value pair argument in formatpoints
.
Create a test validation data set (tdata
) and add missing values.
tdata = data(1:10,:); tdata.CustAge(1) = NaN; tdata.ResStatus(2) = missing; [scr,pts] = score(sc,tdata)
scr = 10×1
566.7335
611.2547
584.5130
628.7876
609.7148
671.1048
403.6413
551.9461
575.9874
524.4789
pts=10×5 table
CustAge EmpStatus CustIncome TmWBank AMBalance
_______ _________ __________ _______ _________
99.902 79.64 153.88 145.38 87.933
149.93 133.98 153.88 85.531 87.933
115.78 133.98 101.44 145.38 87.933
117.5 133.98 153.88 83.991 139.44
149.93 133.98 153.88 83.991 87.933
149.93 133.98 153.88 145.38 87.933
76.588 79.64 73.949 85.531 87.933
117.5 133.98 153.88 85.531 61.06
117.5 79.64 153.88 85.531 139.44
117.5 79.64 153.88 85.531 87.933
Handling of Missing Values in Validation Data Sets
This example shows different possibilities for handling missing data in validation data.
When scoring data from a validation data set, you have several options. If you choose to do nothing, the points assigned to the missing data are NaN
, which comes from the default 'NoScore'
for the 'Missing'
name-value pair argument in formatpoints
.
If you want to score missing values of all the predictors with one consistent metric, you can use the options 'ZeroWOE'
, 'MinPoints'
, or 'MaxPoints'
for the 'Missing'
name-value pair argument in formatpoints
.
load CreditCardData.mat sc = creditscorecard(data); predictorinfo(sc,'CustAge')
ans=1×4 table
PredictorType LatestBinning LatestFillMissingType LatestFillMissingValue
_____________ _________________ _____________________ ______________________
CustAge {'Numeric'} {'Original Data'} {'Original'} {0x0 double}
predictorinfo(sc,'ResStatus')
ans=1×5 table
PredictorType Ordinal LatestBinning LatestFillMissingType LatestFillMissingValue
_______________ _______ _________________ _____________________ ______________________
ResStatus {'Categorical'} false {'Original Data'} {'Original'} {0x0 double}
sc = autobinning(sc); sc = fitmodel(sc,'display','off'); displaypoints(sc)
ans=37×3 table
Predictors Bin Points
______________ ________________ _________
{'CustAge' } {'[-Inf,33)' } -0.15894
{'CustAge' } {'[33,37)' } -0.14036
{'CustAge' } {'[37,40)' } -0.060323
{'CustAge' } {'[40,46)' } 0.046408
{'CustAge' } {'[46,48)' } 0.21445
{'CustAge' } {'[48,58)' } 0.23039
{'CustAge' } {'[58,Inf]' } 0.479
{'CustAge' } {'<missing>' } NaN
{'ResStatus' } {'Tenant' } -0.031252
{'ResStatus' } {'Home Owner' } 0.12696
{'ResStatus' } {'Other' } 0.37641
{'ResStatus' } {'<missing>' } NaN
{'EmpStatus' } {'Unknown' } -0.076317
{'EmpStatus' } {'Employed' } 0.31449
{'EmpStatus' } {'<missing>' } NaN
{'CustIncome'} {'[-Inf,29000)'} -0.45716
⋮
sc = formatpoints(sc,'Missing','minpoints','WorstAndBestScores',[300 850]); displaypoints(sc)
ans=37×3 table
Predictors Bin Points
______________ ________________ ______
{'CustAge' } {'[-Inf,33)' } 46.396
{'CustAge' } {'[33,37)' } 48.727
{'CustAge' } {'[37,40)' } 58.772
{'CustAge' } {'[40,46)' } 72.167
{'CustAge' } {'[46,48)' } 93.256
{'CustAge' } {'[48,58)' } 95.256
{'CustAge' } {'[58,Inf]' } 126.46
{'CustAge' } {'<missing>' } 46.396
{'ResStatus' } {'Tenant' } 62.421
{'ResStatus' } {'Home Owner' } 82.276
{'ResStatus' } {'Other' } 113.58
{'ResStatus' } {'<missing>' } 62.421
{'EmpStatus' } {'Unknown' } 56.765
{'EmpStatus' } {'Employed' } 105.81
{'EmpStatus' } {'<missing>' } 56.765
{'CustIncome'} {'[-Inf,29000)'} 8.9706
⋮
The value of -32.5389
for the <missing>
bin of 'CustAge'
comes from the 'minPoints'
argument for formatpoints
.
[scr,pts] = score(sc,dataMissing(1:5,:))
scr = 5×1
602.0394
648.1988
560.5569
613.5595
646.8109
pts=5×7 table
CustAge ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance
_______ _________ _________ __________ _______ _______ _________
95.256 62.421 56.765 121.18 116.05 86.224 64.15
126.46 82.276 105.81 121.18 62.107 86.224 64.15
93.256 62.421 105.81 76.585 116.05 42.287 64.15
46.396 82.276 105.81 121.18 60.719 86.224 110.96
126.46 82.276 105.81 121.18 60.719 86.224 64.15
Alternatively, you can score missing data for each individual predictor with a different statistic based on that predictor's information. To do so, use fillmissing
for a creditscorecard
object.
load CreditCardData.mat sc = creditscorecard(data); sc = fillmissing(sc,'CustAge','constant',35); predictorinfo(sc,'CustAge')
ans=1×4 table
PredictorType LatestBinning LatestFillMissingType LatestFillMissingValue
_____________ _________________ _____________________ ______________________
CustAge {'Numeric'} {'Original Data'} {'Constant'} {[35]}
sc = fillmissing(sc,'ResStatus','Mode'); predictorinfo(sc,'ResStatus')
ans=1×5 table
PredictorType Ordinal LatestBinning LatestFillMissingType LatestFillMissingValue
_______________ _______ _________________ _____________________ ______________________
ResStatus {'Categorical'} false {'Original Data'} {'Mode'} {'Home Owner'}
sc = autobinning(sc); sc = fitmodel(sc,'display','off'); sc = formatpoints(sc,'Missing','minpoints','WorstAndBestScores',[300 850]); displaypoints(sc)
ans=37×3 table
Predictors Bin Points
______________ ________________ ______
{'CustAge' } {'[-Inf,33)' } 46.396
{'CustAge' } {'[33,37)' } 48.727
{'CustAge' } {'[37,40)' } 58.772
{'CustAge' } {'[40,46)' } 72.167
{'CustAge' } {'[46,48)' } 93.256
{'CustAge' } {'[48,58)' } 95.256
{'CustAge' } {'[58,Inf]' } 126.46
{'CustAge' } {'<missing>' } 48.727
{'ResStatus' } {'Tenant' } 62.421
{'ResStatus' } {'Home Owner' } 82.276
{'ResStatus' } {'Other' } 113.58
{'ResStatus' } {'<missing>' } 82.276
{'EmpStatus' } {'Unknown' } 56.765
{'EmpStatus' } {'Employed' } 105.81
{'EmpStatus' } {'<missing>' } 56.765
{'CustIncome'} {'[-Inf,29000)'} 8.9706
⋮
The value of <missing>
for 'CustAge'
comes from the fill value of 35
even though the training data has no missing values.
disp(dataMissing(1:5,:));
CustID CustAge TmAtAddress ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance UtilRate status ______ _______ ___________ ___________ _________ __________ _______ _______ _________ ________ ______ 1 53 62 <undefined> Unknown 50000 55 Yes 1055.9 0.22 0 2 61 22 Home Owner Employed 52000 25 Yes 1161.6 0.24 0 3 47 30 Tenant Employed 37000 61 No 877.23 0.29 0 4 NaN 75 Home Owner Employed 53000 20 Yes 157.37 0.08 0 5 68 56 Home Owner Employed 53000 14 Yes 561.84 0.11 0
[scr,pts] = score(sc,dataMissing(1:5,:))
scr = 5×1
621.8943
648.1988
560.5569
615.8904
646.8109
pts=5×7 table
CustAge ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance
_______ _________ _________ __________ _______ _______ _________
95.256 82.276 56.765 121.18 116.05 86.224 64.15
126.46 82.276 105.81 121.18 62.107 86.224 64.15
93.256 62.421 105.81 76.585 116.05 42.287 64.15
48.727 82.276 105.81 121.18 60.719 86.224 110.96
126.46 82.276 105.81 121.18 60.719 86.224 64.15
Input Arguments
sc
— Credit scorecard model
creditscorecard
object
Credit scorecard model, specified as a creditscorecard
object.
PredictorNames
— Name of creditscorecard
predictor whose missing data is filled
character vector | string | cell array of character vectors | string array
Name of creditscorecard
predictor to fill missing data
for, specified as a scalar character vector, scalar string, cell array of
character vectors, or string array.
Data Types: char
| string
| cell
Statistics
— Statistic to use to fill missing data for predictors
character vector with a value of 'mean'
,
'median'
, 'mode'
,
'original'
, or 'constant'
| string with a value of "mean"
,
"median"
, "mode"
,
"original"
, or "constant"
Statistic to use to fill missing data for the predictors, specified as a character vector or string with one of the following values.
'mean'
— Replace missing data with the average or mean value. The option is valid only for numeric data. The'mean'
calculates the weighted mean of the predictor by referring to the predictor column and theWeights
column from thecreditscorecard
object. For more information, see Weighted Mean.'median'
— Replace missing data with the median value. Valid for numeric and ordinal data. The'median'
calculates the weighted median of the predictor by referring to the predictor column and theWeights
column from thecreditscorecard
object. For more information, see Weighted Median.'mode'
— Replace missing data with the mode. Valid for numeric and both nominal and ordinal categorical data. The'mode'
calculates the weightedmode
of the predictor by referring to the predictor column and theWeights
column from thecreditscorecard
object. For more information, see Weighted Mode.'original'
— Set the missing data for numeric and categorical predictors back to its original value:NaN
if numeric,<undefined>
or<missing>
if categorical.'constant'
— Set the missing data for numeric and categorical predictors to a constant value that you specify in the optional argument forConstantValue
.
Data Types: char
| string
ConstantValue
— Value to fill missing entries in predictors specified in PredictorNames
[ ]
(default) | numeric | character vector | string | cell array of character vectors | string array
(Optional) Value to fill missing entries in predictors specified in
PredictorNames
, specified as a numeric value,
character vector, string, or cell array of character vectors.
Note
You can use ConstantValue
only if you set the
Statistics
argument to
'constant'
.
Data Types: char
| double
| string
| cell
Output Arguments
sc
— Updated creditscorecard
creditscorecard
object
Updated creditscorecard
object, returned as an
object.
More About
Weighted Mean
The weighted mean is similar to an ordinary mean except that instead of each of the data points contributing equally to the final average, some data points contribute more than others.
The weighted mean for a nonempty finite multiset of data (x) with corresponding nonnegative weights (w) is
Weighted Median
The weighted median is the 50% weighted percentile, where the percentage in the total weight is counted instead of the total number.
For n distinct ordered elements (x) positive weights (w) such that , the weighted median is the element xk:
In the case where the respective weights of both elements border the midpoint of the set of weights without encapsulating it, each element defines a partition equal to 1/2. These elements are referred to as the lower weighted median and upper weighted median. The weighted median is chosen based on which element keeps the partitions most equal. This median is always the weighted median with the lowest weight. In the event that the upper and lower weighted medians are equal, the lower weighted median is accepted.
Weighted Mode
The weighted mode of a set of weighted data values is the value that appears most often.
The mode of a sample is the element that occurs most often in the collection. For example, the mode of the sample [1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17] is 6.
References
[1] “Basel Committee on Banking Supervision: Studies on the Validation of Internal Rating Systems.” Working Paper No. 14, February 2005.
[2] Refaat, M. Credit Risk Scorecards: Development and Implementation Using SAS. lulu.com, 2011.
[3] Loeffler, G. and Posch, P. N. Credit Risk Modeling Using Excel and VBA. Wiley Finance, 2007.
Version History
Introduced in R2020a
See Also
creditscorecard
| bininfo
| predictorinfo
| modifypredictor
| modifybins
| bindata
| plotbins
| fitmodel
| displaypoints
| formatpoints
| score
| setmodel
| probdefault
| validatemodel
| table
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)