shapley
Description
The Shapley value of a feature for a query point explains the deviation of the prediction for the query point from the average prediction, due to the feature. For each query point, the sum of the Shapley values for all features corresponds to the total deviation of the prediction from the average.
You can create a shapley
object for a machine learning model with a
specified query point or query points (queryPoints
). The software creates
an object and computes the Shapley values of all features for the query points.
Use the Shapley values to explain the contribution of individual features to a prediction
at each specified query point. Use the plot
function to
display a bar graph of the Shapley values for one query point or the mean absolute Shapley
values averaged across multiple query points. If you have multiple query points, you can use
the boxchart
, plotDependence
, and
swarmchart
functions to visualize Shapley values. You can compute the Shapley values for new query points
by using the fit
function.
Creation
Syntax
Description
also computes the Shapley values for the query points explainer
= shapley(___,QueryPoints=queryPoints
)queryPoints
and stores the computed Shapley values in the Shapley
property of explainer
. You can specify
queryPoints
in addition to any of the input argument combinations
in the previous syntaxes.
specifies additional options using one or more name-value arguments. For example,
specify explainer
= shapley(___,Name=Value
)UseParallel=true
to compute Shapley values in
parallel.
Input Arguments
blackbox
— Machine learning model to be interpreted
regression model object | classification model object | function handle
Machine learning model to be interpreted, specified as a full or compact regression or classification model object or a function handle.
Full or compact model object — You can specify a full or compact regression or classification model object, which has a
predict
object function. The software uses thepredict
function to compute Shapley values.If you specify a model object that does not contain predictor data (for example, a compact model), then you must provide the predictor data using
X
.When you train a model, use a numeric matrix or table for the predictor data where rows correspond to individual observations.
shapley
does not support a model object trained with more than one response variable.
Regression Model Object
Supported Model Full or Compact Regression Model Object Ensemble of regression models RegressionEnsemble
,RegressionBaggedEnsemble
,CompactRegressionEnsemble
Gaussian kernel regression model using random feature expansion RegressionKernel
Gaussian process regression RegressionGP
,CompactRegressionGP
Generalized additive model RegressionGAM
,CompactRegressionGAM
Linear regression for high-dimensional data RegressionLinear
Neural network regression model RegressionNeuralNetwork
,CompactRegressionNeuralNetwork
Regression tree RegressionTree
,CompactRegressionTree
Support vector machine regression RegressionSVM
,CompactRegressionSVM
Classification Model Object
Supported Model Full or Compact Classification Model Object Discriminant analysis classifier ClassificationDiscriminant
,CompactClassificationDiscriminant
Multiclass model for support vector machines or other classifiers ClassificationECOC
,CompactClassificationECOC
Ensemble of learners for classification ClassificationEnsemble
,CompactClassificationEnsemble
,ClassificationBaggedEnsemble
Gaussian kernel classification model using random feature expansion ClassificationKernel
Generalized additive model ClassificationGAM
,CompactClassificationGAM
k-nearest neighbor classifier ClassificationKNN
Linear classification model ClassificationLinear
Multiclass naive Bayes model ClassificationNaiveBayes
,CompactClassificationNaiveBayes
Neural network classifier ClassificationNeuralNetwork
,CompactClassificationNeuralNetwork
Support vector machine classifier for one-class and binary classification ClassificationSVM
,CompactClassificationSVM
Binary decision tree for multiclass classification ClassificationTree
,CompactClassificationTree
Function handle — You can specify a function handle that accepts predictor data and returns a column vector containing a prediction for each observation in the predictor data. The prediction is a predicted response for regression or a predicted score of a single class for classification. You must provide the predictor data using
X
.
X
— Predictor data
numeric matrix | table
Predictor data, specified as a numeric matrix or table. Each row of
X
corresponds to one observation, and each column corresponds
to one variable.
For a numeric matrix:
The variables that makes up the columns of
X
must have the same order as the predictor variables that trainedblackbox
, stored inblackbox.X
.If you trained
blackbox
using a table, thenX
can be a numeric matrix if the table contains all numeric predictor variables.
For a table:
If you trained
blackbox
using a table (for example,Tbl
), then all predictor variables inX
must have the same variable names and data types as those inTbl
. However, the column order ofX
does not need to correspond to the column order ofTbl
.If you trained
blackbox
using a numeric matrix, then the predictor names inblackbox.PredictorNames
and the corresponding predictor variable names inX
must be the same. To specify predictor names during training, use thePredictorNames
name-value argument. All predictor variables inX
must be numeric vectors.X
can contain additional variables (response variables, observation weights, and so on), butshapley
ignores them.shapley
does not support multicolumn variables or cell arrays other than cell arrays of character vectors.
If blackbox
is a model object that does not contain predictor
data or a function handle, you must provide X
. If
blackbox
is a full machine learning model object and you
specify this argument, then shapley
does not use the predictor
data in blackbox
; it uses the specified predictor data
only.
Data Types: single
| double
queryPoints
— Query points
numeric matrix | table
Query points at which shapley
explains a prediction,
specified as a numeric matrix or a table. Each row of queryPoints
corresponds to one query point.
For a numeric matrix:
For a table:
If you trained
blackbox
using a table (for example,Tbl
), then all predictor variables inqueryPoints
must have the same variable names and data types as those inTbl
. However, the column order ofqueryPoints
does not need to correspond to the column order ofTbl
.If you trained
blackbox
using a numeric matrix, then the predictor names inblackbox.PredictorNames
and the corresponding predictor variable names inqueryPoints
must be the same. To specify predictor names during training, use thePredictorNames
name-value argument. All predictor variables inqueryPoints
must be numeric vectors.queryPoints
can contain additional variables (response variables, observation weights, and so on), butshapley
ignores them.shapley
does not support multicolumn variables or cell arrays other than cell arrays of character vectors.
If queryPoints
contains NaN
s for
continuous predictors and Method
is
"conditional"
, then the Shapley values (Shapley
) in
the returned object are NaN
s. If you use a regression model that is
a Gaussian process regression (GPR), kernel, linear, neural network, or support vector
machine (SVM) model, then shapley
returns NaN
Shapley values for query points that contain missing predictor values or categories
not seen during training. For all other models, shapley
handles
missing values in queryPoints
in the same way as
blackbox
(that is, the predict
object
function of blackbox
or the function handle specified by
blackbox
).
Before R2024a: You can specify only one query point using
QueryPoint=queryPoint
, where queryPoint
is a
row vector of numeric values or a single-row table.
Example: blackbox.X(1,:)
specifies the query point as the first
observation of the predictor data in the full machine learning model
blackbox
.
Data Types: single
| double
| table
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Example: shapley(blackbox,QueryPoint=q,Method="conditional")
creates
a shapley
object and computes the Shapley values for the query point
q
using the extension to the Kernel SHAP algorithm.
CategoricalPredictors
— Categorical predictors list
vector of positive integers | logical vector | character matrix | string array | cell array of character vectors | "all"
Categorical predictors list, specified as one of the values in this table.
Value | Description |
---|---|
Vector of positive integers | Each entry in the vector is an index value indicating that the corresponding predictor
is categorical. The index values are between 1 and If |
Logical vector | A |
Character matrix | Each row of the matrix is the name of a predictor variable. The names must match the variable names of the predictor data in the form of a table. Pad the names with extra blanks so each row of the character matrix has the same length. |
String array or cell array of character vectors | Each element in the array is the name of a predictor variable. The names must match the variable names of the predictor data in the form of a table. |
"all" | All predictors are categorical. |
If you specify
blackbox
as a function handle, thenshapley
identifies categorical predictors from the predictor dataX
. If the predictor data is in a table,shapley
assumes that a variable is categorical if it is a logical vector, unordered categorical vector, character array, string array, or cell array of character vectors. If the predictor data is a matrix,shapley
assumes that all predictors are continuous. To identify any other predictors as categorical predictors, specify them by using theCategoricalPredictors
name-value argument.If you specify
blackbox
as a regression or classification model object, thenshapley
identifies categorical predictors by using theCategoricalPredictors
property of the model object.
shapley
supports an ordered categorical predictor when
blackbox
supports ordered categorical predictors and you
specify Method
as
"interventional"
.
Example: CategoricalPredictors="all"
Data Types: single
| double
| logical
| char
| string
| cell
MaxNumSubsets
— Maximum number of predictor subsets
min(2^M,1024)
where M
is
the number of predictors (default) | positive integer
Maximum number of predictor subsets to use for Shapley value computations, specified as a positive integer.
For details on how shapley
chooses the subsets to use,
see Computational Cost.
This argument is valid only when shapley
uses the Kernel SHAP
algorithm or the extension to the Kernel SHAP algorithm. If you set the
MaxNumSubsets
argument when Method
is
"interventional"
, the software uses the Kernel SHAP algorithm.
For more information, see Algorithms.
Example: MaxNumSubsets=100
Data Types: single
| double
Method
— Shapley value computation algorithm
"interventional"
(default) | "conditional"
Shapley value computation algorithm, specified as
"interventional"
or "conditional"
.
"interventional"
(default) —shapley
computes the Shapley values with an interventional value function.shapley
offers three interventional algorithms: Kernel SHAP [1], Linear SHAP [1], and Tree SHAP [2]. The software selects an algorithm based on the machine learning modelblackbox
and other specified options. For details, see Interventional Algorithms."conditional"
—shapley
uses the extension to the Kernel SHAP algorithm [3] with a conditional value function.
The Method
property stores the name of the selected algorithm. For more information, see Algorithms.
Before R2023a: You can specify this argument as
"interventional-kernel"
or
"conditional-kernel"
. shapley
supports
the Kernel SHAP algorithm and the extension of the Kernel SHAP algorithm.
Example: Method="conditional"
Data Types: char
| string
NumObservationsToSample
— Number of observations to sample from predictor data
100
(default) | "all"
| positive integer scalar
Since R2024b
Number of observations to sample from the predictor data, specified as
"all"
or a positive integer scalar. A value of
"all"
indicates to use all observations in the predictor data
X
to compute Shapley values. A value of n
indicates to use at most n observations randomly sampled from
X
. To see the sampled observations, use the
SampledObservationIndices
property.
Example: NumObservationsToSample="all"
Data Types: single
| double
| char
| string
OutputFcn
— Function called after each query point evaluation
[]
(default) | function handle
Since R2024a
Function called after each query point evaluation, specified as a function handle. An output function can perform various tasks, such as stopping Shapley value computations, creating variables, or plotting results. For details and examples on how to write your own output functions, see Shapley Output Functions.
This argument is valid only when the shapley
function computes
Shapley values for multiple query points.
Data Types: function_handle
UseParallel
— Flag to run in parallel
false
(default) | true
Flag to run in parallel, specified as a numeric or logical
1
(true
) or 0
(false
). If you specify UseParallel=true
, the
shapley
function executes for
-loop iterations by
using parfor
. The loop runs in parallel when you
have Parallel Computing Toolbox™.
This argument is valid only when the shapley
function computes
Shapley values for multiple query points, or computes Shapley values for one query
point by using the Tree SHAP algorithm for an ensemble of trees, the Kernel SHAP
algorithm, or the extension to the Kernel SHAP algorithm.
Example: UseParallel=true
Data Types: logical
Properties
BlackboxModel
— Machine learning model to be interpreted
regression model object | classification model object | function handle
This property is read-only.
Machine learning model to be interpreted, specified as a regression or classification model object or a function handle.
The blackbox
argument sets this property.
BlackboxFitted
— Predictions for query points computed by machine learning model
vector | []
This property is read-only.
Predictions for the query points computed by the machine learning model (BlackboxModel
), specified as a vector.
If
BlackboxModel
is a model object, thenBlackboxFitted
contains predicted responses for regression or classified labels for classification.If
BlackboxModel
is a function handle, thenBlackboxFitted
contains values returned by the function handle, either predicted responses for regression or predicted scores for classification.
The BlackboxFitted
property is empty if you do not specify
query points.
CategoricalPredictors
— Categorical predictor indices
vector of positive integers | []
This property is read-only.
Categorical predictor
indices, specified as a vector of positive integers. CategoricalPredictors
contains index values indicating that the corresponding predictors are categorical. The index
values are between 1 and p
, where p
is the number of
predictors used to train the model. If none of the predictors are categorical, then this
property is empty ([]
).
If you specify
blackbox
using a function handle, thenshapley
identifies categorical predictors from the predictor dataX
. If you specify theCategoricalPredictors
name-value argument, then the argument sets this property.If you specify
blackbox
as a regression or classification model object, thenshapley
determines this property by using theCategoricalPredictors
property of the model object.
shapley
supports an ordered categorical predictor when
blackbox
supports ordered categorical predictors and when you
specify Method
as
"interventional"
.
Intercept
— Average prediction
numeric vector | numeric scalar
This property is read-only.
Average prediction, averaged over the predictor data X
,
specified as a numeric vector or numeric scalar.
If
BlackboxModel
is a classification model object, thenIntercept
is a vector of the average classification scores for each class.If
BlackboxModel
is a regression model object, thenIntercept
is a scalar of the average response.If
BlackboxModel
is a function handle, thenIntercept
is a scalar of the average function evaluation.
For a query point, the sum of the Shapley values for all features corresponds to the
total deviation of the prediction from the average
(Intercept
).
MeanAbsoluteShapley
— Mean absolute Shapley values
table | []
Since R2024a
This property is read-only.
Mean absolute Shapley values, specified as a table. The mean is taken over all query
points (QueryPoints
).
For regression, the table has two columns. The first column contains the predictor variable names, and the second column contains the mean absolute Shapley values of the predictors.
For classification, the table has two or more columns, depending on the number of classes in
BlackboxModel
. The first column contains the predictor variable names, and the rest of the columns contain the mean absolute Shapley values of the predictors for each class.
The MeanAbsoluteShapley
property is empty if you do not specify
query points.
Method
— Shapley value computation algorithm
"interventional-linear"
| "interventional-tree"
| "interventional-kernel"
| "interventional-mix"
| "conditional-kernel"
This property is read-only.
Shapley value computation algorithm, specified as
"interventional-linear"
, "interventional-tree"
,
"interventional-kernel"
, "interventional-mix"
,
or "conditional-kernel"
.
"interventional-linear"
—shapley
uses the Linear SHAP algorithm [1] with an interventional value function. That is,shapley
computes interventional Shapley values using the estimated coefficients for linear models."interventional-tree"
—shapley
uses the Tree SHAP algorithm [2] with an interventional value function."interventional-kernel"
—shapley
uses the Kernel SHAP algorithm [1] with an interventional value function."interventional-mix"
—shapley
might not use the same Shapley value computation algorithm for all query points. That is,shapley
might use the Tree SHAP algorithm with an interventional value function to compute Shapley values for some query points, and use the Kernel SHAP algorithm with an interventional value function to compute Shapley values for other query points. (since R2024a)For an example that shows how to find the method information for specific query points, see Find Method Used for Individual Shapley Value Computations.
"conditional-kernel"
—shapley
uses the extension to the Kernel SHAP algorithm [3] with a conditional value function.
The Method
argument of shapley
or the Method
argument of fit
sets this property.
For more information, see Algorithms.
NumSubsets
— Number of predictor subsets
positive integer
This property is read-only.
Number of predictor subsets to use for Shapley value computations, specified as a positive integer.
The MaxNumSubsets
argument of shapley
or the MaxNumSubsets
argument of fit
sets this property.
For details on how shapley
chooses the subsets to use, see
Computational Cost.
QueryPoints
— Query points
numeric matrix | table | []
This property is read-only.
Query points at which shapley
explains predictions using the
Shapley values (Shapley
),
specified as a numeric matrix or a table.
The QueryPoints=
name-value argument of queryPoints
shapley
or the queryPoints
argument of fit
sets this property.
SampledObservationIndices
— Indices of observations sampled from predictor data
numeric vector
Since R2024b
This property is read-only.
Indices of the observations sampled from the predictor data X
, specified
as a numeric vector. To see the sampled observations, use
explainer.X(explainer.SampledObservationIndices,:)
.
The NumObservationsToSample
name-value argument of
shapley
sets this property.
Shapley
— Shapley values for query points
table | []
This property is read-only.
Shapley values for the query points (QueryPoints
),
specified as a table.
For regression, the table has two columns. The first column contains the predictor variable names, and the second column contains the Shapley values of the predictors.
For classification, the table has two or more columns, depending on the number of classes in
BlackboxModel
. The first column contains the predictor variable names, and the rest of the columns contain the Shapley values of the predictors for each class.
The Shapley
property is empty if you do not specify query
points.
For an example that shows how to find Shapley values for one query point after fitting multiple query points, see Investigate One Query Point After Fitting Multiple Query Points.
Before R2024b: The Shapley
property is
named ShapleyValues
.
X
— Predictor data
numeric matrix | table
This property is read-only.
Predictor data, specified as a numeric matrix or table.
Each row of X
corresponds to one observation, and each column
corresponds to one variable.
If an observation contains NaN
s for continuous predictors and
Method
is
"conditional-kernel"
, then shapley
does not use
the observation for the Shapley value computation. Similarly, if an observation contains
missing predictor values or categories not seen during training, and BlackboxModel
is a regression model of type GPR, kernel, linear, neural network, or SVM, then
shapley
omits the observation from the Shapley value computation.
Otherwise, shapley
handles missing values in X
in the same way as BlackboxModel
(that is, the
predict
object function of BlackboxModel
or
the function handle specified by BlackboxModel
).
shapley
stores all observations, including the rows with missing
values, in this property.
Object Functions
fit | Compute Shapley values for query points |
plot | Plot Shapley values using bar graphs |
plotDependence | Plot dependence of Shapley values on predictor values |
boxchart | Visualize Shapley values using box charts (box plots) |
swarmchart | Visualize Shapley values using swarm scatter charts |
Examples
Compute Shapley Values When Creating shapley
Object
Train a classification model and create a shapley
object. When you create a shapley
object, specify a query point so that the software computes the Shapley values for the query point. Then create a bar graph of the Shapley values by using the object function plot
.
Load the CreditRating_Historical
data set. The data set contains customer IDs and their financial ratios, industry labels, and credit ratings.
tbl = readtable("CreditRating_Historical.dat");
Display the first three rows of the table.
head(tbl,3)
ID WC_TA RE_TA EBIT_TA MVE_BVTD S_TA Industry Rating _____ _____ _____ _______ ________ _____ ________ ______ 62394 0.013 0.104 0.036 0.447 0.142 3 {'BB'} 48608 0.232 0.335 0.062 1.969 0.281 8 {'A' } 42444 0.311 0.367 0.074 1.935 0.366 1 {'A' }
Train a blackbox model of credit ratings by using the fitcecoc
function. Use the variables from the second through seventh columns in tbl
as the predictor variables. A recommended practice is to specify the class names to set the order of the classes.
blackbox = fitcecoc(tbl,"Rating", ... PredictorNames=tbl.Properties.VariableNames(2:7), ... CategoricalPredictors="Industry", ... ClassNames={'AAA','AA','A','BBB','BB','B','CCC'});
Create a shapley
object that explains the prediction for the last observation. Specify a query point so that the software computes Shapley values and stores them in the Shapley
property.
queryPoint = tbl(end,:)
queryPoint=1×8 table
ID WC_TA RE_TA EBIT_TA MVE_BVTD S_TA Industry Rating
_____ _____ _____ _______ ________ ____ ________ ______
73104 0.239 0.463 0.065 2.924 0.34 2 {'AA'}
explainer = shapley(blackbox,QueryPoints=queryPoint)
explainer = shapley explainer with the following local Shapley values: Predictor AAA AA A BBB BB B CCC __________ _________ __________ __________ __________ ___________ __________ __________ "WC_TA" 0.054853 0.022849 0.0082629 3.418e-07 -0.031172 -0.045745 -0.044031 "RE_TA" 0.17254 0.093639 0.048798 -0.015662 -0.097291 -0.22498 -0.31434 "EBIT_TA" 0.0012558 0.0005285 0.00038919 5.0004e-05 -0.00076196 -0.0014544 -0.0012907 "MVE_BVTD" 1.3942 1.3051 0.53214 -0.27713 -0.88189 -1.1197 -0.87933 "S_TA" -0.012379 -0.0080417 0.00013755 -0.0020191 -0.00019923 0.0018047 -0.0026414 "Industry" -0.1102 -0.057898 -0.0019888 0.08099 0.097352 0.11483 0.16764 Properties, Methods
By default, shapley
subsamples 100 observations from the data in blackbox.X
to compute the Shapley values. For faster computation, use a smaller sample of the training set or specify UseParallel
as true
.
For a classification model, shapley
computes Shapley values using the predicted class score for each class. Display the values in the Shapley
property.
explainer.Shapley
ans=6×8 table
Predictor AAA AA A BBB BB B CCC
__________ _________ __________ __________ __________ ___________ __________ __________
"WC_TA" 0.054853 0.022849 0.0082629 3.418e-07 -0.031172 -0.045745 -0.044031
"RE_TA" 0.17254 0.093639 0.048798 -0.015662 -0.097291 -0.22498 -0.31434
"EBIT_TA" 0.0012558 0.0005285 0.00038919 5.0004e-05 -0.00076196 -0.0014544 -0.0012907
"MVE_BVTD" 1.3942 1.3051 0.53214 -0.27713 -0.88189 -1.1197 -0.87933
"S_TA" -0.012379 -0.0080417 0.00013755 -0.0020191 -0.00019923 0.0018047 -0.0026414
"Industry" -0.1102 -0.057898 -0.0019888 0.08099 0.097352 0.11483 0.16764
The Shapley
property contains the Shapley values of all features for each class.
Plot the Shapley values for the predicted class by using the plot
function.
plot(explainer)
The horizontal bar graph shows the Shapley values for all variables, sorted by their absolute values. Each Shapley value explains the deviation of the score for the query point from the average score of the predicted class, due to the corresponding variable.
Create shapley
Object and Compute Shapley Values Using fit
Train a regression model and create a shapley
object. When you create a shapley
object, if you do not specify query points, then the software does not compute Shapley values. Use the object function fit
to compute the Shapley values for a specified query point. Then create a bar graph of the Shapley values by using the object function plot
.
Load the carbig
data set, which contains measurements of cars made in the 1970s and early 1980s.
load carbig
Create a table containing the predictor variables Acceleration
, Cylinders
, and so on, as well as the response variable MPG
.
tbl = table(Acceleration,Cylinders,Displacement, ...
Horsepower,Model_Year,Weight,MPG);
Removing missing values in a training set can help reduce memory consumption and speed up training for the fitrkernel
function. Remove missing values in tbl
.
tbl = rmmissing(tbl);
Train a blackbox model of MPG
by using the fitrkernel
function. Specify the Cylinders
and Model_Year
variables as categorical predictors. Standardize the remaining predictors.
rng("default") % For reproducibility mdl = fitrkernel(tbl,"MPG",CategoricalPredictors=[2 5], ... Standardize=true);
Create a shapley
object. Specify the data set tbl
, because mdl
does not contain training data.
explainer = shapley(mdl,tbl)
explainer = BlackboxModel: [1×1 RegressionKernel] QueryPoints: [] BlackboxFitted: [] Shapley: [] X: [392×7 table] CategoricalPredictors: [2 5] Method: "interventional-kernel" Intercept: 23.2474 NumSubsets: 64
explainer
stores the training data tbl
in the X
property. By default, shapley
subsamples 100 observations from the data in X
and stores their indices in the SampledObservationIndices
property.
Compute the Shapley values of all predictor variables for the first observation in tbl
. The fit
object function uses the sampled observations rather than all of X
to compute the Shapley values.
queryPoint = tbl(1,:)
queryPoint=1×7 table
Acceleration Cylinders Displacement Horsepower Model_Year Weight MPG
____________ _________ ____________ __________ __________ ______ ___
12 8 307 130 70 3504 18
explainer = fit(explainer,queryPoint);
For a regression model, fit
computes Shapley values using the predicted response, and stores them in the Shapley
property of the shapley
object. Display the values in the Shapley
property.
explainer.Shapley
ans=6×2 table
Predictor Value
______________ ________
"Acceleration" -0.33821
"Cylinders" -0.97631
"Displacement" -1.1425
"Horsepower" -0.62927
"Model_Year" -0.17268
"Weight" -0.87595
Plot the Shapley values for the query point by using the plot
function.
plot(explainer)
The horizontal bar graph shows the Shapley values for all variables, sorted by their absolute values. Each Shapley value explains the deviation of the prediction for the query point from the average, due to the corresponding variable.
Investigate One Query Point After Fitting Multiple Query Points
Train a classification model and create a shapley object. Visualize the Shapley values for multiple query points by using the swarmchart
object function. Find the Shapley values for any query points of interest.
Load the fisheriris
data set, which contains measurements for 150 irises, and create a table. SepalLength
, SepalWidth
, PetalLength
, and PetalWidth
are the predictor variables, and Species
is the response variable.
fisheriris = readtable("fisheriris.csv");
Partition the data into training and test sets. Use 75% of the observations to create the training set and 25% of the observations to create the test set.
rng("default") c = cvpartition(fisheriris.Species,"Holdout",0.25); trainTbl = fisheriris(training(c),:); testTbl = fisheriris(test(c),:);
Train a classification model by using the fitcnet
function. Standardize the predictor variables, and specify the order of the classes.
mdl = fitcnet(trainTbl,"Species",Standardize=true, ... ClassNames={'setosa','versicolor','virginica'});
Create a shapley object that explains the predictions for multiple query points. Use the test set data to compute the Shapley values, and specify the observations in the test set as the query points.
explainer = shapley(mdl,testTbl,QueryPoints=testTbl)
explainer = shapley explainer with the following mean absolute Shapley values: Predictor setosa versicolor virginica _____________ ________ __________ _________ "SepalLength" 0.12466 0.12539 0.066055 "SepalWidth" 0.027488 0.03004 0.016665 "PetalLength" 0.17226 0.14164 0.18777 "PetalWidth" 0.11795 0.17135 0.23687 Properties, Methods
For a classification model, shapley
computes the Shapley values using the predicted class scores, and stores them in the Shapley
property. Because explainer
contains Shapley values for multiple query points, the object display shows the mean absolute Shapley values by default.
For each predictor and class, the mean absolute Shapley value is the absolute value of the Shapley values, averaged across all query points.
Visualize the distribution of the Shapley values for the default class (setosa
) by using the swarmchart
object function.
swarmchart(explainer)
For each predictor, the function displays the Shapley values for the query points. The corresponding swarm chart shows the distribution of the Shapley values. The function determines the order of the predictors by using the mean absolute Shapley values.
Find the observation with the lowest SepalWidth
Shapley value for class setosa
. Use data tips to find the index of the observation in the set of query points.
The query point is the 17th observation in the set of query points.
Find the observation's Shapley values in the Shapley
property of explainer
.
First, define a custom function named localShapley
that returns a table of Shapley values for the observation with the specified query point index (queryPointIndex
) in the specified shapley
object (explainer
).
function queryPointTbl= localShapley(explainer,queryPointIndex) tbl = explainer.Shapley(:,2:end); queryPointTbl = varfun(@(x)x(:,queryPointIndex),tbl); queryPointTbl.Properties.VariableNames = tbl.Properties.VariableNames; queryPointTbl = [explainer.Shapley(:,1) queryPointTbl]; end
Return the Shapley values for the query point with index 17
.
results = localShapley(explainer,17)
results=4×4 table
Predictor setosa versicolor virginica
_____________ ________ __________ _________
"SepalLength" 0.06193 -0.028438 -0.033492
"SepalWidth" -0.1135 0.088441 0.02506
"PetalLength" -0.1543 0.31506 -0.16076
"PetalWidth" -0.11846 0.35579 -0.23734
Plot the query point Shapley values using the plot
object function.
plot(explainer,QueryPointIndices=17)
By default, the function plots the Shapley values for the versicolor
class because it is the predicted class for the query point.
Specify Blackbox Model Using Function Handle
Train a regression model and create a shapley
object using a function handle to the predict
function of the model. Use the object function fit
to compute the Shapley values for the specified query point. Then plot the Shapley values by using the object function plot
.
Load the carbig
data set, which contains measurements of cars made in the 1970s and early 1980s.
load carbig
Create a table containing the predictor variables Acceleration
, Cylinders
, and so on.
tbl = table(Acceleration,Cylinders,Displacement, ...
Horsepower,Model_Year,Weight);
Train a blackbox model of MPG
by using the TreeBagger
function.
rng("default") % For reproducibility Mdl = TreeBagger(100,tbl,MPG,Method="regression", ... CategoricalPredictors=[2 5]);
shapley
does not support a TreeBagger
object directly, so you cannot specify the first input argument (blackbox model) of shapley
as a TreeBagger
object. Instead, you can use a function handle to the predict
function. You can also specify options of the predict
function using name-value arguments of the function.
Create the function handle to the predict
function of the TreeBagger
object Mdl
. Specify the array of tree indices to use as 1:50
.
f = @(tbl) predict(Mdl,tbl,Trees=1:50);
Create a shapley
object using the function handle f
. When you specify a blackbox model as a function handle, you must provide the predictor data. tbl
includes categorical predictors (Cylinder
and Model_Year
) with the double
data type. By default, shapley
does not treat variables with the double
data type as categorical predictors. Specify the second (Cylinder
) and fifth (Model_Year
) variables as categorical predictors.
explainer = shapley(f,tbl,CategoricalPredictors=[2 5]); explainer = fit(explainer,tbl(1,:));
Plot the Shapley values.
plot(explainer)
More About
Shapley Values
In game theory, the Shapley value of a player is the average marginal contribution of the player in a cooperative game. In the context of machine learning prediction, the Shapley value of a feature for a query point explains the contribution of the feature to a prediction (response for regression or score of each class for classification) at the specified query point.
The Shapley value of a feature for a query point is the contribution of the feature to the deviation from the average prediction. For a query point, the sum of the Shapley values for all features corresponds to the total deviation of the prediction from the average. That is, the sum of the average prediction and the Shapley values for all features corresponds to the prediction for the query point.
For more details, see Shapley Values for Machine Learning Model.
References
[1] Lundberg, Scott M., and S. Lee. "A Unified Approach to Interpreting Model Predictions." Advances in Neural Information Processing Systems 30 (2017): 4765–774.
[2] Lundberg, Scott M., G. Erion, H. Chen, et al. "From Local Explanations to Global Understanding with Explainable AI for Trees." Nature Machine Intelligence 2 (January 2020): 56–67.
[3] Aas, Kjersti, Martin Jullum, and Anders Løland. "Explaining Individual Predictions When Features Are Dependent: More Accurate Approximations to Shapley Values." Artificial Intelligence 298 (September 2021).
Extended Capabilities
Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.
To run in parallel, set the UseParallel
name-value argument to
true
in the call to this function.
For more general information about parallel computing, see Run MATLAB Functions with Automatic Parallel Support (Parallel Computing Toolbox).
Version History
Introduced in R2021aR2024b: Speed up Shapley value computations
To speed up Shapley value computations, the shapley
function now
uses a default maximum of 100 observations from the predictor data set X
to compute
Shapley values. You can set the number of predictor data observations to use by specifying
the NumObservationsToSample
name-value argument. You can find the indices of the
sampled observations in the SampledObservationIndices
property of the shapley
object.
In previous releases, shapley
uses the entire predictor data set to
compute Shapley values. To replicate this behavior, set the
NumObservationsToSample
value to "all"
.
You can also run Shapley value computations in parallel when using the OutputFcn
name-value argument by setting the UseParallel
value
to true
. To parallelize Shapley value computations, you must have
Parallel Computing Toolbox.
R2024b: Shapley values property is now named Shapley
The Shapley
property of the shapley
object
contains the Shapley values for the query points. In previous releases, the
Shapley
property is named ShapleyValues
.
When you use a regression model (blackbox
) to compute Shapley
values, the Shapley
property contains a table whose second column is
Value
. In previous releases, the column name is
ShapleyValue
.
R2024a: Compute Shapley values for multiple query points
You can now compute Shapley values for multiple query points by using the
QueryPoints=
name-value argument. Before R2024a, you could specify
only one query point using queryPoints
QueryPoint=queryPoint
, where
queryPoint
is a row vector of numeric values or a single-row
table.
The shapley
object contains a new property MeanAbsoluteShapley
, which contains the absolute Shapley values, averaged
across all query points. Additionally, the Method
property
can now have the value "interventional-mix"
. This value indicates that
the software might not use the same Shapley value computation algorithm for all query
points.
When computing Shapley values for multiple query points, you can use an output function
to perform various tasks, such as stopping Shapley value computations, creating variables,
or plotting results. To do so, use the OutputFcn
name-value argument.
R2023b: Interventional Tree SHAP algorithm supports data with missing predictor values
When observations in the input predictor data (
or blackbox
.XX
) or values in
the query point (queryPoint
)
contain missing values and the Method
value is
"interventional"
, the shapley
function can use the
Tree SHAP algorithm for tree models and ensemble models of tree learners. In previous
releases, under these conditions, the shapley
function always used the
Kernel SHAP algorithm for tree-based models. For more information, including cases where the
software still uses Kernel SHAP instead of Tree SHAP for tree-based models, see Interventional Algorithms.
R2023b: Improved performance of Tree SHAP algorithm for tree-based models
The shapley
function
shows improved performance when computing Shapley values for tree models and ensemble models
of tree learners by using the Tree SHAP algorithm with an interventional value function (see
Method
). The
performance increase is sensitive to the values of the shapley
input
arguments (such as the predictor data, machine learning model, and query point). For
example, the Shapley value computation in this code is about 36x faster than in the previous
release:
function timingTest % Generate data set rng("default") numObservations = 1e5; numPredictors = 10; X = rand(numObservations,numPredictors); Y = rand(numObservations,1); % Train model mdl = fitrensemble(X,Y,Learners="tree"); % Compute Shapley value tic shapley(mdl,"QueryPoint",X(50,:),Method="interventional"); toc end
The approximate execution times are:
R2023b: 3s
R2023a: 107s
The code was timed on a Windows® 10, Intel®
Xeon® CPU E5-1650 v4 @ 3.60GHz test system by calling the function
timingTest
.
R2023a: shapley
supports the Linear SHAP and Tree SHAP algorithms
shapley
supports the Linear SHAP [1] algorithm for linear models and the Tree
SHAP [2] algorithm for tree models and ensemble
models of tree learners.
If you specify the Method
name-value
argument as 'interventional'
(default), shapley
selects
an algorithm based on the machine learning model type of blackbox
. The
Method
property
stores the name of the selected algorithm.
R2023a: Values of the Method
name-value argument have changed
The supported values of the Method
name-value
argument have changed from 'interventional-kernel'
and
'conditional-kernel'
to 'interventional'
and
'conditional'
, respectively.
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)