Get Shapley values for the whole model

18 views (last 30 days)
Thomas Kirsh
Thomas Kirsh on 18 Apr 2023
Edited: Drew on 18 Apr 2023
I'm trying to get the Shapley values for my classification model so that I can estimate the feature importances. However, based on the documentation, MATLAB's implementation only lets me get the values for a single row in my dataset. But to get the values for the entire dataset and model, I need to loop over the rows like this
explainer = shapley(mdl, data);
% Create matrix of zeros to store the values for -1 and 1 classes
values = zeros(size(data, 2), 2, size(data, 1));
for row=1:size(data,1)
temp_explainer = fit(explainer, data(row,:));
svalues = temp_explainer.ShapleyValues;
values(:,:,row) = table2array(svalues(:, 2:3));
end
% Average Shapley values across rows for each feature
feature_importances = mean(values, 3);
This is pretty slow. Is there a faster way to get the Shapley values for the entire model and dataset?

Accepted Answer

Drew
Drew on 18 Apr 2023
Edited: Drew on 18 Apr 2023
For Shapley-based feature importance, it is recommended to use the mean absolute Shapley values. So, insert an absolute value function in the last line: mean(abs(values), 3).
In general, you can reduce Shapley computational cost through some combination of these methods:
  1. Use parallel processing over the query points by using a "parfor" loop over the query points.
  2. Use parallel processing within the processing for one query point by setting 'UseParallel' to 'true' in the call to 'fit'. If you have already implemented parallelization over the query points with parfor, then requesting parallelization within the computation for each query point may have little effect and it could make the computation slower in some cases where the overhead from parallelization at this level is greater than the gains. See doc links below.
  3. Reduce the number of background samples. It is often sufficient to choose 100 random background samples from the predictor data. For example, that would make your first line something like "explainer = shapley(mdl, data( randsample(size(data,1),100) , :) );" See doc links below.
  4. Reduce the MaxNumSubsets parameter. If you have M predictors, the default MaxNumSubsets is min( 2^M, 2^10). You could reduce MaxNumSubsets to make the computation faster, at the expense of introducing some approximation into the calculation. See doc links below. Note, in R2023a, if you are using a tree model, and if the software is using the interventional-tree method, then it is recommended to not set "MaxNumSubsets" because it is not applicable to interventional-tree, and the software will switch to using interventional-kernel because the "MaxNumSubsets" is specified.
  5. R2023a has some Shapley computation improvements which can speed-up the computation for linear or tree models. You can try R2023a at matlab.mathworks.com
  6. Reduce the number of query points. If your dataset has a large number of observations, then observing the Shapley values over 1000 random query points may be sufficient for the purpose.
If this answer helps you, please remember to accept the answer.

More Answers (0)

Tags

Products


Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!