Materials towards Homework 2: SHAP with XGBoost & SVM

Part of the eXplainable Machine Learning course for Machine Learning (MSc) studies at the University of Warsaw. @pbiecek @hbaniecki

v0.1.0: 2022-10-13

https://github.com/mim-uw/eXplainableMachineLearning-2023/tree/main/Homeworks/HW2

0. Import packages

In [1]:
import dalex as dx
import xgboost
import shap

import sklearn
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")
In [2]:
import platform
print(f'Python {platform.python_version()}')
Python 3.9.2
In [3]:
{package.__name__: package.__version__ for package in [dx, xgboost, shap, sklearn, pd, np]}
Out[3]:
{'dalex': '1.5.0',
 'xgboost': '1.6.2',
 'shap': '0.41.0',
 'sklearn': '1.0.2',
 'pandas': '1.3.5',
 'numpy': '1.22.4'}

1. Load data

In [4]:
df = dx.datasets.load_titanic()
In [5]:
df
Out[5]:
gender age class embarked fare sibsp parch survived
0 male 42.0 3rd Southampton 7.11 0 0 0
1 male 13.0 3rd Southampton 20.05 0 2 0
2 male 16.0 3rd Southampton 20.05 1 1 0
3 female 39.0 3rd Southampton 20.05 1 1 1
4 female 16.0 3rd Southampton 7.13 0 0 1
... ... ... ... ... ... ... ... ...
2202 male 41.0 deck crew Belfast 0.00 0 0 1
2203 male 40.0 victualling crew Southampton 0.00 0 0 1
2204 male 32.0 engineering crew Southampton 0.00 0 0 0
2205 male 20.0 restaurant staff Southampton 0.00 0 0 0
2206 male 26.0 restaurant staff Southampton 0.00 0 0 0

2207 rows × 8 columns

In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2207 entries, 0 to 2206
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   gender    2207 non-null   object 
 1   age       2207 non-null   float64
 2   class     2207 non-null   object 
 3   embarked  2207 non-null   object 
 4   fare      2207 non-null   float64
 5   sibsp     2207 non-null   int64  
 6   parch     2207 non-null   int64  
 7   survived  2207 non-null   int64  
dtypes: float64(2), int64(3), object(3)
memory usage: 138.1+ KB

We shall try to make the new categorical data type useful

In [7]:
df.loc[:, df.dtypes == 'object'] =\
    df.select_dtypes(['object'])\
    .apply(lambda x: x.astype('category'))
In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2207 entries, 0 to 2206
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   gender    2207 non-null   category
 1   age       2207 non-null   float64 
 2   class     2207 non-null   category
 3   embarked  2207 non-null   category
 4   fare      2207 non-null   float64 
 5   sibsp     2207 non-null   int64   
 6   parch     2207 non-null   int64   
 7   survived  2207 non-null   int64   
dtypes: category(3), float64(2), int64(3)
memory usage: 93.5 KB
In [9]:
X = df.drop(columns='survived')
# convert gender to binary only because the `max_cat_to_onehot` parameter in XGBoost is yet to be working properly..
X =  pd.get_dummies(X, columns=["gender"], drop_first=True) 
y = df.survived

2. Model

How to deal with categorical variables in XGBoost?

In [10]:
categorical_variables = ['class', 'embarked']

preprocessor = ColumnTransformer([
    ('categorical', OneHotEncoder(), categorical_variables)
])

model = xgboost.XGBClassifier(
    n_estimators=200, 
    max_depth=4, 
    use_label_encoder=False, 
    eval_metric="logloss"
)

model_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('model', model)
])
In [11]:
model_pipeline.fit(X, y)
Out[11]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('categorical',
                                                  OneHotEncoder(),
                                                  ['class', 'embarked'])])),
                ('model',
                 XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, early_stopping_rounds=None,
                               enable_categorical=False, eval_metric='logloss',
                               gamma=0, gpu_id=-1, grow_policy='depthwise',
                               importance_type=None, interaction_constraints='',
                               learning_rate=0.300000012, max_bin=256,
                               max_cat_to_onehot=4, max_delta_step=0,
                               max_depth=4, max_leaves=0, min_child_weight=1,
                               missing=nan, monotone_constraints='()',
                               n_estimators=200, n_jobs=0, num_parallel_tree=1,
                               predictor='auto', random_state=0, reg_alpha=0,
                               reg_lambda=1, ...))])
In [12]:
model_pipeline.predict_proba(X.iloc[0:2])
Out[12]:
array([[0.7894829 , 0.21051711],
       [0.7894829 , 0.21051711]], dtype=float32)
In [13]:
model_categorical = xgboost.XGBClassifier(
    n_estimators=200, 
    max_depth=4, 
    use_label_encoder=False, 
    eval_metric="logloss",
    
    enable_categorical=True,
    tree_method="hist"
)
In [14]:
model_categorical.fit(X, y)
Out[14]:
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=True,
              eval_metric='logloss', gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=4,
              max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=200, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)
In [15]:
model_categorical.predict_proba(X.iloc[0:2])
Out[15]:
array([[0.97659236, 0.02340767],
       [0.91492   , 0.08508006]], dtype=float32)

3. Explain with dalex

https://python.drwhy.ai

(Choose one)

In [16]:
model = model_pipeline

pf_xgboost_classifier_default = lambda m, d: m.predict_proba(d)[:, 1]

explainer = dx.Explainer(model, X, y, predict_function=pf_xgboost_classifier_default, label="GBM")
Preparation of a new explainer is initiated

  -> data              : 2207 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : GBM
  -> predict function  : <function <lambda> at 0x00000272608A7B80> will be used
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.00882, mean = 0.322, max = 0.706
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.706, mean = 2.65e-06, max = 0.956
  -> model_info        : package sklearn

A new explainer has been created!
In [17]:
model = model_categorical

def pf_xgboost_classifier_categorical(model, df):
    df.loc[:, df.dtypes == 'object'] =\
        df.select_dtypes(['object'])\
        .apply(lambda x: x.astype('category'))
    return model.predict_proba(df)[:, 1]

explainer = dx.Explainer(model, X, y, predict_function=pf_xgboost_classifier_categorical, label="GBM")
Preparation of a new explainer is initiated

  -> data              : 2207 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : GBM
  -> predict function  : <function pf_xgboost_classifier_categorical at 0x000002720B6953A0> will be used
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.000164, mean = 0.322, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.792, mean = -8.12e-06, max = 0.934
  -> model_info        : package xgboost

A new explainer has been created!

Check performance

In [18]:
explainer.model_performance()
Out[18]:
recall precision f1 accuracy auc
GBM 0.715893 0.92714 0.807937 0.890349 0.943476

Explain predictions

In [19]:
explainer.predict(X.iloc[0:10])
Out[19]:
array([0.02340767, 0.08508006, 0.01295335, 0.8908476 , 0.9421018 ,
       0.38336068, 0.10652091, 0.9748429 , 0.52645004, 0.36709288],
      dtype=float32)
In [20]:
shap_attributions = [explainer.predict_parts(X.iloc[[i]], type="shap", label=f'passenger {i}') for i in range(5)]
In [21]:
shap_attributions[0].plot(shap_attributions[1::])
In [22]:
bd_attributions = [explainer.predict_parts(X.iloc[[i]], type="break_down", label=f'passenger {i}') for i in range(5)]
In [23]:
bd_attributions[0].plot(bd_attributions[1::])

unfortunately, shap doesn't work with Pipelines at this moment

In [24]:
_ = shap.Explainer(model_pipeline)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [24], in <cell line: 1>()
----> 1 _ = shap.Explainer(model_pipeline)

File c:\users\hbani\appdata\local\programs\python\python39\lib\site-packages\shap\explainers\_explainer.py:173, in Explainer.__init__(self, model, masker, link, algorithm, output_names, feature_names, linearize_link, seed, **kwargs)
    169             algorithm = "permutation"
    171     # if we get here then we don't know how to handle what was given to us
    172     else:
--> 173         raise TypeError("The passed model is not callable and cannot be analyzed directly with the given masker! Model: " + str(model))
    175 # build the right subclass
    176 if algorithm == "exact":

TypeError: The passed model is not callable and cannot be analyzed directly with the given masker! Model: Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('categorical',
                                                  OneHotEncoder(),
                                                  ['class', 'embarked'])])),
                ('model',
                 XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, early_stopping_rounds=None,
                               enable_categorical=False, eval_metric='logloss',
                               gamma=0, gpu_id=-1, grow_policy='depthwise',
                               importance_type=None, interaction_constraints='',
                               learning_rate=0.300000012, max_bin=256,
                               max_cat_to_onehot=4, max_delta_step=0,
                               max_depth=4, max_leaves=0, min_child_weight=1,
                               missing=nan, monotone_constraints='()',
                               n_estimators=200, n_jobs=0, num_parallel_tree=1,
                               predictor='auto', random_state=0, reg_alpha=0,
                               reg_lambda=1, ...))])

unfortunately, shap doesn't work with categorical variables at this moment

In [25]:
_ = shap.Explainer(model_categorical)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [25], in <cell line: 1>()
----> 1 _ = shap.Explainer(model_categorical)

File c:\users\hbani\appdata\local\programs\python\python39\lib\site-packages\shap\explainers\_explainer.py:173, in Explainer.__init__(self, model, masker, link, algorithm, output_names, feature_names, linearize_link, seed, **kwargs)
    169             algorithm = "permutation"
    171     # if we get here then we don't know how to handle what was given to us
    172     else:
--> 173         raise TypeError("The passed model is not callable and cannot be analyzed directly with the given masker! Model: " + str(model))
    175 # build the right subclass
    176 if algorithm == "exact":

TypeError: The passed model is not callable and cannot be analyzed directly with the given masker! Model: XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=True,
              eval_metric='logloss', gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=4,
              max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=200, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

C) The oldest way

In [26]:
X_ohe = pd.get_dummies(X, columns=categorical_variables, drop_first=True)
In [27]:
X_ohe
Out[27]:
age fare sibsp parch gender_male class_2nd class_3rd class_deck crew class_engineering crew class_restaurant staff class_victualling crew embarked_Cherbourg embarked_Queenstown embarked_Southampton
0 42.0 7.11 0 0 1 0 1 0 0 0 0 0 0 1
1 13.0 20.05 0 2 1 0 1 0 0 0 0 0 0 1
2 16.0 20.05 1 1 1 0 1 0 0 0 0 0 0 1
3 39.0 20.05 1 1 0 0 1 0 0 0 0 0 0 1
4 16.0 7.13 0 0 0 0 1 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2202 41.0 0.00 0 0 1 0 0 1 0 0 0 0 0 0
2203 40.0 0.00 0 0 1 0 0 0 0 0 1 0 0 1
2204 32.0 0.00 0 0 1 0 0 0 1 0 0 0 0 1
2205 20.0 0.00 0 0 1 0 0 0 0 1 0 0 0 1
2206 26.0 0.00 0 0 1 0 0 0 0 1 0 0 0 1

2207 rows × 14 columns

In [28]:
model_ohe = xgboost.XGBClassifier(
    n_estimators=200, 
    max_depth=4, 
    use_label_encoder=False, 
    eval_metric="logloss"
)
In [29]:
model_ohe.fit(X_ohe, y)
Out[29]:
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric='logloss', gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=4,
              max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=200, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

Explain predictions

In [30]:
shap_explainer = shap.explainers.Tree(model_ohe, data=X_ohe, model_output="probability")
In [31]:
shap_values = shap_explainer(X_ohe)
 85%|=================   | 1871/2207 [00:19<00:03]       
In [32]:
shap_values
Out[32]:
.values =
array([[-1.09719525e-01, -2.79006932e-02,  1.60395517e-03, ...,
        -5.08379334e-03,  2.12194232e-03, -7.09204016e-04],
       [ 1.69173342e-01, -3.95339226e-02,  9.14733980e-03, ...,
        -1.52571276e-02,  3.71611602e-04, -3.57598798e-03],
       [-8.55709581e-02, -1.19821621e-02, -3.38644340e-02, ...,
        -9.85605786e-03,  1.42319896e-04, -2.44987522e-03],
       ...,
       [ 6.07675872e-02, -1.25123512e-02,  1.19170255e-02, ...,
        -4.53929730e-03,  3.81392396e-03,  1.95979771e-02],
       [-2.13001836e-02, -7.24321584e-03,  4.94635106e-03, ...,
        -7.85947583e-03,  1.01644440e-03,  3.76637438e-03],
       [-2.54657124e-02, -1.92159124e-02,  4.19007206e-03, ...,
        -6.80822057e-03,  1.53616686e-03,  2.12301124e-03]])

.base_values =
array([0.34674458, 0.34674458, 0.34674458, ..., 0.34674458, 0.34674458,
       0.34674458])

.data =
array([[42.  ,  7.11,  0.  , ...,  0.  ,  0.  ,  1.  ],
       [13.  , 20.05,  0.  , ...,  0.  ,  0.  ,  1.  ],
       [16.  , 20.05,  1.  , ...,  0.  ,  0.  ,  1.  ],
       ...,
       [32.  ,  0.  ,  0.  , ...,  0.  ,  0.  ,  1.  ],
       [20.  ,  0.  ,  0.  , ...,  0.  ,  0.  ,  1.  ],
       [26.  ,  0.  ,  0.  , ...,  0.  ,  0.  ,  1.  ]])
In [33]:
for i in range(5):
    shap.plots.waterfall(shap_values[i])

Global SHAP for a tree-based model

In [34]:
shap.plots.beeswarm(shap_values, max_display=10, plot_size=(9, 6))
In [35]:
import matplotlib.pyplot as plt
# plots.bar() has no plot_size parameter
shap.plots.bar(shap_values, max_display=10, show=False) 
plt.gcf().set_size_inches(9, 6)
plt.show()
In [36]:
# plot.scatter() has no plot_size parameter
shap.plots.scatter(shap_values[:, "age"], show=False)
plt.gcf().set_size_inches(9, 6)
plt.show()

shap.plots.scatter(shap_values[:, "age"], color=shap_values[:, "gender_male"], show=False)
plt.gcf().set_size_inches(11, 6)
plt.show()

shap.plots.scatter(shap_values[:, "age"], color=shap_values[:, "class_3rd"], show=False)
plt.gcf().set_size_inches(11, 6)
plt.show()

5. Compare XGBoost to SVM

Train the SVM model

In [37]:
from sklearn.svm import SVC
In [38]:
svm_ohe = SVC(probability=True)
svm_ohe.fit(X_ohe, y)
Out[38]:
SVC(probability=True)

Explain predictions with KernelSHAP

In [39]:
# sample data for KerneLSHAP
X_subset = X_ohe.sample(111, random_state=0)

exp_svm = dx.Explainer(svm_ohe, X_subset, label="SVM", verbose=False)
exp_xgboost = dx.Explainer(model_ohe, X_subset, label="GBM", verbose=False)
In [40]:
sv_svm = [exp_svm.predict_parts(X_ohe.iloc[[i]], type="shap_wrapper") for i in range(5)]
Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
In [41]:
sv_xgboost = [exp_xgboost.predict_parts(X_ohe.iloc[[i]], type="shap_wrapper") for i in range(5)]
Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.

Compare explanations

In [42]:
for i in range(5):
    print(f'====  Passenger {i}  ====')
    print('----  XGBoost:')
    sv_xgboost[i].plot()
    print('---- SVM:')
    sv_svm[i].plot()
====  Passenger 0  ====
----  XGBoost:
---- SVM:
====  Passenger 1  ====
----  XGBoost:
---- SVM:
====  Passenger 2  ====
----  XGBoost:
---- SVM:
====  Passenger 3  ====
----  XGBoost:
---- SVM:
====  Passenger 4  ====
----  XGBoost:
---- SVM:

For a theoretical introduction and more examples, see: https://ema.drwhy.ai