Materials towards Homework 2: SHAP with XGBoost & SVM¶

Part of the eXplainable Machine Learning course for Machine Learning (MSc) studies at the University of Warsaw. @pbiecek @hbaniecki

v0.1.0: 2022-10-13

https://github.com/mim-uw/eXplainableMachineLearning-2023/tree/main/Homeworks/HW2

0. Import packages¶

import dalex as dx
import xgboost
import shap

import sklearn
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

import platform
print(f'Python {platform.python_version()}')

Python 3.9.2

{package.__name__: package.__version__ for package in [dx, xgboost, shap, sklearn, pd, np]}

{'dalex': '1.5.0',
 'xgboost': '1.6.2',
 'shap': '0.41.0',
 'sklearn': '1.0.2',
 'pandas': '1.3.5',
 'numpy': '1.22.4'}

1. Load data¶

df = dx.datasets.load_titanic()

df

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2207 entries, 0 to 2206
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   gender    2207 non-null   object 
 1   age       2207 non-null   float64
 2   class     2207 non-null   object 
 3   embarked  2207 non-null   object 
 4   fare      2207 non-null   float64
 5   sibsp     2207 non-null   int64  
 6   parch     2207 non-null   int64  
 7   survived  2207 non-null   int64  
dtypes: float64(2), int64(3), object(3)
memory usage: 138.1+ KB

We shall try to make the new categorical data type useful¶

df.loc[:, df.dtypes == 'object'] =\
    df.select_dtypes(['object'])\
    .apply(lambda x: x.astype('category'))

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2207 entries, 0 to 2206
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   gender    2207 non-null   category
 1   age       2207 non-null   float64 
 2   class     2207 non-null   category
 3   embarked  2207 non-null   category
 4   fare      2207 non-null   float64 
 5   sibsp     2207 non-null   int64   
 6   parch     2207 non-null   int64   
 7   survived  2207 non-null   int64   
dtypes: category(3), float64(2), int64(3)
memory usage: 93.5 KB

X = df.drop(columns='survived')
# convert gender to binary only because the `max_cat_to_onehot` parameter in XGBoost is yet to be working properly..
X =  pd.get_dummies(X, columns=["gender"], drop_first=True) 
y = df.survived

2. Model¶

How to deal with categorical variables in XGBoost?¶

A) The old way¶

Based on https://dalex.drwhy.ai/python-dalex-titanic.html

categorical_variables = ['class', 'embarked']

preprocessor = ColumnTransformer([
    ('categorical', OneHotEncoder(), categorical_variables)
])

model = xgboost.XGBClassifier(
    n_estimators=200, 
    max_depth=4, 
    use_label_encoder=False, 
    eval_metric="logloss"
)

model_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('model', model)
])

model_pipeline.fit(X, y)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('categorical',
                                                  OneHotEncoder(),
                                                  ['class', 'embarked'])])),
                ('model',
                 XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, early_stopping_rounds=None,
                               enable_categorical=False, eval_metric='logloss',
                               gamma=0, gpu_id=-1, grow_policy='depthwise',
                               importance_type=None, interaction_constraints='',
                               learning_rate=0.300000012, max_bin=256,
                               max_cat_to_onehot=4, max_delta_step=0,
                               max_depth=4, max_leaves=0, min_child_weight=1,
                               missing=nan, monotone_constraints='()',
                               n_estimators=200, n_jobs=0, num_parallel_tree=1,
                               predictor='auto', random_state=0, reg_alpha=0,
                               reg_lambda=1, ...))])

model_pipeline.predict_proba(X.iloc[0:2])

array([[0.7894829 , 0.21051711],
       [0.7894829 , 0.21051711]], dtype=float32)

B) The new way¶

Based on https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html and https://xgboost.readthedocs.io/en/stable/python/examples/categorical.html

model_categorical = xgboost.XGBClassifier(
    n_estimators=200, 
    max_depth=4, 
    use_label_encoder=False, 
    eval_metric="logloss",
    
    enable_categorical=True,
    tree_method="hist"
)

model_categorical.fit(X, y)

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=True,
              eval_metric='logloss', gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=4,
              max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=200, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

model_categorical.predict_proba(X.iloc[0:2])

array([[0.97659236, 0.02340767],
       [0.91492   , 0.08508006]], dtype=float32)

3. Explain with dalex¶

https://python.drwhy.ai

(Choose one)¶

model = model_pipeline

pf_xgboost_classifier_default = lambda m, d: m.predict_proba(d)[:, 1]

explainer = dx.Explainer(model, X, y, predict_function=pf_xgboost_classifier_default, label="GBM")

Preparation of a new explainer is initiated

  -> data              : 2207 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : GBM
  -> predict function  : <function <lambda> at 0x00000272608A7B80> will be used
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.00882, mean = 0.322, max = 0.706
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.706, mean = 2.65e-06, max = 0.956
  -> model_info        : package sklearn

A new explainer has been created!

model = model_categorical

def pf_xgboost_classifier_categorical(model, df):
    df.loc[:, df.dtypes == 'object'] =\
        df.select_dtypes(['object'])\
        .apply(lambda x: x.astype('category'))
    return model.predict_proba(df)[:, 1]

explainer = dx.Explainer(model, X, y, predict_function=pf_xgboost_classifier_categorical, label="GBM")

Preparation of a new explainer is initiated

  -> data              : 2207 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : GBM
  -> predict function  : <function pf_xgboost_classifier_categorical at 0x000002720B6953A0> will be used
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.000164, mean = 0.322, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.792, mean = -8.12e-06, max = 0.934
  -> model_info        : package xgboost

A new explainer has been created!

Check performance¶

explainer.model_performance()

Explain predictions¶

explainer.predict(X.iloc[0:10])

array([0.02340767, 0.08508006, 0.01295335, 0.8908476 , 0.9421018 ,
       0.38336068, 0.10652091, 0.9748429 , 0.52645004, 0.36709288],
      dtype=float32)

shap_attributions = [explainer.predict_parts(X.iloc[[i]], type="shap", label=f'passenger {i}') for i in range(5)]

shap_attributions[0].plot(shap_attributions[1::])

bd_attributions = [explainer.predict_parts(X.iloc[[i]], type="break_down", label=f'passenger {i}') for i in range(5)]

bd_attributions[0].plot(bd_attributions[1::])

4. Explain with shap¶

https://github.com/slundberg/shap

unfortunately, shap doesn't work with Pipelines at this moment¶

_ = shap.Explainer(model_pipeline)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [24], in <cell line: 1>()
----> 1 _ = shap.Explainer(model_pipeline)

File c:\users\hbani\appdata\local\programs\python\python39\lib\site-packages\shap\explainers\_explainer.py:173, in Explainer.__init__(self, model, masker, link, algorithm, output_names, feature_names, linearize_link, seed, **kwargs)
    169             algorithm = "permutation"
    171     # if we get here then we don't know how to handle what was given to us
    172     else:
--> 173         raise TypeError("The passed model is not callable and cannot be analyzed directly with the given masker! Model: " + str(model))
    175 # build the right subclass
    176 if algorithm == "exact":

TypeError: The passed model is not callable and cannot be analyzed directly with the given masker! Model: Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('categorical',
                                                  OneHotEncoder(),
                                                  ['class', 'embarked'])])),
                ('model',
                 XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, early_stopping_rounds=None,
                               enable_categorical=False, eval_metric='logloss',
                               gamma=0, gpu_id=-1, grow_policy='depthwise',
                               importance_type=None, interaction_constraints='',
                               learning_rate=0.300000012, max_bin=256,
                               max_cat_to_onehot=4, max_delta_step=0,
                               max_depth=4, max_leaves=0, min_child_weight=1,
                               missing=nan, monotone_constraints='()',
                               n_estimators=200, n_jobs=0, num_parallel_tree=1,
                               predictor='auto', random_state=0, reg_alpha=0,
                               reg_lambda=1, ...))])

unfortunately, shap doesn't work with categorical variables at this moment¶

_ = shap.Explainer(model_categorical)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [25], in <cell line: 1>()
----> 1 _ = shap.Explainer(model_categorical)

File c:\users\hbani\appdata\local\programs\python\python39\lib\site-packages\shap\explainers\_explainer.py:173, in Explainer.__init__(self, model, masker, link, algorithm, output_names, feature_names, linearize_link, seed, **kwargs)
    169             algorithm = "permutation"
    171     # if we get here then we don't know how to handle what was given to us
    172     else:
--> 173         raise TypeError("The passed model is not callable and cannot be analyzed directly with the given masker! Model: " + str(model))
    175 # build the right subclass
    176 if algorithm == "exact":

TypeError: The passed model is not callable and cannot be analyzed directly with the given masker! Model: XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=True,
              eval_metric='logloss', gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=4,
              max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=200, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

C) The oldest way¶

X_ohe = pd.get_dummies(X, columns=categorical_variables, drop_first=True)

X_ohe

model_ohe = xgboost.XGBClassifier(
    n_estimators=200, 
    max_depth=4, 
    use_label_encoder=False, 
    eval_metric="logloss"
)

model_ohe.fit(X_ohe, y)

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric='logloss', gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=4,
              max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=200, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

Explain predictions¶

shap_explainer = shap.explainers.Tree(model_ohe, data=X_ohe, model_output="probability")

shap_values = shap_explainer(X_ohe)

 85%|=================   | 1871/2207 [00:19<00:03]

shap_values

.values =
array([[-1.09719525e-01, -2.79006932e-02,  1.60395517e-03, ...,
        -5.08379334e-03,  2.12194232e-03, -7.09204016e-04],
       [ 1.69173342e-01, -3.95339226e-02,  9.14733980e-03, ...,
        -1.52571276e-02,  3.71611602e-04, -3.57598798e-03],
       [-8.55709581e-02, -1.19821621e-02, -3.38644340e-02, ...,
        -9.85605786e-03,  1.42319896e-04, -2.44987522e-03],
       ...,
       [ 6.07675872e-02, -1.25123512e-02,  1.19170255e-02, ...,
        -4.53929730e-03,  3.81392396e-03,  1.95979771e-02],
       [-2.13001836e-02, -7.24321584e-03,  4.94635106e-03, ...,
        -7.85947583e-03,  1.01644440e-03,  3.76637438e-03],
       [-2.54657124e-02, -1.92159124e-02,  4.19007206e-03, ...,
        -6.80822057e-03,  1.53616686e-03,  2.12301124e-03]])

.base_values =
array([0.34674458, 0.34674458, 0.34674458, ..., 0.34674458, 0.34674458,
       0.34674458])

.data =
array([[42.  ,  7.11,  0.  , ...,  0.  ,  0.  ,  1.  ],
       [13.  , 20.05,  0.  , ...,  0.  ,  0.  ,  1.  ],
       [16.  , 20.05,  1.  , ...,  0.  ,  0.  ,  1.  ],
       ...,
       [32.  ,  0.  ,  0.  , ...,  0.  ,  0.  ,  1.  ],
       [20.  ,  0.  ,  0.  , ...,  0.  ,  0.  ,  1.  ],
       [26.  ,  0.  ,  0.  , ...,  0.  ,  0.  ,  1.  ]])

for i in range(5):
    shap.plots.waterfall(shap_values[i])

Global SHAP for a tree-based model¶

shap.plots.beeswarm(shap_values, max_display=10, plot_size=(9, 6))

import matplotlib.pyplot as plt
# plots.bar() has no plot_size parameter
shap.plots.bar(shap_values, max_display=10, show=False) 
plt.gcf().set_size_inches(9, 6)
plt.show()

# plot.scatter() has no plot_size parameter
shap.plots.scatter(shap_values[:, "age"], show=False)
plt.gcf().set_size_inches(9, 6)
plt.show()

shap.plots.scatter(shap_values[:, "age"], color=shap_values[:, "gender_male"], show=False)
plt.gcf().set_size_inches(11, 6)
plt.show()

shap.plots.scatter(shap_values[:, "age"], color=shap_values[:, "class_3rd"], show=False)
plt.gcf().set_size_inches(11, 6)
plt.show()

5. Compare XGBoost to SVM¶

Train the SVM model¶

from sklearn.svm import SVC

svm_ohe = SVC(probability=True)
svm_ohe.fit(X_ohe, y)

SVC(probability=True)

Explain predictions with KernelSHAP¶

# sample data for KerneLSHAP
X_subset = X_ohe.sample(111, random_state=0)

exp_svm = dx.Explainer(svm_ohe, X_subset, label="SVM", verbose=False)
exp_xgboost = dx.Explainer(model_ohe, X_subset, label="GBM", verbose=False)

sv_svm = [exp_svm.predict_parts(X_ohe.iloc[[i]], type="shap_wrapper") for i in range(5)]

Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.

Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.

Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.

Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.

Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.

sv_xgboost = [exp_xgboost.predict_parts(X_ohe.iloc[[i]], type="shap_wrapper") for i in range(5)]

Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.

Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.

Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.

Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.

Using 111 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.

Compare explanations¶

for i in range(5):
    print(f'====  Passenger {i}  ====')
    print('----  XGBoost:')
    sv_xgboost[i].plot()
    print('---- SVM:')
    sv_svm[i].plot()

====  Passenger 0  ====
----  XGBoost:

---- SVM:

====  Passenger 1  ====
----  XGBoost:

---- SVM:

====  Passenger 2  ====
----  XGBoost:

---- SVM:

====  Passenger 3  ====
----  XGBoost:

---- SVM:

====  Passenger 4  ====
----  XGBoost:

---- SVM:

For a theoretical introduction and more examples, see: https://ema.drwhy.ai ¶

	gender	age	class	embarked	fare	sibsp	parch	survived
0	male	42.0	3rd	Southampton	7.11	0	0	0
1	male	13.0	3rd	Southampton	20.05	0	2	0
2	male	16.0	3rd	Southampton	20.05	1	1	0
3	female	39.0	3rd	Southampton	20.05	1	1	1
4	female	16.0	3rd	Southampton	7.13	0	0	1
...	...	...	...	...	...	...	...	...
2202	male	41.0	deck crew	Belfast	0.00	0	0	1
2203	male	40.0	victualling crew	Southampton	0.00	0	0	1
2204	male	32.0	engineering crew	Southampton	0.00	0	0	0
2205	male	20.0	restaurant staff	Southampton	0.00	0	0	0
2206	male	26.0	restaurant staff	Southampton	0.00	0	0	0

Materials towards Homework 2: SHAP with XGBoost & SVM¶

0. Import packages¶

1. Load data¶

We shall try to make the new categorical data type useful¶

2. Model¶

How to deal with categorical variables in XGBoost?¶

A) The old way¶

B) The new way¶

3. Explain with dalex¶

(Choose one)¶

Check performance¶

Explain predictions¶

4. Explain with shap¶

unfortunately, shap doesn't work with Pipelines at this moment¶

unfortunately, shap doesn't work with categorical variables at this moment¶

C) The oldest way¶

Explain predictions¶

Global SHAP for a tree-based model¶

5. Compare XGBoost to SVM¶

Train the SVM model¶

Explain predictions with KernelSHAP¶

Compare explanations¶

For a theoretical introduction and more examples, see: https://ema.drwhy.ai¶

For a theoretical introduction and more examples, see: https://ema.drwhy.ai ¶