Materials towards Homework 3: LIME with XGBoost & SVM

Part of the eXplainable Machine Learning course for Machine Learning (MSc) studies at the University of Warsaw. @pbiecek @hbaniecki

v0.1.1: 2022-10-20

https://github.com/mim-uw/eXplainableMachineLearning-2023/tree/main/Homeworks/HW3

0. Import packages

In [1]:
import dalex as dx
import xgboost
import lime

import sklearn

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")
In [2]:
import platform
print(f'Python {platform.python_version()}')
Python 3.9.2
In [3]:
{package.__name__: package.__version__ for package in [dx, xgboost, sklearn, pd, np]} | {lime.__name__: "0.2.0.1"}
Out[3]:
{'dalex': '1.5.0',
 'xgboost': '1.6.2',
 'sklearn': '1.0.2',
 'pandas': '1.3.5',
 'numpy': '1.22.4',
 'lime': '0.2.0.1'}

1. Load and preprocess data

lime.lime_tabular.LimeTabularExplainer assumes integer-encoded categorical variables by the following parameters:

categorical_features – list of indices (ints) corresponding to the categorical columns. Everything else will be considered continuous. Values in these columns MUST be integers.

categorical_names – map from int to list of names, where categorical_names[x][y] represents the name of the yth value of column x.

But, XGBoost assumes categorical variables of strict category type.

The challenge is to make one work with the other. First, let's use one-hot encoding.

In [4]:
df = dx.datasets.load_titanic()

X = df.drop(columns='survived')
X = pd.get_dummies(X, columns=['gender', 'class', 'embarked'], drop_first=True) 
y = df.survived

2. Model

We use the same XGBoost classifier as in the previous materials towards Homework 2.

In [5]:
model = xgboost.XGBClassifier(
    n_estimators=200, 
    max_depth=4, 
    use_label_encoder=False, 
    eval_metric="logloss"
)

model.fit(X, y)
Out[5]:
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric='logloss', gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=4,
              max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=200, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

3. Explain with dalex & lime

dalex uses the original lime package to estimate LIME under a unified API.

dalex aims to improve the user's conveninence by:

  1. combining the use of LimeTabularExplainer and explain_instance() into the one predict_surrogate() method,
  2. automatically setting some of the lime parameters based on explainer.data, explainer.model_type etc.
In [6]:
explainer = dx.Explainer(model, X, y)
Preparation of a new explainer is initiated

  -> data              : 2207 rows 14 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x000002A96356B940> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.00022, mean = 0.322, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.809, mean = -8.27e-05, max = 0.959
  -> model_info        : package xgboost

A new explainer has been created!

Check performance on train data

In [7]:
explainer.model_performance(cutoff=y.mean())
Out[7]:
recall precision f1 accuracy auc
XGBClassifier 0.787623 0.851064 0.818115 0.887177 0.944357

Explain a prediction of interest

In [8]:
observation = X.iloc[[0]]
explainer.predict(observation)
Out[8]:
array([0.01038458], dtype=float32)
In [9]:
explanation = explainer.predict_surrogate(observation)

In dalex API, the estimated explanation can be accessed via the result attribute

In [10]:
explanation.result
Out[10]:
variable effect
0 class_deck crew <= 0.00 -0.234861
1 class_restaurant staff <= 0.00 0.144448
2 0.00 < class_3rd <= 1.00 -0.132443
3 embarked_Cherbourg <= 0.00 -0.094520
4 age > 38.00 -0.083834
5 sibsp <= 0.00 0.074970
6 embarked_Queenstown <= 0.00 0.052523
7 class_victualling crew <= 0.00 0.038948
8 class_2nd <= 0.00 0.033706
9 class_engineering crew <= 0.00 -0.011709

Analogously, the estimated explanation can be visualized using the plot() method, which uses as_pyplot_figure() from lime

In [11]:
explanation.plot()

Be careful! LIME algorithm, like many other explanations, involves randomness

In [12]:
import random
import matplotlib.pyplot as plt

for seed in range(4):
    random.seed(seed)
    np.random.seed(seed)
    exp = explainer.predict_surrogate(observation)
    exp.plot(return_figure=True)
    plt.title(f'Explanation for observation id0 assuming random seed is {seed}')

Both LimeTabularExplainer and explain_instance() have many parameters, which can be jointly passed to the predict_surrogate() interface

In [13]:
random.seed(0)
np.random.seed(0)
exp_manhattan = explainer.predict_surrogate(observation, num_samples=1000, distance_metric="manhattan", kernel_width=1)
random.seed(0)
np.random.seed(0)
exp_euclidean = explainer.predict_surrogate(observation, num_samples=1000, distance_metric="euclidean", kernel_width=1)
exp_manhattan.plot()
exp_euclidean.plot()

4. Explain with lime

An example of the same process using the lime package

Note that training_data needs to be of numpy class

In [14]:
lime_explainer = lime.lime_tabular.LimeTabularExplainer(
    training_data=X.values,  
    feature_names=X.columns,
    mode="classification"
)
In [15]:
lime_explanation = lime_explainer.explain_instance(
    data_row=observation.iloc[0],
    predict_fn=lambda d: model.predict_proba(d)
)   
In [16]:
lime_explanation.as_list()
Out[16]:
[('class_deck crew <= 0.00', -0.23918294815608596),
 ('class_restaurant staff <= 0.00', 0.162568079624797),
 ('0.00 < class_3rd <= 1.00', -0.1435698690099299),
 ('age > 38.00', -0.07811232716634249),
 ('class_2nd <= 0.00', 0.07459643879945656),
 ('embarked_Cherbourg <= 0.00', -0.06029276614220415),
 ('0.00 < fare <= 7.15', -0.03669565753466026),
 ('embarked_Queenstown <= 0.00', 0.02902459257869579),
 ('sibsp <= 0.00', 0.02600488136435894),
 ('0.00 < embarked_Southampton <= 1.00', 0.020576109367271223)]
In [17]:
_ = lime_explanation.as_pyplot_figure()
In [18]:
_ = lime_explanation.show_in_notebook()

Now, let's try combining categorical encoding in xgboost with an integer encoding in lime.

We start by creating two datasets of different type

In [19]:
# convert gender to binary only because the `max_cat_to_onehot` parameter in XGBoost is yet to be working properly..

X_cat = pd.get_dummies(df.drop(columns='survived'), columns=["gender"], drop_first=True)
X_fact = pd.get_dummies(df.drop(columns='survived'), columns=["gender"], drop_first=True)

X_cat.loc[:, X_cat.dtypes == 'object'] =\
    X_cat.select_dtypes(['object'])\
    .apply(lambda x: x.astype('category'))

X_fact.loc[:, X_fact.dtypes == 'object'] =\
    X_cat.select_dtypes(['category'])\
        .apply(lambda x: x.cat.codes)

y = df.survived

We use the first dataset to model

In [20]:
model_categorical = xgboost.XGBClassifier(
    n_estimators=200, 
    max_depth=4, 
    use_label_encoder=False, 
    eval_metric="logloss",
    
    enable_categorical=True,
    tree_method="hist"
)

model_categorical.fit(X_cat, y)
Out[20]:
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=True,
              eval_metric='logloss', gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=4,
              max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=200, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

We use the second dataset to explain

In [21]:
categorical_features = [1, 2]
categorical_names = {id: X_cat.iloc[:, id].cat.categories for id in categorical_features}

print(categorical_names)

lime_explainer = lime.lime_tabular.LimeTabularExplainer(
    training_data=X_fact.values,  
    feature_names=X_fact.columns,
    mode="classification",
    categorical_features=categorical_features,
    categorical_names=categorical_names
)
{1: Index(['1st', '2nd', '3rd', 'deck crew', 'engineering crew',
       'restaurant staff', 'victualling crew'],
      dtype='object'), 2: Index(['Belfast', 'Cherbourg', 'Queenstown', 'Southampton'], dtype='object')}
In [22]:
lime_explanation = lime_explainer.explain_instance(
    data_row=X_fact.values[0],
    predict_fn=lambda d: model_categorical.predict_proba(d)
)   

lime_explanation.as_list()
Out[22]:
[('class=3rd', -0.25814334719328047),
 ('sibsp <= 0.00', 0.04341223784846835),
 ('age > 38.00', -0.03630740220045356),
 ('0.00 < fare <= 7.15', 0.022511203740082565),
 ('parch <= 0.00', 0.014509941048029574),
 ('embarked=Southampton', -0.010480402013100477),
 ('gender_male <= 1.00', 0.0)]
In [23]:
lime_explanation.show_in_notebook()

Note that XGBoost allows predicting on both datasets without returning any error..

In [24]:
(model_categorical.predict_proba(X_cat) == model_categorical.predict_proba(X_fact)).mean()
Out[24]:
1.0

5. Compare with SVM

Working with other models is similar in both packages; the key is a proper predict_function in dalex and predict_fn in lime

In [25]:
from sklearn.svm import SVC

svm_ohe = SVC(probability=True)

svm_ohe.fit(X, y)

explainer_svm = dx.Explainer(svm_ohe, X, label="SVM", verbose=False)

explanation_svm = explainer_svm.predict_surrogate(observation)

Compare predictions and their explanations

In [26]:
explanation_svm.plot(return_figure=True)
_ = plt.title(f'Explaining SVM predicting {np.round(explainer_svm.predict(observation).item(), 4)} for observation id0')

explanation.plot(return_figure=True)
_ = plt.title(f'Explaining XGBoost predicting {np.round(explainer.predict(observation).item(), 4)} for observation id0')

For a theoretical introduction and more examples, see: https://ema.drwhy.ai