Materials towards Homework 3: LIME with XGBoost & SVM¶

Part of the eXplainable Machine Learning course for Machine Learning (MSc) studies at the University of Warsaw. @pbiecek @hbaniecki

v0.1.1: 2022-10-20

https://github.com/mim-uw/eXplainableMachineLearning-2023/tree/main/Homeworks/HW3

0. Import packages¶

import dalex as dx
import xgboost
import lime

import sklearn

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

import platform
print(f'Python {platform.python_version()}')

Python 3.9.2

{package.__name__: package.__version__ for package in [dx, xgboost, sklearn, pd, np]} | {lime.__name__: "0.2.0.1"}

{'dalex': '1.5.0',
 'xgboost': '1.6.2',
 'sklearn': '1.0.2',
 'pandas': '1.3.5',
 'numpy': '1.22.4',
 'lime': '0.2.0.1'}

1. Load and preprocess data¶

lime.lime_tabular.LimeTabularExplainer assumes integer-encoded categorical variables by the following parameters:

categorical_features – list of indices (ints) corresponding to the categorical columns. Everything else will be considered continuous. Values in these columns MUST be integers.

categorical_names – map from int to list of names, where categorical_names[x][y] represents the name of the yth value of column x.

But, XGBoost assumes categorical variables of strict category type.

The challenge is to make one work with the other. First, let's use one-hot encoding.

df = dx.datasets.load_titanic()

X = df.drop(columns='survived')
X = pd.get_dummies(X, columns=['gender', 'class', 'embarked'], drop_first=True) 
y = df.survived

2. Model¶

We use the same XGBoost classifier as in the previous materials towards Homework 2.

model = xgboost.XGBClassifier(
    n_estimators=200, 
    max_depth=4, 
    use_label_encoder=False, 
    eval_metric="logloss"
)

model.fit(X, y)

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric='logloss', gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=4,
              max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=200, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

3. Explain with dalex & lime¶

dalex uses the original lime package to estimate LIME under a unified API.

dalex aims to improve the user's conveninence by:

combining the use of LimeTabularExplainer and explain_instance() into the one predict_surrogate() method,
automatically setting some of the lime parameters based on explainer.data, explainer.model_type etc.

explainer = dx.Explainer(model, X, y)

Preparation of a new explainer is initiated

  -> data              : 2207 rows 14 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x000002A96356B940> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.00022, mean = 0.322, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.809, mean = -8.27e-05, max = 0.959
  -> model_info        : package xgboost

A new explainer has been created!

Check performance on train data

explainer.model_performance(cutoff=y.mean())

Explain a prediction of interest

observation = X.iloc[[0]]
explainer.predict(observation)

array([0.01038458], dtype=float32)

explanation = explainer.predict_surrogate(observation)

In dalex API, the estimated explanation can be accessed via the result attribute

explanation.result

Analogously, the estimated explanation can be visualized using the plot() method, which uses as_pyplot_figure() from lime

explanation.plot()

Be careful! LIME algorithm, like many other explanations, involves randomness

import random
import matplotlib.pyplot as plt

for seed in range(4):
    random.seed(seed)
    np.random.seed(seed)
    exp = explainer.predict_surrogate(observation)
    exp.plot(return_figure=True)
    plt.title(f'Explanation for observation id0 assuming random seed is {seed}')

Both LimeTabularExplainer and explain_instance() have many parameters, which can be jointly passed to the predict_surrogate() interface

random.seed(0)
np.random.seed(0)
exp_manhattan = explainer.predict_surrogate(observation, num_samples=1000, distance_metric="manhattan", kernel_width=1)
random.seed(0)
np.random.seed(0)
exp_euclidean = explainer.predict_surrogate(observation, num_samples=1000, distance_metric="euclidean", kernel_width=1)
exp_manhattan.plot()
exp_euclidean.plot()

4. Explain with lime¶

An example of the same process using the lime package

Note that training_data needs to be of numpy class

lime_explainer = lime.lime_tabular.LimeTabularExplainer(
    training_data=X.values,  
    feature_names=X.columns,
    mode="classification"
)

lime_explanation = lime_explainer.explain_instance(
    data_row=observation.iloc[0],
    predict_fn=lambda d: model.predict_proba(d)
)

lime_explanation.as_list()

[('class_deck crew <= 0.00', -0.23918294815608596),
 ('class_restaurant staff <= 0.00', 0.162568079624797),
 ('0.00 < class_3rd <= 1.00', -0.1435698690099299),
 ('age > 38.00', -0.07811232716634249),
 ('class_2nd <= 0.00', 0.07459643879945656),
 ('embarked_Cherbourg <= 0.00', -0.06029276614220415),
 ('0.00 < fare <= 7.15', -0.03669565753466026),
 ('embarked_Queenstown <= 0.00', 0.02902459257869579),
 ('sibsp <= 0.00', 0.02600488136435894),
 ('0.00 < embarked_Southampton <= 1.00', 0.020576109367271223)]

_ = lime_explanation.as_pyplot_figure()

_ = lime_explanation.show_in_notebook()

Now, let's try combining categorical encoding in `xgboost` with an integer encoding in `lime`.¶

We start by creating two datasets of different type

# convert gender to binary only because the `max_cat_to_onehot` parameter in XGBoost is yet to be working properly..

X_cat = pd.get_dummies(df.drop(columns='survived'), columns=["gender"], drop_first=True)
X_fact = pd.get_dummies(df.drop(columns='survived'), columns=["gender"], drop_first=True)

X_cat.loc[:, X_cat.dtypes == 'object'] =\
    X_cat.select_dtypes(['object'])\
    .apply(lambda x: x.astype('category'))

X_fact.loc[:, X_fact.dtypes == 'object'] =\
    X_cat.select_dtypes(['category'])\
        .apply(lambda x: x.cat.codes)

y = df.survived

We use the first dataset to model

model_categorical = xgboost.XGBClassifier(
    n_estimators=200, 
    max_depth=4, 
    use_label_encoder=False, 
    eval_metric="logloss",
    
    enable_categorical=True,
    tree_method="hist"
)

model_categorical.fit(X_cat, y)

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=True,
              eval_metric='logloss', gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=4,
              max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=200, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

We use the second dataset to explain

categorical_features = [1, 2]
categorical_names = {id: X_cat.iloc[:, id].cat.categories for id in categorical_features}

print(categorical_names)

lime_explainer = lime.lime_tabular.LimeTabularExplainer(
    training_data=X_fact.values,  
    feature_names=X_fact.columns,
    mode="classification",
    categorical_features=categorical_features,
    categorical_names=categorical_names
)

{1: Index(['1st', '2nd', '3rd', 'deck crew', 'engineering crew',
       'restaurant staff', 'victualling crew'],
      dtype='object'), 2: Index(['Belfast', 'Cherbourg', 'Queenstown', 'Southampton'], dtype='object')}

lime_explanation = lime_explainer.explain_instance(
    data_row=X_fact.values[0],
    predict_fn=lambda d: model_categorical.predict_proba(d)
)   

lime_explanation.as_list()

[('class=3rd', -0.25814334719328047),
 ('sibsp <= 0.00', 0.04341223784846835),
 ('age > 38.00', -0.03630740220045356),
 ('0.00 < fare <= 7.15', 0.022511203740082565),
 ('parch <= 0.00', 0.014509941048029574),
 ('embarked=Southampton', -0.010480402013100477),
 ('gender_male <= 1.00', 0.0)]

lime_explanation.show_in_notebook()

Note that XGBoost allows predicting on both datasets without returning any error..

(model_categorical.predict_proba(X_cat) == model_categorical.predict_proba(X_fact)).mean()

1.0

5. Compare with SVM¶

Working with other models is similar in both packages; the key is a proper predict_function in dalex and predict_fn in lime

from sklearn.svm import SVC

svm_ohe = SVC(probability=True)

svm_ohe.fit(X, y)

explainer_svm = dx.Explainer(svm_ohe, X, label="SVM", verbose=False)

explanation_svm = explainer_svm.predict_surrogate(observation)

Compare predictions and their explanations

explanation_svm.plot(return_figure=True)
_ = plt.title(f'Explaining SVM predicting {np.round(explainer_svm.predict(observation).item(), 4)} for observation id0')

explanation.plot(return_figure=True)
_ = plt.title(f'Explaining XGBoost predicting {np.round(explainer.predict(observation).item(), 4)} for observation id0')

For a theoretical introduction and more examples, see: https://ema.drwhy.ai ¶

	variable	effect
0	class_deck crew <= 0.00	-0.234861
1	class_restaurant staff <= 0.00	0.144448
2	0.00 < class_3rd <= 1.00	-0.132443
3	embarked_Cherbourg <= 0.00	-0.094520
4	age > 38.00	-0.083834
5	sibsp <= 0.00	0.074970
6	embarked_Queenstown <= 0.00	0.052523
7	class_victualling crew <= 0.00	0.038948
8	class_2nd <= 0.00	0.033706
9	class_engineering crew <= 0.00	-0.011709