Part of the eXplainable Machine Learning course for Machine Learning (MSc) studies at the University of Warsaw. @pbiecek @hbaniecki
v0.1.1: 2022-10-20
https://github.com/mim-uw/eXplainableMachineLearning-2023/tree/main/Homeworks/HW3
import dalex as dx
import xgboost
import lime
import sklearn
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import platform
print(f'Python {platform.python_version()}')
{package.__name__: package.__version__ for package in [dx, xgboost, sklearn, pd, np]} | {lime.__name__: "0.2.0.1"}
lime.lime_tabular.LimeTabularExplainer
assumes integer-encoded categorical variables by the following parameters:
categorical_features – list of indices (ints) corresponding to the categorical columns. Everything else will be considered continuous. Values in these columns MUST be integers.
categorical_names – map from int to list of names, where categorical_names[x][y] represents the name of the yth value of column x.
But, XGBoost
assumes categorical variables of strict category
type.
The challenge is to make one work with the other. First, let's use one-hot encoding.
df = dx.datasets.load_titanic()
X = df.drop(columns='survived')
X = pd.get_dummies(X, columns=['gender', 'class', 'embarked'], drop_first=True)
y = df.survived
We use the same XGBoost classifier as in the previous materials towards Homework 2.
model = xgboost.XGBClassifier(
n_estimators=200,
max_depth=4,
use_label_encoder=False,
eval_metric="logloss"
)
model.fit(X, y)
dalex
uses the original lime
package to estimate LIME under a unified API.
dalex
aims to improve the user's conveninence by:
LimeTabularExplainer
and explain_instance()
into the one predict_surrogate()
method,lime
parameters based on explainer.data
, explainer.model_type
etc.explainer = dx.Explainer(model, X, y)
Check performance on train data
explainer.model_performance(cutoff=y.mean())
Explain a prediction of interest
observation = X.iloc[[0]]
explainer.predict(observation)
explanation = explainer.predict_surrogate(observation)
In dalex
API, the estimated explanation can be accessed via the result
attribute
explanation.result
Analogously, the estimated explanation can be visualized using the plot()
method, which uses as_pyplot_figure()
from lime
explanation.plot()
Be careful! LIME algorithm, like many other explanations, involves randomness
import random
import matplotlib.pyplot as plt
for seed in range(4):
random.seed(seed)
np.random.seed(seed)
exp = explainer.predict_surrogate(observation)
exp.plot(return_figure=True)
plt.title(f'Explanation for observation id0 assuming random seed is {seed}')
Both LimeTabularExplainer
and explain_instance()
have many parameters, which can be jointly passed to the predict_surrogate()
interface
random.seed(0)
np.random.seed(0)
exp_manhattan = explainer.predict_surrogate(observation, num_samples=1000, distance_metric="manhattan", kernel_width=1)
random.seed(0)
np.random.seed(0)
exp_euclidean = explainer.predict_surrogate(observation, num_samples=1000, distance_metric="euclidean", kernel_width=1)
exp_manhattan.plot()
exp_euclidean.plot()
An example of the same process using the lime
package
Note that training_data
needs to be of numpy
class
lime_explainer = lime.lime_tabular.LimeTabularExplainer(
training_data=X.values,
feature_names=X.columns,
mode="classification"
)
lime_explanation = lime_explainer.explain_instance(
data_row=observation.iloc[0],
predict_fn=lambda d: model.predict_proba(d)
)
lime_explanation.as_list()
_ = lime_explanation.as_pyplot_figure()
_ = lime_explanation.show_in_notebook()
xgboost
with an integer encoding in lime
.¶We start by creating two datasets of different type
# convert gender to binary only because the `max_cat_to_onehot` parameter in XGBoost is yet to be working properly..
X_cat = pd.get_dummies(df.drop(columns='survived'), columns=["gender"], drop_first=True)
X_fact = pd.get_dummies(df.drop(columns='survived'), columns=["gender"], drop_first=True)
X_cat.loc[:, X_cat.dtypes == 'object'] =\
X_cat.select_dtypes(['object'])\
.apply(lambda x: x.astype('category'))
X_fact.loc[:, X_fact.dtypes == 'object'] =\
X_cat.select_dtypes(['category'])\
.apply(lambda x: x.cat.codes)
y = df.survived
We use the first dataset to model
model_categorical = xgboost.XGBClassifier(
n_estimators=200,
max_depth=4,
use_label_encoder=False,
eval_metric="logloss",
enable_categorical=True,
tree_method="hist"
)
model_categorical.fit(X_cat, y)
We use the second dataset to explain
categorical_features = [1, 2]
categorical_names = {id: X_cat.iloc[:, id].cat.categories for id in categorical_features}
print(categorical_names)
lime_explainer = lime.lime_tabular.LimeTabularExplainer(
training_data=X_fact.values,
feature_names=X_fact.columns,
mode="classification",
categorical_features=categorical_features,
categorical_names=categorical_names
)
lime_explanation = lime_explainer.explain_instance(
data_row=X_fact.values[0],
predict_fn=lambda d: model_categorical.predict_proba(d)
)
lime_explanation.as_list()
lime_explanation.show_in_notebook()
Note that XGBoost allows predicting on both datasets without returning any error..
(model_categorical.predict_proba(X_cat) == model_categorical.predict_proba(X_fact)).mean()
Working with other models is similar in both packages; the key is a proper predict_function
in dalex
and predict_fn
in lime
from sklearn.svm import SVC
svm_ohe = SVC(probability=True)
svm_ohe.fit(X, y)
explainer_svm = dx.Explainer(svm_ohe, X, label="SVM", verbose=False)
explanation_svm = explainer_svm.predict_surrogate(observation)
Compare predictions and their explanations
explanation_svm.plot(return_figure=True)
_ = plt.title(f'Explaining SVM predicting {np.round(explainer_svm.predict(observation).item(), 4)} for observation id0')
explanation.plot(return_figure=True)
_ = plt.title(f'Explaining XGBoost predicting {np.round(explainer.predict(observation).item(), 4)} for observation id0')