Part of the eXplainable Machine Learning course for Machine Learning (MSc) studies at the University of Warsaw. @pbiecek @hbaniecki
v0.1.0: 2022-11-16
https://github.com/mim-uw/eXplainableMachineLearning-2023/tree/main/Homeworks/HW5
import dalex as dx
import xgboost
import shap
import sklearn
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import platform
print(f'Python {platform.python_version()}')
{package.__name__: package.__version__ for package in [dx, xgboost, shap, sklearn, pd, np]}
We use the same XGBoost classifier trained on the Titanic dataset as in the previous materials towards Homework 2, Homework 3 \& Homework 4.
df = dx.datasets.load_titanic()
df.loc[:, df.dtypes == 'object'] =\
df.select_dtypes(['object'])\
.apply(lambda x: x.astype('category'))
X = df.drop(columns='survived')
# convert gender to binary only because the `max_cat_to_onehot` parameter in XGBoost is yet to be working properly..
X = pd.get_dummies(X, columns=["gender"], drop_first=True)
y = df.survived
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.33, random_state=42)
model = xgboost.XGBClassifier(
n_estimators=50,
max_depth=2,
use_label_encoder=False,
eval_metric="logloss",
enable_categorical=True,
tree_method="hist"
)
model.fit(X_train, y_train)
def pf_xgboost_classifier_categorical(model, df):
df.loc[:, df.dtypes == 'object'] =\
df.select_dtypes(['object'])\
.apply(lambda x: x.astype('category'))
return model.predict_proba(df)[:, 1]
explainer = dx.Explainer(model, X_test, y_test, predict_function=pf_xgboost_classifier_categorical)
explainer.model_performance()
See the API documentation for all the possible values of the specific parameters:
Explainer.model_parts()
, which returns an object of class
VariableImportance
, which can be visualized using
pvi = explainer.model_parts(random_state=0)
pvi.result
pvi.plot(show=False).update_layout(autosize=False, width=600, height=450)
pvi.plot(
max_vars=3,
digits=4,
bar_width=40,
title="Permutation-based Variable Importance (Top 3)",
show=False
).update_layout(width=600)
PVI depends on the loss function used for drop-out computation.
For binary classification, it's 1-AUC
by default.
def loss_one_minus_auc(observed, predicted):
return 1 - auc(y_true=observed, y_pred=predicted)
Let's create our own loss function for comparison.
def loss_one_minus_acc(observed, predicted, th=0.5):
predicted_class = predicted > th
acc = np.mean(observed == predicted_class)
return 1 - acc
pvi_acc = explainer.model_parts(
loss_function=loss_one_minus_acc,
label="XGBoost [loss 1-ACC]",
random_state=0
)
pvi.plot(
pvi_acc,
max_vars=6,
title="Permutation-based Variable Importance (Top 6)",
show=False
).update_layout(width=600)
In this case, the ranking diverges only at the 6th most important variable.
In many cases, it is more beneficial to report the ratio of variable importance relative to the baseline value.
pvi_ratio = explainer.model_parts(
type="ratio",
label="XGBoost [loss 1-AUC]",
random_state=0
)
pvi_acc_ratio = explainer.model_parts(
type="ratio",
loss_function=loss_one_minus_acc,
label="XGBoost [loss 1-ACC]",
random_state=0
)
pvi_ratio.plot(
pvi_acc_ratio,
title="Comparison of perm-based VI ratios for different loss functions",
show=False
).update_layout(width=900)
Note that the variables in both plots are in the same order, which is based on their average importance.
When there are many variables and/or they are correlated, we might be interested in grouping them together for calculating variable importance.
pvi_grouped = explainer.model_parts(
variable_groups={
'personal': ['gender_male', 'age', 'sibsp', 'parch'],
'wealth': ['class', 'fare']
}, random_state=0)
pvi_grouped.plot(
title="Grouped Permutation-based Variable Importance",
show=False
).update_layout(autosize=False, width=600, height=250)
Last but not least, one can increase the N
and B
parameters for lower estimation error, at a cost computation time.
pvi_accurate = explainer.model_parts(N=None, B=25, random_state=0) # None means all data
pd.concat([
pvi_accurate.result.drop("label", axis=1).set_index("variable").rename(columns={"dropout_loss": "accurate PVI"}),
pvi.result.drop("label", axis=1).set_index("variable").rename(columns={"dropout_loss": "default PVI"})
], axis=1)
Tree-based models have variable importance measured by-design.
Many widely-used implementations provide attributes like feature_importances_
.
model.feature_importances_
pd.DataFrame({'variable': X_test.columns, 'importance': model.feature_importances_})
import matplotlib.pyplot as plt
plt.bar(X_test.columns, model.feature_importances_)
plt.title("Gini-based Variable Importance")
plt.show()
See the API documentation for all the possible values of the specific parameters:
Explainer.model_parts(type="shap_wrapper")
, which returns an object of class
ShapWrapper
, which can be visualized using
and API of the shap
package.
Unfortunately, TreeSHAP works only with numerical variables.
X_ohe = pd.get_dummies(X, columns=['class', 'embarked'], drop_first=True)
model_ohe = xgboost.XGBClassifier(
n_estimators=100,
max_depth=3,
use_label_encoder=False,
eval_metric="logloss"
)
X_ohe_train, X_ohe_test, y_ohe_train, y_ohe_test =\
sklearn.model_selection.train_test_split(X_ohe, y, test_size=0.33, random_state=42)
model_ohe.fit(X_ohe_train, y_ohe_train)
explainer_ohe = dx.Explainer(model_ohe, X_ohe_test, y_ohe_test, label="XGBoost with OHE")
shap_vi = explainer_ohe.model_parts(type="shap_wrapper", shap_explainer_type="TreeExplainer")
shap_vi.plot()