Part of the eXplainable Machine Learning course for Machine Learning (MSc) studies at the University of Warsaw. @pbiecek @hbaniecki
v0.1.0: 2022-10-26
https://github.com/mim-uw/eXplainableMachineLearning-2023/tree/main/Homeworks/HW4
import dalex as dx
import xgboost
import alibi
import sklearn
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import platform
print(f'Python {platform.python_version()}')
{package.__name__: package.__version__ for package in [dx, xgboost, alibi, sklearn, pd, np]}
We use the same XGBoost classifier trained on the Titanic dataset as in the previous materials towards Homework 2 \& Homework 3.
Unfortunately, at this moment, we can't compare dalex
to alibi v0.8.0
because the latter one seem not to support categorical variables (tested using code from this notebook).
df = dx.datasets.load_titanic()
df.loc[:, df.dtypes == 'object'] =\
df.select_dtypes(['object'])\
.apply(lambda x: x.astype('category'))
X = df.drop(columns='survived')
# convert gender to binary only because the `max_cat_to_onehot` parameter in XGBoost is yet to be working properly..
X = pd.get_dummies(X, columns=["gender"], drop_first=True)
y = df.survived
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split( X, y, test_size=0.33, random_state=42)
model = xgboost.XGBClassifier(
n_estimators=50,
max_depth=2,
use_label_encoder=False,
eval_metric="logloss",
enable_categorical=True,
tree_method="hist"
)
model.fit(X_train, y_train)
def pf_xgboost_classifier_categorical(model, df):
df.loc[:, df.dtypes == 'object'] =\
df.select_dtypes(['object'])\
.apply(lambda x: x.astype('category'))
return model.predict_proba(df)[:, 1]
explainer = dx.Explainer(model, X_test, y_test, predict_function=pf_xgboost_classifier_categorical)
explainer.model_performance()
Which variables are important?
explainer.model_parts().result
Let's analyze variable effects with both PDP and ALE.
See the API documentation for all the possible values of the specific parameters:
Explainer.predict_profile()
, which returns an object of class
CeterisParibus
, which can be visualized using
cp = explainer.predict_profile(new_observation=X.iloc[[400]])
We can visualize what-if analysis for a single observaiton of interest..
cp.plot(variables=["age", "sibsp"])
..and for many observations at the same time.
cp_10 = explainer.predict_profile(new_observation=X.iloc[400:410])
cp_10.plot(variables=["age", "sibsp"])
An average of many single profiles (local explanations) estimates the partial dependence (global explanation).
See the API documentation for all the possible values of the specific parameters:
Explainer.model_profile()
, which returns an object of class
AggregatedProfiles
, which can be visualized using
pdp = explainer.model_profile() # defaults: type="pdp", N=300
pdp.result
pdp.plot(variables=["age", "fare"])
Explanations seem to indicate high importance of age
.
pdp.plot(variables=["age", "fare"], geom="profiles", title="Partial Dependence Plot with individual profiles")