Materials towards Homework 5: PVI with XGBoost

Part of the eXplainable Machine Learning course for Machine Learning (MSc) studies at the University of Warsaw. @pbiecek @hbaniecki

v0.1.0: 2022-11-16

https://github.com/mim-uw/eXplainableMachineLearning-2023/tree/main/Homeworks/HW5

0. Import packages

In [1]:
import dalex as dx
import xgboost
import shap

import sklearn

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

import platform
print(f'Python {platform.python_version()}')

{package.__name__: package.__version__ for package in [dx, xgboost, shap, sklearn, pd, np]}
Python 3.9.2
Out[1]:
{'dalex': '1.5.0',
 'xgboost': '1.6.2',
 'shap': '0.41.0',
 'sklearn': '1.1.3',
 'pandas': '1.3.5',
 'numpy': '1.23.4'}

We use the same XGBoost classifier trained on the Titanic dataset as in the previous materials towards Homework 2, Homework 3 \& Homework 4.

1. Load and preprocess data

In [2]:
df = dx.datasets.load_titanic()

df.loc[:, df.dtypes == 'object'] =\
    df.select_dtypes(['object'])\
    .apply(lambda x: x.astype('category'))

X = df.drop(columns='survived')
# convert gender to binary only because the `max_cat_to_onehot` parameter in XGBoost is yet to be working properly..
X =  pd.get_dummies(X, columns=["gender"], drop_first=True) 
y = df.survived

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.33, random_state=42)

2. Model

In [3]:
model = xgboost.XGBClassifier(
    n_estimators=50, 
    max_depth=2, 
    use_label_encoder=False, 
    eval_metric="logloss",
    
    enable_categorical=True,
    tree_method="hist"
)

model.fit(X_train, y_train)
Out[3]:
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=True,
              eval_metric='logloss', gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=2,
              max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=50, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [4]:
def pf_xgboost_classifier_categorical(model, df):
    df.loc[:, df.dtypes == 'object'] =\
        df.select_dtypes(['object'])\
        .apply(lambda x: x.astype('category'))
    return model.predict_proba(df)[:, 1]

explainer = dx.Explainer(model, X_test, y_test, predict_function=pf_xgboost_classifier_categorical)
Preparation of a new explainer is initiated

  -> data              : 729 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 729 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function pf_xgboost_classifier_categorical at 0x000001D3817B2CA0> will be used
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.0258, mean = 0.333, max = 0.99
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.987, mean = -0.00781, max = 0.936
  -> model_info        : package xgboost

A new explainer has been created!

3. Evaluate!

In [5]:
explainer.model_performance()
Out[5]:
recall precision f1 accuracy auc
XGBClassifier 0.582278 0.730159 0.647887 0.794239 0.809509

Permutation-based Variable Importance

See the API documentation for all the possible values of the specific parameters:

In [6]:
pvi = explainer.model_parts(random_state=0)
In [7]:
pvi.result
Out[7]:
variable dropout_loss label
0 parch 0.189701 XGBClassifier
1 _full_model_ 0.190491 XGBClassifier
2 embarked 0.195568 XGBClassifier
3 sibsp 0.197910 XGBClassifier
4 fare 0.198708 XGBClassifier
5 age 0.240732 XGBClassifier
6 class 0.285486 XGBClassifier
7 gender_male 0.378925 XGBClassifier
8 _baseline_ 0.510619 XGBClassifier
In [8]:
pvi.plot(show=False).update_layout(autosize=False, width=600, height=450)
In [9]:
pvi.plot(
    max_vars=3, 
    digits=4, 
    bar_width=40, 
    title="Permutation-based Variable Importance (Top 3)", 
    show=False
).update_layout(width=600)

PVI depends on the loss function used for drop-out computation.

For binary classification, it's 1-AUC by default.

def loss_one_minus_auc(observed, predicted):
    return 1 - auc(y_true=observed, y_pred=predicted)

Let's create our own loss function for comparison.

In [10]:
def loss_one_minus_acc(observed, predicted, th=0.5):
    predicted_class = predicted > th
    acc = np.mean(observed == predicted_class)
    return 1 - acc
In [11]:
pvi_acc = explainer.model_parts(
    loss_function=loss_one_minus_acc, 
    label="XGBoost [loss 1-ACC]", 
    random_state=0
)
In [12]:
pvi.plot(
    pvi_acc, 
    max_vars=6, 
    title="Permutation-based Variable Importance (Top 6)", 
    show=False
).update_layout(width=600)

In this case, the ranking diverges only at the 6th most important variable.

In many cases, it is more beneficial to report the ratio of variable importance relative to the baseline value.

In [13]:
pvi_ratio = explainer.model_parts(
    type="ratio", 
    label="XGBoost [loss 1-AUC]", 
    random_state=0
)
pvi_acc_ratio = explainer.model_parts(
    type="ratio", 
    loss_function=loss_one_minus_acc, 
    label="XGBoost [loss 1-ACC]", 
    random_state=0
)
In [14]:
pvi_ratio.plot(
    pvi_acc_ratio, 
    title="Comparison of perm-based VI ratios for different loss functions", 
    show=False
).update_layout(width=900)

Note that the variables in both plots are in the same order, which is based on their average importance.

When there are many variables and/or they are correlated, we might be interested in grouping them together for calculating variable importance.

In [15]:
pvi_grouped = explainer.model_parts(
    variable_groups={
        'personal': ['gender_male', 'age', 'sibsp', 'parch'], 
        'wealth': ['class', 'fare']
    }, random_state=0)
In [16]:
pvi_grouped.plot(
    title="Grouped Permutation-based Variable Importance", 
    show=False
).update_layout(autosize=False, width=600, height=250)

Last but not least, one can increase the N and B parameters for lower estimation error, at a cost computation time.

In [17]:
pvi_accurate = explainer.model_parts(N=None, B=25, random_state=0) # None means all data
In [18]:
pd.concat([
    pvi_accurate.result.drop("label", axis=1).set_index("variable").rename(columns={"dropout_loss": "accurate PVI"}), 
    pvi.result.drop("label", axis=1).set_index("variable").rename(columns={"dropout_loss": "default PVI"})
], axis=1)
Out[18]:
accurate PVI default PVI
variable
parch 0.189974 0.189701
_full_model_ 0.190491 0.190491
embarked 0.193976 0.195568
sibsp 0.197059 0.197910
fare 0.199214 0.198708
age 0.242790 0.240732
class 0.279159 0.285486
gender_male 0.378469 0.378925
_baseline_ 0.493551 0.510619

Gini-based Variable Importance

Tree-based models have variable importance measured by-design.

Many widely-used implementations provide attributes like feature_importances_.

In [19]:
model.feature_importances_
Out[19]:
array([0.05879982, 0.1634799 , 0.04673264, 0.04072754, 0.05098921,
       0.02352506, 0.6157458 ], dtype=float32)
In [20]:
pd.DataFrame({'variable': X_test.columns, 'importance': model.feature_importances_})
Out[20]:
variable importance
0 age 0.058800
1 class 0.163480
2 embarked 0.046733
3 fare 0.040728
4 sibsp 0.050989
5 parch 0.023525
6 gender_male 0.615746
In [21]:
import matplotlib.pyplot as plt
plt.bar(X_test.columns, model.feature_importances_)
plt.title("Gini-based Variable Importance")
plt.show()

SHAP-based Variable Importance

See the API documentation for all the possible values of the specific parameters:

and API of the shap package.

Unfortunately, TreeSHAP works only with numerical variables.

In [22]:
X_ohe = pd.get_dummies(X, columns=['class', 'embarked'], drop_first=True)

model_ohe = xgboost.XGBClassifier(
    n_estimators=100, 
    max_depth=3, 
    use_label_encoder=False, 
    eval_metric="logloss"
)

X_ohe_train, X_ohe_test, y_ohe_train, y_ohe_test =\
    sklearn.model_selection.train_test_split(X_ohe, y, test_size=0.33, random_state=42)

model_ohe.fit(X_ohe_train, y_ohe_train)

explainer_ohe = dx.Explainer(model_ohe, X_ohe_test, y_ohe_test, label="XGBoost with OHE")
Preparation of a new explainer is initiated

  -> data              : 729 rows 14 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 729 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : XGBoost with OHE
  -> predict function  : <function yhat_proba_default at 0x000001D3E59D7700> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.00631, mean = 0.336, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.999, mean = -0.0106, max = 0.98
  -> model_info        : package xgboost

A new explainer has been created!
In [23]:
shap_vi = explainer_ohe.model_parts(type="shap_wrapper", shap_explainer_type="TreeExplainer")
In [24]:
shap_vi.plot()

For a theoretical introduction and more examples, see: https://ema.drwhy.ai