Materials towards Homework 5: PVI with XGBoost¶

Part of the eXplainable Machine Learning course for Machine Learning (MSc) studies at the University of Warsaw. @pbiecek @hbaniecki

v0.1.0: 2022-11-16

https://github.com/mim-uw/eXplainableMachineLearning-2023/tree/main/Homeworks/HW5

0. Import packages¶

import dalex as dx
import xgboost
import shap

import sklearn

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

import platform
print(f'Python {platform.python_version()}')

{package.__name__: package.__version__ for package in [dx, xgboost, shap, sklearn, pd, np]}

Python 3.9.2

{'dalex': '1.5.0',
 'xgboost': '1.6.2',
 'shap': '0.41.0',
 'sklearn': '1.1.3',
 'pandas': '1.3.5',
 'numpy': '1.23.4'}

We use the same XGBoost classifier trained on the Titanic dataset as in the previous materials towards Homework 2, Homework 3 \& Homework 4.

1. Load and preprocess data¶

df = dx.datasets.load_titanic()

df.loc[:, df.dtypes == 'object'] =\
    df.select_dtypes(['object'])\
    .apply(lambda x: x.astype('category'))

X = df.drop(columns='survived')
# convert gender to binary only because the `max_cat_to_onehot` parameter in XGBoost is yet to be working properly..
X =  pd.get_dummies(X, columns=["gender"], drop_first=True) 
y = df.survived

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.33, random_state=42)

2. Model¶

model = xgboost.XGBClassifier(
    n_estimators=50, 
    max_depth=2, 
    use_label_encoder=False, 
    eval_metric="logloss",
    
    enable_categorical=True,
    tree_method="hist"
)

model.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=True,
              eval_metric='logloss', gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=2,
              max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=50, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=True,
              eval_metric='logloss', gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=2,
              max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=50, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

def pf_xgboost_classifier_categorical(model, df):
    df.loc[:, df.dtypes == 'object'] =\
        df.select_dtypes(['object'])\
        .apply(lambda x: x.astype('category'))
    return model.predict_proba(df)[:, 1]

explainer = dx.Explainer(model, X_test, y_test, predict_function=pf_xgboost_classifier_categorical)

Preparation of a new explainer is initiated

  -> data              : 729 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 729 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function pf_xgboost_classifier_categorical at 0x000001D3817B2CA0> will be used
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.0258, mean = 0.333, max = 0.99
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.987, mean = -0.00781, max = 0.936
  -> model_info        : package xgboost

A new explainer has been created!

3. Evaluate!¶

explainer.model_performance()

Permutation-based Variable Importance¶

See the API documentation for all the possible values of the specific parameters:

Explainer.model_parts(), which returns an object of class
VariableImportance, which can be visualized using
VariableImportance.plot().

pvi = explainer.model_parts(random_state=0)

pvi.result

pvi.plot(show=False).update_layout(autosize=False, width=600, height=450)

pvi.plot(
    max_vars=3, 
    digits=4, 
    bar_width=40, 
    title="Permutation-based Variable Importance (Top 3)", 
    show=False
).update_layout(width=600)

PVI depends on the loss function used for drop-out computation.

For binary classification, it's 1-AUC by default.

def loss_one_minus_auc(observed, predicted):
    return 1 - auc(y_true=observed, y_pred=predicted)

Let's create our own loss function for comparison.

def loss_one_minus_acc(observed, predicted, th=0.5):
    predicted_class = predicted > th
    acc = np.mean(observed == predicted_class)
    return 1 - acc

pvi_acc = explainer.model_parts(
    loss_function=loss_one_minus_acc, 
    label="XGBoost [loss 1-ACC]", 
    random_state=0
)

pvi.plot(
    pvi_acc, 
    max_vars=6, 
    title="Permutation-based Variable Importance (Top 6)", 
    show=False
).update_layout(width=600)

In this case, the ranking diverges only at the 6th most important variable.

In many cases, it is more beneficial to report the ratio of variable importance relative to the baseline value.

pvi_ratio = explainer.model_parts(
    type="ratio", 
    label="XGBoost [loss 1-AUC]", 
    random_state=0
)
pvi_acc_ratio = explainer.model_parts(
    type="ratio", 
    loss_function=loss_one_minus_acc, 
    label="XGBoost [loss 1-ACC]", 
    random_state=0
)

pvi_ratio.plot(
    pvi_acc_ratio, 
    title="Comparison of perm-based VI ratios for different loss functions", 
    show=False
).update_layout(width=900)

Note that the variables in both plots are in the same order, which is based on their average importance.

When there are many variables and/or they are correlated, we might be interested in grouping them together for calculating variable importance.

pvi_grouped = explainer.model_parts(
    variable_groups={
        'personal': ['gender_male', 'age', 'sibsp', 'parch'], 
        'wealth': ['class', 'fare']
    }, random_state=0)

pvi_grouped.plot(
    title="Grouped Permutation-based Variable Importance", 
    show=False
).update_layout(autosize=False, width=600, height=250)

Last but not least, one can increase the N and B parameters for lower estimation error, at a cost computation time.

pvi_accurate = explainer.model_parts(N=None, B=25, random_state=0) # None means all data

pd.concat([
    pvi_accurate.result.drop("label", axis=1).set_index("variable").rename(columns={"dropout_loss": "accurate PVI"}), 
    pvi.result.drop("label", axis=1).set_index("variable").rename(columns={"dropout_loss": "default PVI"})
], axis=1)

Gini-based Variable Importance¶

Tree-based models have variable importance measured by-design.

Many widely-used implementations provide attributes like feature_importances_.

model.feature_importances_

array([0.05879982, 0.1634799 , 0.04673264, 0.04072754, 0.05098921,
       0.02352506, 0.6157458 ], dtype=float32)

pd.DataFrame({'variable': X_test.columns, 'importance': model.feature_importances_})

import matplotlib.pyplot as plt
plt.bar(X_test.columns, model.feature_importances_)
plt.title("Gini-based Variable Importance")
plt.show()

SHAP-based Variable Importance¶