Materials towards Homework 4: CP and PDP with XGBoost

Part of the eXplainable Machine Learning course for Machine Learning (MSc) studies at the University of Warsaw. @pbiecek @hbaniecki

v0.1.0: 2022-10-26

https://github.com/mim-uw/eXplainableMachineLearning-2023/tree/main/Homeworks/HW4

0. Import packages

In [2]:
import dalex as dx
import xgboost
import alibi

import sklearn

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

import platform
print(f'Python {platform.python_version()}')

{package.__name__: package.__version__ for package in [dx, xgboost, alibi, sklearn, pd, np]}
Python 3.9.2
Out[2]:
{'dalex': '1.5.0',
 'xgboost': '1.6.2',
 'alibi': '0.8.0',
 'sklearn': '1.1.3',
 'pandas': '1.3.5',
 'numpy': '1.19.5'}

We use the same XGBoost classifier trained on the Titanic dataset as in the previous materials towards Homework 2 \& Homework 3.

Unfortunately, at this moment, we can't compare dalex to alibi v0.8.0 because the latter one seem not to support categorical variables (tested using code from this notebook).

1. Load and preprocess data

In [3]:
df = dx.datasets.load_titanic()

df.loc[:, df.dtypes == 'object'] =\
    df.select_dtypes(['object'])\
    .apply(lambda x: x.astype('category'))

X = df.drop(columns='survived')
# convert gender to binary only because the `max_cat_to_onehot` parameter in XGBoost is yet to be working properly..
X =  pd.get_dummies(X, columns=["gender"], drop_first=True) 
y = df.survived

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split( X, y, test_size=0.33, random_state=42)

2. Model

In [4]:
model = xgboost.XGBClassifier(
    n_estimators=50, 
    max_depth=2, 
    use_label_encoder=False, 
    eval_metric="logloss",
    
    enable_categorical=True,
    tree_method="hist"
)

model.fit(X_train, y_train)
Out[4]:
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=True,
              eval_metric='logloss', gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=2,
              max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=50, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [5]:
def pf_xgboost_classifier_categorical(model, df):
    df.loc[:, df.dtypes == 'object'] =\
        df.select_dtypes(['object'])\
        .apply(lambda x: x.astype('category'))
    return model.predict_proba(df)[:, 1]

explainer = dx.Explainer(model, X_test, y_test, predict_function=pf_xgboost_classifier_categorical)
Preparation of a new explainer is initiated

  -> data              : 729 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 729 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function pf_xgboost_classifier_categorical at 0x000001EC83778280> will be used
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.0258, mean = 0.333, max = 0.99
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.987, mean = -0.00781, max = 0.936
  -> model_info        : package xgboost

A new explainer has been created!

3. Evaluate!

In [6]:
explainer.model_performance()
Out[6]:
recall precision f1 accuracy auc
XGBClassifier 0.582278 0.730159 0.647887 0.794239 0.809509

Which variables are important?

In [7]:
explainer.model_parts().result
Out[7]:
variable dropout_loss label
0 parch 0.189779 XGBClassifier
1 _full_model_ 0.190491 XGBClassifier
2 embarked 0.194837 XGBClassifier
3 sibsp 0.196293 XGBClassifier
4 fare 0.197369 XGBClassifier
5 age 0.243815 XGBClassifier
6 class 0.274981 XGBClassifier
7 gender_male 0.378696 XGBClassifier
8 _baseline_ 0.495097 XGBClassifier

Let's analyze variable effects with both PDP and ALE.

Ceteris Paribus

See the API documentation for all the possible values of the specific parameters:

In [8]:
cp = explainer.predict_profile(new_observation=X.iloc[[400]])
Calculating ceteris paribus: 100%|███████████████████| 7/7 [00:00<00:00, 89.74it/s]

We can visualize what-if analysis for a single observaiton of interest..

In [9]:
cp.plot(variables=["age", "sibsp"])

..and for many observations at the same time.

In [10]:
cp_10 = explainer.predict_profile(new_observation=X.iloc[400:410])
cp_10.plot(variables=["age", "sibsp"])
Calculating ceteris paribus: 100%|███████████████████| 7/7 [00:00<00:00, 82.35it/s]

An average of many single profiles (local explanations) estimates the partial dependence (global explanation).

Partial Dependence Plots

See the API documentation for all the possible values of the specific parameters:

In [11]:
pdp = explainer.model_profile() # defaults: type="pdp", N=300 
Calculating ceteris paribus: 100%|███████████████████| 7/7 [00:00<00:00, 42.68it/s]
In [12]:
pdp.result
Out[12]:
_vname_ _label_ _x_ _yhat_ _ids_
0 age XGBClassifier 0.166667 0.787982 0
1 age XGBClassifier 0.905000 0.787982 0
2 age XGBClassifier 1.643333 0.787982 0
3 age XGBClassifier 2.381667 0.787982 0
4 age XGBClassifier 3.120000 0.787982 0
... ... ... ... ... ...
500 gender_male XGBClassifier 0.960000 0.770857 0
501 gender_male XGBClassifier 0.970000 0.770857 0
502 gender_male XGBClassifier 0.980000 0.770857 0
503 gender_male XGBClassifier 0.990000 0.770857 0
504 gender_male XGBClassifier 1.000000 0.227313 0

505 rows × 5 columns

In [13]:
pdp.plot(variables=["age", "fare"])

Explanations seem to indicate high importance of age.

In [14]:
pdp.plot(variables=["age", "fare"], geom="profiles", title="Partial Dependence Plot with individual profiles")

How about categorical variables?

In [15]:
pdp.plot(variables=["class"])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [15], in <cell line: 1>()
----> 1 pdp.plot(variables=["class"])

File c:\users\hbani\appdata\local\programs\python\python39\lib\site-packages\dalex\model_explanations\_aggregated_profiles\object.py:239, in AggregatedProfiles.plot(self, objects, geom, variables, center, size, alpha, color, facet_ncol, facet_scales, title, y_title, horizontal_spacing, vertical_spacing, show)
    237     all_variables = _global_utils.intersect_unsorted(variables, all_variables)
    238     if len(all_variables) == 0:
--> 239         raise TypeError("variables do not overlap with " + ''.join(variables))
    241     _result_df = _result_df.loc[_result_df['_vname_'].isin(all_variables), :]
    243 #  calculate y axis range to allow for fixedrange True

TypeError: variables do not overlap with class

model_profile uses variable_type="numerical" by default. Let's fix this.

In [16]:
pdp_categorical = explainer.model_profile(variable_type="categorical")
Calculating ceteris paribus: 100%|███████████████████| 7/7 [00:00<00:00, 44.30it/s]
In [17]:
pdp_categorical.plot(variables="class", title="PDP")

For example, these numerical plots might not look as intended.

In [18]:
pdp.plot(variables=["gender_male", "sibsp"])

We can fix this by computing variable effects in points of interest.

In [19]:
explainer.model_profile(
    variable_splits={'gender_male': X.gender_male.unique(), 'sibsp': X.sibsp.unique()}, 
    verbose=False)\
.plot(geom="bars", title="Partial Dependence Plot")

We can also combine the effects between the numerical and categorical variables.

In [20]:
pdp_grouped = explainer.model_profile(groups="gender_male")
Calculating ceteris paribus: 100%|███████████████████| 7/7 [00:00<00:00, 47.62it/s]
In [21]:
pdp_grouped.plot(variables=["age", "fare"], title="PDP")

Note that the plots depend on the grid of split points. Default is uniformly distributed. We can change it to quantiles.

In [22]:
# default: variable_splits_type="uniform"
pdp_splits = explainer.model_profile(groups="gender_male", variable_splits_type="quantile")
Calculating ceteris paribus: 100%|███████████████████| 7/7 [00:00<00:00, 68.63it/s]
In [23]:
pdp_splits.plot(variables=["age", "fare"])

Accumulated Local Effects

Now, change the algorithm for estimating variable effects using type="ale".

In [24]:
ale_num = explainer.model_profile(type="ale", variables=["age", "fare"], center=False)
Calculating ceteris paribus: 100%|███████████████████| 2/2 [00:00<00:00, 35.09it/s]
Calculating accumulated dependency: 100%|████████████| 2/2 [00:00<00:00, 12.66it/s]
In [25]:
ale_num.plot(title="Accumulated Local Effects")

4. Set proper labels to compare the two explanations!

In [26]:
ale_num = explainer.model_profile(type="ale", N=None, variables=["age", "fare"], label="ALE", verbose=False)
pdp_num = explainer.model_profile(type="pdp", N=None, variables=["age", "fare"], label="PDP", verbose=False)
In [27]:
ale_num.plot(pdp_num)

Note that ALE has an arbitrary vertical baseline (mean of the model's prediction by default).

It seems that there is not much difference in the shape of the two profiles in case of this model.

5. Compare models

Let's create a larger XGBoost classifier.

In [28]:
model_large = xgboost.XGBClassifier(
    n_estimators=1000, 
    max_depth=30, 
    use_label_encoder=False, 
    eval_metric="logloss",
    
    enable_categorical=True,
    tree_method="hist"
)

model_large.fit(X_train, y_train)
Out[28]:
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=True,
              eval_metric='logloss', gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=30,
              max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=1000, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Remember to change the default label value.

In [29]:
explainer_large = dx.Explainer(
    model_large, 
    X_test, 
    y_test, 
    predict_function=pf_xgboost_classifier_categorical,
    label="Larger XGBoost",
    verbose=False
)

Compare the performance.

In [30]:
pd.concat([explainer.model_performance().result, explainer_large.model_performance().result]) 
Out[30]:
recall precision f1 accuracy auc
XGBClassifier 0.582278 0.730159 0.647887 0.794239 0.809509
Larger XGBoost 0.548523 0.646766 0.593607 0.755830 0.783592

The larger model seems overfitted. Can we observe that in the explanations?

In [31]:
pdp_large = explainer_large.model_profile()
Calculating ceteris paribus: 100%|███████████████████| 7/7 [00:02<00:00,  3.46it/s]
In [32]:
pdp_large.plot(pdp, variables=["age", "fare"], title="PDP")

Note that the simpler model has smoother variable effects. Will u trust the larger model based on these explanations?


For a theoretical introduction and more examples, see: https://ema.drwhy.ai