Materials towards Homework 6: Fairness with XGBoost

Part of the eXplainable Machine Learning course for Machine Learning (MSc) studies at the University of Warsaw. @pbiecek @hbaniecki

v0.1.0: 2022-11-28

https://github.com/mim-uw/eXplainableMachineLearning-2023/tree/main/Homeworks/HW6

0. Import packages

In [1]:
import dalex as dx
import xgboost

import sklearn
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

import platform
print(f'Python {platform.python_version()}')

{package.__name__: package.__version__ for package in [dx, xgboost, sklearn, pd, np]}
Python 3.9.12
Out[1]:
{'dalex': '1.5.0',
 'xgboost': '1.6.2',
 'sklearn': '1.1.3',
 'pandas': '1.3.5',
 'numpy': '1.23.4'}

We use the same XGBoost classifier trained on the Titanic dataset as in the previous materials towards Homework 3, Homework 4, Homework 5.

1. Load and preprocess data

In [2]:
df = dx.datasets.load_titanic()

df.loc[:, df.dtypes == 'object'] =\
    df.select_dtypes(['object'])\
    .apply(lambda x: x.astype('category'))

X = df.drop(columns='survived')
# convert gender to binary only because the `max_cat_to_onehot` parameter in XGBoost is yet to be working properly..
X =  pd.get_dummies(X, columns=["gender"], drop_first=True) 
y = df.survived

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.33, random_state=42)

2. Model

In [3]:
model = xgboost.XGBClassifier(
    n_estimators=50, 
    max_depth=2, 
    use_label_encoder=False, 
    eval_metric="logloss",
    
    enable_categorical=True,
    tree_method="hist"
)

model.fit(X_train, y_train)
Out[3]:
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=True,
              eval_metric='logloss', gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=2,
              max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=50, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [4]:
def pf_xgboost_classifier_categorical(model, df):
    df.loc[:, df.dtypes == 'object'] =\
        df.select_dtypes(['object'])\
        .apply(lambda x: x.astype('category'))
    return model.predict_proba(df)[:, 1]

explainer = dx.Explainer(model, X_test, y_test, predict_function=pf_xgboost_classifier_categorical)
Preparation of a new explainer is initiated

  -> data              : 729 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 729 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function pf_xgboost_classifier_categorical at 0x00000191B9757040> will be used
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.0258, mean = 0.333, max = 0.99
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.987, mean = -0.00781, max = 0.936
  -> model_info        : package xgboost

A new explainer has been created!

3. Evaluate!

In [5]:
explainer.model_performance()
Out[5]:
recall precision f1 accuracy auc
XGBClassifier 0.582278 0.730159 0.647887 0.794239 0.809509

Fairness

See the API documentation for all the possible values of the specific parameters:

To compute group fairness metrics, we need to choose: protected variable and privileged group.

Note that the protected variable doesn't need to be contained in data. It is sometimes advised not to use sensitive attributes in modelling, but still check for model bias with respect to the privileged group.

In [6]:
protected_variable = X_test.gender_male.apply(lambda x: "male" if x else "female")
privileged_group = "male"

fobject = explainer.model_fairness(
    protected=protected_variable,
    privileged=privileged_group
)

Bias detection

Fairness objects have a convenient form of describing model bias using the fairness_check() method.

Several metrics are computed and checked automatically:

  1. TPR - True positive rate / Equal opportunity
  2. PPV - Positive predictive value / Predictive parity
  3. FPR - False positive rate / Predictive equality
  4. STP - Statistical parity

For a broad description of these methods, consider refering to the following article and its references:

J. Wiśniewski & P. Biecek. fairmodels: a Flexible Tool for Bias Detection, Visualization, and Mitigation in Binary Classification Models. The R Journal, 2022.

More resources are available at https://fairmodels.drwhy.ai and specifically for Python at https://dalex.drwhy.ai/python#fairness.

In [7]:
fobject.fairness_check()
Bias detected in 4 metrics: TPR, PPV, FPR, STP

Conclusion: your model is not fair because 2 or more criteria exceeded acceptable limits set by epsilon.

Ratios of metrics, based on 'male'. Parameter 'epsilon' was set to 0.8 and therefore metrics should be within (0.8, 1.25)
             TPR       ACC       PPV        FPR   STP
female  4.561905  0.958853  1.292437  17.102564  11.6

Bias visualization

In [8]:
fobject.plot()

We clearly observe high bias towards the privileged group in the model. Let's construct a model without the protected variable.

In [9]:
X_train_without_prot, X_test_without_prot = X_train.drop("gender_male", axis=1), X_test.drop("gender_male", axis=1)

model_without_prot = xgboost.XGBClassifier(
    n_estimators=50, 
    max_depth=2, 
    use_label_encoder=False, 
    eval_metric="logloss",
    enable_categorical=True,
    tree_method="hist"
)

model_without_prot.fit(X_train_without_prot, y_train)

explainer_without_prot = dx.Explainer(
    model_without_prot, 
    X_test_without_prot, 
    y_test,
    predict_function=pf_xgboost_classifier_categorical,
    label="XGBClassifier without the protected attribute",
    verbose=False
)

fobject_without_prot = explainer_without_prot.model_fairness(protected_variable, privileged_group)

Now compare the two models.

In [10]:
fobject.plot(fobject_without_prot, show=False).\
    update_layout(autosize=False, width=800, height=450, legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99))

We managed to improve on 3 fairness metrics, at a cost of worse Predictive parity ratio.

This comes at a cost of model performance:

In [11]:
pd.concat([explainer.model_performance().result, explainer_without_prot.model_performance().result], axis=0)
Out[11]:
recall precision f1 accuracy auc
XGBClassifier 0.582278 0.730159 0.647887 0.794239 0.809509
XGBClassifier without the protected attribute 0.421941 0.689655 0.523560 0.750343 0.735138

Bias mitigation

Can we decrease model bias without decreasing model performance?

This is the goal of bias mitigation methods:

  • resample - returns indices used to pick relevant samples of data
  • reweight - returns sample (case) weights for model training
  • roc_pivot - returns the Explainer with a changed y_hat prediction

Let's compare all three.

In [12]:
from dalex.fairness import resample, reweight, roc_pivot
from copy import copy

protected_variable_train = X_train.gender_male.apply(lambda x: "male" if x else "female")

# resample
indices_resample = resample(
    protected_variable_train, 
    y_train, 
    type='preferential', # uniform
    probs=model_without_prot.predict_proba(X_train_without_prot)[:, 1], # requires probabilities
    verbose=False
)
model_resample = copy(model_without_prot)
model_resample.fit(X_train_without_prot.iloc[indices_resample, :], y_train.iloc[indices_resample])
explainer_resample = dx.Explainer(
    model_resample, 
    X_test_without_prot, 
    y_test, 
    label='XGBClassifier with Resample mitigation',
    verbose=False
)
fobject_resample = explainer_resample.model_fairness(
    protected_variable, 
    privileged_group
)

# reweight
sample_weight = reweight(
    protected_variable_train, 
    y_train, 
    verbose=False
)
model_reweight = copy(model_without_prot)
model_reweight.fit(X_train_without_prot, y_train, sample_weight=sample_weight)
explainer_reweight = dx.Explainer(
    model_reweight, 
    X_test_without_prot, 
    y_test, 
    label='XGBClassifier with Reweight mitigation',
    verbose=False
)
fobject_reweight = explainer_reweight.model_fairness(
    protected_variable, 
    privileged_group
)

# roc_pivot
explainer_roc_pivot = roc_pivot(
    copy(explainer_without_prot), 
    protected_variable, 
    privileged_group,
    verbose=False
)
explainer_roc_pivot.label = 'XGBClassifier with ROC pivot mitigation'
fobject_roc_pivot = explainer_roc_pivot.model_fairness(
    protected_variable, 
    privileged_group
)
In [13]:
fobject_without_prot.plot([fobject_resample, fobject_reweight, fobject_roc_pivot], show=False).\
    update_layout(autosize=False, width=800, height=450, legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99))

We can see the tradeoff between different fairness metrics.

Final conclusions will differ depending on the importance of a given metric and the epsilon threshold.

In [14]:
for fobj in [fobject_without_prot, fobject_resample, fobject_reweight, fobject_roc_pivot]:
    print("\n========== " + fobj.label + " ==========")
    fobj.fairness_check(epsilon=0.66)
========== XGBClassifier without the protected attribute ==========
Bias detected in 3 metrics: TPR, PPV, STP

Conclusion: your model is not fair because 2 or more criteria exceeded acceptable limits set by epsilon.

Ratios of metrics, based on 'male'. Parameter 'epsilon' was set to 0.66 and therefore metrics should be within (0.66, 1.515)
             TPR       ACC       PPV       FPR  STP
female  1.788779  0.834615  1.957806  1.076923  3.0

========== XGBClassifier with Resample mitigation ==========
Bias detected in 3 metrics: TPR, PPV, STP

Conclusion: your model is not fair because 2 or more criteria exceeded acceptable limits set by epsilon.

Ratios of metrics, based on 'male'. Parameter 'epsilon' was set to 0.66 and therefore metrics should be within (0.66, 1.515)
            TPR       ACC       PPV       FPR       STP
female  1.69906  0.843389  2.006397  0.795918  2.772414

========== XGBClassifier with Reweight mitigation ==========
Bias detected in 2 metrics: PPV, STP

Conclusion: your model is not fair because 2 or more criteria exceeded acceptable limits set by epsilon.

Ratios of metrics, based on 'male'. Parameter 'epsilon' was set to 0.66 and therefore metrics should be within (0.66, 1.515)
            TPR       ACC      PPV       FPR  STP
female  1.51049  0.744544  1.98927  0.886364  2.5

========== XGBClassifier with ROC pivot mitigation ==========
Bias detected in 4 metrics: TPR, PPV, FPR, STP

Conclusion: your model is not fair because 2 or more criteria exceeded acceptable limits set by epsilon.

Ratios of metrics, based on 'male'. Parameter 'epsilon' was set to 0.66 and therefore metrics should be within (0.66, 1.515)
             TPR       ACC       PPV       FPR    STP
female  2.237918  0.875161  1.991247  1.593023  3.696

Finaly, let's check the bias-performance tradeoff.

In [15]:
pd.concat([
    explainer_without_prot.model_performance().result, 
    explainer_resample.model_performance().result,
    explainer_reweight.model_performance().result,
    explainer_roc_pivot.model_performance().result
], axis=0)
Out[15]:
recall precision f1 accuracy auc
XGBClassifier without the protected attribute 0.421941 0.689655 0.523560 0.750343 0.735138
XGBClassifier with Resample mitigation 0.430380 0.684564 0.528497 0.750343 0.692549
XGBClassifier with Reweight mitigation 0.358650 0.664062 0.465753 0.732510 0.687554
XGBClassifier with ROC pivot mitigation 0.434599 0.695946 0.535065 0.754458 0.735086

For a theoretical introduction and more examples, see: https://ema.drwhy.ai