Contents

Benchmark XGBoost explanations

These benchmark notebooks compare different types of explainers across a variety of metrics. They are all generated from Jupyter notebooks available on GitHub.

  • Model: XGBoost
  • Dataset: Boston Housing (Tabular)

Build Explainers

# use an independent masker
masker = shap.maskers.Independent(X_train)
pmasker = shap.maskers.Partition(X_train)

# build the explainers
explainers = [
    ("Permutation", shap.explainers.Permutation(model.predict, masker)),
    ("Permutation part.", shap.explainers.Permutation(model.predict, pmasker)),
    ("Partition", shap.explainers.Partition(model.predict, pmasker)),
    ("Tree", shap.explainers.Tree(model, masker)),
    ("Tree approx.", shap.explainers.Tree(model, masker, approximate=True)),
    ("Exact", shap.explainers.Exact(model.predict, masker)),
    ("Random", shap.explainers.other.Random(model.predict, masker))
]

shap.maskers

# shap/maskers/__init__.py
from ._masker import Masker
from ._tabular import Independent, Partition, Impute
from ._image import Image
from ._text import Text
from ._fixed import Fixed
from ._composite import Composite
from ._fixed_composite import FixedComposite
from ._output_composite import OutputComposite

The two types of masker used during building the explainers:

  • masker: Independent masks out tabular features by integrating over the given background dataset.
  • pmasker: Partition Unlike Independent, Partition respects a hierarchial structure of the data.
    • param clusteringstring (distance metric to use for creating the clustering of the features) or numpy.ndarray (the clustering of the features).

The following two types of masker is used during benchmarking:

  • cmasker: Composite merges several maskers for different inputs together into a single composite masker.
  • Fixed leaves the input unchanged during masking, and is used for things like scoring labels.

Contents

背景知识 - IG 和 SmoothGrad

Expected gradients combines ideas from Integrated Gradients, SHAP, and SmoothGrad into a single expected value equation.

Integrated Gradients (IG)

使用梯度的积分来解释模型,可以解决“某特征贡献饱和时梯度为0”的问题,需要一个基线图片(与原图片多次线性插值)来做积分。

image-20220101212551806
IG解决梯度饱和问题:利用线性插值求积分

SmoothGrad

核心思想是“removing noise by adding noise”,来源于“给图片加微小扰动会造成梯度解释不稳定”的发现,解决方法是“将 n 张扰动图片的梯度平均”。

Contents

实验代码参考: Interpretability: LIME and SHAP in prose and code

前期准备

实验数据集(Kaggle: Telco Customer Churn)

运行商用户流失预测:字段0~19是用户属性,字段20为标签(Churn: True 表示客户流失)。

Available features:  ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges']

Label Balance - [No Churn, Churn] :  [5163, 1869]

数据集包含 7,043 个用户,其中约 25% 的为流失用户。每个用户的 20 个特征中包含用户的固有属性(gender 性别、SeniorCitizen 是否老年人、Partner 是否单身 等),以及描述开通服务(PhoneService 电话业务、MultipleLines 多线业务、InternetService 网络服务 等)、用户账户(Contract 合同方式、PaperlessBilling 电子账单、MonthlyCharges 月费用 等)的信息。

  • 数据集的特征中即包含连续数据,又包含类别数据;
  • 根据模型的类型,可以将类别字段用不同的方法表示。例如,基于树的模型可以直接使用类别编码来训练,而其他模型(线性回归、神经网络等)使用独热编码的类别变量会取得更好的效果。