9.1 模型解释概述

9.1.1 为什么需要模型解释

在机器学习项目中,模型的可解释性变得越来越重要,特别是在以下场景:

  • 监管要求:金融、医疗等行业需要解释模型决策
  • 业务理解:帮助业务人员理解模型的工作原理
  • 模型调试:发现模型的偏见和错误
  • 建立信任:增强用户对模型的信任度
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_regression, load_breast_cancer, load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

class ModelInterpretationOverview:
    def __init__(self):
        self.models = {}
        self.datasets = {}
        
    def demonstrate_interpretability_importance(self):
        """演示模型可解释性的重要性"""
        print("=== 模型可解释性重要性演示 ===")
        
        # 创建一个简单的分类数据集
        X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, 
                                 n_redundant=2, n_clusters_per_class=1, random_state=42)
        
        feature_names = [f'特征_{i+1}' for i in range(X.shape[1])]
        df = pd.DataFrame(X, columns=feature_names)
        df['target'] = y
        
        # 划分数据集
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # 训练不同复杂度的模型
        models = {
            '逻辑回归': LogisticRegression(random_state=42),
            '决策树': DecisionTreeClassifier(max_depth=5, random_state=42),
            '随机森林': RandomForestClassifier(n_estimators=100, random_state=42)
        }
        
        results = {}
        
        for name, model in models.items():
            # 训练模型
            model.fit(X_train, y_train)
            
            # 预测
            y_pred = model.predict(X_test)
            accuracy = accuracy_score(y_test, y_pred)
            
            results[name] = {
                'model': model,
                'accuracy': accuracy,
                'predictions': y_pred
            }
            
            print(f"{name} 准确率: {accuracy:.3f}")
        
        # 可视化模型复杂度与可解释性的权衡
        self.visualize_interpretability_tradeoff(results, feature_names)
        
        return results, df
        
    def visualize_interpretability_tradeoff(self, results, feature_names):
        """可视化模型复杂度与可解释性的权衡"""
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        
        # 1. 模型准确率比较
        models = list(results.keys())
        accuracies = [results[model]['accuracy'] for model in models]
        
        axes[0, 0].bar(models, accuracies, color=['skyblue', 'lightgreen', 'lightcoral'])
        axes[0, 0].set_title('不同模型的准确率')
        axes[0, 0].set_ylabel('准确率')
        axes[0, 0].set_ylim(0.8, 1.0)
        for i, acc in enumerate(accuracies):
            axes[0, 0].text(i, acc + 0.01, f'{acc:.3f}', ha='center')
        axes[0, 0].grid(True, alpha=0.3)
        
        # 2. 决策树可视化(最可解释的模型)
        tree_model = results['决策树']['model']
        plot_tree(tree_model, max_depth=3, feature_names=feature_names[:5], 
                 class_names=['类别0', '类别1'], filled=True, ax=axes[0, 1])
        axes[0, 1].set_title('决策树结构(部分)')
        
        # 3. 特征重要性比较
        models_with_importance = ['决策树', '随机森林']
        colors = ['lightgreen', 'lightcoral']
        
        for i, model_name in enumerate(models_with_importance):
            model = results[model_name]['model']
            importance = model.feature_importances_
            
            x_pos = np.arange(len(feature_names)) + i * 0.35
            axes[1, 0].bar(x_pos, importance, width=0.35, 
                          label=model_name, color=colors[i], alpha=0.7)
        
        axes[1, 0].set_title('不同模型的特征重要性')
        axes[1, 0].set_xlabel('特征')
        axes[1, 0].set_ylabel('重要性')
        axes[1, 0].set_xticks(np.arange(len(feature_names)) + 0.175)
        axes[1, 0].set_xticklabels(feature_names, rotation=45)
        axes[1, 0].legend()
        axes[1, 0].grid(True, alpha=0.3)
        
        # 4. 可解释性vs复杂度权衡图
        interpretability_scores = [0.9, 0.7, 0.3]  # 主观评分
        complexity_scores = [0.2, 0.5, 0.8]  # 主观评分
        
        scatter = axes[1, 1].scatter(complexity_scores, interpretability_scores, 
                                   s=[acc*1000 for acc in accuracies], 
                                   c=['skyblue', 'lightgreen', 'lightcoral'], 
                                   alpha=0.7, edgecolors='black')
        
        for i, model in enumerate(models):
            axes[1, 1].annotate(model, (complexity_scores[i], interpretability_scores[i]),
                              xytext=(5, 5), textcoords='offset points')
        
        axes[1, 1].set_xlabel('模型复杂度')
        axes[1, 1].set_ylabel('可解释性')
        axes[1, 1].set_title('可解释性 vs 复杂度权衡\n(气泡大小表示准确率)')
        axes[1, 1].grid(True, alpha=0.3)
        axes[1, 1].set_xlim(0, 1)
        axes[1, 1].set_ylim(0, 1)
        
        plt.tight_layout()
        plt.show()

# 演示模型解释概述
print("=== 模型解释概述 ===")
overview = ModelInterpretationOverview()
model_results, demo_data = overview.demonstrate_interpretability_importance()

9.2 特征重要性分析

9.2.1 基于模型的特征重要性

不同的机器学习模型提供了不同的特征重要性计算方法。

class FeatureImportanceAnalysis:
    def __init__(self):
        self.importance_methods = {}
        
    def demonstrate_tree_based_importance(self):
        """演示基于树的特征重要性"""
        print("=== 基于树的特征重要性 ===")
        
        # 使用乳腺癌数据集
        data = load_breast_cancer()
        X, y = data.data, data.target
        feature_names = data.feature_names
        
        # 划分数据集
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # 训练不同的树模型
        models = {
            '决策树': DecisionTreeClassifier(max_depth=10, random_state=42),
            '随机森林': RandomForestClassifier(n_estimators=100, random_state=42)
        }
        
        importance_results = {}
        
        for name, model in models.items():
            print(f"\n训练 {name}...")
            model.fit(X_train, y_train)
            
            # 获取特征重要性
            importance = model.feature_importances_
            
            # 创建特征重要性DataFrame
            importance_df = pd.DataFrame({
                'feature': feature_names,
                'importance': importance
            }).sort_values('importance', ascending=False)
            
            importance_results[name] = importance_df
            
            print(f"{name} 准确率: {model.score(X_test, y_test):.3f}")
            print(f"Top 5 重要特征:")
            for i, row in importance_df.head().iterrows():
                print(f"  {row['feature']}: {row['importance']:.4f}")
        
        # 可视化特征重要性
        self.visualize_tree_importance(importance_results)
        
        return importance_results
        
    def demonstrate_linear_model_importance(self):
        """演示线性模型的特征重要性"""
        print("\n=== 线性模型特征重要性 ===")
        
        # 使用乳腺癌数据集
        data = load_breast_cancer()
        X, y = data.data, data.target
        feature_names = data.feature_names
        
        # 标准化特征
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        # 划分数据集
        X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
        
        # 训练逻辑回归
        lr_model = LogisticRegression(random_state=42, max_iter=1000)
        lr_model.fit(X_train, y_train)
        
        # 获取系数(特征重要性)
        coefficients = lr_model.coef_[0]
        
        # 创建特征重要性DataFrame
        importance_df = pd.DataFrame({
            'feature': feature_names,
            'coefficient': coefficients,
            'abs_coefficient': np.abs(coefficients)
        }).sort_values('abs_coefficient', ascending=False)
        
        print(f"逻辑回归准确率: {lr_model.score(X_test, y_test):.3f}")
        print(f"Top 5 重要特征(按系数绝对值):")
        for i, row in importance_df.head().iterrows():
            print(f"  {row['feature']}: {row['coefficient']:.4f}")
        
        # 可视化线性模型重要性
        self.visualize_linear_importance(importance_df)
        
        return importance_df
        
    def demonstrate_permutation_importance(self):
        """演示排列重要性"""
        print("\n=== 排列重要性 ===")
        
        from sklearn.inspection import permutation_importance
        
        # 使用乳腺癌数据集
        data = load_breast_cancer()
        X, y = data.data, data.target
        feature_names = data.feature_names
        
        # 划分数据集
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # 训练随机森林
        rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
        rf_model.fit(X_train, y_train)
        
        # 计算排列重要性
        perm_importance = permutation_importance(rf_model, X_test, y_test, 
                                               n_repeats=10, random_state=42)
        
        # 创建排列重要性DataFrame
        perm_df = pd.DataFrame({
            'feature': feature_names,
            'importance_mean': perm_importance.importances_mean,
            'importance_std': perm_importance.importances_std
        }).sort_values('importance_mean', ascending=False)
        
        print(f"Top 5 重要特征(排列重要性):")
        for i, row in perm_df.head().iterrows():
            print(f"  {row['feature']}: {row['importance_mean']:.4f} ± {row['importance_std']:.4f}")
        
        # 比较不同重要性方法
        tree_importance = rf_model.feature_importances_
        self.compare_importance_methods(feature_names, tree_importance, perm_df)
        
        return perm_df
        
    def visualize_tree_importance(self, importance_results):
        """可视化基于树的特征重要性"""
        fig, axes = plt.subplots(1, 2, figsize=(16, 8))
        
        colors = ['skyblue', 'lightcoral']
        
        for i, (model_name, importance_df) in enumerate(importance_results.items()):
            # 选择Top 15特征
            top_features = importance_df.head(15)
            
            y_pos = np.arange(len(top_features))
            axes[i].barh(y_pos, top_features['importance'], color=colors[i], alpha=0.7)
            axes[i].set_yticks(y_pos)
            axes[i].set_yticklabels(top_features['feature'], fontsize=10)
            axes[i].set_xlabel('特征重要性')
            axes[i].set_title(f'{model_name} - Top 15 特征重要性')
            axes[i].grid(True, alpha=0.3)
            
            # 添加数值标签
            for j, importance in enumerate(top_features['importance']):
                axes[i].text(importance + 0.001, j, f'{importance:.3f}', 
                           va='center', fontsize=8)
        
        plt.tight_layout()
        plt.show()
        
    def visualize_linear_importance(self, importance_df):
        """可视化线性模型特征重要性"""
        fig, axes = plt.subplots(1, 2, figsize=(16, 8))
        
        # 1. 系数值(包含正负)
        top_features = importance_df.head(15)
        y_pos = np.arange(len(top_features))
        
        colors = ['red' if coef < 0 else 'blue' for coef in top_features['coefficient']]
        axes[0].barh(y_pos, top_features['coefficient'], color=colors, alpha=0.7)
        axes[0].set_yticks(y_pos)
        axes[0].set_yticklabels(top_features['feature'], fontsize=10)
        axes[0].set_xlabel('系数值')
        axes[0].set_title('逻辑回归系数(Top 15)')
        axes[0].axvline(x=0, color='black', linestyle='--', alpha=0.5)
        axes[0].grid(True, alpha=0.3)
        
        # 2. 系数绝对值
        axes[1].barh(y_pos, top_features['abs_coefficient'], color='green', alpha=0.7)
        axes[1].set_yticks(y_pos)
        axes[1].set_yticklabels(top_features['feature'], fontsize=10)
        axes[1].set_xlabel('系数绝对值')
        axes[1].set_title('逻辑回归系数绝对值(Top 15)')
        axes[1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
    def compare_importance_methods(self, feature_names, tree_importance, perm_df):
        """比较不同的特征重要性方法"""
        fig, axes = plt.subplots(1, 2, figsize=(16, 8))
        
        # 创建比较DataFrame
        comparison_df = pd.DataFrame({
            'feature': feature_names,
            'tree_importance': tree_importance,
            'perm_importance': perm_df['importance_mean']
        })
        
        # 1. 散点图比较
        axes[0].scatter(comparison_df['tree_importance'], comparison_df['perm_importance'], 
                       alpha=0.7, s=50)
        
        # 添加对角线
        max_val = max(comparison_df['tree_importance'].max(), comparison_df['perm_importance'].max())
        axes[0].plot([0, max_val], [0, max_val], 'r--', alpha=0.5)
        
        axes[0].set_xlabel('树模型特征重要性')
        axes[0].set_ylabel('排列重要性')
        axes[0].set_title('不同重要性方法比较')
        axes[0].grid(True, alpha=0.3)
        
        # 添加相关系数
        correlation = comparison_df['tree_importance'].corr(comparison_df['perm_importance'])
        axes[0].text(0.05, 0.95, f'相关系数: {correlation:.3f}', 
                    transform=axes[0].transAxes, bbox=dict(boxstyle="round", facecolor='wheat'))
        
        # 2. Top 10特征比较
        tree_top10 = comparison_df.nlargest(10, 'tree_importance')['feature'].tolist()
        perm_top10 = comparison_df.nlargest(10, 'perm_importance')['feature'].tolist()
        
        # 计算重叠
        overlap = set(tree_top10) & set(perm_top10)
        
        # 可视化Top 10特征
        x = np.arange(10)
        width = 0.35
        
        tree_ranks = [tree_top10.index(f) + 1 if f in tree_top10 else 11 for f in perm_top10]
        perm_ranks = list(range(1, 11))
        
        axes[1].bar(x - width/2, [11 - r for r in tree_ranks], width, 
                   label='树模型排名', alpha=0.7, color='skyblue')
        axes[1].bar(x + width/2, [11 - r for r in perm_ranks], width, 
                   label='排列重要性排名', alpha=0.7, color='lightcoral')
        
        axes[1].set_xlabel('特征(按排列重要性排序)')
        axes[1].set_ylabel('重要性排名(倒序)')
        axes[1].set_title(f'Top 10特征排名比较\n重叠特征数: {len(overlap)}/10')
        axes[1].set_xticks(x)
        axes[1].set_xticklabels([f.split()[-1] for f in perm_top10], rotation=45)
        axes[1].legend()
        axes[1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        print(f"\n重要性方法比较结果:")
        print(f"相关系数: {correlation:.3f}")
        print(f"Top 10特征重叠数: {len(overlap)}/10")
        print(f"重叠特征: {list(overlap)}")

# 演示特征重要性分析
print("=== 特征重要性分析 ===")
importance_analyzer = FeatureImportanceAnalysis()
tree_importance = importance_analyzer.demonstrate_tree_based_importance()
linear_importance = importance_analyzer.demonstrate_linear_model_importance()
perm_importance = importance_analyzer.demonstrate_permutation_importance()

9.3 SHAP (SHapley Additive exPlanations)

9.3.1 SHAP原理与应用

SHAP是一种基于博弈论的模型解释方法,能够为每个特征分配一个重要性值。

# 注意:需要安装shap库
# pip install shap

try:
    import shap
    SHAP_AVAILABLE = True
except ImportError:
    SHAP_AVAILABLE = False
    print("警告: SHAP库未安装,将跳过SHAP相关演示")
    print("请运行: pip install shap")

class SHAPAnalysis:
    def __init__(self):
        self.explainers = {}
        self.shap_values = {}
        
    def demonstrate_shap_tree_explainer(self):
        """演示SHAP树解释器"""
        if not SHAP_AVAILABLE:
            print("SHAP库未安装,跳过演示")
            return None
            
        print("=== SHAP树解释器演示 ===")
        
        # 使用乳腺癌数据集
        data = load_breast_cancer()
        X, y = data.data, data.target
        feature_names = data.feature_names
        
        # 划分数据集
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # 训练随机森林
        rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
        rf_model.fit(X_train, y_train)
        
        print(f"模型准确率: {rf_model.score(X_test, y_test):.3f}")
        
        # 创建SHAP解释器
        explainer = shap.TreeExplainer(rf_model)
        
        # 计算SHAP值(使用部分测试数据以节省时间)
        shap_values = explainer.shap_values(X_test[:100])
        
        # 如果是二分类,取正类的SHAP值
        if len(shap_values) == 2:
            shap_values_class1 = shap_values[1]
        else:
            shap_values_class1 = shap_values
        
        # 可视化SHAP结果
        self.visualize_shap_results(shap_values_class1, X_test[:100], feature_names)
        
        return explainer, shap_values
        
    def demonstrate_shap_linear_explainer(self):
        """演示SHAP线性解释器"""
        if not SHAP_AVAILABLE:
            print("SHAP库未安装,跳过演示")
            return None
            
        print("\n=== SHAP线性解释器演示 ===")
        
        # 使用乳腺癌数据集
        data = load_breast_cancer()
        X, y = data.data, data.target
        feature_names = data.feature_names
        
        # 标准化特征
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        # 划分数据集
        X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
        
        # 训练逻辑回归
        lr_model = LogisticRegression(random_state=42, max_iter=1000)
        lr_model.fit(X_train, y_train)
        
        print(f"模型准确率: {lr_model.score(X_test, y_test):.3f}")
        
        # 创建SHAP解释器
        explainer = shap.LinearExplainer(lr_model, X_train)
        
        # 计算SHAP值
        shap_values = explainer.shap_values(X_test[:100])
        
        # 可视化SHAP结果
        self.visualize_shap_linear_results(shap_values, X_test[:100], feature_names)
        
        return explainer, shap_values
        
    def visualize_shap_results(self, shap_values, X_test, feature_names):
        """可视化SHAP结果"""
        if not SHAP_AVAILABLE:
            return
            
        print("生成SHAP可视化图表...")
        
        # 1. SHAP摘要图
        plt.figure(figsize=(12, 8))
        shap.summary_plot(shap_values, X_test, feature_names=feature_names, show=False)
        plt.title('SHAP摘要图 - 特征重要性与影响')
        plt.tight_layout()
        plt.show()
        
        # 2. SHAP条形图
        plt.figure(figsize=(10, 8))
        shap.summary_plot(shap_values, X_test, feature_names=feature_names, 
                         plot_type="bar", show=False)
        plt.title('SHAP条形图 - 平均特征重要性')
        plt.tight_layout()
        plt.show()
        
        # 3. 单个预测的SHAP瀑布图
        if hasattr(shap, 'waterfall_plot'):
            plt.figure(figsize=(12, 8))
            shap.waterfall_plot(shap.Explanation(values=shap_values[0], 
                                               base_values=explainer.expected_value[1] if hasattr(explainer, 'expected_value') else 0,
                                               data=X_test[0], 
                                               feature_names=feature_names))
            plt.title('SHAP瀑布图 - 单个预测解释')
            plt.tight_layout()
            plt.show()
        
    def visualize_shap_linear_results(self, shap_values, X_test, feature_names):
        """可视化线性模型SHAP结果"""
        if not SHAP_AVAILABLE:
            return
            
        print("生成线性模型SHAP可视化图表...")
        
        # SHAP摘要图
        plt.figure(figsize=(12, 8))
        shap.summary_plot(shap_values, X_test, feature_names=feature_names, show=False)
        plt.title('线性模型SHAP摘要图')
        plt.tight_layout()
        plt.show()
        
    def demonstrate_shap_interaction(self):
        """演示SHAP交互效应"""
        if not SHAP_AVAILABLE:
            print("SHAP库未安装,跳过演示")
            return None
            
        print("\n=== SHAP交互效应演示 ===")
        
        # 创建一个简单的回归数据集
        X, y = make_regression(n_samples=1000, n_features=5, noise=0.1, random_state=42)
        feature_names = [f'特征_{i+1}' for i in range(X.shape[1])]
        
        # 添加一些交互效应
        X[:, 0] = X[:, 0] * X[:, 1]  # 特征1和特征2的交互
        
        # 划分数据集
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # 训练随机森林回归器
        rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
        rf_model.fit(X_train, y_train)
        
        print(f"模型R²分数: {rf_model.score(X_test, y_test):.3f}")
        
        # 创建SHAP解释器
        explainer = shap.TreeExplainer(rf_model)
        
        # 计算SHAP交互值
        shap_interaction_values = explainer.shap_interaction_values(X_test[:50])
        
        # 可视化交互效应
        self.visualize_shap_interactions(shap_interaction_values, X_test[:50], feature_names)
        
        return shap_interaction_values

    def visualize_shap_interactions(self, shap_interaction_values, X_test, feature_names):
        """可视化SHAP交互效应"""
        if not SHAP_AVAILABLE:
            return
            
        print("生成SHAP交互效应可视化...")
        
        # 计算平均交互强度
        mean_interaction = np.abs(shap_interaction_values).mean(0)
        
        # 创建交互矩阵热图
        plt.figure(figsize=(10, 8))
        mask = np.triu(np.ones_like(mean_interaction, dtype=bool))
        sns.heatmap(mean_interaction, mask=mask, annot=True, fmt='.3f', 
                   xticklabels=feature_names, yticklabels=feature_names,
                   cmap='coolwarm', center=0)
        plt.title('SHAP特征交互强度矩阵')
        plt.tight_layout()
        plt.show()

# 演示SHAP分析
if SHAP_AVAILABLE:
    print("=== SHAP分析 ===")
    shap_analyzer = SHAPAnalysis()
    tree_explainer, tree_shap_values = shap_analyzer.demonstrate_shap_tree_explainer()
    linear_explainer, linear_shap_values = shap_analyzer.demonstrate_shap_linear_explainer()
    interaction_values = shap_analyzer.demonstrate_shap_interaction()
else:
    print("跳过SHAP演示,请安装shap库")

9.4 LIME (Local Interpretable Model-agnostic Explanations)

9.4.1 LIME原理与应用

LIME通过在局部区域拟合可解释模型来解释任何机器学习模型的预测。

# 注意:需要安装lime库
# pip install lime

try:
    import lime
    import lime.lime_tabular
    LIME_AVAILABLE = True
except ImportError:
    LIME_AVAILABLE = False
    print("警告: LIME库未安装,将跳过LIME相关演示")
    print("请运行: pip install lime")

class LIMEAnalysis:
    def __init__(self):
        self.explainers = {}
        
    def demonstrate_lime_tabular(self):
        """演示LIME表格数据解释"""
        if not LIME_AVAILABLE:
            print("LIME库未安装,跳过演示")
            return None
            
        print("=== LIME表格数据解释演示 ===")
        
        # 使用乳腺癌数据集
        data = load_breast_cancer()
        X, y = data.data, data.target
        feature_names = data.feature_names
        class_names = data.target_names
        
        # 划分数据集
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # 训练随机森林
        rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
        rf_model.fit(X_train, y_train)
        
        print(f"模型准确率: {rf_model.score(X_test, y_test):.3f}")
        
        # 创建LIME解释器
        explainer = lime.lime_tabular.LimeTabularExplainer(
            X_train,
            feature_names=feature_names,
            class_names=class_names,
            mode='classification'
        )
        
        # 解释几个预测
        explanations = []
        for i in range(3):
            exp = explainer.explain_instance(X_test[i], rf_model.predict_proba, num_features=10)
            explanations.append(exp)
            
            print(f"\n样本 {i+1} 预测解释:")
            print(f"真实标签: {class_names[y_test[i]]}")
            print(f"预测标签: {class_names[rf_model.predict([X_test[i]])[0]]}")
            print(f"预测概率: {rf_model.predict_proba([X_test[i]])[0]}")
            
            # 显示特征重要性
            feature_importance = exp.as_list()
            print("Top 5 重要特征:")
            for feature, importance in feature_importance[:5]:
                print(f"  {feature}: {importance:.4f}")
        
        # 可视化LIME结果
        self.visualize_lime_results(explanations, X_test[:3], y_test[:3], class_names, rf_model)
        
        return explainer, explanations
        
    def demonstrate_lime_comparison(self):
        """演示LIME与其他解释方法的比较"""
        if not LIME_AVAILABLE:
            print("LIME库未安装,跳过演示")
            return None
            
        print("\n=== LIME与其他方法比较 ===")
        
        # 使用葡萄酒数据集
        data = load_wine()
        X, y = data.data, data.target
        feature_names = data.feature_names
        class_names = data.target_names
        
        # 标准化特征
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        # 划分数据集
        X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
        
        # 训练不同模型
        models = {
            '随机森林': RandomForestClassifier(n_estimators=100, random_state=42),
            '逻辑回归': LogisticRegression(random_state=42, max_iter=1000)
        }
        
        lime_results = {}
        
        for model_name, model in models.items():
            print(f"\n分析 {model_name}...")
            model.fit(X_train, y_train)
            
            # 创建LIME解释器
            explainer = lime.lime_tabular.LimeTabularExplainer(
                X_train,
                feature_names=feature_names,
                class_names=class_names,
                mode='classification'
            )
            
            # 解释一个样本
            sample_idx = 0
            exp = explainer.explain_instance(X_test[sample_idx], model.predict_proba, num_features=len(feature_names))
            
            lime_results[model_name] = {
                'explanation': exp,
                'model': model,
                'accuracy': model.score(X_test, y_test)
            }
            
            print(f"准确率: {model.score(X_test, y_test):.3f}")
        
        # 比较不同模型的LIME解释
        self.compare_lime_explanations(lime_results, X_test[0], y_test[0], feature_names, class_names)
        
        return lime_results
        
    def visualize_lime_results(self, explanations, X_samples, y_true, class_names, model):
        """可视化LIME结果"""
        if not LIME_AVAILABLE:
            return
            
        fig, axes = plt.subplots(1, 3, figsize=(18, 6))
        
        for i, (exp, sample, true_label) in enumerate(zip(explanations, X_samples, y_true)):
            # 获取特征重要性
            feature_importance = exp.as_list()
            
            # 分离特征名和重要性值
            features = [item[0] for item in feature_importance[:10]]
            importances = [item[1] for item in feature_importance[:10]]
            
            # 创建颜色(正值为蓝色,负值为红色)
            colors = ['blue' if imp > 0 else 'red' for imp in importances]
            
            # 绘制水平条形图
            y_pos = np.arange(len(features))
            axes[i].barh(y_pos, importances, color=colors, alpha=0.7)
            axes[i].set_yticks(y_pos)
            axes[i].set_yticklabels([f.split('<=')[0].strip() for f in features], fontsize=8)
            axes[i].set_xlabel('LIME重要性')
            
            # 预测信息
            pred_proba = model.predict_proba([sample])[0]
            pred_class = model.predict([sample])[0]
            
            axes[i].set_title(f'样本 {i+1}\n真实: {class_names[true_label]}\n'
                            f'预测: {class_names[pred_class]} ({pred_proba[pred_class]:.3f})')
            axes[i].grid(True, alpha=0.3)
            axes[i].axvline(x=0, color='black', linestyle='-', alpha=0.5)
        
        plt.tight_layout()
        plt.show()
        
    def compare_lime_explanations(self, lime_results, sample, true_label, feature_names, class_names):
        """比较不同模型的LIME解释"""
        if not LIME_AVAILABLE:
            return
            
        fig, axes = plt.subplots(1, 2, figsize=(16, 8))
        
        model_names = list(lime_results.keys())
        
        for i, model_name in enumerate(model_names):
            exp = lime_results[model_name]['explanation']
            model = lime_results[model_name]['model']
            
            # 获取特征重要性
            feature_importance = exp.as_list()
            
            # 分离特征名和重要性值
            features = [item[0] for item in feature_importance[:10]]
            importances = [item[1] for item in feature_importance[:10]]
            
            # 创建颜色
            colors = ['blue' if imp > 0 else 'red' for imp in importances]
            
            # 绘制条形图
            y_pos = np.arange(len(features))
            axes[i].barh(y_pos, importances, color=colors, alpha=0.7)
            axes[i].set_yticks(y_pos)
            axes[i].set_yticklabels([f.split('<=')[0].strip() for f in features], fontsize=10)
            axes[i].set_xlabel('LIME重要性')
            
            # 预测信息
            pred_proba = model.predict_proba([sample])[0]
            pred_class = model.predict([sample])[0]
            
            axes[i].set_title(f'{model_name}\n预测: {class_names[pred_class]} ({pred_proba[pred_class]:.3f})\n'
                            f'准确率: {lime_results[model_name]["accuracy"]:.3f}')
            axes[i].grid(True, alpha=0.3)
            axes[i].axvline(x=0, color='black', linestyle='-', alpha=0.5)
        
        plt.suptitle(f'不同模型的LIME解释比较\n真实标签: {class_names[true_label]}', fontsize=14)
        plt.tight_layout()
        plt.show()

# 演示LIME分析
if LIME_AVAILABLE:
    print("=== LIME分析 ===")
    lime_analyzer = LIMEAnalysis()
    lime_explainer, lime_explanations = lime_analyzer.demonstrate_lime_tabular()
    lime_comparison = lime_analyzer.demonstrate_lime_comparison()
else:
    print("跳过LIME演示,请安装lime库")

9.5 模型可视化技术

9.5.1 决策边界可视化

class ModelVisualization:
    def __init__(self):
        self.visualization_methods = {}
        
    def demonstrate_decision_boundary(self):
        """演示决策边界可视化"""
        print("=== 决策边界可视化演示 ===")
        
        # 创建2D分类数据集
        X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, 
                                 n_informative=2, n_clusters_per_class=1, random_state=42)
        
        # 训练不同的分类器
        classifiers = {
            '逻辑回归': LogisticRegression(random_state=42),
            '决策树': DecisionTreeClassifier(max_depth=5, random_state=42),
            '随机森林': RandomForestClassifier(n_estimators=100, random_state=42)
        }
        
        # 可视化决策边界
        self.plot_decision_boundaries(X, y, classifiers)
        
        return X, y, classifiers
        
    def plot_decision_boundaries(self, X, y, classifiers):
        """绘制决策边界"""
        fig, axes = plt.subplots(1, 3, figsize=(18, 6))
        
        # 创建网格点
        h = 0.02
        x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
        y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
        xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                           np.arange(y_min, y_max, h))
        
        for i, (name, clf) in enumerate(classifiers.items()):
            # 训练分类器
            clf.fit(X, y)
            
            # 预测网格点
            Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
            Z = Z.reshape(xx.shape)
            
            # 绘制决策边界
            axes[i].contourf(xx, yy, Z, alpha=0.4, cmap=plt.cm.RdYlBu)
            
            # 绘制数据点
            scatter = axes[i].scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdYlBu, edgecolors='black')
            
            axes[i].set_title(f'{name}\n准确率: {clf.score(X, y):.3f}')
            axes[i].set_xlabel('特征 1')
            axes[i].set_ylabel('特征 2')
            axes[i].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
    def demonstrate_learning_curves(self):
        """演示学习曲线可视化"""
        print("\n=== 学习曲线可视化演示 ===")
        
        from sklearn.model_selection import learning_curve
        
        # 使用乳腺癌数据集
        data = load_breast_cancer()
        X, y = data.data, data.target
        
        # 不同复杂度的模型
        models = {
            '决策树(深度3)': DecisionTreeClassifier(max_depth=3, random_state=42),
            '决策树(深度10)': DecisionTreeClassifier(max_depth=10, random_state=42),
            '随机森林': RandomForestClassifier(n_estimators=100, random_state=42)
        }
        
        # 绘制学习曲线
        self.plot_learning_curves(X, y, models)
        
    def plot_learning_curves(self, X, y, models):
        """绘制学习曲线"""
        fig, axes = plt.subplots(1, 3, figsize=(18, 6))
        
        train_sizes = np.linspace(0.1, 1.0, 10)
        
        for i, (name, model) in enumerate(models.items()):
            # 计算学习曲线
            train_sizes_abs, train_scores, val_scores = learning_curve(
                model, X, y, train_sizes=train_sizes, cv=5, random_state=42
            )
            
            # 计算均值和标准差
            train_mean = np.mean(train_scores, axis=1)
            train_std = np.std(train_scores, axis=1)
            val_mean = np.mean(val_scores, axis=1)
            val_std = np.std(val_scores, axis=1)
            
            # 绘制学习曲线
            axes[i].plot(train_sizes_abs, train_mean, 'o-', color='blue', label='训练分数')
            axes[i].fill_between(train_sizes_abs, train_mean - train_std, 
                               train_mean + train_std, alpha=0.1, color='blue')
            
            axes[i].plot(train_sizes_abs, val_mean, 'o-', color='red', label='验证分数')
            axes[i].fill_between(train_sizes_abs, val_mean - val_std, 
                               val_mean + val_std, alpha=0.1, color='red')
            
            axes[i].set_title(f'{name}学习曲线')
            axes[i].set_xlabel('训练样本数')
            axes[i].set_ylabel('准确率')
            axes[i].legend()
            axes[i].grid(True, alpha=0.3)
            axes[i].set_ylim(0.8, 1.02)
        
        plt.tight_layout()
        plt.show()
        
    def demonstrate_validation_curves(self):
        """演示验证曲线可视化"""
        print("\n=== 验证曲线可视化演示 ===")
        
        from sklearn.model_selection import validation_curve
        
        # 使用乳腺癌数据集
        data = load_breast_cancer()
        X, y = data.data, data.target
        
        # 不同参数的验证曲线
        param_configs = [
            {
                'model': DecisionTreeClassifier(random_state=42),
                'param_name': 'max_depth',
                'param_range': range(1, 21),
                'title': '决策树最大深度'
            },
            {
                'model': RandomForestClassifier(random_state=42),
                'param_name': 'n_estimators',
                'param_range': [10, 50, 100, 200, 300],
                'title': '随机森林树的数量'
            }
        ]
        
        # 绘制验证曲线
        self.plot_validation_curves(X, y, param_configs)
        
    def plot_validation_curves(self, X, y, param_configs):
        """绘制验证曲线"""
        fig, axes = plt.subplots(1, 2, figsize=(16, 6))
        
        for i, config in enumerate(param_configs):
            # 计算验证曲线
            train_scores, val_scores = validation_curve(
                config['model'], X, y, 
                param_name=config['param_name'],
                param_range=config['param_range'],
                cv=5, random_state=42
            )
            
            # 计算均值和标准差
            train_mean = np.mean(train_scores, axis=1)
            train_std = np.std(train_scores, axis=1)
            val_mean = np.mean(val_scores, axis=1)
            val_std = np.std(val_scores, axis=1)
            
            # 绘制验证曲线
            axes[i].plot(config['param_range'], train_mean, 'o-', color='blue', label='训练分数')
            axes[i].fill_between(config['param_range'], train_mean - train_std, 
                               train_mean + train_std, alpha=0.1, color='blue')
            
            axes[i].plot(config['param_range'], val_mean, 'o-', color='red', label='验证分数')
            axes[i].fill_between(config['param_range'], val_mean - val_std, 
                               val_mean + val_std, alpha=0.1, color='red')
            
            axes[i].set_title(f'{config["title"]}验证曲线')
            axes[i].set_xlabel(config['param_name'])
            axes[i].set_ylabel('准确率')
            axes[i].legend()
            axes[i].grid(True, alpha=0.3)
            axes[i].set_ylim(0.8, 1.02)
        
        plt.tight_layout()
        plt.show()
        
    def demonstrate_confusion_matrix_visualization(self):
        """演示混淆矩阵可视化"""
        print("\n=== 混淆矩阵可视化演示 ===")
        
        # 使用乳腺癌数据集
        data = load_breast_cancer()
        X, y = data.data, data.target
        class_names = data.target_names
        
        # 划分数据集
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # 训练不同模型
        models = {
            '逻辑回归': LogisticRegression(random_state=42, max_iter=1000),
            '随机森林': RandomForestClassifier(n_estimators=100, random_state=42)
        }
        
        # 可视化混淆矩阵
        self.plot_confusion_matrices(models, X_train, X_test, y_train, y_test, class_names)
        
    def plot_confusion_matrices(self, models, X_train, X_test, y_train, y_test, class_names):
        """绘制混淆矩阵"""
        fig, axes = plt.subplots(1, 2, figsize=(12, 5))
        
        for i, (name, model) in enumerate(models.items()):
            # 训练模型
            model.fit(X_train, y_train)
            
            # 预测
            y_pred = model.predict(X_test)
            
            # 计算混淆矩阵
            cm = confusion_matrix(y_test, y_pred)
            
            # 绘制混淆矩阵
            sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                       xticklabels=class_names, yticklabels=class_names, ax=axes[i])
            
            axes[i].set_title(f'{name}\n准确率: {accuracy_score(y_test, y_pred):.3f}')
            axes[i].set_xlabel('预测标签')
            axes[i].set_ylabel('真实标签')
        
        plt.tight_layout()
        plt.show()

# 演示模型可视化
print("=== 模型可视化技术 ===")
visualizer = ModelVisualization()
X_2d, y_2d, boundary_classifiers = visualizer.demonstrate_decision_boundary()
visualizer.demonstrate_learning_curves()
visualizer.demonstrate_validation_curves()
visualizer.demonstrate_confusion_matrix_visualization()

9.6 综合案例:信用评分模型解释

9.6.1 项目背景与目标

在这个综合案例中,我们将构建一个信用评分模型,并使用多种解释技术来理解模型的决策过程。

class CreditScoringInterpretation:
    def __init__(self):
        self.models = {}
        self.explanations = {}
        
    def create_credit_dataset(self):
        """创建信用评分数据集"""
        print("=== 创建信用评分数据集 ===")
        
        np.random.seed(42)
        n_samples = 2000
        
        # 基础特征
        data = {
            # 个人信息
            'age': np.random.normal(40, 12, n_samples),
            'income': np.random.lognormal(10, 0.5, n_samples),
            'employment_years': np.random.exponential(5, n_samples),
            
            # 信贷历史
            'credit_history_length': np.random.exponential(8, n_samples),
            'num_credit_accounts': np.random.poisson(3, n_samples),
            'credit_utilization': np.random.beta(2, 5, n_samples),
            
            # 贷款信息
            'loan_amount': np.random.lognormal(9, 0.8, n_samples),
            'loan_term': np.random.choice([12, 24, 36, 48, 60], n_samples),
            
            # 类别特征
            'education': np.random.choice(['高中', '本科', '硕士', '博士'], n_samples, 
                                       p=[0.3, 0.5, 0.15, 0.05]),
            'employment_type': np.random.choice(['全职', '兼职', '自雇', '失业'], n_samples,
                                             p=[0.7, 0.15, 0.1, 0.05]),
            'housing_status': np.random.choice(['自有', '租赁', '其他'], n_samples,
                                             p=[0.6, 0.35, 0.05])
        }
        
        # 创建DataFrame
        df = pd.DataFrame(data)
        
        # 数据清理
        df['age'] = np.clip(df['age'], 18, 80)
        df['income'] = np.clip(df['income'], 20000, 200000)
        df['employment_years'] = np.clip(df['employment_years'], 0, 40)
        df['credit_history_length'] = np.clip(df['credit_history_length'], 0, 30)
        df['num_credit_accounts'] = np.clip(df['num_credit_accounts'], 0, 10)
        df['credit_utilization'] = np.clip(df['credit_utilization'], 0, 1)
        df['loan_amount'] = np.clip(df['loan_amount'], 5000, 100000)
        
        # 生成目标变量(违约概率)
        # 基础违约概率
        base_default_prob = 0.1
        
        # 年龄影响(年轻人和老年人风险较高)
        age_effect = np.where((df['age'] < 25) | (df['age'] > 65), 0.05, -0.02)
        
        # 收入影响(收入越高,违约风险越低)
        income_effect = -0.1 * (df['income'] - df['income'].mean()) / df['income'].std()
        
        # 信用利用率影响(利用率越高,风险越高)
        utilization_effect = 0.2 * df['credit_utilization']
        
        # 教育影响
        education_effect = df['education'].map({
            '高中': 0.03, '本科': 0, '硕士': -0.02, '博士': -0.03
        })
        
        # 就业状态影响
        employment_effect = df['employment_type'].map({
            '全职': -0.02, '兼职': 0.02, '自雇': 0.01, '失业': 0.15
        })
        
        # 计算违约概率
        default_prob = (base_default_prob + age_effect + income_effect + 
                       utilization_effect + education_effect + employment_effect)
        default_prob = np.clip(default_prob, 0.01, 0.8)
        
        # 生成违约标签
        df['default'] = np.random.binomial(1, default_prob)
        
        print(f"数据集形状: {df.shape}")
        print(f"违约率: {df['default'].mean():.3f}")
        
        # 可视化数据集
        self.visualize_credit_dataset(df)
        
        return df
        
    def comprehensive_model_interpretation(self, df):
        """综合模型解释分析"""
        print("\n=== 综合模型解释分析 ===")
        
        # 特征工程
        df_processed = self.preprocess_credit_data(df)
        
        # 准备数据
        feature_columns = [col for col in df_processed.columns if col != 'default']
        X = df_processed[feature_columns]
        y = df_processed['default']
        
        # 划分数据集
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # 训练模型
        models = {
            '逻辑回归': LogisticRegression(random_state=42, max_iter=1000),
            '随机森林': RandomForestClassifier(n_estimators=100, random_state=42)
        }
        
        interpretation_results = {}
        
        for name, model in models.items():
            print(f"\n分析 {name}...")
            model.fit(X_train, y_train)
            
            # 模型性能
            train_score = model.score(X_train, y_train)
            test_score = model.score(X_test, y_test)
            
            print(f"训练准确率: {train_score:.3f}")
            print(f"测试准确率: {test_score:.3f}")
            
            # 特征重要性分析
            if hasattr(model, 'feature_importances_'):
                feature_importance = model.feature_importances_
            elif hasattr(model, 'coef_'):
                feature_importance = np.abs(model.coef_[0])
            else:
                feature_importance = None
            
            interpretation_results[name] = {
                'model': model,
                'train_score': train_score,
                'test_score': test_score,
                'feature_importance': feature_importance,
                'feature_names': feature_columns
            }
        
        # 可视化模型解释结果
        self.visualize_comprehensive_interpretation(interpretation_results, X_test, y_test)
        
        return interpretation_results, X_test, y_test
        
    def preprocess_credit_data(self, df):
        """预处理信用数据"""
        df_processed = df.copy()
        
        # 类别特征编码
        categorical_features = ['education', 'employment_type', 'housing_status']
        df_encoded = pd.get_dummies(df_processed, columns=categorical_features, prefix=categorical_features)
        
        # 特征缩放
        numerical_features = ['age', 'income', 'employment_years', 'credit_history_length', 
                            'num_credit_accounts', 'credit_utilization', 'loan_amount', 'loan_term']
        
        scaler = StandardScaler()
        df_encoded[numerical_features] = scaler.fit_transform(df_encoded[numerical_features])
        
        return df_encoded
        
    def visualize_credit_dataset(self, df):
        """可视化信用数据集"""
        fig, axes = plt.subplots(3, 3, figsize=(18, 15))
        
        # 1. 违约率分布
        default_rate = df['default'].value_counts()
        axes[0, 0].pie(default_rate.values, labels=['正常', '违约'], autopct='%1.1f%%', 
                      colors=['lightgreen', 'lightcoral'])
        axes[0, 0].set_title('违约率分布')
        
        # 2. 年龄vs违约率
        age_bins = pd.cut(df['age'], bins=5)
        age_default_rate = df.groupby(age_bins)['default'].mean()
        axes[0, 1].bar(range(len(age_default_rate)), age_default_rate.values, alpha=0.7)
        axes[0, 1].set_title('年龄段vs违约率')
        axes[0, 1].set_xlabel('年龄段')
        axes[0, 1].set_ylabel('违约率')
        axes[0, 1].set_xticks(range(len(age_default_rate)))
        axes[0, 1].set_xticklabels([f'{int(interval.left)}-{int(interval.right)}' 
                                   for interval in age_default_rate.index], rotation=45)
        axes[0, 1].grid(True, alpha=0.3)
        
        # 3. 收入vs违约率
        income_bins = pd.qcut(df['income'], q=5)
        income_default_rate = df.groupby(income_bins)['default'].mean()
        axes[0, 2].bar(range(len(income_default_rate)), income_default_rate.values, alpha=0.7)
        axes[0, 2].set_title('收入段vs违约率')
        axes[0, 2].set_xlabel('收入段')
        axes[0, 2].set_ylabel('违约率')
        axes[0, 2].set_xticks(range(len(income_default_rate)))
        axes[0, 2].set_xticklabels([f'{int(interval.left/1000)}k-{int(interval.right/1000)}k' 
                                   for interval in income_default_rate.index], rotation=45)
        axes[0, 2].grid(True, alpha=0.3)
        
        # 4. 教育水平vs违约率
        education_default_rate = df.groupby('education')['default'].mean().sort_values(ascending=False)
        axes[1, 0].bar(range(len(education_default_rate)), education_default_rate.values, alpha=0.7)
        axes[1, 0].set_title('教育水平vs违约率')
        axes[1, 0].set_xlabel('教育水平')
        axes[1, 0].set_ylabel('违约率')
        axes[1, 0].set_xticks(range(len(education_default_rate)))
        axes[1, 0].set_xticklabels(education_default_rate.index, rotation=45)
        axes[1, 0].grid(True, alpha=0.3)
        
        # 5. 就业状态vs违约率
        employment_default_rate = df.groupby('employment_type')['default'].mean().sort_values(ascending=False)
        axes[1, 1].bar(range(len(employment_default_rate)), employment_default_rate.values, alpha=0.7)
        axes[1, 1].set_title('就业状态vs违约率')
        axes[1, 1].set_xlabel('就业状态')
        axes[1, 1].set_ylabel('违约率')
        axes[1, 1].set_xticks(range(len(employment_default_rate)))
        axes[1, 1].set_xticklabels(employment_default_rate.index, rotation=45)
        axes[1, 1].grid(True, alpha=0.3)
        
        # 6. 信用利用率vs违约率
        utilization_bins = pd.cut(df['credit_utilization'], bins=5)
        utilization_default_rate = df.groupby(utilization_bins)['default'].mean()
        axes[1, 2].bar(range(len(utilization_default_rate)), utilization_default_rate.values, alpha=0.7)
        axes[1, 2].set_title('信用利用率vs违约率')
        axes[1, 2].set_xlabel('信用利用率')
        axes[1, 2].set_ylabel('违约率')
        axes[1, 2].set_xticks(range(len(utilization_default_rate)))
        axes[1, 2].set_xticklabels([f'{interval.left:.1f}-{interval.right:.1f}' 
                                   for interval in utilization_default_rate.index], rotation=45)
        axes[1, 2].grid(True, alpha=0.3)
        
        # 7. 相关性热图
        numerical_features = ['age', 'income', 'employment_years', 'credit_history_length', 
                            'num_credit_accounts', 'credit_utilization', 'loan_amount', 'default']
        corr_matrix = df[numerical_features].corr()
        
        im = axes[2, 0].imshow(corr_matrix, cmap='coolwarm', aspect='auto', vmin=-1, vmax=1)
        axes[2, 0].set_xticks(range(len(corr_matrix.columns)))
        axes[2, 0].set_yticks(range(len(corr_matrix.columns)))
        axes[2, 0].set_xticklabels([col.replace('_', '\n') for col in corr_matrix.columns], rotation=45)
        axes[2, 0].set_yticklabels([col.replace('_', '\n') for col in corr_matrix.columns])
        axes[2, 0].set_title('特征相关性')
        
        # 8. 贷款金额vs违约率
        loan_bins = pd.qcut(df['loan_amount'], q=5)
        loan_default_rate = df.groupby(loan_bins)['default'].mean()
        axes[2, 1].bar(range(len(loan_default_rate)), loan_default_rate.values, alpha=0.7)
        axes[2, 1].set_title('贷款金额vs违约率')
        axes[2, 1].set_xlabel('贷款金额')
        axes[2, 1].set_ylabel('违约率')
        axes[2, 1].set_xticks(range(len(loan_default_rate)))
        axes[2, 1].set_xticklabels([f'{int(interval.left/1000)}k-{int(interval.right/1000)}k' 
                                   for interval in loan_default_rate.index], rotation=45)
        axes[2, 1].grid(True, alpha=0.3)
        
        # 9. 特征分布对比
        features_to_compare = ['age', 'income', 'credit_utilization']
        for i, feature in enumerate(features_to_compare):
            if i < 1:  # 只显示一个特征分布
                for default_status in [0, 1]:
                    subset = df[df['default'] == default_status][feature]
                    axes[2, 2].hist(subset, alpha=0.5, label=f'违约={default_status}', bins=20)
                
                axes[2, 2].set_title(f'{feature}分布对比')
                axes[2, 2].set_xlabel(feature)
                axes[2, 2].set_ylabel('频率')
                axes[2, 2].legend()
                axes[2, 2].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
    def visualize_comprehensive_interpretation(self, interpretation_results, X_test, y_test):
        """可视化综合解释结果"""
        fig, axes = plt.subplots(2, 3, figsize=(18, 12))
        
        model_names = list(interpretation_results.keys())
        
        # 1. 模型性能比较
        train_scores = [interpretation_results[name]['train_score'] for name in model_names]
        test_scores = [interpretation_results[name]['test_score'] for name in model_names]
        
        x = np.arange(len(model_names))
        width = 0.35
        
        axes[0, 0].bar(x - width/2, train_scores, width, label='训练准确率', alpha=0.7)
        axes[0, 0].bar(x + width/2, test_scores, width, label='测试准确率', alpha=0.7)
        axes[0, 0].set_title('模型性能比较')
        axes[0, 0].set_xlabel('模型')
        axes[0, 0].set_ylabel('准确率')
        axes[0, 0].set_xticks(x)
        axes[0, 0].set_xticklabels(model_names)
        axes[0, 0].legend()
        axes[0, 0].grid(True, alpha=0.3)
        
        # 2. 特征重要性比较
        for i, model_name in enumerate(model_names):
            if interpretation_results[model_name]['feature_importance'] is not None:
                importance = interpretation_results[model_name]['feature_importance']
                feature_names = interpretation_results[model_name]['feature_names']
                
                # 选择Top 10特征
                top_indices = np.argsort(importance)[-10:]
                top_importance = importance[top_indices]
                top_features = [feature_names[idx] for idx in top_indices]
                
                y_pos = np.arange(len(top_features))
                axes[0, i+1].barh(y_pos, top_importance, alpha=0.7)
                axes[0, i+1].set_yticks(y_pos)
                axes[0, i+1].set_yticklabels([f.replace('_', '\n') for f in top_features], fontsize=8)
                axes[0, i+1].set_xlabel('重要性')
                axes[0, i+1].set_title(f'{model_name}\nTop 10特征重要性')
                axes[0, i+1].grid(True, alpha=0.3)
        
        # 3. 混淆矩阵
        for i, model_name in enumerate(model_names):
            model = interpretation_results[model_name]['model']
            y_pred = model.predict(X_test)
            cm = confusion_matrix(y_test, y_pred)
            
            sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                       xticklabels=['正常', '违约'], yticklabels=['正常', '违约'], 
                       ax=axes[1, i])
            axes[1, i].set_title(f'{model_name}混淆矩阵')
            axes[1, i].set_xlabel('预测标签')
            axes[1, i].set_ylabel('真实标签')
        
        # 4. ROC曲线比较
        from sklearn.metrics import roc_curve, auc
        
        for model_name in model_names:
            model = interpretation_results[model_name]['model']
            y_proba = model.predict_proba(X_test)[:, 1]
            fpr, tpr, _ = roc_curve(y_test, y_proba)
            roc_auc = auc(fpr, tpr)
            
            axes[1, 2].plot(fpr, tpr, label=f'{model_name} (AUC = {roc_auc:.3f})')
        
        axes[1, 2].plot([0, 1], [0, 1], 'k--', alpha=0.5)
        axes[1, 2].set_xlabel('假正率')
        axes[1, 2].set_ylabel('真正率')
        axes[1, 2].set_title('ROC曲线比较')
        axes[1, 2].legend()
        axes[1, 2].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()

# 演示综合案例
print("=== 信用评分模型解释综合案例 ===")
credit_interpreter = CreditScoringInterpretation()
credit_data = credit_interpreter.create_credit_dataset()
interpretation_results, X_test, y_test = credit_interpreter.comprehensive_model_interpretation(credit_data)

9.7 本章小结

9.7.1 模型解释核心概念回顾

本章我们深入学习了模型解释与可视化的各个方面:

  1. 模型解释概述

    • 模型可解释性的重要性
    • 可解释性与复杂度的权衡
    • 不同解释方法的适用场景
  2. 特征重要性分析

    • 基于树的特征重要性
    • 线性模型特征重要性
    • 排列重要性
    • 不同方法的比较
  3. SHAP分析

    • SHAP的基本原理
    • 树解释器和线性解释器
    • SHAP可视化技术
    • 特征交互效应分析
  4. LIME分析

    • LIME的局部解释原理
    • 表格数据解释
    • 不同模型的LIME比较
  5. 模型可视化技术

    • 决策边界可视化
    • 学习曲线和验证曲线
    • 混淆矩阵可视化
  6. 综合案例

    • 信用评分模型解释实战

9.7.2 模型解释最佳实践

  1. 选择合适的解释方法

    • 根据模型类型选择解释技术
    • 考虑解释的目标受众
    • 平衡解释的准确性和可理解性
  2. 多角度解释

    • 结合全局和局部解释
    • 使用多种解释方法验证结果
    • 关注特征交互效应
  3. 可视化设计

    • 选择合适的图表类型
    • 确保可视化的清晰性
    • 添加必要的说明和标注

9.7.3 常见陷阱与注意事项

  1. 解释方法的局限性

    • 理解每种方法的假设和限制
    • 避免过度解释模型行为
    • 注意解释的稳定性
  2. 可视化误区

    • 避免误导性的图表设计
    • 确保比例和尺度的准确性
    • 考虑色彩和布局的影响
  3. 业务理解

    • 将技术解释转化为业务洞察
    • 考虑解释的实际应用价值
    • 与业务专家协作验证解释

9.7.4 进阶学习方向

  1. 高级解释技术

    • 学习更多解释方法(如Anchors、Counterfactual Explanations)
    • 研究深度学习模型解释
    • 探索因果推理在模型解释中的应用
  2. 自动化解释

    • 开发自动化解释流程
    • 构建解释报告生成系统
    • 研究解释质量评估方法
  3. 领域特定解释

    • 医疗AI的可解释性
    • 金融风控模型解释
    • 自动驾驶决策解释

9.7.5 练习题

理论题

  1. 解释SHAP和LIME的主要区别,以及它们各自的优缺点。
  2. 为什么在某些行业(如医疗、金融)中模型可解释性特别重要?
  3. 如何评估模型解释的质量和可靠性?

实践题

  1. 使用本章学到的技术,对一个复杂模型进行全面的解释分析。
  2. 比较不同特征重要性方法在同一数据集上的结果差异。
  3. 设计并实现一个模型解释的可视化仪表板。

项目题

  1. 选择一个实际的业务问题,构建模型并提供完整的解释报告。
  2. 研究某个特定领域的模型解释需求,开发相应的解释工具。
  3. 实现一个自动化的模型解释流程,包括多种解释方法的集成。

第9章完结

模型解释与可视化是现代机器学习不可或缺的重要组成部分。通过本章的学习,你已经掌握了从基础的特征重要性分析到高级的SHAP和LIME解释技术的完整技能体系。记住,好的模型解释不仅能帮助我们理解模型的工作原理,更能建立用户对模型的信任,并为模型的改进提供有价值的洞察。在实际项目中,要根据具体的业务需求和目标受众,选择合适的解释方法和可视化技术。 “`