6.1 模型评估概述

6.1.1 为什么需要模型评估

模型评估是机器学习项目中的关键环节,它帮助我们: - 评估模型的泛化能力 - 比较不同算法的性能 - 选择最优的超参数 - 避免过拟合和欠拟合 - 为模型部署提供可靠依据

6.1.2 评估的基本原则

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_regression, load_iris, load_boston
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold, KFold, LeaveOneOut, validation_curve, learning_curve
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, precision_recall_curve
from sklearn.linear_model import LogisticRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

class ModelEvaluationBasics:
    def __init__(self):
        self.models = {}
        self.results = {}
        
    def evaluation_principles(self):
        """模型评估基本原则"""
        print("=== 模型评估基本原则 ===")
        print("1. 独立性原则:")
        print("   - 训练集和测试集必须独立")
        print("   - 避免数据泄露")
        print("   - 时间序列数据注意时间顺序")
        print("2. 代表性原则:")
        print("   - 测试集应代表真实应用场景")
        print("   - 保持数据分布一致性")
        print("   - 考虑样本不平衡问题")
        print("3. 稳定性原则:")
        print("   - 使用交叉验证评估稳定性")
        print("   - 多次运行取平均值")
        print("   - 报告置信区间")
        print("4. 全面性原则:")
        print("   - 使用多个评估指标")
        print("   - 考虑业务目标")
        print("   - 分析错误案例")
        
    def create_sample_datasets(self):
        """创建示例数据集"""
        # 分类数据集
        X_clf, y_clf = make_classification(
            n_samples=1000, n_features=20, n_informative=10,
            n_redundant=5, n_classes=3, random_state=42
        )
        
        # 回归数据集
        X_reg, y_reg = make_regression(
            n_samples=1000, n_features=10, noise=0.1, random_state=42
        )
        
        print("=== 示例数据集信息 ===")
        print(f"分类数据集: {X_clf.shape}, 类别数: {len(np.unique(y_clf))}")
        print(f"回归数据集: {X_reg.shape}")
        
        return (X_clf, y_clf), (X_reg, y_reg)
        
    def train_test_split_demo(self):
        """训练测试集划分演示"""
        (X_clf, y_clf), (X_reg, y_reg) = self.create_sample_datasets()
        
        print("=== 训练测试集划分 ===")
        
        # 简单划分
        X_train, X_test, y_train, y_test = train_test_split(
            X_clf, y_clf, test_size=0.2, random_state=42
        )
        print(f"简单划分 - 训练集: {X_train.shape}, 测试集: {X_test.shape}")
        
        # 分层划分(保持类别比例)
        X_train_strat, X_test_strat, y_train_strat, y_test_strat = train_test_split(
            X_clf, y_clf, test_size=0.2, stratify=y_clf, random_state=42
        )
        
        # 比较类别分布
        fig, axes = plt.subplots(1, 3, figsize=(15, 5))
        
        # 原始数据分布
        unique, counts = np.unique(y_clf, return_counts=True)
        axes[0].bar(unique, counts / len(y_clf))
        axes[0].set_title('原始数据类别分布')
        axes[0].set_xlabel('类别')
        axes[0].set_ylabel('比例')
        
        # 简单划分测试集分布
        unique, counts = np.unique(y_test, return_counts=True)
        axes[1].bar(unique, counts / len(y_test))
        axes[1].set_title('简单划分测试集分布')
        axes[1].set_xlabel('类别')
        axes[1].set_ylabel('比例')
        
        # 分层划分测试集分布
        unique, counts = np.unique(y_test_strat, return_counts=True)
        axes[2].bar(unique, counts / len(y_test_strat))
        axes[2].set_title('分层划分测试集分布')
        axes[2].set_xlabel('类别')
        axes[2].set_ylabel('比例')
        
        plt.tight_layout()
        plt.show()
        
        return (X_train_strat, X_test_strat, y_train_strat, y_test_strat)

# 演示模型评估基础
print("=== 模型评估基础演示 ===")
eval_basics = ModelEvaluationBasics()
eval_basics.evaluation_principles()
eval_basics.train_test_split_demo()

6.2 交叉验证

6.2.1 交叉验证原理

交叉验证是一种重要的模型评估技术,通过多次划分数据来获得更稳定的性能估计。

class CrossValidationDemo:
    def __init__(self):
        self.cv_methods = {}
        self.results = {}
        
    def cv_theory_explanation(self):
        """交叉验证理论解释"""
        print("=== 交叉验证理论 ===")
        print("1. K折交叉验证:")
        print("   - 将数据分成K个子集")
        print("   - 轮流使用K-1个子集训练,1个子集测试")
        print("   - 取K次结果的平均值")
        print("2. 分层K折交叉验证:")
        print("   - 保持每个子集中类别比例一致")
        print("   - 适用于不平衡数据集")
        print("3. 留一交叉验证:")
        print("   - K=样本数量")
        print("   - 计算量大但方差小")
        print("4. 时间序列交叉验证:")
        print("   - 保持时间顺序")
        print("   - 避免未来信息泄露")
        
    def k_fold_demo(self):
        """K折交叉验证演示"""
        # 创建数据
        X, y = make_classification(n_samples=200, n_features=10, n_classes=2, random_state=42)
        
        # 不同的K值
        k_values = [3, 5, 10]
        models = {
            'Logistic Regression': LogisticRegression(random_state=42),
            'Decision Tree': DecisionTreeClassifier(random_state=42),
            'Random Forest': RandomForestClassifier(n_estimators=50, random_state=42)
        }
        
        results = {}
        
        for k in k_values:
            results[f'{k}-Fold'] = {}
            kfold = KFold(n_splits=k, shuffle=True, random_state=42)
            
            for name, model in models.items():
                scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
                results[f'{k}-Fold'][name] = {
                    'mean': scores.mean(),
                    'std': scores.std(),
                    'scores': scores
                }
        
        # 可视化结果
        fig, axes = plt.subplots(1, len(k_values), figsize=(15, 5))
        
        for i, k in enumerate(k_values):
            model_names = list(models.keys())
            means = [results[f'{k}-Fold'][name]['mean'] for name in model_names]
            stds = [results[f'{k}-Fold'][name]['std'] for name in model_names]
            
            x_pos = np.arange(len(model_names))
            axes[i].bar(x_pos, means, yerr=stds, capsize=5, alpha=0.7)
            axes[i].set_title(f'{k}-Fold 交叉验证')
            axes[i].set_xlabel('模型')
            axes[i].set_ylabel('准确率')
            axes[i].set_xticks(x_pos)
            axes[i].set_xticklabels(model_names, rotation=45)
            axes[i].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        # 打印详细结果
        print("=== K折交叉验证结果 ===")
        for k in k_values:
            print(f"\n{k}-Fold 交叉验证:")
            for name in models.keys():
                mean_score = results[f'{k}-Fold'][name]['mean']
                std_score = results[f'{k}-Fold'][name]['std']
                print(f"  {name}: {mean_score:.4f} ± {std_score:.4f}")
        
        return results
        
    def stratified_cv_demo(self):
        """分层交叉验证演示"""
        # 创建不平衡数据集
        X, y = make_classification(
            n_samples=1000, n_features=10, n_classes=3,
            weights=[0.7, 0.2, 0.1], random_state=42
        )
        
        print("=== 分层交叉验证演示 ===")
        print(f"原始数据类别分布: {np.bincount(y)}")
        
        # 普通K折 vs 分层K折
        cv_methods = {
            'K-Fold': KFold(n_splits=5, shuffle=True, random_state=42),
            'Stratified K-Fold': StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        }
        
        model = LogisticRegression(random_state=42)
        
        for cv_name, cv in cv_methods.items():
            scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
            print(f"{cv_name}: {scores.mean():.4f} ± {scores.std():.4f}")
            
            # 检查每个折的类别分布
            fold_distributions = []
            for train_idx, test_idx in cv.split(X, y):
                test_distribution = np.bincount(y[test_idx])
                fold_distributions.append(test_distribution / len(test_idx))
            
            fold_distributions = np.array(fold_distributions)
            
            # 可视化类别分布
            plt.figure(figsize=(12, 4))
            
            plt.subplot(1, 2, 1)
            for i in range(len(fold_distributions)):
                plt.bar(range(3), fold_distributions[i], alpha=0.6, label=f'Fold {i+1}')
            plt.title(f'{cv_name} - 各折类别分布')
            plt.xlabel('类别')
            plt.ylabel('比例')
            plt.legend()
            
            plt.subplot(1, 2, 2)
            plt.boxplot([fold_distributions[:, i] for i in range(3)], labels=['类别0', '类别1', '类别2'])
            plt.title(f'{cv_name} - 类别分布变异性')
            plt.ylabel('比例')
            
            plt.tight_layout()
            plt.show()
        
    def leave_one_out_demo(self):
        """留一交叉验证演示"""
        # 使用小数据集演示
        X, y = make_classification(n_samples=50, n_features=5, n_classes=2, random_state=42)
        
        print("=== 留一交叉验证演示 ===")
        
        # 比较不同交叉验证方法
        cv_methods = {
            '5-Fold': KFold(n_splits=5, shuffle=True, random_state=42),
            '10-Fold': KFold(n_splits=10, shuffle=True, random_state=42),
            'Leave-One-Out': LeaveOneOut()
        }
        
        model = LogisticRegression(random_state=42)
        
        results = {}
        for cv_name, cv in cv_methods.items():
            scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
            results[cv_name] = {
                'mean': scores.mean(),
                'std': scores.std(),
                'n_splits': len(scores)
            }
            print(f"{cv_name}: {scores.mean():.4f} ± {scores.std():.4f} ({len(scores)} splits)")
        
        # 可视化比较
        methods = list(results.keys())
        means = [results[method]['mean'] for method in methods]
        stds = [results[method]['std'] for method in methods]
        
        plt.figure(figsize=(10, 6))
        x_pos = np.arange(len(methods))
        plt.bar(x_pos, means, yerr=stds, capsize=5, alpha=0.7)
        plt.title('不同交叉验证方法比较')
        plt.xlabel('交叉验证方法')
        plt.ylabel('准确率')
        plt.xticks(x_pos, methods)
        plt.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.show()
        
        return results
        
    def time_series_cv_demo(self):
        """时间序列交叉验证演示"""
        from sklearn.model_selection import TimeSeriesSplit
        
        # 创建时间序列数据
        n_samples = 100
        X = np.random.randn(n_samples, 5)
        # 添加时间趋势
        time_trend = np.linspace(0, 1, n_samples)
        y = X.sum(axis=1) + 2 * time_trend + np.random.randn(n_samples) * 0.1
        
        print("=== 时间序列交叉验证演示 ===")
        
        # 时间序列分割
        tscv = TimeSeriesSplit(n_splits=5)
        
        # 可视化分割方式
        plt.figure(figsize=(12, 8))
        
        for i, (train_idx, test_idx) in enumerate(tscv.split(X)):
            plt.subplot(3, 2, i+1)
            plt.plot(range(len(y)), y, 'b-', alpha=0.3, label='全部数据')
            plt.plot(train_idx, y[train_idx], 'g-', label='训练集')
            plt.plot(test_idx, y[test_idx], 'r-', label='测试集')
            plt.title(f'时间序列分割 {i+1}')
            plt.xlabel('时间')
            plt.ylabel('值')
            plt.legend()
            plt.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        # 比较普通交叉验证和时间序列交叉验证
        model = Ridge(random_state=42)
        
        # 普通K折(错误的做法)
        kfold_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
        
        # 时间序列交叉验证(正确的做法)
        ts_scores = cross_val_score(model, X, y, cv=tscv, scoring='r2')
        
        print(f"普通K折交叉验证: {kfold_scores.mean():.4f} ± {kfold_scores.std():.4f}")
        print(f"时间序列交叉验证: {ts_scores.mean():.4f} ± {ts_scores.std():.4f}")
        
        return kfold_scores, ts_scores

# 演示交叉验证
print("=== 交叉验证演示 ===")
cv_demo = CrossValidationDemo()
cv_demo.cv_theory_explanation()
cv_demo.k_fold_demo()
cv_demo.stratified_cv_demo()
cv_demo.leave_one_out_demo()
cv_demo.time_series_cv_demo()

6.3 分类模型评估指标

6.3.1 分类评估指标详解

class ClassificationMetricsDemo:
    def __init__(self):
        self.metrics = {}
        self.results = {}
        
    def metrics_theory(self):
        """分类评估指标理论"""
        print("=== 分类评估指标理论 ===")
        print("1. 混淆矩阵:")
        print("   - TP: 真正例(预测为正,实际为正)")
        print("   - TN: 真负例(预测为负,实际为负)")
        print("   - FP: 假正例(预测为正,实际为负)")
        print("   - FN: 假负例(预测为负,实际为正)")
        print("2. 基本指标:")
        print("   - 准确率 = (TP + TN) / (TP + TN + FP + FN)")
        print("   - 精确率 = TP / (TP + FP)")
        print("   - 召回率 = TP / (TP + FN)")
        print("   - F1分数 = 2 * (精确率 * 召回率) / (精确率 + 召回率)")
        print("3. 高级指标:")
        print("   - ROC AUC: 受试者工作特征曲线下面积")
        print("   - PR AUC: 精确率-召回率曲线下面积")
        
    def confusion_matrix_demo(self):
        """混淆矩阵演示"""
        # 创建数据
        X, y = make_classification(n_samples=1000, n_features=10, n_classes=3, 
                                 n_informative=5, random_state=42)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                          stratify=y, random_state=42)
        
        # 训练模型
        models = {
            'Logistic Regression': LogisticRegression(random_state=42),
            'Decision Tree': DecisionTreeClassifier(random_state=42),
            'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
        }
        
        fig, axes = plt.subplots(1, 3, figsize=(18, 5))
        
        for i, (name, model) in enumerate(models.items()):
            # 训练和预测
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            
            # 混淆矩阵
            cm = confusion_matrix(y_test, y_pred)
            
            # 可视化混淆矩阵
            sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[i])
            axes[i].set_title(f'{name}\n混淆矩阵')
            axes[i].set_xlabel('预测标签')
            axes[i].set_ylabel('真实标签')
            
            # 计算指标
            accuracy = accuracy_score(y_test, y_pred)
            precision = precision_score(y_test, y_pred, average='weighted')
            recall = recall_score(y_test, y_pred, average='weighted')
            f1 = f1_score(y_test, y_pred, average='weighted')
            
            print(f"\n{name} 性能指标:")
            print(f"  准确率: {accuracy:.4f}")
            print(f"  精确率: {precision:.4f}")
            print(f"  召回率: {recall:.4f}")
            print(f"  F1分数: {f1:.4f}")
        
        plt.tight_layout()
        plt.show()
        
    def binary_classification_metrics(self):
        """二分类详细指标分析"""
        # 创建二分类数据
        X, y = make_classification(n_samples=1000, n_features=10, n_classes=2,
                                 weights=[0.7, 0.3], random_state=42)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                          stratify=y, random_state=42)
        
        # 训练模型
        model = LogisticRegression(random_state=42)
        model.fit(X_train, y_train)
        
        # 预测
        y_pred = model.predict(X_test)
        y_pred_proba = model.predict_proba(X_test)[:, 1]
        
        # 计算各种指标
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)
        roc_auc = roc_auc_score(y_test, y_pred_proba)
        
        print("=== 二分类详细指标 ===")
        print(f"准确率: {accuracy:.4f}")
        print(f"精确率: {precision:.4f}")
        print(f"召回率: {recall:.4f}")
        print(f"F1分数: {f1:.4f}")
        print(f"ROC AUC: {roc_auc:.4f}")
        
        # 混淆矩阵
        cm = confusion_matrix(y_test, y_pred)
        tn, fp, fn, tp = cm.ravel()
        
        print(f"\n混淆矩阵分解:")
        print(f"真负例(TN): {tn}")
        print(f"假正例(FP): {fp}")
        print(f"假负例(FN): {fn}")
        print(f"真正例(TP): {tp}")
        
        # 可视化
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        
        # 混淆矩阵
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0])
        axes[0, 0].set_title('混淆矩阵')
        axes[0, 0].set_xlabel('预测标签')
        axes[0, 0].set_ylabel('真实标签')
        
        # ROC曲线
        fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
        axes[0, 1].plot(fpr, tpr, label=f'ROC曲线 (AUC = {roc_auc:.3f})')
        axes[0, 1].plot([0, 1], [0, 1], 'k--', label='随机分类器')
        axes[0, 1].set_xlabel('假正例率')
        axes[0, 1].set_ylabel('真正例率')
        axes[0, 1].set_title('ROC曲线')
        axes[0, 1].legend()
        axes[0, 1].grid(True, alpha=0.3)
        
        # PR曲线
        precision_curve, recall_curve, _ = precision_recall_curve(y_test, y_pred_proba)
        pr_auc = np.trapz(precision_curve, recall_curve)
        axes[1, 0].plot(recall_curve, precision_curve, label=f'PR曲线 (AUC = {pr_auc:.3f})')
        axes[1, 0].set_xlabel('召回率')
        axes[1, 0].set_ylabel('精确率')
        axes[1, 0].set_title('精确率-召回率曲线')
        axes[1, 0].legend()
        axes[1, 0].grid(True, alpha=0.3)
        
        # 阈值分析
        thresholds = np.linspace(0, 1, 100)
        precisions, recalls, f1s = [], [], []
        
        for threshold in thresholds:
            y_pred_thresh = (y_pred_proba >= threshold).astype(int)
            if len(np.unique(y_pred_thresh)) > 1:
                precisions.append(precision_score(y_test, y_pred_thresh))
                recalls.append(recall_score(y_test, y_pred_thresh))
                f1s.append(f1_score(y_test, y_pred_thresh))
            else:
                precisions.append(0)
                recalls.append(0)
                f1s.append(0)
        
        axes[1, 1].plot(thresholds, precisions, label='精确率')
        axes[1, 1].plot(thresholds, recalls, label='召回率')
        axes[1, 1].plot(thresholds, f1s, label='F1分数')
        axes[1, 1].set_xlabel('阈值')
        axes[1, 1].set_ylabel('指标值')
        axes[1, 1].set_title('阈值对指标的影响')
        axes[1, 1].legend()
        axes[1, 1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        return {
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1': f1,
            'roc_auc': roc_auc,
            'pr_auc': pr_auc
        }
        
    def multiclass_metrics_demo(self):
        """多分类指标演示"""
        # 创建多分类数据
        X, y = make_classification(n_samples=1000, n_features=10, n_classes=4,
                                 n_informative=8, random_state=42)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                          stratify=y, random_state=42)
        
        # 训练模型
        model = RandomForestClassifier(n_estimators=100, random_state=42)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        
        print("=== 多分类指标演示 ===")
        
        # 整体指标
        accuracy = accuracy_score(y_test, y_pred)
        print(f"整体准确率: {accuracy:.4f}")
        
        # 不同平均方式的指标
        avg_methods = ['macro', 'micro', 'weighted']
        
        for avg in avg_methods:
            precision = precision_score(y_test, y_pred, average=avg)
            recall = recall_score(y_test, y_pred, average=avg)
            f1 = f1_score(y_test, y_pred, average=avg)
            
            print(f"\n{avg.upper()} 平均:")
            print(f"  精确率: {precision:.4f}")
            print(f"  召回率: {recall:.4f}")
            print(f"  F1分数: {f1:.4f}")
        
        # 每个类别的详细报告
        print("\n=== 分类报告 ===")
        print(classification_report(y_test, y_pred))
        
        # 可视化混淆矩阵
        cm = confusion_matrix(y_test, y_pred)
        
        plt.figure(figsize=(10, 8))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
        plt.title('多分类混淆矩阵')
        plt.xlabel('预测标签')
        plt.ylabel('真实标签')
        plt.tight_layout()
        plt.show()
        
        # 每个类别的ROC曲线
        y_pred_proba = model.predict_proba(X_test)
        
        plt.figure(figsize=(12, 8))
        
        for i in range(len(np.unique(y))):
            # 将多分类转为二分类(一对其余)
            y_test_binary = (y_test == i).astype(int)
            y_pred_proba_binary = y_pred_proba[:, i]
            
            fpr, tpr, _ = roc_curve(y_test_binary, y_pred_proba_binary)
            roc_auc = roc_auc_score(y_test_binary, y_pred_proba_binary)
            
            plt.plot(fpr, tpr, label=f'类别 {i} (AUC = {roc_auc:.3f})')
        
        plt.plot([0, 1], [0, 1], 'k--', label='随机分类器')
        plt.xlabel('假正例率')
        plt.ylabel('真正例率')
        plt.title('多分类ROC曲线(一对其余)')
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.show()

# 演示分类评估指标
print("=== 分类评估指标演示 ===")
clf_metrics = ClassificationMetricsDemo()
clf_metrics.metrics_theory()
clf_metrics.confusion_matrix_demo()
clf_metrics.binary_classification_metrics()
clf_metrics.multiclass_metrics_demo()

6.4 回归模型评估指标

6.4.1 回归评估指标详解

class RegressionMetricsDemo:
    def __init__(self):
        self.metrics = {}
        self.results = {}
        
    def regression_metrics_theory(self):
        """回归评估指标理论"""
        print("=== 回归评估指标理论 ===")
        print("1. 均方误差 (MSE):")
        print("   - MSE = Σ(y_true - y_pred)² / n")
        print("   - 对大误差敏感,单位是目标变量的平方")
        print("2. 均方根误差 (RMSE):")
        print("   - RMSE = √MSE")
        print("   - 与目标变量同单位,易于解释")
        print("3. 平均绝对误差 (MAE):")
        print("   - MAE = Σ|y_true - y_pred| / n")
        print("   - 对异常值不敏感,线性惩罚")
        print("4. 决定系数 (R²):")
        print("   - R² = 1 - SS_res / SS_tot")
        print("   - 表示模型解释的方差比例,范围[0,1]")
        print("5. 平均绝对百分比误差 (MAPE):")
        print("   - MAPE = Σ|y_true - y_pred| / |y_true| / n * 100%")
        print("   - 相对误差,便于比较不同量级的数据")
        
    def basic_regression_metrics(self):
        """基础回归指标演示"""
        # 创建回归数据
        X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
        
        # 训练不同模型
        models = {
            'Linear Regression': Ridge(alpha=0.1),
            'Decision Tree': DecisionTreeRegressor(random_state=42),
            'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
        }
        
        results = {}
        
        for name, model in models.items():
            # 训练和预测
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            
            # 计算指标
            mse = mean_squared_error(y_test, y_pred)
            rmse = np.sqrt(mse)
            mae = mean_absolute_error(y_test, y_pred)
            r2 = r2_score(y_test, y_pred)
            
            # 计算MAPE(避免除零)
            mape = np.mean(np.abs((y_test - y_pred) / np.where(y_test != 0, y_test, 1))) * 100
            
            results[name] = {
                'MSE': mse,
                'RMSE': rmse,
                'MAE': mae,
                'R²': r2,
                'MAPE': mape,
                'predictions': y_pred
            }
            
            print(f"\n{name} 性能指标:")
            print(f"  MSE: {mse:.4f}")
            print(f"  RMSE: {rmse:.4f}")
            print(f"  MAE: {mae:.4f}")
            print(f"  R²: {r2:.4f}")
            print(f"  MAPE: {mape:.2f}%")
        
        # 可视化预测结果
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        
        # 指标比较
        metrics = ['MSE', 'RMSE', 'MAE', 'R²']
        for i, metric in enumerate(metrics):
            ax = axes[i//2, i%2]
            model_names = list(models.keys())
            values = [results[name][metric] for name in model_names]
            
            bars = ax.bar(model_names, values, alpha=0.7)
            ax.set_title(f'{metric} 比较')
            ax.set_ylabel(metric)
            ax.tick_params(axis='x', rotation=45)
            
            # 添加数值标签
            for bar, value in zip(bars, values):
                height = bar.get_height()
                ax.text(bar.get_x() + bar.get_width()/2., height,
                       f'{value:.3f}', ha='center', va='bottom')
            
            ax.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        # 预测vs真实值散点图
        fig, axes = plt.subplots(1, 3, figsize=(18, 5))
        
        for i, (name, result) in enumerate(results.items()):
            y_pred = result['predictions']
            r2 = result['R²']
            
            axes[i].scatter(y_test, y_pred, alpha=0.6)
            axes[i].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
            axes[i].set_xlabel('真实值')
            axes[i].set_ylabel('预测值')
            axes[i].set_title(f'{name}\nR² = {r2:.3f}')
            axes[i].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        return results
        
    def residual_analysis(self):
        """残差分析"""
        # 创建数据
        X, y = make_regression(n_samples=500, n_features=5, noise=0.1, random_state=42)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
        
        # 训练模型
        model = Ridge(alpha=0.1)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        
        # 计算残差
        residuals = y_test - y_pred
        
        print("=== 残差分析 ===")
        print(f"残差均值: {residuals.mean():.6f}")
        print(f"残差标准差: {residuals.std():.4f}")
        
        # 可视化残差分析
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        
        # 残差vs预测值
        axes[0, 0].scatter(y_pred, residuals, alpha=0.6)
        axes[0, 0].axhline(y=0, color='r', linestyle='--')
        axes[0, 0].set_xlabel('预测值')
        axes[0, 0].set_ylabel('残差')
        axes[0, 0].set_title('残差 vs 预测值')
        axes[0, 0].grid(True, alpha=0.3)
        
        # 残差直方图
        axes[0, 1].hist(residuals, bins=30, alpha=0.7, edgecolor='black')
        axes[0, 1].set_xlabel('残差')
        axes[0, 1].set_ylabel('频次')
        axes[0, 1].set_title('残差分布')
        axes[0, 1].grid(True, alpha=0.3)
        
        # Q-Q图
        from scipy import stats
        stats.probplot(residuals, dist="norm", plot=axes[1, 0])
        axes[1, 0].set_title('残差Q-Q图')
        axes[1, 0].grid(True, alpha=0.3)
        
        # 残差vs拟合值(检查异方差性)
        axes[1, 1].scatter(y_pred, np.abs(residuals), alpha=0.6)
        axes[1, 1].set_xlabel('预测值')
        axes[1, 1].set_ylabel('|残差|')
        axes[1, 1].set_title('绝对残差 vs 预测值')
        axes[1, 1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        # 残差统计检验
        from scipy.stats import shapiro, jarque_bera
        
        # 正态性检验
        shapiro_stat, shapiro_p = shapiro(residuals)
        jb_stat, jb_p = jarque_bera(residuals)
        
        print(f"\n残差正态性检验:")
        print(f"Shapiro-Wilk检验: 统计量={shapiro_stat:.4f}, p值={shapiro_p:.4f}")
        print(f"Jarque-Bera检验: 统计量={jb_stat:.4f}, p值={jb_p:.4f}")
        
        if shapiro_p > 0.05:
            print("残差符合正态分布(p > 0.05)")
        else:
            print("残差不符合正态分布(p ≤ 0.05)")
        
        return residuals
        
    def cross_validation_regression(self):
        """回归模型交叉验证"""
        # 创建数据
        X, y = make_regression(n_samples=500, n_features=8, noise=0.1, random_state=42)
        
        # 不同模型
        models = {
            'Ridge': Ridge(alpha=1.0),
            'Lasso': Lasso(alpha=0.1),
            'Decision Tree': DecisionTreeRegressor(random_state=42),
            'Random Forest': RandomForestRegressor(n_estimators=50, random_state=42)
        }
        
        # 不同评估指标
        scoring_metrics = ['neg_mean_squared_error', 'neg_mean_absolute_error', 'r2']
        
        results = {}
        
        for name, model in models.items():
            results[name] = {}
            for metric in scoring_metrics:
                scores = cross_val_score(model, X, y, cv=5, scoring=metric)
                if metric.startswith('neg_'):
                    scores = -scores  # 转换为正值
                    metric_name = metric[4:]  # 去掉'neg_'前缀
                else:
                    metric_name = metric
                
                results[name][metric_name] = {
                    'mean': scores.mean(),
                    'std': scores.std(),
                    'scores': scores
                }
        
        # 可视化结果
        fig, axes = plt.subplots(1, 3, figsize=(18, 6))
        
        metric_names = ['mean_squared_error', 'mean_absolute_error', 'r2']
        metric_labels = ['MSE', 'MAE', 'R²']
        
        for i, (metric, label) in enumerate(zip(metric_names, metric_labels)):
            model_names = list(models.keys())
            means = [results[name][metric]['mean'] for name in model_names]
            stds = [results[name][metric]['std'] for name in model_names]
            
            x_pos = np.arange(len(model_names))
            bars = axes[i].bar(x_pos, means, yerr=stds, capsize=5, alpha=0.7)
            axes[i].set_title(f'{label} 交叉验证结果')
            axes[i].set_xlabel('模型')
            axes[i].set_ylabel(label)
            axes[i].set_xticks(x_pos)
            axes[i].set_xticklabels(model_names, rotation=45)
            axes[i].grid(True, alpha=0.3)
            
            # 添加数值标签
            for bar, mean, std in zip(bars, means, stds):
                height = bar.get_height()
                axes[i].text(bar.get_x() + bar.get_width()/2., height,
                           f'{mean:.3f}±{std:.3f}', ha='center', va='bottom', fontsize=8)
        
        plt.tight_layout()
        plt.show()
        
        # 打印详细结果
        print("=== 回归模型交叉验证结果 ===")
        for name in models.keys():
            print(f"\n{name}:")
            for metric in metric_names:
                mean_score = results[name][metric]['mean']
                std_score = results[name][metric]['std']
                print(f"  {metric}: {mean_score:.4f} ± {std_score:.4f}")
        
        return results

# 演示回归评估指标
print("=== 回归评估指标演示 ===")
reg_metrics = RegressionMetricsDemo()
reg_metrics.regression_metrics_theory()
reg_metrics.basic_regression_metrics()
reg_metrics.residual_analysis()
reg_metrics.cross_validation_regression()

6.5 超参数调优

6.5.1 超参数调优概述

超参数调优是机器学习中的重要环节,通过系统性地搜索最优参数组合来提升模型性能。

class HyperparameterTuningDemo:
    def __init__(self):
        self.best_models = {}
        self.tuning_results = {}
        
    def tuning_theory(self):
        """超参数调优理论"""
        print("=== 超参数调优理论 ===")
        print("1. 超参数 vs 参数:")
        print("   - 参数:模型训练过程中学习的权重")
        print("   - 超参数:训练前设定的配置参数")
        print("2. 调优方法:")
        print("   - 网格搜索:穷举所有参数组合")
        print("   - 随机搜索:随机采样参数组合")
        print("   - 贝叶斯优化:基于先验知识的智能搜索")
        print("   - 遗传算法:模拟进化过程的搜索")
        print("3. 搜索策略:")
        print("   - 粗搜索 + 细搜索")
        print("   - 多阶段搜索")
        print("   - 早停策略")
        
    def grid_search_demo(self):
        """网格搜索演示"""
        # 创建数据
        X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                          stratify=y, random_state=42)
        
        # 标准化
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        
        print("=== 网格搜索演示 ===")
        
        # SVM网格搜索
        svm_param_grid = {
            'C': [0.1, 1, 10, 100],
            'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1],
            'kernel': ['rbf', 'poly', 'sigmoid']
        }
        
        svm = SVC(random_state=42)
        svm_grid_search = GridSearchCV(
            svm, svm_param_grid, cv=5, scoring='accuracy', 
            n_jobs=-1, verbose=1
        )
        
        print("开始SVM网格搜索...")
        svm_grid_search.fit(X_train_scaled, y_train)
        
        print(f"SVM最佳参数: {svm_grid_search.best_params_}")
        print(f"SVM最佳交叉验证分数: {svm_grid_search.best_score_:.4f}")
        
        # 随机森林网格搜索
        rf_param_grid = {
            'n_estimators': [50, 100, 200],
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }
        
        rf = RandomForestClassifier(random_state=42)
        rf_grid_search = GridSearchCV(
            rf, rf_param_grid, cv=5, scoring='accuracy', 
            n_jobs=-1, verbose=1
        )
        
        print("\n开始随机森林网格搜索...")
        rf_grid_search.fit(X_train, y_train)
        
        print(f"随机森林最佳参数: {rf_grid_search.best_params_}")
        print(f"随机森林最佳交叉验证分数: {rf_grid_search.best_score_:.4f}")
        
        # 测试集性能比较
        svm_test_score = svm_grid_search.score(X_test_scaled, y_test)
        rf_test_score = rf_grid_search.score(X_test, y_test)
        
        print(f"\n测试集性能:")
        print(f"SVM: {svm_test_score:.4f}")
        print(f"随机森林: {rf_test_score:.4f}")
        
        # 可视化网格搜索结果
        self.visualize_grid_search_results(svm_grid_search, 'SVM')
        self.visualize_grid_search_results(rf_grid_search, '随机森林')
        
        self.best_models['SVM'] = svm_grid_search.best_estimator_
        self.best_models['RandomForest'] = rf_grid_search.best_estimator_
        
        return svm_grid_search, rf_grid_search
        
    def visualize_grid_search_results(self, grid_search, model_name):
        """可视化网格搜索结果"""
        results_df = pd.DataFrame(grid_search.cv_results_)
        
        # 选择前10个最佳结果
        top_results = results_df.nlargest(10, 'mean_test_score')
        
        plt.figure(figsize=(12, 8))
        
        # 参数组合性能
        plt.subplot(2, 2, 1)
        plt.bar(range(len(top_results)), top_results['mean_test_score'])
        plt.title(f'{model_name} - 前10个最佳参数组合')
        plt.xlabel('参数组合排名')
        plt.ylabel('交叉验证分数')
        plt.grid(True, alpha=0.3)
        
        # 训练时间 vs 性能
        plt.subplot(2, 2, 2)
        plt.scatter(results_df['mean_fit_time'], results_df['mean_test_score'], alpha=0.6)
        plt.xlabel('平均训练时间 (秒)')
        plt.ylabel('交叉验证分数')
        plt.title(f'{model_name} - 训练时间 vs 性能')
        plt.grid(True, alpha=0.3)
        
        # 分数分布
        plt.subplot(2, 2, 3)
        plt.hist(results_df['mean_test_score'], bins=20, alpha=0.7, edgecolor='black')
        plt.xlabel('交叉验证分数')
        plt.ylabel('频次')
        plt.title(f'{model_name} - 分数分布')
        plt.grid(True, alpha=0.3)
        
        # 分数 vs 标准差
        plt.subplot(2, 2, 4)
        plt.scatter(results_df['mean_test_score'], results_df['std_test_score'], alpha=0.6)
        plt.xlabel('平均交叉验证分数')
        plt.ylabel('分数标准差')
        plt.title(f'{model_name} - 性能 vs 稳定性')
        plt.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
    def random_search_demo(self):
        """随机搜索演示"""
        # 创建数据
        X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                          stratify=y, random_state=42)
        
        print("=== 随机搜索演示 ===")
        
        # 随机森林随机搜索
        from scipy.stats import randint, uniform
        
        rf_param_dist = {
            'n_estimators': randint(50, 300),
            'max_depth': [None] + list(randint(10, 50).rvs(10)),
            'min_samples_split': randint(2, 20),
            'min_samples_leaf': randint(1, 10),
            'max_features': uniform(0.1, 0.9)
        }
        
        rf = RandomForestClassifier(random_state=42)
        rf_random_search = RandomizedSearchCV(
            rf, rf_param_dist, n_iter=100, cv=5, 
            scoring='accuracy', n_jobs=-1, random_state=42, verbose=1
        )
        
        print("开始随机森林随机搜索...")
        rf_random_search.fit(X_train, y_train)
        
        print(f"随机搜索最佳参数: {rf_random_search.best_params_}")
        print(f"随机搜索最佳分数: {rf_random_search.best_score_:.4f}")
        
        # 比较网格搜索和随机搜索
        if 'RandomForest' in self.best_models:
            grid_score = self.best_models['RandomForest'].score(X_test, y_test)
            random_score = rf_random_search.score(X_test, y_test)
            
            print(f"\n测试集性能比较:")
            print(f"网格搜索: {grid_score:.4f}")
            print(f"随机搜索: {random_score:.4f}")
        
        # 可视化搜索过程
        self.visualize_search_progress(rf_random_search, '随机搜索')
        
        return rf_random_search
        
    def visualize_search_progress(self, search_cv, search_type):
        """可视化搜索进度"""
        results_df = pd.DataFrame(search_cv.cv_results_)
        
        plt.figure(figsize=(15, 5))
        
        # 搜索进度
        plt.subplot(1, 3, 1)
        plt.plot(range(len(results_df)), results_df['mean_test_score'], 'b-', alpha=0.7)
        plt.axhline(y=search_cv.best_score_, color='r', linestyle='--', 
                   label=f'最佳分数: {search_cv.best_score_:.4f}')
        plt.xlabel('迭代次数')
        plt.ylabel('交叉验证分数')
        plt.title(f'{search_type} - 搜索进度')
        plt.legend()
        plt.grid(True, alpha=0.3)
        
        # 累积最佳分数
        plt.subplot(1, 3, 2)
        cumulative_best = np.maximum.accumulate(results_df['mean_test_score'])
        plt.plot(range(len(results_df)), cumulative_best, 'g-', linewidth=2)
        plt.xlabel('迭代次数')
        plt.ylabel('累积最佳分数')
        plt.title(f'{search_type} - 累积最佳分数')
        plt.grid(True, alpha=0.3)
        
        # 分数分布
        plt.subplot(1, 3, 3)
        plt.hist(results_df['mean_test_score'], bins=20, alpha=0.7, edgecolor='black')
        plt.axvline(x=search_cv.best_score_, color='r', linestyle='--', 
                   label=f'最佳分数: {search_cv.best_score_:.4f}')
        plt.xlabel('交叉验证分数')
        plt.ylabel('频次')
        plt.title(f'{search_type} - 分数分布')
        plt.legend()
        plt.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
    def bayesian_optimization_demo(self):
        """贝叶斯优化演示(使用scikit-optimize)"""
        try:
            from skopt import BayesSearchCV
            from skopt.space import Real, Integer, Categorical
            
            # 创建数据
            X, y = make_classification(n_samples=500, n_features=10, n_classes=2, random_state=42)
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                              stratify=y, random_state=42)
            
            print("=== 贝叶斯优化演示 ===")
            
            # 定义搜索空间
            search_space = {
                'n_estimators': Integer(50, 300),
                'max_depth': Integer(3, 50),
                'min_samples_split': Integer(2, 20),
                'min_samples_leaf': Integer(1, 10),
                'max_features': Real(0.1, 1.0)
            }
            
            rf = RandomForestClassifier(random_state=42)
            bayes_search = BayesSearchCV(
                rf, search_space, n_iter=50, cv=5,
                scoring='accuracy', n_jobs=-1, random_state=42
            )
            
            print("开始贝叶斯优化...")
            bayes_search.fit(X_train, y_train)
            
            print(f"贝叶斯优化最佳参数: {bayes_search.best_params_}")
            print(f"贝叶斯优化最佳分数: {bayes_search.best_score_:.4f}")
            
            # 测试集性能
            bayes_score = bayes_search.score(X_test, y_test)
            print(f"测试集性能: {bayes_score:.4f}")
            
            # 可视化优化过程
            self.visualize_search_progress(bayes_search, '贝叶斯优化')
            
            return bayes_search
            
        except ImportError:
            print("scikit-optimize未安装,跳过贝叶斯优化演示")
            print("可以通过 'pip install scikit-optimize' 安装")
            return None
            
    def hyperparameter_importance_analysis(self):
        """超参数重要性分析"""
        # 创建数据
        X, y = make_classification(n_samples=500, n_features=10, n_classes=2, random_state=42)
        
        # 随机森林参数重要性分析
        param_grid = {
            'n_estimators': [50, 100, 200],
            'max_depth': [10, 20, None],
            'min_samples_split': [2, 5, 10]
        }
        
        rf = RandomForestClassifier(random_state=42)
        grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
        grid_search.fit(X, y)
        
        results_df = pd.DataFrame(grid_search.cv_results_)
        
        print("=== 超参数重要性分析 ===")
        
        # 分析每个参数的影响
        params = ['param_n_estimators', 'param_max_depth', 'param_min_samples_split']
        param_names = ['n_estimators', 'max_depth', 'min_samples_split']
        
        fig, axes = plt.subplots(1, 3, figsize=(18, 5))
        
        for i, (param, param_name) in enumerate(zip(params, param_names)):
            # 按参数值分组计算平均性能
            param_performance = results_df.groupby(param)['mean_test_score'].agg(['mean', 'std'])
            
            x_values = range(len(param_performance))
            axes[i].bar(x_values, param_performance['mean'], 
                       yerr=param_performance['std'], capsize=5, alpha=0.7)
            axes[i].set_title(f'{param_name} 对性能的影响')
            axes[i].set_xlabel(param_name)
            axes[i].set_ylabel('交叉验证分数')
            axes[i].set_xticks(x_values)
            axes[i].set_xticklabels(param_performance.index, rotation=45)
            axes[i].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        # 参数交互效应分析
        self.analyze_parameter_interactions(results_df)
        
    def analyze_parameter_interactions(self, results_df):
        """分析参数交互效应"""
        plt.figure(figsize=(15, 10))
        
        # n_estimators vs max_depth
        plt.subplot(2, 2, 1)
        pivot_table = results_df.pivot_table(
            values='mean_test_score', 
            index='param_n_estimators', 
            columns='param_max_depth'
        )
        sns.heatmap(pivot_table, annot=True, fmt='.3f', cmap='viridis')
        plt.title('n_estimators vs max_depth')
        
        # n_estimators vs min_samples_split
        plt.subplot(2, 2, 2)
        pivot_table = results_df.pivot_table(
            values='mean_test_score', 
            index='param_n_estimators', 
            columns='param_min_samples_split'
        )
        sns.heatmap(pivot_table, annot=True, fmt='.3f', cmap='viridis')
        plt.title('n_estimators vs min_samples_split')
        
        # max_depth vs min_samples_split
        plt.subplot(2, 2, 3)
        pivot_table = results_df.pivot_table(
            values='mean_test_score', 
            index='param_max_depth', 
            columns='param_min_samples_split'
        )
        sns.heatmap(pivot_table, annot=True, fmt='.3f', cmap='viridis')
        plt.title('max_depth vs min_samples_split')
        
        # 3D参数空间(选择最佳的两个参数)
        plt.subplot(2, 2, 4)
        scatter = plt.scatter(
            results_df['param_n_estimators'].astype(int), 
            results_df['param_min_samples_split'].astype(int),
            c=results_df['mean_test_score'], 
            cmap='viridis', s=100, alpha=0.7
        )
        plt.colorbar(scatter, label='交叉验证分数')
        plt.xlabel('n_estimators')
        plt.ylabel('min_samples_split')
        plt.title('参数空间可视化')
        
        plt.tight_layout()
        plt.show()

# 演示超参数调优
print("=== 超参数调优演示 ===")
tuning_demo = HyperparameterTuningDemo()
tuning_demo.tuning_theory()
tuning_demo.grid_search_demo()
tuning_demo.random_search_demo()
tuning_demo.bayesian_optimization_demo()
tuning_demo.hyperparameter_importance_analysis()

6.6 学习曲线与验证曲线

6.6.1 学习曲线分析

学习曲线帮助我们理解模型的学习能力和数据需求。

class LearningCurveDemo:
    def __init__(self):
        self.curves = {}
        
    def learning_curve_theory(self):
        """学习曲线理论"""
        print("=== 学习曲线理论 ===")
        print("1. 学习曲线定义:")
        print("   - 训练集大小 vs 模型性能的关系")
        print("   - 帮助判断数据量是否充足")
        print("2. 曲线形状分析:")
        print("   - 训练分数持续上升:欠拟合")
        print("   - 训练分数下降:过拟合")
        print("   - 验证分数上升:学习能力正常")
        print("   - 训练和验证分数收敛:模型稳定")
        print("3. 应用场景:")
        print("   - 确定最优训练集大小")
        print("   - 诊断偏差-方差问题")
        print("   - 评估数据收集价值")
        
    def plot_learning_curve(self, estimator, X, y, title="学习曲线"):
        """绘制学习曲线"""
        train_sizes, train_scores, val_scores = learning_curve(
            estimator, X, y, cv=5, n_jobs=-1,
            train_sizes=np.linspace(0.1, 1.0, 10),
            scoring='accuracy' if hasattr(estimator, 'predict_proba') else 'r2'
        )
        
        # 计算均值和标准差
        train_mean = np.mean(train_scores, axis=1)
        train_std = np.std(train_scores, axis=1)
        val_mean = np.mean(val_scores, axis=1)
        val_std = np.std(val_scores, axis=1)
        
        # 绘制学习曲线
        plt.figure(figsize=(10, 6))
        
        plt.plot(train_sizes, train_mean, 'o-', color='blue', label='训练分数')
        plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, 
                        alpha=0.1, color='blue')
        
        plt.plot(train_sizes, val_mean, 'o-', color='red', label='验证分数')
        plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, 
                        alpha=0.1, color='red')
        
        plt.xlabel('训练集大小')
        plt.ylabel('分数')
        plt.title(title)
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.show()
        
        return train_sizes, train_scores, val_scores
        
    def compare_learning_curves(self):
        """比较不同模型的学习曲线"""
        # 创建数据
        X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
        
        # 不同复杂度的模型
        models = {
            '线性模型': LogisticRegression(random_state=42),
            '决策树': DecisionTreeClassifier(random_state=42),
            '随机森林': RandomForestClassifier(n_estimators=100, random_state=42),
            '过拟合决策树': DecisionTreeClassifier(max_depth=None, min_samples_split=2, 
                                           min_samples_leaf=1, random_state=42)
        }
        
        print("=== 不同模型学习曲线比较 ===")
        
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        axes = axes.ravel()
        
        for i, (name, model) in enumerate(models.items()):
            train_sizes, train_scores, val_scores = learning_curve(
                model, X, y, cv=5, n_jobs=-1,
                train_sizes=np.linspace(0.1, 1.0, 10),
                scoring='accuracy'
            )
            
            train_mean = np.mean(train_scores, axis=1)
            train_std = np.std(train_scores, axis=1)
            val_mean = np.mean(val_scores, axis=1)
            val_std = np.std(val_scores, axis=1)
            
            axes[i].plot(train_sizes, train_mean, 'o-', color='blue', label='训练分数')
            axes[i].fill_between(train_sizes, train_mean - train_std, train_mean + train_std, 
                               alpha=0.1, color='blue')
            
            axes[i].plot(train_sizes, val_mean, 'o-', color='red', label='验证分数')
            axes[i].fill_between(train_sizes, val_mean - val_std, val_mean + val_std, 
                               alpha=0.1, color='red')
            
            axes[i].set_xlabel('训练集大小')
            axes[i].set_ylabel('准确率')
            axes[i].set_title(f'{name}学习曲线')
            axes[i].legend()
            axes[i].grid(True, alpha=0.3)
            
            # 分析学习曲线特征
            final_gap = train_mean[-1] - val_mean[-1]
            if final_gap > 0.1:
                bias_variance = "高方差(过拟合)"
            elif val_mean[-1] < 0.8:
                bias_variance = "高偏差(欠拟合)"
            else:
                bias_variance = "良好平衡"
            
            print(f"{name}: 最终训练分数={train_mean[-1]:.3f}, "
                  f"验证分数={val_mean[-1]:.3f}, 诊断={bias_variance}")
        
        plt.tight_layout()
        plt.show()
        
    def validation_curve_demo(self):
        """验证曲线演示"""
        # 创建数据
        X, y = make_classification(n_samples=500, n_features=10, n_classes=2, random_state=42)
        
        print("=== 验证曲线演示 ===")
        
        # 随机森林 n_estimators 验证曲线
        param_range = [10, 20, 50, 100, 200, 300]
        train_scores, val_scores = validation_curve(
            RandomForestClassifier(random_state=42), X, y,
            param_name='n_estimators', param_range=param_range,
            cv=5, scoring='accuracy', n_jobs=-1
        )
        
        self.plot_validation_curve(param_range, train_scores, val_scores, 
                                 'n_estimators', '随机森林 n_estimators 验证曲线')
        
        # SVM C参数验证曲线
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        param_range = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
        train_scores, val_scores = validation_curve(
            SVC(kernel='rbf', random_state=42), X_scaled, y,
            param_name='C', param_range=param_range,
            cv=5, scoring='accuracy', n_jobs=-1
        )
        
        self.plot_validation_curve(param_range, train_scores, val_scores, 
                                 'C', 'SVM C参数验证曲线', log_scale=True)
        
        # 决策树 max_depth 验证曲线
        param_range = range(1, 21)
        train_scores, val_scores = validation_curve(
            DecisionTreeClassifier(random_state=42), X, y,
            param_name='max_depth', param_range=param_range,
            cv=5, scoring='accuracy', n_jobs=-1
        )
        
        self.plot_validation_curve(param_range, train_scores, val_scores, 
                                 'max_depth', '决策树 max_depth 验证曲线')
        
    def plot_validation_curve(self, param_range, train_scores, val_scores, 
                            param_name, title, log_scale=False):
        """绘制验证曲线"""
        train_mean = np.mean(train_scores, axis=1)
        train_std = np.std(train_scores, axis=1)
        val_mean = np.mean(val_scores, axis=1)
        val_std = np.std(val_scores, axis=1)
        
        plt.figure(figsize=(10, 6))
        
        plt.plot(param_range, train_mean, 'o-', color='blue', label='训练分数')
        plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, 
                        alpha=0.1, color='blue')
        
        plt.plot(param_range, val_mean, 'o-', color='red', label='验证分数')
        plt.fill_between(param_range, val_mean - val_std, val_mean + val_std, 
                        alpha=0.1, color='red')
        
        if log_scale:
            plt.xscale('log')
        
        plt.xlabel(param_name)
        plt.ylabel('准确率')
        plt.title(title)
        plt.legend()
        plt.grid(True, alpha=0.3)
        
        # 找到最佳参数
        best_idx = np.argmax(val_mean)
        best_param = param_range[best_idx]
        best_score = val_mean[best_idx]
        
        plt.axvline(x=best_param, color='green', linestyle='--', 
                   label=f'最佳{param_name}={best_param}')
        plt.legend()
        
        print(f"最佳{param_name}: {best_param}, 验证分数: {best_score:.4f}")
        
        plt.tight_layout()
        plt.show()
        
    def bias_variance_analysis(self):
        """偏差-方差分析"""
        # 创建数据
        X, y = make_regression(n_samples=200, n_features=1, noise=0.3, random_state=42)
        
        print("=== 偏差-方差分析 ===")
        
        # 不同复杂度的模型
        models = {
            '欠拟合(线性)': Ridge(alpha=100),
            '适度拟合(多项式2次)': Ridge(alpha=1),
            '过拟合(多项式15次)': Ridge(alpha=0.001)
        }
        
        fig, axes = plt.subplots(1, 3, figsize=(18, 5))
        
        for i, (name, base_model) in enumerate(models.items()):
            # 生成多个训练集
            n_experiments = 100
            predictions = []
            
            for _ in range(n_experiments):
                # 重新采样训练集
                indices = np.random.choice(len(X), size=int(0.8*len(X)), replace=True)
                X_sample = X[indices]
                y_sample = y[indices]
                
                # 多项式特征
                if '线性' in name:
                    degree = 1
                elif '2次' in name:
                    degree = 2
                else:
                    degree = 15
                
                from sklearn.preprocessing import PolynomialFeatures
                poly = PolynomialFeatures(degree=degree)
                X_poly = poly.fit_transform(X_sample)
                
                # 训练模型
                model = base_model
                model.fit(X_poly, y_sample)
                
                # 预测
                X_test_range = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
                X_test_poly = poly.transform(X_test_range)
                y_pred = model.predict(X_test_poly)
                predictions.append(y_pred)
            
            predictions = np.array(predictions)
            
            # 计算偏差和方差
            mean_prediction = np.mean(predictions, axis=0)
            variance = np.var(predictions, axis=0)
            
            # 真实函数(简化)
            X_test_range = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
            true_function = X_test_range.ravel() * 2 + 1  # 简化的真实函数
            
            bias_squared = (mean_prediction - true_function) ** 2
            
            # 可视化
            axes[i].scatter(X, y, alpha=0.3, color='gray', label='训练数据')
            axes[i].plot(X_test_range, true_function, 'g-', linewidth=2, label='真实函数')
            axes[i].plot(X_test_range, mean_prediction, 'r-', linewidth=2, label='平均预测')
            
            # 显示一些个别预测
            for j in range(0, n_experiments, 20):
                axes[i].plot(X_test_range, predictions[j], 'b-', alpha=0.1)
            
            axes[i].set_title(f'{name}\n平均偏差²={np.mean(bias_squared):.3f}, '
                            f'平均方差={np.mean(variance):.3f}')
            axes[i].set_xlabel('X')
            axes[i].set_ylabel('y')
            axes[i].legend()
            axes[i].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()

# 演示学习曲线
print("=== 学习曲线演示 ===")
lc_demo = LearningCurveDemo()
lc_demo.learning_curve_theory()
lc_demo.compare_learning_curves()
lc_demo.validation_curve_demo()
lc_demo.bias_variance_analysis()

6.7 模型选择策略

6.7.1 模型选择框架

模型选择是机器学习项目中的关键决策,需要综合考虑多个因素。

class ModelSelectionDemo:
    def __init__(self):
        self.models = {}
        self.results = {}
        
    def model_selection_theory(self):
        """模型选择理论"""
        print("=== 模型选择理论 ===")
        print("1. 选择标准:")
        print("   - 预测性能:准确率、精确率、召回率等")
        print("   - 计算效率:训练时间、预测时间")
        print("   - 模型复杂度:参数数量、可解释性")
        print("   - 鲁棒性:对噪声和异常值的敏感性")
        print("2. 选择流程:")
        print("   - 问题分析:分类/回归、数据特征")
        print("   - 候选模型:基于问题特点选择")
        print("   - 性能评估:交叉验证、多指标评估")
        print("   - 综合决策:权衡各项指标")
        print("3. 常见策略:")
        print("   - 简单优先:奥卡姆剃刀原则")
        print("   - 集成方法:结合多个模型优势")
        print("   - 领域知识:结合专业经验")
        
    def comprehensive_model_comparison(self):
        """综合模型比较"""
        # 创建多样化数据集
        datasets = {
            '线性可分': make_classification(n_samples=1000, n_features=10, n_redundant=0, 
                                      n_informative=10, n_clusters_per_class=1, random_state=42),
            '非线性': make_classification(n_samples=1000, n_features=10, n_redundant=0,
                                    n_informative=5, n_clusters_per_class=2, random_state=42),
            '高维稀疏': make_classification(n_samples=500, n_features=100, n_informative=10,
                                     n_redundant=0, random_state=42),
            '不平衡': make_classification(n_samples=1000, n_features=10, n_classes=2,
                                    weights=[0.9, 0.1], random_state=42)
        }
        
        # 候选模型
        models = {
            '逻辑回归': LogisticRegression(random_state=42, max_iter=1000),
            '决策树': DecisionTreeClassifier(random_state=42),
            '随机森林': RandomForestClassifier(n_estimators=100, random_state=42),
            'SVM': SVC(random_state=42, probability=True),
            '朴素贝叶斯': GaussianNB(),
            'KNN': KNeighborsClassifier(),
            '梯度提升': GradientBoostingClassifier(random_state=42)
        }
        
        print("=== 综合模型比较 ===")
        
        # 评估指标
        scoring_metrics = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro', 'roc_auc']
        
        results_summary = {}
        
        for dataset_name, (X, y) in datasets.items():
            print(f"\n数据集: {dataset_name}")
            print(f"样本数: {X.shape[0]}, 特征数: {X.shape[1]}")
            
            # 标准化数据
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X)
            
            dataset_results = {}
            
            for model_name, model in models.items():
                # 选择合适的数据(SVM使用标准化数据)
                X_use = X_scaled if model_name == 'SVM' else X
                
                # 交叉验证评估
                cv_results = {}
                for metric in scoring_metrics:
                    try:
                        scores = cross_val_score(model, X_use, y, cv=5, scoring=metric, n_jobs=-1)
                        cv_results[metric] = {
                            'mean': scores.mean(),
                            'std': scores.std()
                        }
                    except Exception as e:
                        cv_results[metric] = {'mean': 0, 'std': 0}
                
                # 训练时间评估
                start_time = time.time()
                model.fit(X_use, y)
                train_time = time.time() - start_time
                
                # 预测时间评估
                start_time = time.time()
                model.predict(X_use[:100])  # 预测100个样本
                predict_time = (time.time() - start_time) * 10  # 换算为1000个样本的时间
                
                dataset_results[model_name] = {
                    'metrics': cv_results,
                    'train_time': train_time,
                    'predict_time': predict_time
                }
                
                print(f"{model_name:12s}: 准确率={cv_results['accuracy']['mean']:.3f}±{cv_results['accuracy']['std']:.3f}, "
                      f"F1={cv_results['f1_macro']['mean']:.3f}±{cv_results['f1_macro']['std']:.3f}")
            
            results_summary[dataset_name] = dataset_results
        
        # 可视化比较结果
        self.visualize_model_comparison(results_summary)
        
        return results_summary
        
    def visualize_model_comparison(self, results_summary):
        """可视化模型比较结果"""
        datasets = list(results_summary.keys())
        models = list(results_summary[datasets[0]].keys())
        
        # 准确率比较
        fig, axes = plt.subplots(2, 2, figsize=(16, 12))
        
        # 1. 准确率热力图
        accuracy_matrix = np.zeros((len(models), len(datasets)))
        for i, model in enumerate(models):
            for j, dataset in enumerate(datasets):
                accuracy_matrix[i, j] = results_summary[dataset][model]['metrics']['accuracy']['mean']
        
        im1 = axes[0, 0].imshow(accuracy_matrix, cmap='viridis', aspect='auto')
        axes[0, 0].set_xticks(range(len(datasets)))
        axes[0, 0].set_xticklabels(datasets, rotation=45)
        axes[0, 0].set_yticks(range(len(models)))
        axes[0, 0].set_yticklabels(models)
        axes[0, 0].set_title('模型准确率热力图')
        
        # 添加数值标注
        for i in range(len(models)):
            for j in range(len(datasets)):
                axes[0, 0].text(j, i, f'{accuracy_matrix[i, j]:.3f}', 
                               ha='center', va='center', color='white')
        
        plt.colorbar(im1, ax=axes[0, 0])
        
        # 2. 训练时间比较
        train_times = []
        model_names = []
        for model in models:
            avg_time = np.mean([results_summary[dataset][model]['train_time'] 
                              for dataset in datasets])
            train_times.append(avg_time)
            model_names.append(model)
        
        bars = axes[0, 1].bar(model_names, train_times, alpha=0.7)
        axes[0, 1].set_title('平均训练时间比较')
        axes[0, 1].set_ylabel('时间 (秒)')
        axes[0, 1].tick_params(axis='x', rotation=45)
        
        # 添加数值标注
        for bar, time in zip(bars, train_times):
            axes[0, 1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
                           f'{time:.3f}', ha='center', va='bottom')
        
        # 3. 性能稳定性(标准差)
        stability_data = []
        for model in models:
            avg_std = np.mean([results_summary[dataset][model]['metrics']['accuracy']['std'] 
                             for dataset in datasets])
            stability_data.append(avg_std)
        
        bars = axes[1, 0].bar(model_names, stability_data, alpha=0.7, color='orange')
        axes[1, 0].set_title('模型稳定性(准确率标准差)')
        axes[1, 0].set_ylabel('标准差')
        axes[1, 0].tick_params(axis='x', rotation=45)
        
        # 4. 综合性能雷达图(选择一个数据集)
        dataset_name = datasets[0]  # 选择第一个数据集
        metrics = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
        
        # 选择前4个模型进行雷达图比较
        selected_models = models[:4]
        
        angles = np.linspace(0, 2 * np.pi, len(metrics), endpoint=False).tolist()
        angles += angles[:1]  # 闭合图形
        
        ax = plt.subplot(2, 2, 4, projection='polar')
        
        for model in selected_models:
            values = [results_summary[dataset_name][model]['metrics'][metric]['mean'] 
                     for metric in metrics]
            values += values[:1]  # 闭合图形
            
            ax.plot(angles, values, 'o-', linewidth=2, label=model)
            ax.fill(angles, values, alpha=0.1)
        
        ax.set_xticks(angles[:-1])
        ax.set_xticklabels(metrics)
        ax.set_ylim(0, 1)
        ax.set_title(f'{dataset_name}数据集性能雷达图')
        ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0))
        
        plt.tight_layout()
        plt.show()
        
    def model_selection_decision_tree(self):
        """模型选择决策树"""
        print("=== 模型选择决策树 ===")
        
        decision_tree = """
        模型选择决策流程:
        
        1. 数据特征分析
           ├── 样本量 < 1000 → 简单模型(逻辑回归、朴素贝叶斯)
           ├── 样本量 > 10000 → 可考虑复杂模型(深度学习、集成方法)
           └── 中等样本量 → 传统机器学习算法
        
        2. 特征维度分析
           ├── 高维稀疏 → 线性模型(逻辑回归、线性SVM)
           ├── 低维稠密 → 非线性模型(RBF SVM、随机森林)
           └── 中等维度 → 多种模型可选
        
        3. 问题类型分析
           ├── 线性可分 → 线性模型
           ├── 非线性复杂 → 树模型、SVM、神经网络
           └── 不确定 → 尝试多种模型
        
        4. 性能要求分析
           ├── 高准确率 → 集成方法、深度学习
           ├── 快速预测 → 线性模型、朴素贝叶斯
           ├── 可解释性 → 决策树、线性模型
           └── 平衡考虑 → 随机森林
        
        5. 数据质量分析
           ├── 有噪声 → 鲁棒模型(随机森林、SVM)
           ├── 有缺失值 → 树模型
           └── 数据干净 → 多种模型可选
        """
        
        print(decision_tree)
        
        # 实际决策示例
        self.practical_selection_example()
        
    def practical_selection_example(self):
        """实际选择示例"""
        print("\n=== 实际选择示例 ===")
        
        # 模拟不同场景的数据
        scenarios = {
            '小样本高维': {
                'data': make_classification(n_samples=200, n_features=50, n_informative=10, random_state=42),
                'recommended': ['逻辑回归', '朴素贝叶斯'],
                'reason': '样本量小,特征维度高,线性模型更稳定'
            },
            '大样本低维': {
                'data': make_classification(n_samples=5000, n_features=5, n_informative=5, random_state=42),
                'recommended': ['随机森林', 'SVM', '梯度提升'],
                'reason': '样本量大,可以使用复杂模型捕获非线性关系'
            },
            '不平衡数据': {
                'data': make_classification(n_samples=1000, n_features=10, weights=[0.95, 0.05], random_state=42),
                'recommended': ['随机森林', '梯度提升'],
                'reason': '不平衡数据,集成方法通常表现更好'
            }
        }
        
        for scenario_name, scenario_info in scenarios.items():
            X, y = scenario_info['data']
            print(f"\n场景: {scenario_name}")
            print(f"数据形状: {X.shape}")
            print(f"类别分布: {np.bincount(y)}")
            print(f"推荐模型: {', '.join(scenario_info['recommended'])}")
            print(f"推荐理由: {scenario_info['reason']}")
            
            # 快速验证推荐
            models_to_test = {
                '逻辑回归': LogisticRegression(random_state=42, max_iter=1000),
                '随机森林': RandomForestClassifier(n_estimators=100, random_state=42),
                '朴素贝叶斯': GaussianNB()
            }
            
            best_score = 0
            best_model = None
            
            for model_name, model in models_to_test.items():
                score = cross_val_score(model, X, y, cv=3, scoring='f1_macro').mean()
                print(f"  {model_name}: F1={score:.3f}")
                
                if score > best_score:
                    best_score = score
                    best_model = model_name
            
            print(f"  实际最佳: {best_model}")

# 演示模型选择
print("=== 模型选择演示 ===")
ms_demo = ModelSelectionDemo()
ms_demo.model_selection_theory()
ms_demo.comprehensive_model_comparison()
ms_demo.model_selection_decision_tree()

6.8 综合案例:信用卡欺诈检测

6.8.1 项目背景与数据准备

通过一个完整的信用卡欺诈检测项目,综合运用本章所学的评估与选择技术。

class CreditCardFraudDetection:
    def __init__(self):
        self.models = {}
        self.results = {}
        self.best_model = None
        
    def create_fraud_dataset(self):
        """创建模拟信用卡欺诈数据集"""
        print("=== 创建信用卡欺诈数据集 ===")
        
        # 创建不平衡的二分类数据集
        X, y = make_classification(
            n_samples=10000,
            n_features=20,
            n_informative=15,
            n_redundant=5,
            n_clusters_per_class=1,
            weights=[0.99, 0.01],  # 1%的欺诈交易
            random_state=42
        )
        
        # 添加特征名称
        feature_names = [
            'transaction_amount', 'account_age', 'transaction_hour',
            'merchant_category', 'location_risk', 'payment_method',
            'transaction_frequency', 'account_balance', 'credit_limit',
            'previous_fraud', 'device_id', 'ip_address_risk',
            'transaction_velocity', 'merchant_risk', 'card_type',
            'international_transaction', 'weekend_transaction', 'night_transaction',
            'high_amount_flag', 'suspicious_pattern'
        ]
        
        # 创建DataFrame
        df = pd.DataFrame(X, columns=feature_names)
        df['is_fraud'] = y
        
        print(f"数据集形状: {df.shape}")
        print(f"欺诈交易比例: {y.mean():.3f}")
        print(f"正常交易: {(y==0).sum()}, 欺诈交易: {(y==1).sum()}")
        
        # 数据分布可视化
        self.visualize_dataset(df)
        
        return df
        
    def visualize_dataset(self, df):
        """可视化数据集"""
        fig, axes = plt.subplots(2, 3, figsize=(18, 10))
        
        # 1. 类别分布
        axes[0, 0].pie([len(df[df['is_fraud']==0]), len(df[df['is_fraud']==1])], 
                      labels=['正常', '欺诈'], autopct='%1.1f%%', startangle=90)
        axes[0, 0].set_title('交易类别分布')
        
        # 2. 特征分布对比(选择几个重要特征)
        important_features = ['transaction_amount', 'account_age', 'location_risk', 'transaction_frequency']
        
        for i, feature in enumerate(important_features):
            row = (i + 1) // 3
            col = (i + 1) % 3
            
            # 正常交易分布
            normal_data = df[df['is_fraud']==0][feature]
            fraud_data = df[df['is_fraud']==1][feature]
            
            axes[row, col].hist(normal_data, bins=30, alpha=0.7, label='正常', density=True)
            axes[row, col].hist(fraud_data, bins=30, alpha=0.7, label='欺诈', density=True)
            axes[row, col].set_title(f'{feature} 分布对比')
            axes[row, col].legend()
            axes[row, col].grid(True, alpha=0.3)
        
        # 3. 相关性热力图
        correlation_matrix = df.select_dtypes(include=[np.number]).corr()
        im = axes[0, 2].imshow(correlation_matrix.values, cmap='coolwarm', aspect='auto')
        axes[0, 2].set_title('特征相关性热力图')
        axes[0, 2].set_xticks(range(len(correlation_matrix.columns)))
        axes[0, 2].set_xticklabels(correlation_matrix.columns, rotation=90)
        axes[0, 2].set_yticks(range(len(correlation_matrix.columns)))
        axes[0, 2].set_yticklabels(correlation_matrix.columns)
        
        plt.tight_layout()
        plt.show()
        
    def comprehensive_evaluation(self, df):
        """综合评估流程"""
        print("=== 综合评估流程 ===")
        
        # 准备数据
        X = df.drop('is_fraud', axis=1)
        y = df['is_fraud']
        
        # 划分数据集
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, stratify=y, random_state=42
        )
        
        # 标准化
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        
        # 候选模型
        models = {
            '逻辑回归': LogisticRegression(random_state=42, max_iter=1000),
            '随机森林': RandomForestClassifier(n_estimators=100, random_state=42),
            'XGBoost': GradientBoostingClassifier(random_state=42),
            'SVM': SVC(random_state=42, probability=True),
            '朴素贝叶斯': GaussianNB()
        }
        
        # 评估指标(针对不平衡数据)
        scoring_metrics = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
        
        print("1. 基础模型评估")
        basic_results = {}
        
        for model_name, model in models.items():
            print(f"\n评估 {model_name}...")
            
            # 选择合适的数据
            X_train_use = X_train_scaled if model_name == 'SVM' else X_train
            X_test_use = X_test_scaled if model_name == 'SVM' else X_test
            
            # 交叉验证
            cv_results = {}
            for metric in scoring_metrics:
                scores = cross_val_score(model, X_train_use, y_train, cv=5, 
                                       scoring=metric, n_jobs=-1)
                cv_results[metric] = scores.mean()
            
            # 训练模型
            model.fit(X_train_use, y_train)
            
            # 测试集预测
            y_pred = model.predict(X_test_use)
            y_pred_proba = model.predict_proba(X_test_use)[:, 1] if hasattr(model, 'predict_proba') else None
            
            # 详细评估
            test_results = {
                'accuracy': accuracy_score(y_test, y_pred),
                'precision': precision_score(y_test, y_pred),
                'recall': recall_score(y_test, y_pred),
                'f1': f1_score(y_test, y_pred),
                'roc_auc': roc_auc_score(y_test, y_pred_proba) if y_pred_proba is not None else 0
            }
            
            basic_results[model_name] = {
                'cv_results': cv_results,
                'test_results': test_results,
                'model': model,
                'predictions': y_pred,
                'probabilities': y_pred_proba
            }
            
            print(f"  交叉验证 F1: {cv_results['f1']:.3f}")
            print(f"  测试集 F1: {test_results['f1']:.3f}")
            print(f"  测试集 AUC: {test_results['roc_auc']:.3f}")
        
        # 2. 超参数调优(选择最有潜力的模型)
        print("\n2. 超参数调优")
        best_cv_f1 = 0
        best_model_name = None
        
        for model_name, results in basic_results.items():
            if results['cv_results']['f1'] > best_cv_f1:
                best_cv_f1 = results['cv_results']['f1']
                best_model_name = model_name
        
        print(f"选择 {best_model_name} 进行超参数调优")
        
        # 针对最佳模型进行调优
        tuned_model = self.hyperparameter_tuning(best_model_name, X_train, y_train)
        
        # 3. 最终评估
        print("\n3. 最终模型评估")
        self.final_evaluation(tuned_model, X_test, y_test, scaler, best_model_name)
        
        # 4. 可视化结果
        self.visualize_results(basic_results, X_test, y_test)
        
        return basic_results, tuned_model
        
    def hyperparameter_tuning(self, model_name, X_train, y_train):
        """超参数调优"""
        if model_name == '随机森林':
            param_grid = {
                'n_estimators': [100, 200, 300],
                'max_depth': [10, 20, None],
                'min_samples_split': [2, 5, 10],
                'min_samples_leaf': [1, 2, 4]
            }
            model = RandomForestClassifier(random_state=42)
            
        elif model_name == 'XGBoost':
            param_grid = {
                'n_estimators': [100, 200],
                'max_depth': [3, 6, 9],
                'learning_rate': [0.01, 0.1, 0.2],
                'subsample': [0.8, 1.0]
            }
            model = GradientBoostingClassifier(random_state=42)
            
        else:  # 逻辑回归
            param_grid = {
                'C': [0.1, 1, 10, 100],
                'penalty': ['l1', 'l2'],
                'solver': ['liblinear', 'saga']
            }
            model = LogisticRegression(random_state=42, max_iter=1000)
        
        # 使用F1分数进行调优(适合不平衡数据)
        grid_search = GridSearchCV(
            model, param_grid, cv=5, scoring='f1', 
            n_jobs=-1, verbose=1
        )
        
        grid_search.fit(X_train, y_train)
        
        print(f"最佳参数: {grid_search.best_params_}")
        print(f"最佳F1分数: {grid_search.best_score_:.3f}")
        
        return grid_search.best_estimator_
        
    def final_evaluation(self, model, X_test, y_test, scaler, model_name):
        """最终评估"""
        # 预测
        X_test_use = scaler.transform(X_test) if model_name == 'SVM' else X_test
        y_pred = model.predict(X_test_use)
        y_pred_proba = model.predict_proba(X_test_use)[:, 1]
        
        # 详细评估报告
        print("=== 最终评估报告 ===")
        print(f"模型: {model_name}")
        print(f"准确率: {accuracy_score(y_test, y_pred):.3f}")
        print(f"精确率: {precision_score(y_test, y_pred):.3f}")
        print(f"召回率: {recall_score(y_test, y_pred):.3f}")
        print(f"F1分数: {f1_score(y_test, y_pred):.3f}")
        print(f"AUC分数: {roc_auc_score(y_test, y_pred_proba):.3f}")
        
        # 混淆矩阵
        cm = confusion_matrix(y_test, y_pred)
        print(f"\n混淆矩阵:")
        print(f"真负例: {cm[0,0]}, 假正例: {cm[0,1]}")
        print(f"假负例: {cm[1,0]}, 真正例: {cm[1,1]}")
        
        # 分类报告
        print(f"\n分类报告:")
        print(classification_report(y_test, y_pred, target_names=['正常', '欺诈']))
        
        self.best_model = model
        
    def visualize_results(self, results, X_test, y_test):
        """可视化评估结果"""
        fig, axes = plt.subplots(2, 3, figsize=(18, 12))
        
        # 1. 模型性能比较
        models = list(results.keys())
        f1_scores = [results[model]['test_results']['f1'] for model in models]
        auc_scores = [results[model]['test_results']['roc_auc'] for model in models]
        
        x = np.arange(len(models))
        width = 0.35
        
        axes[0, 0].bar(x - width/2, f1_scores, width, label='F1 Score', alpha=0.7)
        axes[0, 0].bar(x + width/2, auc_scores, width, label='AUC Score', alpha=0.7)
        axes[0, 0].set_xlabel('模型')
        axes[0, 0].set_ylabel('分数')
        axes[0, 0].set_title('模型性能比较')
        axes[0, 0].set_xticks(x)
        axes[0, 0].set_xticklabels(models, rotation=45)
        axes[0, 0].legend()
        axes[0, 0].grid(True, alpha=0.3)
        
        # 2. ROC曲线比较
        for model_name, result in results.items():
            if result['probabilities'] is not None:
                fpr, tpr, _ = roc_curve(y_test, result['probabilities'])
                auc = result['test_results']['roc_auc']
                axes[0, 1].plot(fpr, tpr, label=f'{model_name} (AUC={auc:.3f})')
        
        axes[0, 1].plot([0, 1], [0, 1], 'k--', alpha=0.5)
        axes[0, 1].set_xlabel('假正例率')
        axes[0, 1].set_ylabel('真正例率')
        axes[0, 1].set_title('ROC曲线比较')
        axes[0, 1].legend()
        axes[0, 1].grid(True, alpha=0.3)
        
        # 3. 精确率-召回率曲线
        for model_name, result in results.items():
            if result['probabilities'] is not None:
                precision, recall, _ = precision_recall_curve(y_test, result['probabilities'])
                axes[0, 2].plot(recall, precision, label=model_name)
        
        axes[0, 2].set_xlabel('召回率')
        axes[0, 2].set_ylabel('精确率')
        axes[0, 2].set_title('精确率-召回率曲线')
        axes[0, 2].legend()
        axes[0, 2].grid(True, alpha=0.3)
        
        # 4. 混淆矩阵(最佳模型)
        best_model_name = max(results.keys(), key=lambda x: results[x]['test_results']['f1'])
        best_predictions = results[best_model_name]['predictions']
        cm = confusion_matrix(y_test, best_predictions)
        
        im = axes[1, 0].imshow(cm, interpolation='nearest', cmap='Blues')
        axes[1, 0].set_title(f'{best_model_name} 混淆矩阵')
        tick_marks = np.arange(2)
        axes[1, 0].set_xticks(tick_marks)
        axes[1, 0].set_xticklabels(['正常', '欺诈'])
        axes[1, 0].set_yticks(tick_marks)
        axes[1, 0].set_yticklabels(['正常', '欺诈'])
        
        # 添加数值标注
        thresh = cm.max() / 2.
        for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
            axes[1, 0].text(j, i, format(cm[i, j], 'd'),
                           horizontalalignment="center",
                           color="white" if cm[i, j] > thresh else "black")
        
        # 5. 特征重要性(如果模型支持)
        if hasattr(results[best_model_name]['model'], 'feature_importances_'):
            importances = results[best_model_name]['model'].feature_importances_
            feature_names = [f'feature_{i}' for i in range(len(importances))]
            
            # 选择前10个重要特征
            indices = np.argsort(importances)[::-1][:10]
            
            axes[1, 1].bar(range(10), importances[indices])
            axes[1, 1].set_title(f'{best_model_name} 特征重要性')
            axes[1, 1].set_xlabel('特征')
            axes[1, 1].set_ylabel('重要性')
            axes[1, 1].set_xticks(range(10))
            axes[1, 1].set_xticklabels([feature_names[i] for i in indices], rotation=45)
        
        # 6. 阈值优化
        if results[best_model_name]['probabilities'] is not None:
            thresholds = np.linspace(0, 1, 100)
            f1_scores_thresh = []
            
            for thresh in thresholds:
                y_pred_thresh = (results[best_model_name]['probabilities'] > thresh).astype(int)
                f1 = f1_score(y_test, y_pred_thresh)
                f1_scores_thresh.append(f1)
            
            best_thresh_idx = np.argmax(f1_scores_thresh)
            best_thresh = thresholds[best_thresh_idx]
            
            axes[1, 2].plot(thresholds, f1_scores_thresh)
            axes[1, 2].axvline(x=best_thresh, color='r', linestyle='--', 
                              label=f'最佳阈值: {best_thresh:.3f}')
            axes[1, 2].set_xlabel('阈值')
            axes[1, 2].set_ylabel('F1分数')
            axes[1, 2].set_title('阈值优化')
            axes[1, 2].legend()
            axes[1, 2].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()

# 演示信用卡欺诈检测
print("=== 信用卡欺诈检测综合案例 ===")
fraud_detector = CreditCardFraudDetection()
fraud_data = fraud_detector.create_fraud_dataset()
results, best_model = fraud_detector.comprehensive_evaluation(fraud_data)

6.9 本章小结

6.9.1 核心内容回顾

本章系统介绍了机器学习中的模型评估与选择技术:

  1. 模型评估概述

    • 评估的必要性和基本原则
    • 训练集、验证集、测试集的划分策略
  2. 交叉验证技术

    • K折交叉验证、分层交叉验证
    • 留一交叉验证、时间序列交叉验证
    • 交叉验证的优缺点和适用场景
  3. 分类模型评估指标

    • 混淆矩阵、准确率、精确率、召回率
    • F1分数、ROC曲线、AUC值
    • 多分类评估指标和可视化方法
  4. 回归模型评估指标

    • MAE、MSE、RMSE、R²等基础指标
    • 残差分析和诊断方法
    • 交叉验证在回归中的应用
  5. 超参数调优

    • 网格搜索、随机搜索、贝叶斯优化
    • 超参数重要性分析
    • 参数交互效应研究
  6. 学习曲线与验证曲线

    • 学习曲线的绘制和解释
    • 验证曲线的应用
    • 偏差-方差分析
  7. 模型选择策略

    • 综合模型比较框架
    • 模型选择决策树
    • 实际应用中的选择策略
  8. 综合案例

    • 信用卡欺诈检测完整流程
    • 不平衡数据的处理策略
    • 多模型比较和最终选择

6.9.2 最佳实践总结

  1. 评估策略

    • 始终使用交叉验证进行模型评估
    • 根据问题特点选择合适的评估指标
    • 保留独立的测试集进行最终评估
  2. 超参数调优

    • 先粗搜索后细搜索
    • 使用适当的评估指标指导调优
    • 考虑计算资源和时间成本
  3. 模型选择

    • 从简单模型开始
    • 综合考虑性能、复杂度、可解释性
    • 结合业务需求做最终决策
  4. 实际应用

    • 关注数据质量和特征工程
    • 处理类别不平衡问题
    • 建立完整的评估流程

6.9.3 常见陷阱与注意事项

  1. 数据泄露

    • 避免在特征工程时使用未来信息
    • 确保训练集和测试集严格分离
  2. 过度调优

    • 避免在测试集上反复调优
    • 使用验证集进行模型选择
  3. 评估偏差

    • 注意样本分布的代表性
    • 考虑时间因素对模型性能的影响
  4. 指标选择

    • 根据业务目标选择合适指标
    • 避免单一指标的局限性

6.9.4 进阶学习方向

  1. 高级评估技术

    • 嵌套交叉验证
    • 时间序列特定的评估方法
    • 在线学习的评估策略
  2. 自动化机器学习

    • AutoML框架的使用
    • 神经架构搜索
    • 自动特征工程
  3. 模型解释性

    • SHAP、LIME等解释方法
    • 模型可视化技术
    • 公平性和偏见检测
  4. 生产环境部署

    • 模型监控和维护
    • A/B测试框架
    • 模型版本管理

6.10 练习题

基础练习

  1. 交叉验证实践

    • 在Iris数据集上比较3折、5折、10折交叉验证的结果
    • 分析不同折数对评估结果稳定性的影响
  2. 评估指标计算

    • 手动计算给定混淆矩阵的各项指标
    • 解释在不同业务场景下应该优先考虑哪个指标
  3. 超参数调优

    • 对决策树进行网格搜索调优
    • 比较网格搜索和随机搜索的效率

进阶练习

  1. 学习曲线分析

    • 绘制不同模型在相同数据集上的学习曲线
    • 分析哪些模型存在过拟合或欠拟合问题
  2. 不平衡数据处理

    • 创建不平衡数据集,比较不同评估指标的表现
    • 尝试不同的采样策略和评估方法
  3. 模型选择项目

    • 选择一个实际数据集,完成完整的模型选择流程
    • 包括数据预处理、模型比较、超参数调优和最终评估

挑战练习

  1. 自定义评估指标

    • 设计一个业务相关的自定义评估指标
    • 在模型选择中使用这个指标
  2. 时间序列交叉验证

    • 实现时间序列特定的交叉验证方法
    • 比较与标准交叉验证的差异
  3. 集成评估框架

    • 构建一个自动化的模型评估和选择框架
    • 支持多种模型、多种指标的批量比较

第6章完结

通过本章的学习,你已经掌握了机器学习中模型评估与选择的核心技术。这些技能将帮助你在实际项目中做出明智的模型选择决策,构建高质量的机器学习系统。下一章我们将学习集成学习方法,探索如何通过组合多个模型来进一步提升预测性能。