本章概述

在前面的章节中,我们系统学习了Scikit-learn的各个组件和技术。本章将通过四个完整的实战项目,将所学知识综合运用到实际问题中。每个项目都包含完整的数据科学流程:问题定义、数据探索、特征工程、模型构建、评估优化和结果解释。

本章学习目标

  • 掌握完整的机器学习项目流程
  • 学会处理不同类型的实际问题
  • 理解项目中的关键决策点
  • 培养端到端的项目实施能力

项目列表

  1. 房价预测项目 - 回归问题实战
  2. 客户分类项目 - 分类问题实战
  3. 推荐系统项目 - 协同过滤实战
  4. 时间序列预测项目 - 时序分析实战

10.1 项目一:房价预测系统

10.1.1 项目背景与目标

房价预测是一个经典的回归问题,对于房地产行业、金融机构和个人购房者都具有重要意义。

项目目标: - 构建准确的房价预测模型 - 识别影响房价的关键因素 - 提供可解释的预测结果

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.inspection import permutation_importance
import warnings
warnings.filterwarnings('ignore')

# 设置中文字体和图表样式
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
sns.set_style("whitegrid")

class HousePricePrediction:
    def __init__(self):
        self.models = {}
        self.scaler = StandardScaler()
        self.feature_names = []
        self.best_model = None
        
    def create_house_dataset(self):
        """创建房价数据集"""
        print("=== 创建房价数据集 ===")
        
        np.random.seed(42)
        n_samples = 3000
        
        # 基础特征
        data = {
            # 房屋基本信息
            'area': np.random.normal(120, 40, n_samples),  # 面积
            'bedrooms': np.random.poisson(3, n_samples),   # 卧室数
            'bathrooms': np.random.poisson(2, n_samples),  # 浴室数
            'floors': np.random.choice([1, 2, 3], n_samples, p=[0.3, 0.6, 0.1]),  # 楼层数
            'age': np.random.exponential(10, n_samples),   # 房龄
            
            # 位置特征
            'distance_to_center': np.random.exponential(15, n_samples),  # 距市中心距离
            'distance_to_subway': np.random.exponential(2, n_samples),   # 距地铁距离
            'distance_to_school': np.random.exponential(3, n_samples),   # 距学校距离
            
            # 类别特征
            'district': np.random.choice(['市中心', '新区', '郊区'], n_samples, p=[0.2, 0.5, 0.3]),
            'property_type': np.random.choice(['公寓', '别墅', '联排'], n_samples, p=[0.7, 0.2, 0.1]),
            'decoration': np.random.choice(['毛坯', '简装', '精装'], n_samples, p=[0.3, 0.4, 0.3]),
            
            # 环境特征
            'green_ratio': np.random.beta(3, 2, n_samples),  # 绿化率
            'parking_ratio': np.random.beta(2, 3, n_samples),  # 停车位比例
        }
        
        # 创建DataFrame
        df = pd.DataFrame(data)
        
        # 数据清理
        df['area'] = np.clip(df['area'], 50, 300)
        df['bedrooms'] = np.clip(df['bedrooms'], 1, 6)
        df['bathrooms'] = np.clip(df['bathrooms'], 1, 4)
        df['age'] = np.clip(df['age'], 0, 50)
        df['distance_to_center'] = np.clip(df['distance_to_center'], 1, 50)
        df['distance_to_subway'] = np.clip(df['distance_to_subway'], 0.1, 10)
        df['distance_to_school'] = np.clip(df['distance_to_school'], 0.1, 15)
        
        # 生成房价(目标变量)
        # 基础价格
        base_price = 50000  # 每平米基础价格
        
        # 面积影响(非线性)
        area_effect = df['area'] * (1 + 0.001 * df['area'])
        
        # 位置影响
        district_effect = df['district'].map({'市中心': 1.5, '新区': 1.0, '郊区': 0.7})
        distance_effect = 1 / (1 + 0.05 * df['distance_to_center'])
        subway_effect = 1 / (1 + 0.1 * df['distance_to_subway'])
        
        # 房屋特征影响
        room_effect = 1 + 0.1 * df['bedrooms'] + 0.05 * df['bathrooms']
        floor_effect = df['floors'].map({1: 0.9, 2: 1.0, 3: 1.1})
        age_effect = np.exp(-0.02 * df['age'])  # 房龄影响
        
        # 装修影响
        decoration_effect = df['decoration'].map({'毛坯': 0.8, '简装': 1.0, '精装': 1.3})
        
        # 物业类型影响
        property_effect = df['property_type'].map({'公寓': 1.0, '别墅': 1.8, '联排': 1.4})
        
        # 环境影响
        environment_effect = 1 + 0.2 * df['green_ratio'] + 0.1 * df['parking_ratio']
        
        # 计算总价
        total_price = (base_price * area_effect * district_effect * distance_effect * 
                      subway_effect * room_effect * floor_effect * age_effect * 
                      decoration_effect * property_effect * environment_effect)
        
        # 添加噪声
        noise = np.random.normal(1, 0.1, n_samples)
        df['price'] = total_price * noise
        
        # 确保价格合理
        df['price'] = np.clip(df['price'], 100000, 50000000)
        
        print(f"数据集形状: {df.shape}")
        print(f"房价统计:")
        print(df['price'].describe())
        
        # 可视化数据集
        self.visualize_house_dataset(df)
        
        return df
        
    def visualize_house_dataset(self, df):
        """可视化房价数据集"""
        fig, axes = plt.subplots(3, 4, figsize=(20, 15))
        
        # 1. 房价分布
        axes[0, 0].hist(df['price']/10000, bins=50, alpha=0.7, color='skyblue')
        axes[0, 0].set_title('房价分布(万元)')
        axes[0, 0].set_xlabel('房价(万元)')
        axes[0, 0].set_ylabel('频率')
        axes[0, 0].grid(True, alpha=0.3)
        
        # 2. 面积vs房价
        axes[0, 1].scatter(df['area'], df['price']/10000, alpha=0.5)
        axes[0, 1].set_title('面积vs房价')
        axes[0, 1].set_xlabel('面积(平米)')
        axes[0, 1].set_ylabel('房价(万元)')
        axes[0, 1].grid(True, alpha=0.3)
        
        # 3. 区域vs房价
        district_price = df.groupby('district')['price'].mean() / 10000
        axes[0, 2].bar(district_price.index, district_price.values, alpha=0.7)
        axes[0, 2].set_title('区域vs平均房价')
        axes[0, 2].set_xlabel('区域')
        axes[0, 2].set_ylabel('平均房价(万元)')
        axes[0, 2].grid(True, alpha=0.3)
        
        # 4. 房龄vs房价
        axes[0, 3].scatter(df['age'], df['price']/10000, alpha=0.5)
        axes[0, 3].set_title('房龄vs房价')
        axes[0, 3].set_xlabel('房龄(年)')
        axes[0, 3].set_ylabel('房价(万元)')
        axes[0, 3].grid(True, alpha=0.3)
        
        # 5. 卧室数vs房价
        bedroom_price = df.groupby('bedrooms')['price'].mean() / 10000
        axes[1, 0].bar(bedroom_price.index, bedroom_price.values, alpha=0.7)
        axes[1, 0].set_title('卧室数vs平均房价')
        axes[1, 0].set_xlabel('卧室数')
        axes[1, 0].set_ylabel('平均房价(万元)')
        axes[1, 0].grid(True, alpha=0.3)
        
        # 6. 距市中心距离vs房价
        axes[1, 1].scatter(df['distance_to_center'], df['price']/10000, alpha=0.5)
        axes[1, 1].set_title('距市中心距离vs房价')
        axes[1, 1].set_xlabel('距离(公里)')
        axes[1, 1].set_ylabel('房价(万元)')
        axes[1, 1].grid(True, alpha=0.3)
        
        # 7. 装修情况vs房价
        decoration_price = df.groupby('decoration')['price'].mean() / 10000
        axes[1, 2].bar(decoration_price.index, decoration_price.values, alpha=0.7)
        axes[1, 2].set_title('装修情况vs平均房价')
        axes[1, 2].set_xlabel('装修情况')
        axes[1, 2].set_ylabel('平均房价(万元)')
        axes[1, 2].grid(True, alpha=0.3)
        
        # 8. 物业类型vs房价
        property_price = df.groupby('property_type')['price'].mean() / 10000
        axes[1, 3].bar(property_price.index, property_price.values, alpha=0.7)
        axes[1, 3].set_title('物业类型vs平均房价')
        axes[1, 3].set_xlabel('物业类型')
        axes[1, 3].set_ylabel('平均房价(万元)')
        axes[1, 3].grid(True, alpha=0.3)
        
        # 9. 相关性热图
        numerical_features = ['area', 'bedrooms', 'bathrooms', 'floors', 'age',
                            'distance_to_center', 'distance_to_subway', 'distance_to_school',
                            'green_ratio', 'parking_ratio', 'price']
        corr_matrix = df[numerical_features].corr()
        
        im = axes[2, 0].imshow(corr_matrix, cmap='coolwarm', aspect='auto', vmin=-1, vmax=1)
        axes[2, 0].set_xticks(range(len(corr_matrix.columns)))
        axes[2, 0].set_yticks(range(len(corr_matrix.columns)))
        axes[2, 0].set_xticklabels([col.replace('_', '\n') for col in corr_matrix.columns], 
                                  rotation=45, fontsize=8)
        axes[2, 0].set_yticklabels([col.replace('_', '\n') for col in corr_matrix.columns], 
                                  fontsize=8)
        axes[2, 0].set_title('特征相关性热图')
        
        # 10. 绿化率vs房价
        axes[2, 1].scatter(df['green_ratio'], df['price']/10000, alpha=0.5)
        axes[2, 1].set_title('绿化率vs房价')
        axes[2, 1].set_xlabel('绿化率')
        axes[2, 1].set_ylabel('房价(万元)')
        axes[2, 1].grid(True, alpha=0.3)
        
        # 11. 停车位比例vs房价
        axes[2, 2].scatter(df['parking_ratio'], df['price']/10000, alpha=0.5)
        axes[2, 2].set_title('停车位比例vs房价')
        axes[2, 2].set_xlabel('停车位比例')
        axes[2, 2].set_ylabel('房价(万元)')
        axes[2, 2].grid(True, alpha=0.3)
        
        # 12. 楼层数vs房价
        floor_price = df.groupby('floors')['price'].mean() / 10000
        axes[2, 3].bar(floor_price.index, floor_price.values, alpha=0.7)
        axes[2, 3].set_title('楼层数vs平均房价')
        axes[2, 3].set_xlabel('楼层数')
        axes[2, 3].set_ylabel('平均房价(万元)')
        axes[2, 3].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
    def feature_engineering(self, df):
        """特征工程"""
        print("\n=== 特征工程 ===")
        
        df_processed = df.copy()
        
        # 1. 创建新特征
        # 房屋总房间数
        df_processed['total_rooms'] = df_processed['bedrooms'] + df_processed['bathrooms']
        
        # 每房间平均面积
        df_processed['area_per_room'] = df_processed['area'] / df_processed['total_rooms']
        
        # 便利性评分(距离的倒数)
        df_processed['convenience_score'] = (1 / (1 + df_processed['distance_to_subway']) + 
                                           1 / (1 + df_processed['distance_to_school']) + 
                                           1 / (1 + df_processed['distance_to_center']))
        
        # 环境评分
        df_processed['environment_score'] = (df_processed['green_ratio'] + 
                                           df_processed['parking_ratio']) / 2
        
        # 房屋新旧程度(年龄分组)
        df_processed['age_group'] = pd.cut(df_processed['age'], 
                                         bins=[0, 5, 15, 30, 50], 
                                         labels=['新房', '次新', '中等', '老房'])
        
        # 面积分组
        df_processed['area_group'] = pd.cut(df_processed['area'], 
                                          bins=[0, 80, 120, 180, 300], 
                                          labels=['小户型', '中户型', '大户型', '豪宅'])
        
        # 2. 类别特征编码
        categorical_features = ['district', 'property_type', 'decoration', 'age_group', 'area_group']
        df_encoded = pd.get_dummies(df_processed, columns=categorical_features, prefix=categorical_features)
        
        # 3. 数值特征标准化(除了目标变量)
        numerical_features = ['area', 'bedrooms', 'bathrooms', 'floors', 'age',
                            'distance_to_center', 'distance_to_subway', 'distance_to_school',
                            'green_ratio', 'parking_ratio', 'total_rooms', 'area_per_room',
                            'convenience_score', 'environment_score']
        
        # 保存特征名称
        self.feature_names = [col for col in df_encoded.columns if col != 'price']
        
        print(f"原始特征数: {len(df.columns) - 1}")
        print(f"工程后特征数: {len(self.feature_names)}")
        print(f"新增特征: {len(self.feature_names) - len(df.columns) + 1}")
        
        return df_encoded
        
    def train_models(self, df):
        """训练多个模型"""
        print("\n=== 模型训练与比较 ===")
        
        # 准备数据
        X = df[self.feature_names]
        y = df['price']
        
        # 划分数据集
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # 特征标准化
        X_train_scaled = self.scaler.fit_transform(X_train)
        X_test_scaled = self.scaler.transform(X_test)
        
        # 定义模型
        models = {
            '线性回归': LinearRegression(),
            'Ridge回归': Ridge(alpha=1.0),
            'Lasso回归': Lasso(alpha=1.0),
            '弹性网络': ElasticNet(alpha=1.0, l1_ratio=0.5),
            '随机森林': RandomForestRegressor(n_estimators=100, random_state=42),
            '梯度提升': GradientBoostingRegressor(n_estimators=100, random_state=42),
            'SVR': SVR(kernel='rbf', C=1.0)
        }
        
        # 训练和评估模型
        results = {}
        
        for name, model in models.items():
            print(f"\n训练 {name}...")
            
            # 对于需要标准化的模型使用标准化数据
            if name in ['线性回归', 'Ridge回归', 'Lasso回归', '弹性网络', 'SVR']:
                model.fit(X_train_scaled, y_train)
                y_pred = model.predict(X_test_scaled)
                
                # 交叉验证
                cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, 
                                          scoring='neg_mean_squared_error')
            else:
                model.fit(X_train, y_train)
                y_pred = model.predict(X_test)
                
                # 交叉验证
                cv_scores = cross_val_score(model, X_train, y_train, cv=5, 
                                          scoring='neg_mean_squared_error')
            
            # 计算评估指标
            mse = mean_squared_error(y_test, y_pred)
            rmse = np.sqrt(mse)
            mae = mean_absolute_error(y_test, y_pred)
            r2 = r2_score(y_test, y_pred)
            cv_rmse = np.sqrt(-cv_scores.mean())
            
            results[name] = {
                'model': model,
                'mse': mse,
                'rmse': rmse,
                'mae': mae,
                'r2': r2,
                'cv_rmse': cv_rmse,
                'y_pred': y_pred
            }
            
            print(f"RMSE: {rmse:.2f}")
            print(f"MAE: {mae:.2f}")
            print(f"R²: {r2:.4f}")
            print(f"CV RMSE: {cv_rmse:.2f}")
        
        # 保存结果
        self.models = results
        
        # 可视化模型比较
        self.visualize_model_comparison(results, y_test)
        
        return results, X_test, y_test
        
    def hyperparameter_tuning(self, X_train, y_train):
        """超参数调优"""
        print("\n=== 超参数调优 ===")
        
        # 对表现最好的几个模型进行调优
        tuning_configs = {
            'RandomForest': {
                'model': RandomForestRegressor(random_state=42),
                'params': {
                    'n_estimators': [50, 100, 200],
                    'max_depth': [10, 20, None],
                    'min_samples_split': [2, 5, 10],
                    'min_samples_leaf': [1, 2, 4]
                }
            },
            'GradientBoosting': {
                'model': GradientBoostingRegressor(random_state=42),
                'params': {
                    'n_estimators': [50, 100, 200],
                    'learning_rate': [0.05, 0.1, 0.2],
                    'max_depth': [3, 5, 7],
                    'subsample': [0.8, 0.9, 1.0]
                }
            }
        }
        
        best_models = {}
        
        for name, config in tuning_configs.items():
            print(f"\n调优 {name}...")
            
            grid_search = GridSearchCV(
                config['model'], 
                config['params'],
                cv=5,
                scoring='neg_mean_squared_error',
                n_jobs=-1,
                verbose=1
            )
            
            grid_search.fit(X_train, y_train)
            
            best_models[name] = {
                'model': grid_search.best_estimator_,
                'best_params': grid_search.best_params_,
                'best_score': -grid_search.best_score_
            }
            
            print(f"最佳参数: {grid_search.best_params_}")
            print(f"最佳CV RMSE: {np.sqrt(-grid_search.best_score_):.2f}")
        
        return best_models
        
    def model_interpretation(self, df):
        """模型解释"""
        print("\n=== 模型解释 ===")
        
        # 使用最佳模型进行解释
        best_model_name = min(self.models.keys(), 
                             key=lambda x: self.models[x]['rmse'])
        best_model = self.models[best_model_name]['model']
        
        print(f"最佳模型: {best_model_name}")
        
        # 准备数据
        X = df[self.feature_names]
        y = df['price']
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # 特征重要性分析
        if hasattr(best_model, 'feature_importances_'):
            # 基于树的模型
            feature_importance = best_model.feature_importances_
            importance_df = pd.DataFrame({
                'feature': self.feature_names,
                'importance': feature_importance
            }).sort_values('importance', ascending=False)
            
            print("\nTop 15 重要特征:")
            print(importance_df.head(15))
            
        elif hasattr(best_model, 'coef_'):
            # 线性模型
            feature_importance = np.abs(best_model.coef_)
            importance_df = pd.DataFrame({
                'feature': self.feature_names,
                'importance': feature_importance
            }).sort_values('importance', ascending=False)
            
            print("\nTop 15 重要特征:")
            print(importance_df.head(15))
        
        # 排列重要性
        if best_model_name not in ['线性回归', 'Ridge回归', 'Lasso回归', '弹性网络', 'SVR']:
            perm_importance = permutation_importance(best_model, X_test, y_test, 
                                                   n_repeats=10, random_state=42)
            
            perm_importance_df = pd.DataFrame({
                'feature': self.feature_names,
                'importance': perm_importance.importances_mean
            }).sort_values('importance', ascending=False)
            
            print("\n排列重要性 Top 15:")
            print(perm_importance_df.head(15))
        
        # 可视化特征重要性
        self.visualize_feature_importance(importance_df)
        
        # 预测示例
        self.prediction_examples(best_model, X_test, y_test)
        
        return best_model, importance_df
        
    def visualize_model_comparison(self, results, y_test):
        """可视化模型比较"""
        fig, axes = plt.subplots(2, 3, figsize=(18, 12))
        
        model_names = list(results.keys())
        
        # 1. RMSE比较
        rmse_values = [results[name]['rmse'] for name in model_names]
        axes[0, 0].bar(range(len(model_names)), rmse_values, alpha=0.7)
        axes[0, 0].set_title('模型RMSE比较')
        axes[0, 0].set_xlabel('模型')
        axes[0, 0].set_ylabel('RMSE')
        axes[0, 0].set_xticks(range(len(model_names)))
        axes[0, 0].set_xticklabels(model_names, rotation=45)
        axes[0, 0].grid(True, alpha=0.3)
        
        # 2. R²比较
        r2_values = [results[name]['r2'] for name in model_names]
        axes[0, 1].bar(range(len(model_names)), r2_values, alpha=0.7)
        axes[0, 1].set_title('模型R²比较')
        axes[0, 1].set_xlabel('模型')
        axes[0, 1].set_ylabel('R²')
        axes[0, 1].set_xticks(range(len(model_names)))
        axes[0, 1].set_xticklabels(model_names, rotation=45)
        axes[0, 1].grid(True, alpha=0.3)
        
        # 3. MAE比较
        mae_values = [results[name]['mae'] for name in model_names]
        axes[0, 2].bar(range(len(model_names)), mae_values, alpha=0.7)
        axes[0, 2].set_title('模型MAE比较')
        axes[0, 2].set_xlabel('模型')
        axes[0, 2].set_ylabel('MAE')
        axes[0, 2].set_xticks(range(len(model_names)))
        axes[0, 2].set_xticklabels(model_names, rotation=45)
        axes[0, 2].grid(True, alpha=0.3)
        
        # 4. 预测vs实际(最佳模型)
        best_model_name = min(model_names, key=lambda x: results[x]['rmse'])
        best_predictions = results[best_model_name]['y_pred']
        
        axes[1, 0].scatter(y_test, best_predictions, alpha=0.5)
        axes[1, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
        axes[1, 0].set_title(f'{best_model_name} - 预测vs实际')
        axes[1, 0].set_xlabel('实际价格')
        axes[1, 0].set_ylabel('预测价格')
        axes[1, 0].grid(True, alpha=0.3)
        
        # 5. 残差分布(最佳模型)
        residuals = y_test - best_predictions
        axes[1, 1].hist(residuals, bins=30, alpha=0.7)
        axes[1, 1].set_title(f'{best_model_name} - 残差分布')
        axes[1, 1].set_xlabel('残差')
        axes[1, 1].set_ylabel('频率')
        axes[1, 1].grid(True, alpha=0.3)
        
        # 6. 残差vs预测值(最佳模型)
        axes[1, 2].scatter(best_predictions, residuals, alpha=0.5)
        axes[1, 2].axhline(y=0, color='r', linestyle='--')
        axes[1, 2].set_title(f'{best_model_name} - 残差vs预测值')
        axes[1, 2].set_xlabel('预测价格')
        axes[1, 2].set_ylabel('残差')
        axes[1, 2].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
    def visualize_feature_importance(self, importance_df):
        """可视化特征重要性"""
        plt.figure(figsize=(12, 8))
        
        # 选择Top 15特征
        top_features = importance_df.head(15)
        
        plt.barh(range(len(top_features)), top_features['importance'], alpha=0.7)
        plt.yticks(range(len(top_features)), top_features['feature'])
        plt.xlabel('重要性')
        plt.title('Top 15 特征重要性')
        plt.gca().invert_yaxis()
        plt.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.show()
        
    def prediction_examples(self, model, X_test, y_test):
        """预测示例"""
        print("\n=== 预测示例 ===")
        
        # 随机选择几个样本进行预测
        sample_indices = np.random.choice(len(X_test), 5, replace=False)
        
        for i, idx in enumerate(sample_indices):
            sample = X_test.iloc[idx:idx+1]
            actual_price = y_test.iloc[idx]
            
            if hasattr(model, 'predict'):
                if hasattr(model, 'feature_importances_'):
                    # 树模型,直接预测
                    predicted_price = model.predict(sample)[0]
                else:
                    # 线性模型,需要标准化
                    sample_scaled = self.scaler.transform(sample)
                    predicted_price = model.predict(sample_scaled)[0]
            
            error = abs(actual_price - predicted_price)
            error_rate = error / actual_price * 100
            
            print(f"\n样本 {i+1}:")
            print(f"实际价格: {actual_price:,.0f} 元")
            print(f"预测价格: {predicted_price:,.0f} 元")
            print(f"误差: {error:,.0f} 元 ({error_rate:.1f}%)")

# 演示房价预测项目
print("=== 房价预测项目实战 ===")
house_predictor = HousePricePrediction()

# 1. 创建数据集
house_data = house_predictor.create_house_dataset()

# 2. 特征工程
processed_data = house_predictor.feature_engineering(house_data)

# 3. 模型训练与比较
model_results, X_test, y_test = house_predictor.train_models(processed_data)

# 4. 模型解释
best_model, feature_importance = house_predictor.model_interpretation(processed_data)

10.2 项目二:客户分类系统

10.2.1 项目背景与目标

客户分类是企业进行精准营销和客户管理的重要手段。通过分析客户的行为特征,可以将客户分为不同类别,制定针对性的营销策略。

项目目标: - 构建客户分类模型 - 识别不同类型客户的特征 - 为营销策略提供数据支持

class CustomerSegmentation:
    def __init__(self):
        self.models = {}
        self.scaler = StandardScaler()
        self.label_encoders = {}
        self.feature_names = []
        
    def create_customer_dataset(self):
        """创建客户数据集"""
        print("=== 创建客户数据集 ===")
        
        np.random.seed(42)
        n_samples = 5000
        
        # 定义客户类型(隐藏标签,用于生成数据)
        customer_types = np.random.choice(['高价值', '中价值', '低价值', '流失风险'], 
                                        n_samples, p=[0.15, 0.35, 0.35, 0.15])
        
        # 基础特征
        data = {
            # 人口统计特征
            'age': np.random.normal(40, 15, n_samples),
            'gender': np.random.choice(['男', '女'], n_samples),
            'education': np.random.choice(['高中', '本科', '硕士', '博士'], n_samples, 
                                       p=[0.3, 0.5, 0.15, 0.05]),
            'income': np.random.lognormal(10, 0.8, n_samples),
            'city_tier': np.random.choice(['一线', '二线', '三线'], n_samples, p=[0.3, 0.4, 0.3]),
            
            # 行为特征
            'tenure_months': np.random.exponential(24, n_samples),  # 客户生命周期
            'total_purchases': np.random.poisson(10, n_samples),    # 总购买次数
            'avg_order_value': np.random.lognormal(6, 0.5, n_samples),  # 平均订单价值
            'last_purchase_days': np.random.exponential(30, n_samples),  # 距上次购买天数
            
            # 偏好特征
            'preferred_category': np.random.choice(['电子产品', '服装', '家居', '美妆', '运动'], 
                                                 n_samples, p=[0.25, 0.25, 0.2, 0.15, 0.15]),
            'channel_preference': np.random.choice(['线上', '线下', '混合'], n_samples, 
                                                 p=[0.5, 0.3, 0.2]),
            
            # 互动特征
            'website_visits': np.random.poisson(15, n_samples),     # 网站访问次数
            'email_opens': np.random.poisson(8, n_samples),        # 邮件打开次数
            'customer_service_calls': np.random.poisson(2, n_samples),  # 客服电话次数
            'social_media_engagement': np.random.beta(2, 5, n_samples),  # 社交媒体参与度
        }
        
        # 创建DataFrame
        df = pd.DataFrame(data)
        
        # 数据清理
        df['age'] = np.clip(df['age'], 18, 80)
        df['income'] = np.clip(df['income'], 20000, 500000)
        df['tenure_months'] = np.clip(df['tenure_months'], 1, 120)
        df['total_purchases'] = np.clip(df['total_purchases'], 0, 100)
        df['avg_order_value'] = np.clip(df['avg_order_value'], 50, 5000)
        df['last_purchase_days'] = np.clip(df['last_purchase_days'], 0, 365)
        df['website_visits'] = np.clip(df['website_visits'], 0, 100)
        df['email_opens'] = np.clip(df['email_opens'], 0, 50)
        df['customer_service_calls'] = np.clip(df['customer_service_calls'], 0, 20)
        
        # 根据客户类型调整特征(模拟真实关系)
        for i, customer_type in enumerate(customer_types):
            if customer_type == '高价值':
                df.loc[i, 'income'] *= 1.5
                df.loc[i, 'avg_order_value'] *= 1.8
                df.loc[i, 'total_purchases'] *= 1.5
                df.loc[i, 'tenure_months'] *= 1.3
                df.loc[i, 'last_purchase_days'] *= 0.5
                
            elif customer_type == '中价值':
                df.loc[i, 'income'] *= 1.1
                df.loc[i, 'avg_order_value'] *= 1.2
                df.loc[i, 'total_purchases'] *= 1.1
                
            elif customer_type == '低价值':
                df.loc[i, 'income'] *= 0.8
                df.loc[i, 'avg_order_value'] *= 0.7
                df.loc[i, 'total_purchases'] *= 0.8
                
            elif customer_type == '流失风险':
                df.loc[i, 'last_purchase_days'] *= 3
                df.loc[i, 'website_visits'] *= 0.3
                df.loc[i, 'email_opens'] *= 0.2
                df.loc[i, 'social_media_engagement'] *= 0.3
        
        # 重新应用数据范围限制
        df['income'] = np.clip(df['income'], 20000, 500000)
        df['avg_order_value'] = np.clip(df['avg_order_value'], 50, 5000)
        df['total_purchases'] = np.clip(df['total_purchases'], 0, 100)
        df['tenure_months'] = np.clip(df['tenure_months'], 1, 120)
        df['last_purchase_days'] = np.clip(df['last_purchase_days'], 0, 365)
        df['website_visits'] = np.clip(df['website_visits'], 0, 100)
        df['email_opens'] = np.clip(df['email_opens'], 0, 50)
        
        # 添加目标变量
        df['customer_type'] = customer_types
        
        print(f"数据集形状: {df.shape}")
        print(f"客户类型分布:")
        print(df['customer_type'].value_counts())
        
        # 可视化数据集
        self.visualize_customer_dataset(df)
        
        return df
    
    def visualize_customer_dataset(self, df):
        """可视化客户数据集"""
        fig, axes = plt.subplots(3, 3, figsize=(18, 15))
        fig.suptitle('客户数据集探索性分析', fontsize=16, fontweight='bold')
        
        # 1. 客户类型分布
        df['customer_type'].value_counts().plot(kind='bar', ax=axes[0,0], color='skyblue')
        axes[0,0].set_title('客户类型分布')
        axes[0,0].set_xlabel('客户类型')
        axes[0,0].set_ylabel('数量')
        axes[0,0].tick_params(axis='x', rotation=45)
        
        # 2. 收入分布
        for i, customer_type in enumerate(df['customer_type'].unique()):
            subset = df[df['customer_type'] == customer_type]['income']
            axes[0,1].hist(subset, alpha=0.7, label=customer_type, bins=20)
        axes[0,1].set_title('不同客户类型的收入分布')
        axes[0,1].set_xlabel('收入')
        axes[0,1].set_ylabel('频次')
        axes[0,1].legend()
        
        # 3. 平均订单价值分布
        sns.boxplot(data=df, x='customer_type', y='avg_order_value', ax=axes[0,2])
        axes[0,2].set_title('平均订单价值分布')
        axes[0,2].tick_params(axis='x', rotation=45)
        
        # 4. 总购买次数分布
        sns.boxplot(data=df, x='customer_type', y='total_purchases', ax=axes[1,0])
        axes[1,0].set_title('总购买次数分布')
        axes[1,0].tick_params(axis='x', rotation=45)
        
        # 5. 客户任期分布
        sns.violinplot(data=df, x='customer_type', y='tenure_months', ax=axes[1,1])
        axes[1,1].set_title('客户任期分布')
        axes[1,1].tick_params(axis='x', rotation=45)
        
        # 6. 最后购买天数分布
        sns.boxplot(data=df, x='customer_type', y='last_purchase_days', ax=axes[1,2])
        axes[1,2].set_title('最后购买天数分布')
        axes[1,2].tick_params(axis='x', rotation=45)
        
        # 7. 网站访问次数分布
        sns.boxplot(data=df, x='customer_type', y='website_visits', ax=axes[2,0])
        axes[2,0].set_title('网站访问次数分布')
        axes[2,0].tick_params(axis='x', rotation=45)
        
        # 8. 邮件打开次数分布
        sns.boxplot(data=df, x='customer_type', y='email_opens', ax=axes[2,1])
        axes[2,1].set_title('邮件打开次数分布')
        axes[2,1].tick_params(axis='x', rotation=45)
        
        # 9. 社交媒体参与度分布
        sns.boxplot(data=df, x='customer_type', y='social_media_engagement', ax=axes[2,2])
        axes[2,2].set_title('社交媒体参与度分布')
        axes[2,2].tick_params(axis='x', rotation=45)
        
        plt.tight_layout()
        plt.show()
        
        # 相关性热力图
        plt.figure(figsize=(12, 10))
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        correlation_matrix = df[numeric_cols].corr()
        sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
                   square=True, linewidths=0.5)
        plt.title('特征相关性热力图', fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()

# 创建客户分类项目实例
customer_project = CustomerSegmentationProject()

# 创建数据集
print("=== 创建客户数据集 ===")
customer_data = customer_project.create_customer_dataset()

10.2.3 数据预处理与特征工程

def preprocess_customer_data(self, df):
    """客户数据预处理"""
    print("=== 数据预处理 ===")
    
    # 分离特征和目标变量
    X = df.drop('customer_type', axis=1)
    y = df['customer_type']
    
    # 编码目标变量
    le = LabelEncoder()
    y_encoded = le.fit_transform(y)
    
    print(f"特征数量: {X.shape[1]}")
    print(f"样本数量: {X.shape[0]}")
    print(f"类别数量: {len(le.classes_)}")
    print(f"类别映射: {dict(zip(le.classes_, range(len(le.classes_))))}")
    
    # 划分训练集和测试集
    X_train, X_test, y_train, y_test = train_test_split(
        X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
    )
    
    # 特征标准化
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    print(f"训练集形状: {X_train_scaled.shape}")
    print(f"测试集形状: {X_test_scaled.shape}")
    
    return X_train_scaled, X_test_scaled, y_train, y_test, le, scaler

# 数据预处理
X_train, X_test, y_train, y_test, label_encoder, scaler = customer_project.preprocess_customer_data(customer_data)

10.2.4 模型训练与比较

def train_classification_models(self, X_train, X_test, y_train, y_test):
    """训练多种分类模型"""
    print("=== 模型训练与比较 ===")
    
    # 定义模型
    models = {
        'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
        'Gradient Boosting': GradientBoostingClassifier(random_state=42),
        'SVM': SVC(random_state=42, probability=True),
        'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5)
    }
    
    results = {}
    
    for name, model in models.items():
        print(f"\n训练 {name}...")
        
        # 训练模型
        model.fit(X_train, y_train)
        
        # 预测
        y_pred = model.predict(X_test)
        y_pred_proba = model.predict_proba(X_test) if hasattr(model, 'predict_proba') else None
        
        # 评估指标
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, average='weighted')
        recall = recall_score(y_test, y_pred, average='weighted')
        f1 = f1_score(y_test, y_pred, average='weighted')
        
        # 交叉验证
        cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
        
        results[name] = {
            'model': model,
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1_score': f1,
            'cv_mean': cv_scores.mean(),
            'cv_std': cv_scores.std(),
            'y_pred': y_pred,
            'y_pred_proba': y_pred_proba
        }
        
        print(f"准确率: {accuracy:.4f}")
        print(f"精确率: {precision:.4f}")
        print(f"召回率: {recall:.4f}")
        print(f"F1分数: {f1:.4f}")
        print(f"交叉验证: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
    
    return results

# 训练模型
model_results = customer_project.train_classification_models(X_train, X_test, y_train, y_test)

10.2.5 模型评估与可视化

def visualize_classification_results(self, results, y_test, label_encoder):
    """可视化分类结果"""
    print("=== 模型评估可视化 ===")
    
    # 1. 模型性能比较
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle('客户分类模型性能比较', fontsize=16, fontweight='bold')
    
    # 准确率比较
    models = list(results.keys())
    accuracies = [results[model]['accuracy'] for model in models]
    axes[0,0].bar(models, accuracies, color='skyblue')
    axes[0,0].set_title('模型准确率比较')
    axes[0,0].set_ylabel('准确率')
    axes[0,0].tick_params(axis='x', rotation=45)
    
    # F1分数比较
    f1_scores = [results[model]['f1_score'] for model in models]
    axes[0,1].bar(models, f1_scores, color='lightgreen')
    axes[0,1].set_title('模型F1分数比较')
    axes[0,1].set_ylabel('F1分数')
    axes[0,1].tick_params(axis='x', rotation=45)
    
    # 交叉验证分数比较
    cv_means = [results[model]['cv_mean'] for model in models]
    cv_stds = [results[model]['cv_std'] for model in models]
    axes[1,0].bar(models, cv_means, yerr=cv_stds, capsize=5, color='orange')
    axes[1,0].set_title('交叉验证分数比较')
    axes[1,0].set_ylabel('CV准确率')
    axes[1,0].tick_params(axis='x', rotation=45)
    
    # 综合指标雷达图
    metrics = ['accuracy', 'precision', 'recall', 'f1_score']
    best_model = max(results.keys(), key=lambda x: results[x]['accuracy'])
    values = [results[best_model][metric] for metric in metrics]
    
    angles = np.linspace(0, 2 * np.pi, len(metrics), endpoint=False)
    values += values[:1]  # 闭合图形
    angles = np.concatenate((angles, [angles[0]]))
    
    axes[1,1].plot(angles, values, 'o-', linewidth=2, label=best_model)
    axes[1,1].fill(angles, values, alpha=0.25)
    axes[1,1].set_xticks(angles[:-1])
    axes[1,1].set_xticklabels(metrics)
    axes[1,1].set_ylim(0, 1)
    axes[1,1].set_title(f'最佳模型 ({best_model}) 性能雷达图')
    axes[1,1].grid(True)
    
    plt.tight_layout()
    plt.show()
    
    # 2. 混淆矩阵
    best_model_name = max(results.keys(), key=lambda x: results[x]['accuracy'])
    y_pred_best = results[best_model_name]['y_pred']
    
    plt.figure(figsize=(10, 8))
    cm = confusion_matrix(y_test, y_pred_best)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=label_encoder.classes_,
                yticklabels=label_encoder.classes_)
    plt.title(f'混淆矩阵 - {best_model_name}', fontsize=14, fontweight='bold')
    plt.xlabel('预测类别')
    plt.ylabel('真实类别')
    plt.tight_layout()
    plt.show()
    
    # 3. 分类报告
    print(f"\n=== {best_model_name} 详细分类报告 ===")
    print(classification_report(y_test, y_pred_best, 
                              target_names=label_encoder.classes_))

# 可视化结果
customer_project.visualize_classification_results(model_results, y_test, label_encoder)

10.2.6 特征重要性分析

def analyze_feature_importance(self, results, feature_names):
    """分析特征重要性"""
    print("=== 特征重要性分析 ===")
    
    # 获取随机森林的特征重要性
    rf_model = results['Random Forest']['model']
    rf_importance = rf_model.feature_importances_
    
    # 获取梯度提升的特征重要性
    gb_model = results['Gradient Boosting']['model']
    gb_importance = gb_model.feature_importances_
    
    # 创建特征重要性DataFrame
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Random_Forest': rf_importance,
        'Gradient_Boosting': gb_importance
    })
    
    # 计算平均重要性
    importance_df['Average'] = (importance_df['Random_Forest'] + 
                               importance_df['Gradient_Boosting']) / 2
    
    # 按平均重要性排序
    importance_df = importance_df.sort_values('Average', ascending=False)
    
    print("特征重要性排名:")
    print(importance_df)
    
    # 可视化特征重要性
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # 随机森林特征重要性
    top_features_rf = importance_df.head(10)
    axes[0].barh(range(len(top_features_rf)), top_features_rf['Random_Forest'])
    axes[0].set_yticks(range(len(top_features_rf)))
    axes[0].set_yticklabels(top_features_rf['Feature'])
    axes[0].set_title('随机森林 - 特征重要性 (Top 10)')
    axes[0].set_xlabel('重要性')
    
    # 梯度提升特征重要性
    top_features_gb = importance_df.head(10)
    axes[1].barh(range(len(top_features_gb)), top_features_gb['Gradient_Boosting'])
    axes[1].set_yticks(range(len(top_features_gb)))
    axes[1].set_yticklabels(top_features_gb['Feature'])
    axes[1].set_title('梯度提升 - 特征重要性 (Top 10)')
    axes[1].set_xlabel('重要性')
    
    plt.tight_layout()
    plt.show()
    
    return importance_df

# 特征重要性分析
feature_names = customer_data.drop('customer_type', axis=1).columns.tolist()
importance_analysis = customer_project.analyze_feature_importance(model_results, feature_names)

10.2.7 客户分类预测示例

def predict_customer_type(self, model, scaler, label_encoder, customer_features):
    """预测新客户类型"""
    print("=== 客户类型预测示例 ===")
    
    # 标准化特征
    customer_features_scaled = scaler.transform([customer_features])
    
    # 预测
    prediction = model.predict(customer_features_scaled)[0]
    prediction_proba = model.predict_proba(customer_features_scaled)[0]
    
    # 解码预测结果
    predicted_type = label_encoder.inverse_transform([prediction])[0]
    
    print(f"客户特征: {customer_features}")
    print(f"预测客户类型: {predicted_type}")
    print(f"预测概率:")
    for i, class_name in enumerate(label_encoder.classes_):
        print(f"  {class_name}: {prediction_proba[i]:.4f}")
    
    return predicted_type, prediction_proba

# 预测示例
best_model = model_results['Random Forest']['model']

# 示例客户1:高价值客户特征
example_customer_1 = [45, 120000, 2500, 25, 36, 15, 25, 15, 8.5]
customer_project.predict_customer_type(best_model, scaler, label_encoder, example_customer_1)

print("\n" + "="*50)

# 示例客户2:流失风险客户特征
example_customer_2 = [35, 45000, 150, 3, 8, 180, 5, 2, 1.2]
customer_project.predict_customer_type(best_model, scaler, label_encoder, example_customer_2)

10.3 项目三:推荐系统

10.3.1 项目背景与目标

推荐系统是现代互联网应用的核心组件,广泛应用于电商、视频、音乐等平台。

项目目标: - 构建基于协同过滤的推荐系统 - 实现基于内容的推荐算法 - 评估推荐系统性能

class RecommendationSystem:
    def __init__(self):
        self.user_item_matrix = None
        self.item_features = None
        self.user_similarity = None
        self.item_similarity = None
    
    def create_recommendation_dataset(self, n_users=1000, n_items=500, n_ratings=50000):
        """创建推荐系统数据集"""
        print("=== 创建推荐系统数据集 ===")
        
        np.random.seed(42)
        
        # 生成用户ID和物品ID
        user_ids = np.random.choice(range(1, n_users + 1), n_ratings)
        item_ids = np.random.choice(range(1, n_items + 1), n_ratings)
        
        # 生成评分(1-5分)
        ratings = np.random.choice([1, 2, 3, 4, 5], n_ratings, 
                                 p=[0.1, 0.15, 0.25, 0.35, 0.15])
        
        # 创建评分数据框
        ratings_df = pd.DataFrame({
            'user_id': user_ids,
            'item_id': item_ids,
            'rating': ratings
        })
        
        # 去除重复的用户-物品对,保留最后一次评分
        ratings_df = ratings_df.drop_duplicates(subset=['user_id', 'item_id'], keep='last')
        
        # 创建物品特征
        item_features_df = pd.DataFrame({
            'item_id': range(1, n_items + 1),
            'category': np.random.choice(['电子产品', '服装', '书籍', '家居', '运动'], n_items),
            'price': np.random.uniform(10, 1000, n_items),
            'brand_popularity': np.random.uniform(0, 1, n_items),
            'release_year': np.random.choice(range(2015, 2024), n_items)
        })
        
        print(f"评分数据形状: {ratings_df.shape}")
        print(f"物品特征形状: {item_features_df.shape}")
        print(f"用户数量: {ratings_df['user_id'].nunique()}")
        print(f"物品数量: {ratings_df['item_id'].nunique()}")
        print(f"评分分布:")
        print(ratings_df['rating'].value_counts().sort_index())
        
        # 可视化数据集
        self.visualize_recommendation_dataset(ratings_df, item_features_df)
        
        return ratings_df, item_features_df
    
    def visualize_recommendation_dataset(self, ratings_df, item_features_df):
        """可视化推荐系统数据集"""
        fig, axes = plt.subplots(2, 3, figsize=(18, 12))
        fig.suptitle('推荐系统数据集分析', fontsize=16, fontweight='bold')
        
        # 1. 评分分布
        ratings_df['rating'].value_counts().sort_index().plot(kind='bar', ax=axes[0,0], color='skyblue')
        axes[0,0].set_title('评分分布')
        axes[0,0].set_xlabel('评分')
        axes[0,0].set_ylabel('数量')
        
        # 2. 用户评分数量分布
        user_rating_counts = ratings_df['user_id'].value_counts()
        axes[0,1].hist(user_rating_counts, bins=30, color='lightgreen', alpha=0.7)
        axes[0,1].set_title('用户评分数量分布')
        axes[0,1].set_xlabel('评分数量')
        axes[0,1].set_ylabel('用户数量')
        
        # 3. 物品评分数量分布
        item_rating_counts = ratings_df['item_id'].value_counts()
        axes[0,2].hist(item_rating_counts, bins=30, color='orange', alpha=0.7)
        axes[0,2].set_title('物品评分数量分布')
        axes[0,2].set_xlabel('评分数量')
        axes[0,2].set_ylabel('物品数量')
        
        # 4. 物品类别分布
        item_features_df['category'].value_counts().plot(kind='bar', ax=axes[1,0], color='purple')
        axes[1,0].set_title('物品类别分布')
        axes[1,0].set_xlabel('类别')
        axes[1,0].set_ylabel('数量')
        axes[1,0].tick_params(axis='x', rotation=45)
        
        # 5. 物品价格分布
        axes[1,1].hist(item_features_df['price'], bins=30, color='red', alpha=0.7)
        axes[1,1].set_title('物品价格分布')
        axes[1,1].set_xlabel('价格')
        axes[1,1].set_ylabel('数量')
        
        # 6. 发布年份分布
        item_features_df['release_year'].value_counts().sort_index().plot(kind='bar', ax=axes[1,2], color='brown')
        axes[1,2].set_title('物品发布年份分布')
        axes[1,2].set_xlabel('年份')
        axes[1,2].set_ylabel('数量')
        
        plt.tight_layout()
        plt.show()

# 创建推荐系统实例
rec_system = RecommendationSystem()

# 创建数据集
print("=== 创建推荐系统数据集 ===")
ratings_data, item_features = rec_system.create_recommendation_dataset()

10.3.2 协同过滤算法实现

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split

def build_user_item_matrix(self, ratings_df):
    """构建用户-物品评分矩阵"""
    print("=== 构建用户-物品矩阵 ===")
    
    # 创建用户-物品矩阵
    user_item_matrix = ratings_df.pivot(index='user_id', columns='item_id', values='rating')
    user_item_matrix = user_item_matrix.fillna(0)
    
    print(f"用户-物品矩阵形状: {user_item_matrix.shape}")
    print(f"稀疏度: {(user_item_matrix == 0).sum().sum() / (user_item_matrix.shape[0] * user_item_matrix.shape[1]):.4f}")
    
    self.user_item_matrix = user_item_matrix
    return user_item_matrix

def compute_user_similarity(self, user_item_matrix):
    """计算用户相似度"""
    print("=== 计算用户相似度 ===")
    
    # 计算用户相似度矩阵
    user_similarity = cosine_similarity(user_item_matrix)
    user_similarity_df = pd.DataFrame(user_similarity, 
                                    index=user_item_matrix.index,
                                    columns=user_item_matrix.index)
    
    self.user_similarity = user_similarity_df
    print(f"用户相似度矩阵形状: {user_similarity_df.shape}")
    
    return user_similarity_df

def compute_item_similarity(self, user_item_matrix):
    """计算物品相似度"""
    print("=== 计算物品相似度 ===")
    
    # 计算物品相似度矩阵
    item_similarity = cosine_similarity(user_item_matrix.T)
    item_similarity_df = pd.DataFrame(item_similarity,
                                    index=user_item_matrix.columns,
                                    columns=user_item_matrix.columns)
    
    self.item_similarity = item_similarity_df
    print(f"物品相似度矩阵形状: {item_similarity_df.shape}")
    
    return item_similarity_df

def user_based_recommendation(self, user_id, n_recommendations=10):
    """基于用户的协同过滤推荐"""
    if self.user_similarity is None or self.user_item_matrix is None:
        raise ValueError("请先计算用户相似度和构建用户-物品矩阵")
    
    # 获取目标用户的相似用户
    user_similarities = self.user_similarity.loc[user_id].sort_values(ascending=False)
    similar_users = user_similarities.iloc[1:11]  # 排除自己,取前10个相似用户
    
    # 获取目标用户已评分的物品
    user_ratings = self.user_item_matrix.loc[user_id]
    rated_items = user_ratings[user_ratings > 0].index
    
    # 计算推荐分数
    recommendations = {}
    for item_id in self.user_item_matrix.columns:
        if item_id not in rated_items:  # 只推荐未评分的物品
            score = 0
            similarity_sum = 0
            
            for similar_user_id, similarity in similar_users.items():
                if self.user_item_matrix.loc[similar_user_id, item_id] > 0:
                    score += similarity * self.user_item_matrix.loc[similar_user_id, item_id]
                    similarity_sum += abs(similarity)
            
            if similarity_sum > 0:
                recommendations[item_id] = score / similarity_sum
    
    # 排序并返回前N个推荐
    sorted_recommendations = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)
    return sorted_recommendations[:n_recommendations]

def item_based_recommendation(self, user_id, n_recommendations=10):
    """基于物品的协同过滤推荐"""
    if self.item_similarity is None or self.user_item_matrix is None:
        raise ValueError("请先计算物品相似度和构建用户-物品矩阵")
    
    # 获取目标用户的评分
    user_ratings = self.user_item_matrix.loc[user_id]
    rated_items = user_ratings[user_ratings > 0]
    
    # 计算推荐分数
    recommendations = {}
    for item_id in self.user_item_matrix.columns:
        if item_id not in rated_items.index:  # 只推荐未评分的物品
            score = 0
            similarity_sum = 0
            
            for rated_item_id, rating in rated_items.items():
                similarity = self.item_similarity.loc[item_id, rated_item_id]
                score += similarity * rating
                similarity_sum += abs(similarity)
            
            if similarity_sum > 0:
                recommendations[item_id] = score / similarity_sum
    
    # 排序并返回前N个推荐
    sorted_recommendations = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)
    return sorted_recommendations[:n_recommendations]

# 添加方法到类
RecommendationSystem.build_user_item_matrix = build_user_item_matrix
RecommendationSystem.compute_user_similarity = compute_user_similarity
RecommendationSystem.compute_item_similarity = compute_item_similarity
RecommendationSystem.user_based_recommendation = user_based_recommendation
RecommendationSystem.item_based_recommendation = item_based_recommendation

# 构建推荐系统
user_item_matrix = rec_system.build_user_item_matrix(ratings_data)
user_similarity = rec_system.compute_user_similarity(user_item_matrix)
item_similarity = rec_system.compute_item_similarity(user_item_matrix)

10.3.3 矩阵分解推荐算法

def matrix_factorization_recommendation(self, ratings_df, n_components=50):
    """基于矩阵分解的推荐算法"""
    print("=== 矩阵分解推荐算法 ===")
    
    # 准备数据
    user_item_matrix = self.user_item_matrix.values
    
    # 使用SVD进行矩阵分解
    svd = TruncatedSVD(n_components=n_components, random_state=42)
    user_factors = svd.fit_transform(user_item_matrix)
    item_factors = svd.components_.T
    
    # 重构评分矩阵
    reconstructed_matrix = np.dot(user_factors, svd.components_)
    
    print(f"原始矩阵形状: {user_item_matrix.shape}")
    print(f"用户因子矩阵形状: {user_factors.shape}")
    print(f"物品因子矩阵形状: {item_factors.shape}")
    print(f"重构矩阵形状: {reconstructed_matrix.shape}")
    
    # 转换为DataFrame
    reconstructed_df = pd.DataFrame(reconstructed_matrix,
                                  index=self.user_item_matrix.index,
                                  columns=self.user_item_matrix.columns)
    
    return reconstructed_df, svd

def svd_recommendation(self, user_id, reconstructed_matrix, n_recommendations=10):
    """基于SVD的推荐"""
    # 获取用户的原始评分和预测评分
    original_ratings = self.user_item_matrix.loc[user_id]
    predicted_ratings = reconstructed_matrix.loc[user_id]
    
    # 只推荐未评分的物品
    unrated_items = original_ratings[original_ratings == 0].index
    recommendations = predicted_ratings[unrated_items].sort_values(ascending=False)
    
    return recommendations.head(n_recommendations)

# 添加方法到类
RecommendationSystem.matrix_factorization_recommendation = matrix_factorization_recommendation
RecommendationSystem.svd_recommendation = svd_recommendation

# 矩阵分解推荐
reconstructed_matrix, svd_model = rec_system.matrix_factorization_recommendation(ratings_data)

10.3.4 推荐系统评估

def evaluate_recommendation_system(self, ratings_df, test_size=0.2):
    """评估推荐系统性能"""
    print("=== 推荐系统评估 ===")
    
    # 划分训练集和测试集
    train_data, test_data = train_test_split(ratings_df, test_size=test_size, random_state=42)
    
    # 构建训练集的用户-物品矩阵
    train_matrix = train_data.pivot(index='user_id', columns='item_id', values='rating').fillna(0)
    
    # 计算相似度
    user_sim = cosine_similarity(train_matrix)
    item_sim = cosine_similarity(train_matrix.T)
    
    # 评估指标
    mae_scores = []
    rmse_scores = []
    
    # 对测试集中的每个评分进行预测
    for _, row in test_data.iterrows():
        user_id, item_id, true_rating = row['user_id'], row['item_id'], row['rating']
        
        if user_id in train_matrix.index and item_id in train_matrix.columns:
            # 基于用户的协同过滤预测
            user_idx = list(train_matrix.index).index(user_id)
            item_idx = list(train_matrix.columns).index(item_id)
            
            # 计算预测评分
            user_similarities = user_sim[user_idx]
            user_ratings = train_matrix.iloc[:, item_idx]
            
            # 加权平均预测
            numerator = np.sum(user_similarities * user_ratings)
            denominator = np.sum(np.abs(user_similarities))
            
            if denominator > 0:
                predicted_rating = numerator / denominator
            else:
                predicted_rating = np.mean(train_matrix.values[train_matrix.values > 0])
            
            # 计算误差
            mae_scores.append(abs(true_rating - predicted_rating))
            rmse_scores.append((true_rating - predicted_rating) ** 2)
    
    # 计算最终指标
    mae = np.mean(mae_scores)
    rmse = np.sqrt(np.mean(rmse_scores))
    
    print(f"平均绝对误差 (MAE): {mae:.4f}")
    print(f"均方根误差 (RMSE): {rmse:.4f}")
    
    return mae, rmse

def visualize_recommendation_results(self, user_id=1):
    """可视化推荐结果"""
    print(f"=== 用户 {user_id} 推荐结果可视化 ===")
    
    # 获取不同算法的推荐结果
    user_based_recs = self.user_based_recommendation(user_id, 10)
    item_based_recs = self.item_based_recommendation(user_id, 10)
    svd_recs = self.svd_recommendation(user_id, reconstructed_matrix, 10)
    
    # 创建可视化
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle(f'用户 {user_id} 推荐结果比较', fontsize=16, fontweight='bold')
    
    # 1. 基于用户的协同过滤推荐
    if user_based_recs:
        items, scores = zip(*user_based_recs)
        axes[0,0].bar(range(len(items)), scores, color='skyblue')
        axes[0,0].set_title('基于用户的协同过滤推荐')
        axes[0,0].set_xlabel('推荐物品')
        axes[0,0].set_ylabel('推荐分数')
        axes[0,0].set_xticks(range(len(items)))
        axes[0,0].set_xticklabels([f'Item {item}' for item in items], rotation=45)
    
    # 2. 基于物品的协同过滤推荐
    if item_based_recs:
        items, scores = zip(*item_based_recs)
        axes[0,1].bar(range(len(items)), scores, color='lightgreen')
        axes[0,1].set_title('基于物品的协同过滤推荐')
        axes[0,1].set_xlabel('推荐物品')
        axes[0,1].set_ylabel('推荐分数')
        axes[0,1].set_xticks(range(len(items)))
        axes[0,1].set_xticklabels([f'Item {item}' for item in items], rotation=45)
    
    # 3. SVD推荐
    axes[1,0].bar(range(len(svd_recs)), svd_recs.values, color='orange')
    axes[1,0].set_title('SVD矩阵分解推荐')
    axes[1,0].set_xlabel('推荐物品')
    axes[1,0].set_ylabel('预测评分')
    axes[1,0].set_xticks(range(len(svd_recs)))
    axes[1,0].set_xticklabels([f'Item {item}' for item in svd_recs.index], rotation=45)
    
    # 4. 用户评分历史
    user_ratings = self.user_item_matrix.loc[user_id]
    rated_items = user_ratings[user_ratings > 0]
    
    if len(rated_items) > 0:
        axes[1,1].bar(range(len(rated_items)), rated_items.values, color='red', alpha=0.7)
        axes[1,1].set_title('用户历史评分')
        axes[1,1].set_xlabel('已评分物品')
        axes[1,1].set_ylabel('评分')
        axes[1,1].set_xticks(range(len(rated_items)))
        axes[1,1].set_xticklabels([f'Item {item}' for item in rated_items.index], rotation=45)
    
    plt.tight_layout()
    plt.show()
    
    # 打印推荐结果
    print(f"\n基于用户的协同过滤推荐 (Top 5):")
    for i, (item, score) in enumerate(user_based_recs[:5], 1):
        print(f"{i}. 物品 {item}: {score:.4f}")
    
    print(f"\n基于物品的协同过滤推荐 (Top 5):")
    for i, (item, score) in enumerate(item_based_recs[:5], 1):
        print(f"{i}. 物品 {item}: {score:.4f}")
    
    print(f"\nSVD矩阵分解推荐 (Top 5):")
    for i, (item, score) in enumerate(svd_recs.head(5).items(), 1):
        print(f"{i}. 物品 {item}: {score:.4f}")

# 添加方法到类
RecommendationSystem.evaluate_recommendation_system = evaluate_recommendation_system
RecommendationSystem.visualize_recommendation_results = visualize_recommendation_results

# 评估推荐系统
mae, rmse = rec_system.evaluate_recommendation_system(ratings_data)

# 可视化推荐结果
rec_system.visualize_recommendation_results(user_id=1)

10.4 项目四:时间序列预测

10.4.1 项目背景与目标

时间序列预测在金融、零售、制造等领域有广泛应用,如股价预测、销量预测、设备故障预测等。

项目目标: - 构建时间序列预测模型 - 处理季节性和趋势性 - 评估预测性能

class TimeSeriesForecasting:
    def __init__(self):
        self.data = None
        self.models = {}
        self.predictions = {}
    
    def create_time_series_dataset(self, n_points=1000, start_date='2020-01-01'):
        """创建时间序列数据集"""
        print("=== 创建时间序列数据集 ===")
        
        # 创建日期范围
        dates = pd.date_range(start=start_date, periods=n_points, freq='D')
        
        # 生成基础趋势
        trend = np.linspace(100, 200, n_points)
        
        # 添加季节性(年度和周度)
        annual_seasonality = 20 * np.sin(2 * np.pi * np.arange(n_points) / 365.25)
        weekly_seasonality = 10 * np.sin(2 * np.pi * np.arange(n_points) / 7)
        
        # 添加噪声
        noise = np.random.normal(0, 5, n_points)
        
        # 组合所有组件
        values = trend + annual_seasonality + weekly_seasonality + noise
        
        # 添加一些异常值
        anomaly_indices = np.random.choice(n_points, size=int(n_points * 0.02), replace=False)
        values[anomaly_indices] += np.random.normal(0, 30, len(anomaly_indices))
        
        # 创建DataFrame
        ts_data = pd.DataFrame({
            'date': dates,
            'value': values,
            'trend': trend,
            'annual_seasonality': annual_seasonality,
            'weekly_seasonality': weekly_seasonality,
            'noise': noise
        })
        
        ts_data.set_index('date', inplace=True)
        
        print(f"时间序列数据形状: {ts_data.shape}")
        print(f"日期范围: {ts_data.index.min()} 到 {ts_data.index.max()}")
        print(f"数值范围: {ts_data['value'].min():.2f} 到 {ts_data['value'].max():.2f}")
        
        self.data = ts_data
        
        # 可视化时间序列
        self.visualize_time_series(ts_data)
        
        return ts_data
    
    def visualize_time_series(self, ts_data):
        """可视化时间序列数据"""
        fig, axes = plt.subplots(3, 2, figsize=(16, 12))
        fig.suptitle('时间序列数据分析', fontsize=16, fontweight='bold')
        
        # 1. 原始时间序列
        axes[0,0].plot(ts_data.index, ts_data['value'], color='blue', alpha=0.7)
        axes[0,0].set_title('原始时间序列')
        axes[0,0].set_xlabel('日期')
        axes[0,0].set_ylabel('数值')
        axes[0,0].grid(True, alpha=0.3)
        
        # 2. 趋势组件
        axes[0,1].plot(ts_data.index, ts_data['trend'], color='red', linewidth=2)
        axes[0,1].set_title('趋势组件')
        axes[0,1].set_xlabel('日期')
        axes[0,1].set_ylabel('趋势值')
        axes[0,1].grid(True, alpha=0.3)
        
        # 3. 年度季节性
        axes[1,0].plot(ts_data.index[:365], ts_data['annual_seasonality'][:365], color='green')
        axes[1,0].set_title('年度季节性 (前365天)')
        axes[1,0].set_xlabel('日期')
        axes[1,0].set_ylabel('季节性值')
        axes[1,0].grid(True, alpha=0.3)
        
        # 4. 周度季节性
        axes[1,1].plot(ts_data.index[:28], ts_data['weekly_seasonality'][:28], color='orange')
        axes[1,1].set_title('周度季节性 (前4周)')
        axes[1,1].set_xlabel('日期')
        axes[1,1].set_ylabel('季节性值')
        axes[1,1].grid(True, alpha=0.3)
        
        # 5. 数值分布
        axes[2,0].hist(ts_data['value'], bins=50, color='purple', alpha=0.7)
        axes[2,0].set_title('数值分布')
        axes[2,0].set_xlabel('数值')
        axes[2,0].set_ylabel('频次')
        
        # 6. 自相关图
        from statsmodels.tsa.stattools import acf
        lags = range(1, 50)
        autocorr = [acf(ts_data['value'], nlags=lag)[-1] for lag in lags]
        axes[2,1].plot(lags, autocorr, color='brown')
        axes[2,1].axhline(y=0, color='black', linestyle='--', alpha=0.5)
        axes[2,1].set_title('自相关函数')
        axes[2,1].set_xlabel('滞后期')
        axes[2,1].set_ylabel('自相关系数')
        axes[2,1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()

# 创建时间序列预测实例
ts_forecaster = TimeSeriesForecasting()

# 创建数据集
print("=== 创建时间序列数据集 ===")
ts_data = ts_forecaster.create_time_series_dataset()

10.4.2 特征工程与数据预处理

def create_time_series_features(self, ts_data, window_sizes=[7, 14, 30]):
    """创建时间序列特征"""
    print("=== 时间序列特征工程 ===")
    
    # 复制数据
    features_df = ts_data.copy()
    
    # 1. 滞后特征
    for lag in [1, 2, 3, 7, 14, 30]:
        features_df[f'lag_{lag}'] = features_df['value'].shift(lag)
    
    # 2. 滑动窗口统计特征
    for window in window_sizes:
        features_df[f'rolling_mean_{window}'] = features_df['value'].rolling(window=window).mean()
        features_df[f'rolling_std_{window}'] = features_df['value'].rolling(window=window).std()
        features_df[f'rolling_min_{window}'] = features_df['value'].rolling(window=window).min()
        features_df[f'rolling_max_{window}'] = features_df['value'].rolling(window=window).max()
    
    # 3. 时间特征
    features_df['year'] = features_df.index.year
    features_df['month'] = features_df.index.month
    features_df['day'] = features_df.index.day
    features_df['dayofweek'] = features_df.index.dayofweek
    features_df['dayofyear'] = features_df.index.dayofyear
    features_df['quarter'] = features_df.index.quarter
    features_df['is_weekend'] = (features_df.index.dayofweek >= 5).astype(int)
    
    # 4. 差分特征
    features_df['diff_1'] = features_df['value'].diff(1)
    features_df['diff_7'] = features_df['value'].diff(7)
    features_df['diff_30'] = features_df['value'].diff(30)
    
    # 5. 变化率特征
    features_df['pct_change_1'] = features_df['value'].pct_change(1)
    features_df['pct_change_7'] = features_df['value'].pct_change(7)
    
    # 6. 周期性特征
    features_df['sin_dayofyear'] = np.sin(2 * np.pi * features_df['dayofyear'] / 365.25)
    features_df['cos_dayofyear'] = np.cos(2 * np.pi * features_df['dayofyear'] / 365.25)
    features_df['sin_dayofweek'] = np.sin(2 * np.pi * features_df['dayofweek'] / 7)
    features_df['cos_dayofweek'] = np.cos(2 * np.pi * features_df['dayofweek'] / 7)
    
    # 删除包含NaN的行
    features_df = features_df.dropna()
    
    print(f"特征工程后数据形状: {features_df.shape}")
    print(f"特征列数: {len(features_df.columns)}")
    
    return features_df

def prepare_supervised_data(self, features_df, target_col='value', forecast_horizon=1):
    """准备监督学习数据"""
    print("=== 准备监督学习数据 ===")
    
    # 创建目标变量(未来值)
    y = features_df[target_col].shift(-forecast_horizon)
    
    # 特征变量(排除目标变量和其组件)
    exclude_cols = ['value', 'trend', 'annual_seasonality', 'weekly_seasonality', 'noise']
    X = features_df.drop(columns=[col for col in exclude_cols if col in features_df.columns])
    
    # 删除包含NaN的行
    valid_indices = ~y.isna()
    X = X[valid_indices]
    y = y[valid_indices]
    
    print(f"特征矩阵形状: {X.shape}")
    print(f"目标变量形状: {y.shape}")
    
    return X, y

# 添加方法到类
TimeSeriesForecasting.create_time_series_features = create_time_series_features
TimeSeriesForecasting.prepare_supervised_data = prepare_supervised_data

# 特征工程
features_data = ts_forecaster.create_time_series_features(ts_data)
X, y = ts_forecaster.prepare_supervised_data(features_data)

10.4.3 时间序列预测模型

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def train_forecasting_models(self, X, y, test_size=0.2):
    """训练多种时间序列预测模型"""
    print("=== 训练时间序列预测模型 ===")
    
    # 时间序列数据按时间顺序划分训练集和测试集
    split_index = int(len(X) * (1 - test_size))
    X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
    y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]
    
    print(f"训练集大小: {X_train.shape[0]}")
    print(f"测试集大小: {X_test.shape[0]}")
    
    # 定义模型
    models = {
        'Linear Regression': LinearRegression(),
        'Ridge Regression': Ridge(alpha=1.0),
        'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
        'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
        'SVR': SVR(kernel='rbf', C=1.0, gamma='scale')
    }
    
    # 训练和评估模型
    results = {}
    predictions = {}
    
    for name, model in models.items():
        print(f"\n训练 {name}...")
        
        # 训练模型
        model.fit(X_train, y_train)
        
        # 预测
        y_pred_train = model.predict(X_train)
        y_pred_test = model.predict(X_test)
        
        # 评估指标
        train_mae = mean_absolute_error(y_train, y_pred_train)
        test_mae = mean_absolute_error(y_test, y_pred_test)
        train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
        test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
        train_r2 = r2_score(y_train, y_pred_train)
        test_r2 = r2_score(y_test, y_pred_test)
        
        results[name] = {
            'train_mae': train_mae,
            'test_mae': test_mae,
            'train_rmse': train_rmse,
            'test_rmse': test_rmse,
            'train_r2': train_r2,
            'test_r2': test_r2
        }
        
        predictions[name] = {
            'train_pred': y_pred_train,
            'test_pred': y_pred_test
        }
        
        print(f"训练 MAE: {train_mae:.4f}, 测试 MAE: {test_mae:.4f}")
        print(f"训练 RMSE: {train_rmse:.4f}, 测试 RMSE: {test_rmse:.4f}")
        print(f"训练 R²: {train_r2:.4f}, 测试 R²: {test_r2:.4f}")
    
    self.models = models
    self.predictions = predictions
    
    return results, X_train, X_test, y_train, y_test

def visualize_forecasting_results(self, results, X_train, X_test, y_train, y_test):
    """可视化预测结果"""
    print("=== 可视化预测结果 ===")
    
    # 创建图形
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    fig.suptitle('时间序列预测结果', fontsize=16, fontweight='bold')
    
    # 1. 模型性能比较
    model_names = list(results.keys())
    test_maes = [results[name]['test_mae'] for name in model_names]
    test_rmses = [results[name]['test_rmse'] for name in model_names]
    test_r2s = [results[name]['test_r2'] for name in model_names]
    
    axes[0,0].bar(model_names, test_maes, color='skyblue')
    axes[0,0].set_title('测试集 MAE 比较')
    axes[0,0].set_ylabel('MAE')
    axes[0,0].tick_params(axis='x', rotation=45)
    
    axes[0,1].bar(model_names, test_rmses, color='lightgreen')
    axes[0,1].set_title('测试集 RMSE 比较')
    axes[0,1].set_ylabel('RMSE')
    axes[0,1].tick_params(axis='x', rotation=45)
    
    axes[0,2].bar(model_names, test_r2s, color='orange')
    axes[0,2].set_title('测试集 R² 比较')
    axes[0,2].set_ylabel('R²')
    axes[0,2].tick_params(axis='x', rotation=45)
    
    # 2. 预测结果可视化(选择最佳模型)
    best_model = min(results.keys(), key=lambda x: results[x]['test_mae'])
    
    # 训练集预测
    train_dates = X_train.index
    test_dates = X_test.index
    
    axes[1,0].plot(train_dates, y_train, label='真实值', color='blue', alpha=0.7)
    axes[1,0].plot(train_dates, self.predictions[best_model]['train_pred'], 
                   label='预测值', color='red', alpha=0.7)
    axes[1,0].set_title(f'{best_model} - 训练集预测')
    axes[1,0].set_xlabel('日期')
    axes[1,0].set_ylabel('数值')
    axes[1,0].legend()
    axes[1,0].grid(True, alpha=0.3)
    
    # 测试集预测
    axes[1,1].plot(test_dates, y_test, label='真实值', color='blue', alpha=0.7)
    axes[1,1].plot(test_dates, self.predictions[best_model]['test_pred'], 
                   label='预测值', color='red', alpha=0.7)
    axes[1,1].set_title(f'{best_model} - 测试集预测')
    axes[1,1].set_xlabel('日期')
    axes[1,1].set_ylabel('数值')
    axes[1,1].legend()
    axes[1,1].grid(True, alpha=0.3)
    
    # 残差分析
    residuals = y_test.values - self.predictions[best_model]['test_pred']
    axes[1,2].scatter(self.predictions[best_model]['test_pred'], residuals, alpha=0.6)
    axes[1,2].axhline(y=0, color='red', linestyle='--')
    axes[1,2].set_title(f'{best_model} - 残差分析')
    axes[1,2].set_xlabel('预测值')
    axes[1,2].set_ylabel('残差')
    axes[1,2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # 打印最佳模型信息
    print(f"\n最佳模型: {best_model}")
    print(f"测试集 MAE: {results[best_model]['test_mae']:.4f}")
    print(f"测试集 RMSE: {results[best_model]['test_rmse']:.4f}")
    print(f"测试集 R²: {results[best_model]['test_r2']:.4f}")

def feature_importance_analysis(self, model_name='Random Forest'):
    """特征重要性分析"""
    print(f"=== {model_name} 特征重要性分析 ===")
    
    if model_name not in self.models:
        print(f"模型 {model_name} 不存在")
        return
    
    model = self.models[model_name]
    
    # 获取特征重要性
    if hasattr(model, 'feature_importances_'):
        importances = model.feature_importances_
        feature_names = X.columns
        
        # 创建特征重要性DataFrame
        importance_df = pd.DataFrame({
            'feature': feature_names,
            'importance': importances
        }).sort_values('importance', ascending=False)
        
        # 可视化特征重要性
        plt.figure(figsize=(12, 8))
        top_features = importance_df.head(15)
        plt.barh(range(len(top_features)), top_features['importance'], color='skyblue')
        plt.yticks(range(len(top_features)), top_features['feature'])
        plt.xlabel('特征重要性')
        plt.title(f'{model_name} - Top 15 特征重要性')
        plt.gca().invert_yaxis()
        plt.tight_layout()
        plt.show()
        
        print("Top 10 重要特征:")
        for i, (_, row) in enumerate(importance_df.head(10).iterrows(), 1):
            print(f"{i}. {row['feature']}: {row['importance']:.4f}")
    
    else:
        print(f"模型 {model_name} 不支持特征重要性分析")

# 添加方法到类
TimeSeriesForecasting.train_forecasting_models = train_forecasting_models
TimeSeriesForecasting.visualize_forecasting_results = visualize_forecasting_results
TimeSeriesForecasting.feature_importance_analysis = feature_importance_analysis

# 训练模型
results, X_train, X_test, y_train, y_test = ts_forecaster.train_forecasting_models(X, y)

# 可视化结果
ts_forecaster.visualize_forecasting_results(results, X_train, X_test, y_train, y_test)

# 特征重要性分析
ts_forecaster.feature_importance_analysis('Random Forest')
ts_forecaster.feature_importance_analysis('Gradient Boosting')

10.4.4 预测示例

def make_future_predictions(self, n_steps=30):
    """进行未来预测"""
    print(f"=== 未来 {n_steps} 天预测 ===")
    
    # 选择最佳模型
    best_model_name = min(results.keys(), key=lambda x: results[x]['test_mae'])
    best_model = self.models[best_model_name]
    
    print(f"使用最佳模型: {best_model_name}")
    
    # 获取最新的特征数据
    last_features = X.iloc[-1:].copy()
    
    # 存储预测结果
    future_predictions = []
    current_features = last_features.copy()
    
    for step in range(n_steps):
        # 进行预测
        pred = best_model.predict(current_features)[0]
        future_predictions.append(pred)
        
        # 更新特征(简化版本,实际应用中需要更复杂的特征更新逻辑)
        # 这里只更新滞后特征作为示例
        if 'lag_1' in current_features.columns:
            current_features['lag_1'] = pred
        
        print(f"第 {step+1} 天预测值: {pred:.2f}")
    
    # 可视化预测结果
    plt.figure(figsize=(14, 8))
    
    # 绘制历史数据
    historical_dates = y.index[-100:]  # 最近100天
    historical_values = y.iloc[-100:]
    
    plt.plot(historical_dates, historical_values, label='历史数据', color='blue', alpha=0.7)
    
    # 绘制未来预测
    future_dates = pd.date_range(start=y.index[-1] + pd.Timedelta(days=1), 
                                periods=n_steps, freq='D')
    plt.plot(future_dates, future_predictions, label='未来预测', 
             color='red', linestyle='--', marker='o', markersize=4)
    
    plt.title(f'时间序列预测 - 未来 {n_steps} 天')
    plt.xlabel('日期')
    plt.ylabel('数值')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    return future_predictions, future_dates

# 添加方法到类
TimeSeriesForecasting.make_future_predictions = make_future_predictions

# 进行未来预测
future_preds, future_dates = ts_forecaster.make_future_predictions(30)

10.5 本章小结

10.5.1 项目总结

本章通过四个完整的实战项目,展示了Scikit-learn在不同领域的应用:

  1. 房价预测系统:回归问题的完整解决方案
  2. 客户分类与营销策略:分类问题的业务应用
  3. 推荐系统:协同过滤和矩阵分解技术
  4. 时间序列预测:时序数据的特征工程和预测

10.5.2 核心技能

通过这些项目,你掌握了:

  • 数据预处理:缺失值处理、特征缩放、编码
  • 特征工程:特征选择、构造、变换
  • 模型选择:多种算法比较和评估
  • 超参数调优:网格搜索和随机搜索
  • 模型解释:特征重要性和可视化
  • 业务应用:将技术转化为业务价值

10.5.3 最佳实践

  1. 数据质量:始终关注数据质量和完整性
  2. 特征工程:投入足够时间进行特征工程
  3. 模型验证:使用适当的验证策略
  4. 可解释性:确保模型结果可解释
  5. 业务理解:深入理解业务需求和约束

10.5.4 进阶学习

  • 深度学习:TensorFlow、PyTorch
  • 大数据处理:Spark MLlib、Dask
  • 模型部署:Flask、FastAPI、Docker
  • MLOps:模型版本控制、监控、自动化

10.5.5 练习建议

  1. 尝试其他数据集和问题类型
  2. 实现更复杂的特征工程
  3. 探索集成学习方法
  4. 学习模型部署和监控
  5. 参与Kaggle竞赛实践

通过这些实战项目的学习,你已经具备了使用Scikit-learn解决实际机器学习问题的能力。继续实践和探索,将帮助你成为更优秀的数据科学家!