本章概述
在前面的章节中,我们系统学习了Scikit-learn的各个组件和技术。本章将通过四个完整的实战项目,将所学知识综合运用到实际问题中。每个项目都包含完整的数据科学流程:问题定义、数据探索、特征工程、模型构建、评估优化和结果解释。
本章学习目标
- 掌握完整的机器学习项目流程
- 学会处理不同类型的实际问题
- 理解项目中的关键决策点
- 培养端到端的项目实施能力
项目列表
- 房价预测项目 - 回归问题实战
- 客户分类项目 - 分类问题实战
- 推荐系统项目 - 协同过滤实战
- 时间序列预测项目 - 时序分析实战
10.1 项目一:房价预测系统
10.1.1 项目背景与目标
房价预测是一个经典的回归问题,对于房地产行业、金融机构和个人购房者都具有重要意义。
项目目标: - 构建准确的房价预测模型 - 识别影响房价的关键因素 - 提供可解释的预测结果
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.inspection import permutation_importance
import warnings
warnings.filterwarnings('ignore')
# 设置中文字体和图表样式
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
sns.set_style("whitegrid")
class HousePricePrediction:
def __init__(self):
self.models = {}
self.scaler = StandardScaler()
self.feature_names = []
self.best_model = None
def create_house_dataset(self):
"""创建房价数据集"""
print("=== 创建房价数据集 ===")
np.random.seed(42)
n_samples = 3000
# 基础特征
data = {
# 房屋基本信息
'area': np.random.normal(120, 40, n_samples), # 面积
'bedrooms': np.random.poisson(3, n_samples), # 卧室数
'bathrooms': np.random.poisson(2, n_samples), # 浴室数
'floors': np.random.choice([1, 2, 3], n_samples, p=[0.3, 0.6, 0.1]), # 楼层数
'age': np.random.exponential(10, n_samples), # 房龄
# 位置特征
'distance_to_center': np.random.exponential(15, n_samples), # 距市中心距离
'distance_to_subway': np.random.exponential(2, n_samples), # 距地铁距离
'distance_to_school': np.random.exponential(3, n_samples), # 距学校距离
# 类别特征
'district': np.random.choice(['市中心', '新区', '郊区'], n_samples, p=[0.2, 0.5, 0.3]),
'property_type': np.random.choice(['公寓', '别墅', '联排'], n_samples, p=[0.7, 0.2, 0.1]),
'decoration': np.random.choice(['毛坯', '简装', '精装'], n_samples, p=[0.3, 0.4, 0.3]),
# 环境特征
'green_ratio': np.random.beta(3, 2, n_samples), # 绿化率
'parking_ratio': np.random.beta(2, 3, n_samples), # 停车位比例
}
# 创建DataFrame
df = pd.DataFrame(data)
# 数据清理
df['area'] = np.clip(df['area'], 50, 300)
df['bedrooms'] = np.clip(df['bedrooms'], 1, 6)
df['bathrooms'] = np.clip(df['bathrooms'], 1, 4)
df['age'] = np.clip(df['age'], 0, 50)
df['distance_to_center'] = np.clip(df['distance_to_center'], 1, 50)
df['distance_to_subway'] = np.clip(df['distance_to_subway'], 0.1, 10)
df['distance_to_school'] = np.clip(df['distance_to_school'], 0.1, 15)
# 生成房价(目标变量)
# 基础价格
base_price = 50000 # 每平米基础价格
# 面积影响(非线性)
area_effect = df['area'] * (1 + 0.001 * df['area'])
# 位置影响
district_effect = df['district'].map({'市中心': 1.5, '新区': 1.0, '郊区': 0.7})
distance_effect = 1 / (1 + 0.05 * df['distance_to_center'])
subway_effect = 1 / (1 + 0.1 * df['distance_to_subway'])
# 房屋特征影响
room_effect = 1 + 0.1 * df['bedrooms'] + 0.05 * df['bathrooms']
floor_effect = df['floors'].map({1: 0.9, 2: 1.0, 3: 1.1})
age_effect = np.exp(-0.02 * df['age']) # 房龄影响
# 装修影响
decoration_effect = df['decoration'].map({'毛坯': 0.8, '简装': 1.0, '精装': 1.3})
# 物业类型影响
property_effect = df['property_type'].map({'公寓': 1.0, '别墅': 1.8, '联排': 1.4})
# 环境影响
environment_effect = 1 + 0.2 * df['green_ratio'] + 0.1 * df['parking_ratio']
# 计算总价
total_price = (base_price * area_effect * district_effect * distance_effect *
subway_effect * room_effect * floor_effect * age_effect *
decoration_effect * property_effect * environment_effect)
# 添加噪声
noise = np.random.normal(1, 0.1, n_samples)
df['price'] = total_price * noise
# 确保价格合理
df['price'] = np.clip(df['price'], 100000, 50000000)
print(f"数据集形状: {df.shape}")
print(f"房价统计:")
print(df['price'].describe())
# 可视化数据集
self.visualize_house_dataset(df)
return df
def visualize_house_dataset(self, df):
"""可视化房价数据集"""
fig, axes = plt.subplots(3, 4, figsize=(20, 15))
# 1. 房价分布
axes[0, 0].hist(df['price']/10000, bins=50, alpha=0.7, color='skyblue')
axes[0, 0].set_title('房价分布(万元)')
axes[0, 0].set_xlabel('房价(万元)')
axes[0, 0].set_ylabel('频率')
axes[0, 0].grid(True, alpha=0.3)
# 2. 面积vs房价
axes[0, 1].scatter(df['area'], df['price']/10000, alpha=0.5)
axes[0, 1].set_title('面积vs房价')
axes[0, 1].set_xlabel('面积(平米)')
axes[0, 1].set_ylabel('房价(万元)')
axes[0, 1].grid(True, alpha=0.3)
# 3. 区域vs房价
district_price = df.groupby('district')['price'].mean() / 10000
axes[0, 2].bar(district_price.index, district_price.values, alpha=0.7)
axes[0, 2].set_title('区域vs平均房价')
axes[0, 2].set_xlabel('区域')
axes[0, 2].set_ylabel('平均房价(万元)')
axes[0, 2].grid(True, alpha=0.3)
# 4. 房龄vs房价
axes[0, 3].scatter(df['age'], df['price']/10000, alpha=0.5)
axes[0, 3].set_title('房龄vs房价')
axes[0, 3].set_xlabel('房龄(年)')
axes[0, 3].set_ylabel('房价(万元)')
axes[0, 3].grid(True, alpha=0.3)
# 5. 卧室数vs房价
bedroom_price = df.groupby('bedrooms')['price'].mean() / 10000
axes[1, 0].bar(bedroom_price.index, bedroom_price.values, alpha=0.7)
axes[1, 0].set_title('卧室数vs平均房价')
axes[1, 0].set_xlabel('卧室数')
axes[1, 0].set_ylabel('平均房价(万元)')
axes[1, 0].grid(True, alpha=0.3)
# 6. 距市中心距离vs房价
axes[1, 1].scatter(df['distance_to_center'], df['price']/10000, alpha=0.5)
axes[1, 1].set_title('距市中心距离vs房价')
axes[1, 1].set_xlabel('距离(公里)')
axes[1, 1].set_ylabel('房价(万元)')
axes[1, 1].grid(True, alpha=0.3)
# 7. 装修情况vs房价
decoration_price = df.groupby('decoration')['price'].mean() / 10000
axes[1, 2].bar(decoration_price.index, decoration_price.values, alpha=0.7)
axes[1, 2].set_title('装修情况vs平均房价')
axes[1, 2].set_xlabel('装修情况')
axes[1, 2].set_ylabel('平均房价(万元)')
axes[1, 2].grid(True, alpha=0.3)
# 8. 物业类型vs房价
property_price = df.groupby('property_type')['price'].mean() / 10000
axes[1, 3].bar(property_price.index, property_price.values, alpha=0.7)
axes[1, 3].set_title('物业类型vs平均房价')
axes[1, 3].set_xlabel('物业类型')
axes[1, 3].set_ylabel('平均房价(万元)')
axes[1, 3].grid(True, alpha=0.3)
# 9. 相关性热图
numerical_features = ['area', 'bedrooms', 'bathrooms', 'floors', 'age',
'distance_to_center', 'distance_to_subway', 'distance_to_school',
'green_ratio', 'parking_ratio', 'price']
corr_matrix = df[numerical_features].corr()
im = axes[2, 0].imshow(corr_matrix, cmap='coolwarm', aspect='auto', vmin=-1, vmax=1)
axes[2, 0].set_xticks(range(len(corr_matrix.columns)))
axes[2, 0].set_yticks(range(len(corr_matrix.columns)))
axes[2, 0].set_xticklabels([col.replace('_', '\n') for col in corr_matrix.columns],
rotation=45, fontsize=8)
axes[2, 0].set_yticklabels([col.replace('_', '\n') for col in corr_matrix.columns],
fontsize=8)
axes[2, 0].set_title('特征相关性热图')
# 10. 绿化率vs房价
axes[2, 1].scatter(df['green_ratio'], df['price']/10000, alpha=0.5)
axes[2, 1].set_title('绿化率vs房价')
axes[2, 1].set_xlabel('绿化率')
axes[2, 1].set_ylabel('房价(万元)')
axes[2, 1].grid(True, alpha=0.3)
# 11. 停车位比例vs房价
axes[2, 2].scatter(df['parking_ratio'], df['price']/10000, alpha=0.5)
axes[2, 2].set_title('停车位比例vs房价')
axes[2, 2].set_xlabel('停车位比例')
axes[2, 2].set_ylabel('房价(万元)')
axes[2, 2].grid(True, alpha=0.3)
# 12. 楼层数vs房价
floor_price = df.groupby('floors')['price'].mean() / 10000
axes[2, 3].bar(floor_price.index, floor_price.values, alpha=0.7)
axes[2, 3].set_title('楼层数vs平均房价')
axes[2, 3].set_xlabel('楼层数')
axes[2, 3].set_ylabel('平均房价(万元)')
axes[2, 3].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
def feature_engineering(self, df):
"""特征工程"""
print("\n=== 特征工程 ===")
df_processed = df.copy()
# 1. 创建新特征
# 房屋总房间数
df_processed['total_rooms'] = df_processed['bedrooms'] + df_processed['bathrooms']
# 每房间平均面积
df_processed['area_per_room'] = df_processed['area'] / df_processed['total_rooms']
# 便利性评分(距离的倒数)
df_processed['convenience_score'] = (1 / (1 + df_processed['distance_to_subway']) +
1 / (1 + df_processed['distance_to_school']) +
1 / (1 + df_processed['distance_to_center']))
# 环境评分
df_processed['environment_score'] = (df_processed['green_ratio'] +
df_processed['parking_ratio']) / 2
# 房屋新旧程度(年龄分组)
df_processed['age_group'] = pd.cut(df_processed['age'],
bins=[0, 5, 15, 30, 50],
labels=['新房', '次新', '中等', '老房'])
# 面积分组
df_processed['area_group'] = pd.cut(df_processed['area'],
bins=[0, 80, 120, 180, 300],
labels=['小户型', '中户型', '大户型', '豪宅'])
# 2. 类别特征编码
categorical_features = ['district', 'property_type', 'decoration', 'age_group', 'area_group']
df_encoded = pd.get_dummies(df_processed, columns=categorical_features, prefix=categorical_features)
# 3. 数值特征标准化(除了目标变量)
numerical_features = ['area', 'bedrooms', 'bathrooms', 'floors', 'age',
'distance_to_center', 'distance_to_subway', 'distance_to_school',
'green_ratio', 'parking_ratio', 'total_rooms', 'area_per_room',
'convenience_score', 'environment_score']
# 保存特征名称
self.feature_names = [col for col in df_encoded.columns if col != 'price']
print(f"原始特征数: {len(df.columns) - 1}")
print(f"工程后特征数: {len(self.feature_names)}")
print(f"新增特征: {len(self.feature_names) - len(df.columns) + 1}")
return df_encoded
def train_models(self, df):
"""训练多个模型"""
print("\n=== 模型训练与比较 ===")
# 准备数据
X = df[self.feature_names]
y = df['price']
# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 特征标准化
X_train_scaled = self.scaler.fit_transform(X_train)
X_test_scaled = self.scaler.transform(X_test)
# 定义模型
models = {
'线性回归': LinearRegression(),
'Ridge回归': Ridge(alpha=1.0),
'Lasso回归': Lasso(alpha=1.0),
'弹性网络': ElasticNet(alpha=1.0, l1_ratio=0.5),
'随机森林': RandomForestRegressor(n_estimators=100, random_state=42),
'梯度提升': GradientBoostingRegressor(n_estimators=100, random_state=42),
'SVR': SVR(kernel='rbf', C=1.0)
}
# 训练和评估模型
results = {}
for name, model in models.items():
print(f"\n训练 {name}...")
# 对于需要标准化的模型使用标准化数据
if name in ['线性回归', 'Ridge回归', 'Lasso回归', '弹性网络', 'SVR']:
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
# 交叉验证
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5,
scoring='neg_mean_squared_error')
else:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# 交叉验证
cv_scores = cross_val_score(model, X_train, y_train, cv=5,
scoring='neg_mean_squared_error')
# 计算评估指标
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
cv_rmse = np.sqrt(-cv_scores.mean())
results[name] = {
'model': model,
'mse': mse,
'rmse': rmse,
'mae': mae,
'r2': r2,
'cv_rmse': cv_rmse,
'y_pred': y_pred
}
print(f"RMSE: {rmse:.2f}")
print(f"MAE: {mae:.2f}")
print(f"R²: {r2:.4f}")
print(f"CV RMSE: {cv_rmse:.2f}")
# 保存结果
self.models = results
# 可视化模型比较
self.visualize_model_comparison(results, y_test)
return results, X_test, y_test
def hyperparameter_tuning(self, X_train, y_train):
"""超参数调优"""
print("\n=== 超参数调优 ===")
# 对表现最好的几个模型进行调优
tuning_configs = {
'RandomForest': {
'model': RandomForestRegressor(random_state=42),
'params': {
'n_estimators': [50, 100, 200],
'max_depth': [10, 20, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
},
'GradientBoosting': {
'model': GradientBoostingRegressor(random_state=42),
'params': {
'n_estimators': [50, 100, 200],
'learning_rate': [0.05, 0.1, 0.2],
'max_depth': [3, 5, 7],
'subsample': [0.8, 0.9, 1.0]
}
}
}
best_models = {}
for name, config in tuning_configs.items():
print(f"\n调优 {name}...")
grid_search = GridSearchCV(
config['model'],
config['params'],
cv=5,
scoring='neg_mean_squared_error',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
best_models[name] = {
'model': grid_search.best_estimator_,
'best_params': grid_search.best_params_,
'best_score': -grid_search.best_score_
}
print(f"最佳参数: {grid_search.best_params_}")
print(f"最佳CV RMSE: {np.sqrt(-grid_search.best_score_):.2f}")
return best_models
def model_interpretation(self, df):
"""模型解释"""
print("\n=== 模型解释 ===")
# 使用最佳模型进行解释
best_model_name = min(self.models.keys(),
key=lambda x: self.models[x]['rmse'])
best_model = self.models[best_model_name]['model']
print(f"最佳模型: {best_model_name}")
# 准备数据
X = df[self.feature_names]
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 特征重要性分析
if hasattr(best_model, 'feature_importances_'):
# 基于树的模型
feature_importance = best_model.feature_importances_
importance_df = pd.DataFrame({
'feature': self.feature_names,
'importance': feature_importance
}).sort_values('importance', ascending=False)
print("\nTop 15 重要特征:")
print(importance_df.head(15))
elif hasattr(best_model, 'coef_'):
# 线性模型
feature_importance = np.abs(best_model.coef_)
importance_df = pd.DataFrame({
'feature': self.feature_names,
'importance': feature_importance
}).sort_values('importance', ascending=False)
print("\nTop 15 重要特征:")
print(importance_df.head(15))
# 排列重要性
if best_model_name not in ['线性回归', 'Ridge回归', 'Lasso回归', '弹性网络', 'SVR']:
perm_importance = permutation_importance(best_model, X_test, y_test,
n_repeats=10, random_state=42)
perm_importance_df = pd.DataFrame({
'feature': self.feature_names,
'importance': perm_importance.importances_mean
}).sort_values('importance', ascending=False)
print("\n排列重要性 Top 15:")
print(perm_importance_df.head(15))
# 可视化特征重要性
self.visualize_feature_importance(importance_df)
# 预测示例
self.prediction_examples(best_model, X_test, y_test)
return best_model, importance_df
def visualize_model_comparison(self, results, y_test):
"""可视化模型比较"""
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
model_names = list(results.keys())
# 1. RMSE比较
rmse_values = [results[name]['rmse'] for name in model_names]
axes[0, 0].bar(range(len(model_names)), rmse_values, alpha=0.7)
axes[0, 0].set_title('模型RMSE比较')
axes[0, 0].set_xlabel('模型')
axes[0, 0].set_ylabel('RMSE')
axes[0, 0].set_xticks(range(len(model_names)))
axes[0, 0].set_xticklabels(model_names, rotation=45)
axes[0, 0].grid(True, alpha=0.3)
# 2. R²比较
r2_values = [results[name]['r2'] for name in model_names]
axes[0, 1].bar(range(len(model_names)), r2_values, alpha=0.7)
axes[0, 1].set_title('模型R²比较')
axes[0, 1].set_xlabel('模型')
axes[0, 1].set_ylabel('R²')
axes[0, 1].set_xticks(range(len(model_names)))
axes[0, 1].set_xticklabels(model_names, rotation=45)
axes[0, 1].grid(True, alpha=0.3)
# 3. MAE比较
mae_values = [results[name]['mae'] for name in model_names]
axes[0, 2].bar(range(len(model_names)), mae_values, alpha=0.7)
axes[0, 2].set_title('模型MAE比较')
axes[0, 2].set_xlabel('模型')
axes[0, 2].set_ylabel('MAE')
axes[0, 2].set_xticks(range(len(model_names)))
axes[0, 2].set_xticklabels(model_names, rotation=45)
axes[0, 2].grid(True, alpha=0.3)
# 4. 预测vs实际(最佳模型)
best_model_name = min(model_names, key=lambda x: results[x]['rmse'])
best_predictions = results[best_model_name]['y_pred']
axes[1, 0].scatter(y_test, best_predictions, alpha=0.5)
axes[1, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[1, 0].set_title(f'{best_model_name} - 预测vs实际')
axes[1, 0].set_xlabel('实际价格')
axes[1, 0].set_ylabel('预测价格')
axes[1, 0].grid(True, alpha=0.3)
# 5. 残差分布(最佳模型)
residuals = y_test - best_predictions
axes[1, 1].hist(residuals, bins=30, alpha=0.7)
axes[1, 1].set_title(f'{best_model_name} - 残差分布')
axes[1, 1].set_xlabel('残差')
axes[1, 1].set_ylabel('频率')
axes[1, 1].grid(True, alpha=0.3)
# 6. 残差vs预测值(最佳模型)
axes[1, 2].scatter(best_predictions, residuals, alpha=0.5)
axes[1, 2].axhline(y=0, color='r', linestyle='--')
axes[1, 2].set_title(f'{best_model_name} - 残差vs预测值')
axes[1, 2].set_xlabel('预测价格')
axes[1, 2].set_ylabel('残差')
axes[1, 2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
def visualize_feature_importance(self, importance_df):
"""可视化特征重要性"""
plt.figure(figsize=(12, 8))
# 选择Top 15特征
top_features = importance_df.head(15)
plt.barh(range(len(top_features)), top_features['importance'], alpha=0.7)
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('重要性')
plt.title('Top 15 特征重要性')
plt.gca().invert_yaxis()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
def prediction_examples(self, model, X_test, y_test):
"""预测示例"""
print("\n=== 预测示例 ===")
# 随机选择几个样本进行预测
sample_indices = np.random.choice(len(X_test), 5, replace=False)
for i, idx in enumerate(sample_indices):
sample = X_test.iloc[idx:idx+1]
actual_price = y_test.iloc[idx]
if hasattr(model, 'predict'):
if hasattr(model, 'feature_importances_'):
# 树模型,直接预测
predicted_price = model.predict(sample)[0]
else:
# 线性模型,需要标准化
sample_scaled = self.scaler.transform(sample)
predicted_price = model.predict(sample_scaled)[0]
error = abs(actual_price - predicted_price)
error_rate = error / actual_price * 100
print(f"\n样本 {i+1}:")
print(f"实际价格: {actual_price:,.0f} 元")
print(f"预测价格: {predicted_price:,.0f} 元")
print(f"误差: {error:,.0f} 元 ({error_rate:.1f}%)")
# 演示房价预测项目
print("=== 房价预测项目实战 ===")
house_predictor = HousePricePrediction()
# 1. 创建数据集
house_data = house_predictor.create_house_dataset()
# 2. 特征工程
processed_data = house_predictor.feature_engineering(house_data)
# 3. 模型训练与比较
model_results, X_test, y_test = house_predictor.train_models(processed_data)
# 4. 模型解释
best_model, feature_importance = house_predictor.model_interpretation(processed_data)
10.2 项目二:客户分类系统
10.2.1 项目背景与目标
客户分类是企业进行精准营销和客户管理的重要手段。通过分析客户的行为特征,可以将客户分为不同类别,制定针对性的营销策略。
项目目标: - 构建客户分类模型 - 识别不同类型客户的特征 - 为营销策略提供数据支持
class CustomerSegmentation:
def __init__(self):
self.models = {}
self.scaler = StandardScaler()
self.label_encoders = {}
self.feature_names = []
def create_customer_dataset(self):
"""创建客户数据集"""
print("=== 创建客户数据集 ===")
np.random.seed(42)
n_samples = 5000
# 定义客户类型(隐藏标签,用于生成数据)
customer_types = np.random.choice(['高价值', '中价值', '低价值', '流失风险'],
n_samples, p=[0.15, 0.35, 0.35, 0.15])
# 基础特征
data = {
# 人口统计特征
'age': np.random.normal(40, 15, n_samples),
'gender': np.random.choice(['男', '女'], n_samples),
'education': np.random.choice(['高中', '本科', '硕士', '博士'], n_samples,
p=[0.3, 0.5, 0.15, 0.05]),
'income': np.random.lognormal(10, 0.8, n_samples),
'city_tier': np.random.choice(['一线', '二线', '三线'], n_samples, p=[0.3, 0.4, 0.3]),
# 行为特征
'tenure_months': np.random.exponential(24, n_samples), # 客户生命周期
'total_purchases': np.random.poisson(10, n_samples), # 总购买次数
'avg_order_value': np.random.lognormal(6, 0.5, n_samples), # 平均订单价值
'last_purchase_days': np.random.exponential(30, n_samples), # 距上次购买天数
# 偏好特征
'preferred_category': np.random.choice(['电子产品', '服装', '家居', '美妆', '运动'],
n_samples, p=[0.25, 0.25, 0.2, 0.15, 0.15]),
'channel_preference': np.random.choice(['线上', '线下', '混合'], n_samples,
p=[0.5, 0.3, 0.2]),
# 互动特征
'website_visits': np.random.poisson(15, n_samples), # 网站访问次数
'email_opens': np.random.poisson(8, n_samples), # 邮件打开次数
'customer_service_calls': np.random.poisson(2, n_samples), # 客服电话次数
'social_media_engagement': np.random.beta(2, 5, n_samples), # 社交媒体参与度
}
# 创建DataFrame
df = pd.DataFrame(data)
# 数据清理
df['age'] = np.clip(df['age'], 18, 80)
df['income'] = np.clip(df['income'], 20000, 500000)
df['tenure_months'] = np.clip(df['tenure_months'], 1, 120)
df['total_purchases'] = np.clip(df['total_purchases'], 0, 100)
df['avg_order_value'] = np.clip(df['avg_order_value'], 50, 5000)
df['last_purchase_days'] = np.clip(df['last_purchase_days'], 0, 365)
df['website_visits'] = np.clip(df['website_visits'], 0, 100)
df['email_opens'] = np.clip(df['email_opens'], 0, 50)
df['customer_service_calls'] = np.clip(df['customer_service_calls'], 0, 20)
# 根据客户类型调整特征(模拟真实关系)
for i, customer_type in enumerate(customer_types):
if customer_type == '高价值':
df.loc[i, 'income'] *= 1.5
df.loc[i, 'avg_order_value'] *= 1.8
df.loc[i, 'total_purchases'] *= 1.5
df.loc[i, 'tenure_months'] *= 1.3
df.loc[i, 'last_purchase_days'] *= 0.5
elif customer_type == '中价值':
df.loc[i, 'income'] *= 1.1
df.loc[i, 'avg_order_value'] *= 1.2
df.loc[i, 'total_purchases'] *= 1.1
elif customer_type == '低价值':
df.loc[i, 'income'] *= 0.8
df.loc[i, 'avg_order_value'] *= 0.7
df.loc[i, 'total_purchases'] *= 0.8
elif customer_type == '流失风险':
df.loc[i, 'last_purchase_days'] *= 3
df.loc[i, 'website_visits'] *= 0.3
df.loc[i, 'email_opens'] *= 0.2
df.loc[i, 'social_media_engagement'] *= 0.3
# 重新应用数据范围限制
df['income'] = np.clip(df['income'], 20000, 500000)
df['avg_order_value'] = np.clip(df['avg_order_value'], 50, 5000)
df['total_purchases'] = np.clip(df['total_purchases'], 0, 100)
df['tenure_months'] = np.clip(df['tenure_months'], 1, 120)
df['last_purchase_days'] = np.clip(df['last_purchase_days'], 0, 365)
df['website_visits'] = np.clip(df['website_visits'], 0, 100)
df['email_opens'] = np.clip(df['email_opens'], 0, 50)
# 添加目标变量
df['customer_type'] = customer_types
print(f"数据集形状: {df.shape}")
print(f"客户类型分布:")
print(df['customer_type'].value_counts())
# 可视化数据集
self.visualize_customer_dataset(df)
return df
def visualize_customer_dataset(self, df):
"""可视化客户数据集"""
fig, axes = plt.subplots(3, 3, figsize=(18, 15))
fig.suptitle('客户数据集探索性分析', fontsize=16, fontweight='bold')
# 1. 客户类型分布
df['customer_type'].value_counts().plot(kind='bar', ax=axes[0,0], color='skyblue')
axes[0,0].set_title('客户类型分布')
axes[0,0].set_xlabel('客户类型')
axes[0,0].set_ylabel('数量')
axes[0,0].tick_params(axis='x', rotation=45)
# 2. 收入分布
for i, customer_type in enumerate(df['customer_type'].unique()):
subset = df[df['customer_type'] == customer_type]['income']
axes[0,1].hist(subset, alpha=0.7, label=customer_type, bins=20)
axes[0,1].set_title('不同客户类型的收入分布')
axes[0,1].set_xlabel('收入')
axes[0,1].set_ylabel('频次')
axes[0,1].legend()
# 3. 平均订单价值分布
sns.boxplot(data=df, x='customer_type', y='avg_order_value', ax=axes[0,2])
axes[0,2].set_title('平均订单价值分布')
axes[0,2].tick_params(axis='x', rotation=45)
# 4. 总购买次数分布
sns.boxplot(data=df, x='customer_type', y='total_purchases', ax=axes[1,0])
axes[1,0].set_title('总购买次数分布')
axes[1,0].tick_params(axis='x', rotation=45)
# 5. 客户任期分布
sns.violinplot(data=df, x='customer_type', y='tenure_months', ax=axes[1,1])
axes[1,1].set_title('客户任期分布')
axes[1,1].tick_params(axis='x', rotation=45)
# 6. 最后购买天数分布
sns.boxplot(data=df, x='customer_type', y='last_purchase_days', ax=axes[1,2])
axes[1,2].set_title('最后购买天数分布')
axes[1,2].tick_params(axis='x', rotation=45)
# 7. 网站访问次数分布
sns.boxplot(data=df, x='customer_type', y='website_visits', ax=axes[2,0])
axes[2,0].set_title('网站访问次数分布')
axes[2,0].tick_params(axis='x', rotation=45)
# 8. 邮件打开次数分布
sns.boxplot(data=df, x='customer_type', y='email_opens', ax=axes[2,1])
axes[2,1].set_title('邮件打开次数分布')
axes[2,1].tick_params(axis='x', rotation=45)
# 9. 社交媒体参与度分布
sns.boxplot(data=df, x='customer_type', y='social_media_engagement', ax=axes[2,2])
axes[2,2].set_title('社交媒体参与度分布')
axes[2,2].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
# 相关性热力图
plt.figure(figsize=(12, 10))
numeric_cols = df.select_dtypes(include=[np.number]).columns
correlation_matrix = df[numeric_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
square=True, linewidths=0.5)
plt.title('特征相关性热力图', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# 创建客户分类项目实例
customer_project = CustomerSegmentationProject()
# 创建数据集
print("=== 创建客户数据集 ===")
customer_data = customer_project.create_customer_dataset()
10.2.3 数据预处理与特征工程
def preprocess_customer_data(self, df):
"""客户数据预处理"""
print("=== 数据预处理 ===")
# 分离特征和目标变量
X = df.drop('customer_type', axis=1)
y = df['customer_type']
# 编码目标变量
le = LabelEncoder()
y_encoded = le.fit_transform(y)
print(f"特征数量: {X.shape[1]}")
print(f"样本数量: {X.shape[0]}")
print(f"类别数量: {len(le.classes_)}")
print(f"类别映射: {dict(zip(le.classes_, range(len(le.classes_))))}")
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
)
# 特征标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"训练集形状: {X_train_scaled.shape}")
print(f"测试集形状: {X_test_scaled.shape}")
return X_train_scaled, X_test_scaled, y_train, y_test, le, scaler
# 数据预处理
X_train, X_test, y_train, y_test, label_encoder, scaler = customer_project.preprocess_customer_data(customer_data)
10.2.4 模型训练与比较
def train_classification_models(self, X_train, X_test, y_train, y_test):
"""训练多种分类模型"""
print("=== 模型训练与比较 ===")
# 定义模型
models = {
'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(random_state=42),
'SVM': SVC(random_state=42, probability=True),
'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5)
}
results = {}
for name, model in models.items():
print(f"\n训练 {name}...")
# 训练模型
model.fit(X_train, y_train)
# 预测
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test) if hasattr(model, 'predict_proba') else None
# 评估指标
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
# 交叉验证
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
results[name] = {
'model': model,
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1_score': f1,
'cv_mean': cv_scores.mean(),
'cv_std': cv_scores.std(),
'y_pred': y_pred,
'y_pred_proba': y_pred_proba
}
print(f"准确率: {accuracy:.4f}")
print(f"精确率: {precision:.4f}")
print(f"召回率: {recall:.4f}")
print(f"F1分数: {f1:.4f}")
print(f"交叉验证: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
return results
# 训练模型
model_results = customer_project.train_classification_models(X_train, X_test, y_train, y_test)
10.2.5 模型评估与可视化
def visualize_classification_results(self, results, y_test, label_encoder):
"""可视化分类结果"""
print("=== 模型评估可视化 ===")
# 1. 模型性能比较
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('客户分类模型性能比较', fontsize=16, fontweight='bold')
# 准确率比较
models = list(results.keys())
accuracies = [results[model]['accuracy'] for model in models]
axes[0,0].bar(models, accuracies, color='skyblue')
axes[0,0].set_title('模型准确率比较')
axes[0,0].set_ylabel('准确率')
axes[0,0].tick_params(axis='x', rotation=45)
# F1分数比较
f1_scores = [results[model]['f1_score'] for model in models]
axes[0,1].bar(models, f1_scores, color='lightgreen')
axes[0,1].set_title('模型F1分数比较')
axes[0,1].set_ylabel('F1分数')
axes[0,1].tick_params(axis='x', rotation=45)
# 交叉验证分数比较
cv_means = [results[model]['cv_mean'] for model in models]
cv_stds = [results[model]['cv_std'] for model in models]
axes[1,0].bar(models, cv_means, yerr=cv_stds, capsize=5, color='orange')
axes[1,0].set_title('交叉验证分数比较')
axes[1,0].set_ylabel('CV准确率')
axes[1,0].tick_params(axis='x', rotation=45)
# 综合指标雷达图
metrics = ['accuracy', 'precision', 'recall', 'f1_score']
best_model = max(results.keys(), key=lambda x: results[x]['accuracy'])
values = [results[best_model][metric] for metric in metrics]
angles = np.linspace(0, 2 * np.pi, len(metrics), endpoint=False)
values += values[:1] # 闭合图形
angles = np.concatenate((angles, [angles[0]]))
axes[1,1].plot(angles, values, 'o-', linewidth=2, label=best_model)
axes[1,1].fill(angles, values, alpha=0.25)
axes[1,1].set_xticks(angles[:-1])
axes[1,1].set_xticklabels(metrics)
axes[1,1].set_ylim(0, 1)
axes[1,1].set_title(f'最佳模型 ({best_model}) 性能雷达图')
axes[1,1].grid(True)
plt.tight_layout()
plt.show()
# 2. 混淆矩阵
best_model_name = max(results.keys(), key=lambda x: results[x]['accuracy'])
y_pred_best = results[best_model_name]['y_pred']
plt.figure(figsize=(10, 8))
cm = confusion_matrix(y_test, y_pred_best)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=label_encoder.classes_,
yticklabels=label_encoder.classes_)
plt.title(f'混淆矩阵 - {best_model_name}', fontsize=14, fontweight='bold')
plt.xlabel('预测类别')
plt.ylabel('真实类别')
plt.tight_layout()
plt.show()
# 3. 分类报告
print(f"\n=== {best_model_name} 详细分类报告 ===")
print(classification_report(y_test, y_pred_best,
target_names=label_encoder.classes_))
# 可视化结果
customer_project.visualize_classification_results(model_results, y_test, label_encoder)
10.2.6 特征重要性分析
def analyze_feature_importance(self, results, feature_names):
"""分析特征重要性"""
print("=== 特征重要性分析 ===")
# 获取随机森林的特征重要性
rf_model = results['Random Forest']['model']
rf_importance = rf_model.feature_importances_
# 获取梯度提升的特征重要性
gb_model = results['Gradient Boosting']['model']
gb_importance = gb_model.feature_importances_
# 创建特征重要性DataFrame
importance_df = pd.DataFrame({
'Feature': feature_names,
'Random_Forest': rf_importance,
'Gradient_Boosting': gb_importance
})
# 计算平均重要性
importance_df['Average'] = (importance_df['Random_Forest'] +
importance_df['Gradient_Boosting']) / 2
# 按平均重要性排序
importance_df = importance_df.sort_values('Average', ascending=False)
print("特征重要性排名:")
print(importance_df)
# 可视化特征重要性
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# 随机森林特征重要性
top_features_rf = importance_df.head(10)
axes[0].barh(range(len(top_features_rf)), top_features_rf['Random_Forest'])
axes[0].set_yticks(range(len(top_features_rf)))
axes[0].set_yticklabels(top_features_rf['Feature'])
axes[0].set_title('随机森林 - 特征重要性 (Top 10)')
axes[0].set_xlabel('重要性')
# 梯度提升特征重要性
top_features_gb = importance_df.head(10)
axes[1].barh(range(len(top_features_gb)), top_features_gb['Gradient_Boosting'])
axes[1].set_yticks(range(len(top_features_gb)))
axes[1].set_yticklabels(top_features_gb['Feature'])
axes[1].set_title('梯度提升 - 特征重要性 (Top 10)')
axes[1].set_xlabel('重要性')
plt.tight_layout()
plt.show()
return importance_df
# 特征重要性分析
feature_names = customer_data.drop('customer_type', axis=1).columns.tolist()
importance_analysis = customer_project.analyze_feature_importance(model_results, feature_names)
10.2.7 客户分类预测示例
def predict_customer_type(self, model, scaler, label_encoder, customer_features):
"""预测新客户类型"""
print("=== 客户类型预测示例 ===")
# 标准化特征
customer_features_scaled = scaler.transform([customer_features])
# 预测
prediction = model.predict(customer_features_scaled)[0]
prediction_proba = model.predict_proba(customer_features_scaled)[0]
# 解码预测结果
predicted_type = label_encoder.inverse_transform([prediction])[0]
print(f"客户特征: {customer_features}")
print(f"预测客户类型: {predicted_type}")
print(f"预测概率:")
for i, class_name in enumerate(label_encoder.classes_):
print(f" {class_name}: {prediction_proba[i]:.4f}")
return predicted_type, prediction_proba
# 预测示例
best_model = model_results['Random Forest']['model']
# 示例客户1:高价值客户特征
example_customer_1 = [45, 120000, 2500, 25, 36, 15, 25, 15, 8.5]
customer_project.predict_customer_type(best_model, scaler, label_encoder, example_customer_1)
print("\n" + "="*50)
# 示例客户2:流失风险客户特征
example_customer_2 = [35, 45000, 150, 3, 8, 180, 5, 2, 1.2]
customer_project.predict_customer_type(best_model, scaler, label_encoder, example_customer_2)
10.3 项目三:推荐系统
10.3.1 项目背景与目标
推荐系统是现代互联网应用的核心组件,广泛应用于电商、视频、音乐等平台。
项目目标: - 构建基于协同过滤的推荐系统 - 实现基于内容的推荐算法 - 评估推荐系统性能
class RecommendationSystem:
def __init__(self):
self.user_item_matrix = None
self.item_features = None
self.user_similarity = None
self.item_similarity = None
def create_recommendation_dataset(self, n_users=1000, n_items=500, n_ratings=50000):
"""创建推荐系统数据集"""
print("=== 创建推荐系统数据集 ===")
np.random.seed(42)
# 生成用户ID和物品ID
user_ids = np.random.choice(range(1, n_users + 1), n_ratings)
item_ids = np.random.choice(range(1, n_items + 1), n_ratings)
# 生成评分(1-5分)
ratings = np.random.choice([1, 2, 3, 4, 5], n_ratings,
p=[0.1, 0.15, 0.25, 0.35, 0.15])
# 创建评分数据框
ratings_df = pd.DataFrame({
'user_id': user_ids,
'item_id': item_ids,
'rating': ratings
})
# 去除重复的用户-物品对,保留最后一次评分
ratings_df = ratings_df.drop_duplicates(subset=['user_id', 'item_id'], keep='last')
# 创建物品特征
item_features_df = pd.DataFrame({
'item_id': range(1, n_items + 1),
'category': np.random.choice(['电子产品', '服装', '书籍', '家居', '运动'], n_items),
'price': np.random.uniform(10, 1000, n_items),
'brand_popularity': np.random.uniform(0, 1, n_items),
'release_year': np.random.choice(range(2015, 2024), n_items)
})
print(f"评分数据形状: {ratings_df.shape}")
print(f"物品特征形状: {item_features_df.shape}")
print(f"用户数量: {ratings_df['user_id'].nunique()}")
print(f"物品数量: {ratings_df['item_id'].nunique()}")
print(f"评分分布:")
print(ratings_df['rating'].value_counts().sort_index())
# 可视化数据集
self.visualize_recommendation_dataset(ratings_df, item_features_df)
return ratings_df, item_features_df
def visualize_recommendation_dataset(self, ratings_df, item_features_df):
"""可视化推荐系统数据集"""
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('推荐系统数据集分析', fontsize=16, fontweight='bold')
# 1. 评分分布
ratings_df['rating'].value_counts().sort_index().plot(kind='bar', ax=axes[0,0], color='skyblue')
axes[0,0].set_title('评分分布')
axes[0,0].set_xlabel('评分')
axes[0,0].set_ylabel('数量')
# 2. 用户评分数量分布
user_rating_counts = ratings_df['user_id'].value_counts()
axes[0,1].hist(user_rating_counts, bins=30, color='lightgreen', alpha=0.7)
axes[0,1].set_title('用户评分数量分布')
axes[0,1].set_xlabel('评分数量')
axes[0,1].set_ylabel('用户数量')
# 3. 物品评分数量分布
item_rating_counts = ratings_df['item_id'].value_counts()
axes[0,2].hist(item_rating_counts, bins=30, color='orange', alpha=0.7)
axes[0,2].set_title('物品评分数量分布')
axes[0,2].set_xlabel('评分数量')
axes[0,2].set_ylabel('物品数量')
# 4. 物品类别分布
item_features_df['category'].value_counts().plot(kind='bar', ax=axes[1,0], color='purple')
axes[1,0].set_title('物品类别分布')
axes[1,0].set_xlabel('类别')
axes[1,0].set_ylabel('数量')
axes[1,0].tick_params(axis='x', rotation=45)
# 5. 物品价格分布
axes[1,1].hist(item_features_df['price'], bins=30, color='red', alpha=0.7)
axes[1,1].set_title('物品价格分布')
axes[1,1].set_xlabel('价格')
axes[1,1].set_ylabel('数量')
# 6. 发布年份分布
item_features_df['release_year'].value_counts().sort_index().plot(kind='bar', ax=axes[1,2], color='brown')
axes[1,2].set_title('物品发布年份分布')
axes[1,2].set_xlabel('年份')
axes[1,2].set_ylabel('数量')
plt.tight_layout()
plt.show()
# 创建推荐系统实例
rec_system = RecommendationSystem()
# 创建数据集
print("=== 创建推荐系统数据集 ===")
ratings_data, item_features = rec_system.create_recommendation_dataset()
10.3.2 协同过滤算法实现
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split
def build_user_item_matrix(self, ratings_df):
"""构建用户-物品评分矩阵"""
print("=== 构建用户-物品矩阵 ===")
# 创建用户-物品矩阵
user_item_matrix = ratings_df.pivot(index='user_id', columns='item_id', values='rating')
user_item_matrix = user_item_matrix.fillna(0)
print(f"用户-物品矩阵形状: {user_item_matrix.shape}")
print(f"稀疏度: {(user_item_matrix == 0).sum().sum() / (user_item_matrix.shape[0] * user_item_matrix.shape[1]):.4f}")
self.user_item_matrix = user_item_matrix
return user_item_matrix
def compute_user_similarity(self, user_item_matrix):
"""计算用户相似度"""
print("=== 计算用户相似度 ===")
# 计算用户相似度矩阵
user_similarity = cosine_similarity(user_item_matrix)
user_similarity_df = pd.DataFrame(user_similarity,
index=user_item_matrix.index,
columns=user_item_matrix.index)
self.user_similarity = user_similarity_df
print(f"用户相似度矩阵形状: {user_similarity_df.shape}")
return user_similarity_df
def compute_item_similarity(self, user_item_matrix):
"""计算物品相似度"""
print("=== 计算物品相似度 ===")
# 计算物品相似度矩阵
item_similarity = cosine_similarity(user_item_matrix.T)
item_similarity_df = pd.DataFrame(item_similarity,
index=user_item_matrix.columns,
columns=user_item_matrix.columns)
self.item_similarity = item_similarity_df
print(f"物品相似度矩阵形状: {item_similarity_df.shape}")
return item_similarity_df
def user_based_recommendation(self, user_id, n_recommendations=10):
"""基于用户的协同过滤推荐"""
if self.user_similarity is None or self.user_item_matrix is None:
raise ValueError("请先计算用户相似度和构建用户-物品矩阵")
# 获取目标用户的相似用户
user_similarities = self.user_similarity.loc[user_id].sort_values(ascending=False)
similar_users = user_similarities.iloc[1:11] # 排除自己,取前10个相似用户
# 获取目标用户已评分的物品
user_ratings = self.user_item_matrix.loc[user_id]
rated_items = user_ratings[user_ratings > 0].index
# 计算推荐分数
recommendations = {}
for item_id in self.user_item_matrix.columns:
if item_id not in rated_items: # 只推荐未评分的物品
score = 0
similarity_sum = 0
for similar_user_id, similarity in similar_users.items():
if self.user_item_matrix.loc[similar_user_id, item_id] > 0:
score += similarity * self.user_item_matrix.loc[similar_user_id, item_id]
similarity_sum += abs(similarity)
if similarity_sum > 0:
recommendations[item_id] = score / similarity_sum
# 排序并返回前N个推荐
sorted_recommendations = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)
return sorted_recommendations[:n_recommendations]
def item_based_recommendation(self, user_id, n_recommendations=10):
"""基于物品的协同过滤推荐"""
if self.item_similarity is None or self.user_item_matrix is None:
raise ValueError("请先计算物品相似度和构建用户-物品矩阵")
# 获取目标用户的评分
user_ratings = self.user_item_matrix.loc[user_id]
rated_items = user_ratings[user_ratings > 0]
# 计算推荐分数
recommendations = {}
for item_id in self.user_item_matrix.columns:
if item_id not in rated_items.index: # 只推荐未评分的物品
score = 0
similarity_sum = 0
for rated_item_id, rating in rated_items.items():
similarity = self.item_similarity.loc[item_id, rated_item_id]
score += similarity * rating
similarity_sum += abs(similarity)
if similarity_sum > 0:
recommendations[item_id] = score / similarity_sum
# 排序并返回前N个推荐
sorted_recommendations = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)
return sorted_recommendations[:n_recommendations]
# 添加方法到类
RecommendationSystem.build_user_item_matrix = build_user_item_matrix
RecommendationSystem.compute_user_similarity = compute_user_similarity
RecommendationSystem.compute_item_similarity = compute_item_similarity
RecommendationSystem.user_based_recommendation = user_based_recommendation
RecommendationSystem.item_based_recommendation = item_based_recommendation
# 构建推荐系统
user_item_matrix = rec_system.build_user_item_matrix(ratings_data)
user_similarity = rec_system.compute_user_similarity(user_item_matrix)
item_similarity = rec_system.compute_item_similarity(user_item_matrix)
10.3.3 矩阵分解推荐算法
def matrix_factorization_recommendation(self, ratings_df, n_components=50):
"""基于矩阵分解的推荐算法"""
print("=== 矩阵分解推荐算法 ===")
# 准备数据
user_item_matrix = self.user_item_matrix.values
# 使用SVD进行矩阵分解
svd = TruncatedSVD(n_components=n_components, random_state=42)
user_factors = svd.fit_transform(user_item_matrix)
item_factors = svd.components_.T
# 重构评分矩阵
reconstructed_matrix = np.dot(user_factors, svd.components_)
print(f"原始矩阵形状: {user_item_matrix.shape}")
print(f"用户因子矩阵形状: {user_factors.shape}")
print(f"物品因子矩阵形状: {item_factors.shape}")
print(f"重构矩阵形状: {reconstructed_matrix.shape}")
# 转换为DataFrame
reconstructed_df = pd.DataFrame(reconstructed_matrix,
index=self.user_item_matrix.index,
columns=self.user_item_matrix.columns)
return reconstructed_df, svd
def svd_recommendation(self, user_id, reconstructed_matrix, n_recommendations=10):
"""基于SVD的推荐"""
# 获取用户的原始评分和预测评分
original_ratings = self.user_item_matrix.loc[user_id]
predicted_ratings = reconstructed_matrix.loc[user_id]
# 只推荐未评分的物品
unrated_items = original_ratings[original_ratings == 0].index
recommendations = predicted_ratings[unrated_items].sort_values(ascending=False)
return recommendations.head(n_recommendations)
# 添加方法到类
RecommendationSystem.matrix_factorization_recommendation = matrix_factorization_recommendation
RecommendationSystem.svd_recommendation = svd_recommendation
# 矩阵分解推荐
reconstructed_matrix, svd_model = rec_system.matrix_factorization_recommendation(ratings_data)
10.3.4 推荐系统评估
def evaluate_recommendation_system(self, ratings_df, test_size=0.2):
"""评估推荐系统性能"""
print("=== 推荐系统评估 ===")
# 划分训练集和测试集
train_data, test_data = train_test_split(ratings_df, test_size=test_size, random_state=42)
# 构建训练集的用户-物品矩阵
train_matrix = train_data.pivot(index='user_id', columns='item_id', values='rating').fillna(0)
# 计算相似度
user_sim = cosine_similarity(train_matrix)
item_sim = cosine_similarity(train_matrix.T)
# 评估指标
mae_scores = []
rmse_scores = []
# 对测试集中的每个评分进行预测
for _, row in test_data.iterrows():
user_id, item_id, true_rating = row['user_id'], row['item_id'], row['rating']
if user_id in train_matrix.index and item_id in train_matrix.columns:
# 基于用户的协同过滤预测
user_idx = list(train_matrix.index).index(user_id)
item_idx = list(train_matrix.columns).index(item_id)
# 计算预测评分
user_similarities = user_sim[user_idx]
user_ratings = train_matrix.iloc[:, item_idx]
# 加权平均预测
numerator = np.sum(user_similarities * user_ratings)
denominator = np.sum(np.abs(user_similarities))
if denominator > 0:
predicted_rating = numerator / denominator
else:
predicted_rating = np.mean(train_matrix.values[train_matrix.values > 0])
# 计算误差
mae_scores.append(abs(true_rating - predicted_rating))
rmse_scores.append((true_rating - predicted_rating) ** 2)
# 计算最终指标
mae = np.mean(mae_scores)
rmse = np.sqrt(np.mean(rmse_scores))
print(f"平均绝对误差 (MAE): {mae:.4f}")
print(f"均方根误差 (RMSE): {rmse:.4f}")
return mae, rmse
def visualize_recommendation_results(self, user_id=1):
"""可视化推荐结果"""
print(f"=== 用户 {user_id} 推荐结果可视化 ===")
# 获取不同算法的推荐结果
user_based_recs = self.user_based_recommendation(user_id, 10)
item_based_recs = self.item_based_recommendation(user_id, 10)
svd_recs = self.svd_recommendation(user_id, reconstructed_matrix, 10)
# 创建可视化
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle(f'用户 {user_id} 推荐结果比较', fontsize=16, fontweight='bold')
# 1. 基于用户的协同过滤推荐
if user_based_recs:
items, scores = zip(*user_based_recs)
axes[0,0].bar(range(len(items)), scores, color='skyblue')
axes[0,0].set_title('基于用户的协同过滤推荐')
axes[0,0].set_xlabel('推荐物品')
axes[0,0].set_ylabel('推荐分数')
axes[0,0].set_xticks(range(len(items)))
axes[0,0].set_xticklabels([f'Item {item}' for item in items], rotation=45)
# 2. 基于物品的协同过滤推荐
if item_based_recs:
items, scores = zip(*item_based_recs)
axes[0,1].bar(range(len(items)), scores, color='lightgreen')
axes[0,1].set_title('基于物品的协同过滤推荐')
axes[0,1].set_xlabel('推荐物品')
axes[0,1].set_ylabel('推荐分数')
axes[0,1].set_xticks(range(len(items)))
axes[0,1].set_xticklabels([f'Item {item}' for item in items], rotation=45)
# 3. SVD推荐
axes[1,0].bar(range(len(svd_recs)), svd_recs.values, color='orange')
axes[1,0].set_title('SVD矩阵分解推荐')
axes[1,0].set_xlabel('推荐物品')
axes[1,0].set_ylabel('预测评分')
axes[1,0].set_xticks(range(len(svd_recs)))
axes[1,0].set_xticklabels([f'Item {item}' for item in svd_recs.index], rotation=45)
# 4. 用户评分历史
user_ratings = self.user_item_matrix.loc[user_id]
rated_items = user_ratings[user_ratings > 0]
if len(rated_items) > 0:
axes[1,1].bar(range(len(rated_items)), rated_items.values, color='red', alpha=0.7)
axes[1,1].set_title('用户历史评分')
axes[1,1].set_xlabel('已评分物品')
axes[1,1].set_ylabel('评分')
axes[1,1].set_xticks(range(len(rated_items)))
axes[1,1].set_xticklabels([f'Item {item}' for item in rated_items.index], rotation=45)
plt.tight_layout()
plt.show()
# 打印推荐结果
print(f"\n基于用户的协同过滤推荐 (Top 5):")
for i, (item, score) in enumerate(user_based_recs[:5], 1):
print(f"{i}. 物品 {item}: {score:.4f}")
print(f"\n基于物品的协同过滤推荐 (Top 5):")
for i, (item, score) in enumerate(item_based_recs[:5], 1):
print(f"{i}. 物品 {item}: {score:.4f}")
print(f"\nSVD矩阵分解推荐 (Top 5):")
for i, (item, score) in enumerate(svd_recs.head(5).items(), 1):
print(f"{i}. 物品 {item}: {score:.4f}")
# 添加方法到类
RecommendationSystem.evaluate_recommendation_system = evaluate_recommendation_system
RecommendationSystem.visualize_recommendation_results = visualize_recommendation_results
# 评估推荐系统
mae, rmse = rec_system.evaluate_recommendation_system(ratings_data)
# 可视化推荐结果
rec_system.visualize_recommendation_results(user_id=1)
10.4 项目四:时间序列预测
10.4.1 项目背景与目标
时间序列预测在金融、零售、制造等领域有广泛应用,如股价预测、销量预测、设备故障预测等。
项目目标: - 构建时间序列预测模型 - 处理季节性和趋势性 - 评估预测性能
class TimeSeriesForecasting:
def __init__(self):
self.data = None
self.models = {}
self.predictions = {}
def create_time_series_dataset(self, n_points=1000, start_date='2020-01-01'):
"""创建时间序列数据集"""
print("=== 创建时间序列数据集 ===")
# 创建日期范围
dates = pd.date_range(start=start_date, periods=n_points, freq='D')
# 生成基础趋势
trend = np.linspace(100, 200, n_points)
# 添加季节性(年度和周度)
annual_seasonality = 20 * np.sin(2 * np.pi * np.arange(n_points) / 365.25)
weekly_seasonality = 10 * np.sin(2 * np.pi * np.arange(n_points) / 7)
# 添加噪声
noise = np.random.normal(0, 5, n_points)
# 组合所有组件
values = trend + annual_seasonality + weekly_seasonality + noise
# 添加一些异常值
anomaly_indices = np.random.choice(n_points, size=int(n_points * 0.02), replace=False)
values[anomaly_indices] += np.random.normal(0, 30, len(anomaly_indices))
# 创建DataFrame
ts_data = pd.DataFrame({
'date': dates,
'value': values,
'trend': trend,
'annual_seasonality': annual_seasonality,
'weekly_seasonality': weekly_seasonality,
'noise': noise
})
ts_data.set_index('date', inplace=True)
print(f"时间序列数据形状: {ts_data.shape}")
print(f"日期范围: {ts_data.index.min()} 到 {ts_data.index.max()}")
print(f"数值范围: {ts_data['value'].min():.2f} 到 {ts_data['value'].max():.2f}")
self.data = ts_data
# 可视化时间序列
self.visualize_time_series(ts_data)
return ts_data
def visualize_time_series(self, ts_data):
"""可视化时间序列数据"""
fig, axes = plt.subplots(3, 2, figsize=(16, 12))
fig.suptitle('时间序列数据分析', fontsize=16, fontweight='bold')
# 1. 原始时间序列
axes[0,0].plot(ts_data.index, ts_data['value'], color='blue', alpha=0.7)
axes[0,0].set_title('原始时间序列')
axes[0,0].set_xlabel('日期')
axes[0,0].set_ylabel('数值')
axes[0,0].grid(True, alpha=0.3)
# 2. 趋势组件
axes[0,1].plot(ts_data.index, ts_data['trend'], color='red', linewidth=2)
axes[0,1].set_title('趋势组件')
axes[0,1].set_xlabel('日期')
axes[0,1].set_ylabel('趋势值')
axes[0,1].grid(True, alpha=0.3)
# 3. 年度季节性
axes[1,0].plot(ts_data.index[:365], ts_data['annual_seasonality'][:365], color='green')
axes[1,0].set_title('年度季节性 (前365天)')
axes[1,0].set_xlabel('日期')
axes[1,0].set_ylabel('季节性值')
axes[1,0].grid(True, alpha=0.3)
# 4. 周度季节性
axes[1,1].plot(ts_data.index[:28], ts_data['weekly_seasonality'][:28], color='orange')
axes[1,1].set_title('周度季节性 (前4周)')
axes[1,1].set_xlabel('日期')
axes[1,1].set_ylabel('季节性值')
axes[1,1].grid(True, alpha=0.3)
# 5. 数值分布
axes[2,0].hist(ts_data['value'], bins=50, color='purple', alpha=0.7)
axes[2,0].set_title('数值分布')
axes[2,0].set_xlabel('数值')
axes[2,0].set_ylabel('频次')
# 6. 自相关图
from statsmodels.tsa.stattools import acf
lags = range(1, 50)
autocorr = [acf(ts_data['value'], nlags=lag)[-1] for lag in lags]
axes[2,1].plot(lags, autocorr, color='brown')
axes[2,1].axhline(y=0, color='black', linestyle='--', alpha=0.5)
axes[2,1].set_title('自相关函数')
axes[2,1].set_xlabel('滞后期')
axes[2,1].set_ylabel('自相关系数')
axes[2,1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# 创建时间序列预测实例
ts_forecaster = TimeSeriesForecasting()
# 创建数据集
print("=== 创建时间序列数据集 ===")
ts_data = ts_forecaster.create_time_series_dataset()
10.4.2 特征工程与数据预处理
def create_time_series_features(self, ts_data, window_sizes=[7, 14, 30]):
"""创建时间序列特征"""
print("=== 时间序列特征工程 ===")
# 复制数据
features_df = ts_data.copy()
# 1. 滞后特征
for lag in [1, 2, 3, 7, 14, 30]:
features_df[f'lag_{lag}'] = features_df['value'].shift(lag)
# 2. 滑动窗口统计特征
for window in window_sizes:
features_df[f'rolling_mean_{window}'] = features_df['value'].rolling(window=window).mean()
features_df[f'rolling_std_{window}'] = features_df['value'].rolling(window=window).std()
features_df[f'rolling_min_{window}'] = features_df['value'].rolling(window=window).min()
features_df[f'rolling_max_{window}'] = features_df['value'].rolling(window=window).max()
# 3. 时间特征
features_df['year'] = features_df.index.year
features_df['month'] = features_df.index.month
features_df['day'] = features_df.index.day
features_df['dayofweek'] = features_df.index.dayofweek
features_df['dayofyear'] = features_df.index.dayofyear
features_df['quarter'] = features_df.index.quarter
features_df['is_weekend'] = (features_df.index.dayofweek >= 5).astype(int)
# 4. 差分特征
features_df['diff_1'] = features_df['value'].diff(1)
features_df['diff_7'] = features_df['value'].diff(7)
features_df['diff_30'] = features_df['value'].diff(30)
# 5. 变化率特征
features_df['pct_change_1'] = features_df['value'].pct_change(1)
features_df['pct_change_7'] = features_df['value'].pct_change(7)
# 6. 周期性特征
features_df['sin_dayofyear'] = np.sin(2 * np.pi * features_df['dayofyear'] / 365.25)
features_df['cos_dayofyear'] = np.cos(2 * np.pi * features_df['dayofyear'] / 365.25)
features_df['sin_dayofweek'] = np.sin(2 * np.pi * features_df['dayofweek'] / 7)
features_df['cos_dayofweek'] = np.cos(2 * np.pi * features_df['dayofweek'] / 7)
# 删除包含NaN的行
features_df = features_df.dropna()
print(f"特征工程后数据形状: {features_df.shape}")
print(f"特征列数: {len(features_df.columns)}")
return features_df
def prepare_supervised_data(self, features_df, target_col='value', forecast_horizon=1):
"""准备监督学习数据"""
print("=== 准备监督学习数据 ===")
# 创建目标变量(未来值)
y = features_df[target_col].shift(-forecast_horizon)
# 特征变量(排除目标变量和其组件)
exclude_cols = ['value', 'trend', 'annual_seasonality', 'weekly_seasonality', 'noise']
X = features_df.drop(columns=[col for col in exclude_cols if col in features_df.columns])
# 删除包含NaN的行
valid_indices = ~y.isna()
X = X[valid_indices]
y = y[valid_indices]
print(f"特征矩阵形状: {X.shape}")
print(f"目标变量形状: {y.shape}")
return X, y
# 添加方法到类
TimeSeriesForecasting.create_time_series_features = create_time_series_features
TimeSeriesForecasting.prepare_supervised_data = prepare_supervised_data
# 特征工程
features_data = ts_forecaster.create_time_series_features(ts_data)
X, y = ts_forecaster.prepare_supervised_data(features_data)
10.4.3 时间序列预测模型
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
def train_forecasting_models(self, X, y, test_size=0.2):
"""训练多种时间序列预测模型"""
print("=== 训练时间序列预测模型 ===")
# 时间序列数据按时间顺序划分训练集和测试集
split_index = int(len(X) * (1 - test_size))
X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]
print(f"训练集大小: {X_train.shape[0]}")
print(f"测试集大小: {X_test.shape[0]}")
# 定义模型
models = {
'Linear Regression': LinearRegression(),
'Ridge Regression': Ridge(alpha=1.0),
'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
'SVR': SVR(kernel='rbf', C=1.0, gamma='scale')
}
# 训练和评估模型
results = {}
predictions = {}
for name, model in models.items():
print(f"\n训练 {name}...")
# 训练模型
model.fit(X_train, y_train)
# 预测
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
# 评估指标
train_mae = mean_absolute_error(y_train, y_pred_train)
test_mae = mean_absolute_error(y_test, y_pred_test)
train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
train_r2 = r2_score(y_train, y_pred_train)
test_r2 = r2_score(y_test, y_pred_test)
results[name] = {
'train_mae': train_mae,
'test_mae': test_mae,
'train_rmse': train_rmse,
'test_rmse': test_rmse,
'train_r2': train_r2,
'test_r2': test_r2
}
predictions[name] = {
'train_pred': y_pred_train,
'test_pred': y_pred_test
}
print(f"训练 MAE: {train_mae:.4f}, 测试 MAE: {test_mae:.4f}")
print(f"训练 RMSE: {train_rmse:.4f}, 测试 RMSE: {test_rmse:.4f}")
print(f"训练 R²: {train_r2:.4f}, 测试 R²: {test_r2:.4f}")
self.models = models
self.predictions = predictions
return results, X_train, X_test, y_train, y_test
def visualize_forecasting_results(self, results, X_train, X_test, y_train, y_test):
"""可视化预测结果"""
print("=== 可视化预测结果 ===")
# 创建图形
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('时间序列预测结果', fontsize=16, fontweight='bold')
# 1. 模型性能比较
model_names = list(results.keys())
test_maes = [results[name]['test_mae'] for name in model_names]
test_rmses = [results[name]['test_rmse'] for name in model_names]
test_r2s = [results[name]['test_r2'] for name in model_names]
axes[0,0].bar(model_names, test_maes, color='skyblue')
axes[0,0].set_title('测试集 MAE 比较')
axes[0,0].set_ylabel('MAE')
axes[0,0].tick_params(axis='x', rotation=45)
axes[0,1].bar(model_names, test_rmses, color='lightgreen')
axes[0,1].set_title('测试集 RMSE 比较')
axes[0,1].set_ylabel('RMSE')
axes[0,1].tick_params(axis='x', rotation=45)
axes[0,2].bar(model_names, test_r2s, color='orange')
axes[0,2].set_title('测试集 R² 比较')
axes[0,2].set_ylabel('R²')
axes[0,2].tick_params(axis='x', rotation=45)
# 2. 预测结果可视化(选择最佳模型)
best_model = min(results.keys(), key=lambda x: results[x]['test_mae'])
# 训练集预测
train_dates = X_train.index
test_dates = X_test.index
axes[1,0].plot(train_dates, y_train, label='真实值', color='blue', alpha=0.7)
axes[1,0].plot(train_dates, self.predictions[best_model]['train_pred'],
label='预测值', color='red', alpha=0.7)
axes[1,0].set_title(f'{best_model} - 训练集预测')
axes[1,0].set_xlabel('日期')
axes[1,0].set_ylabel('数值')
axes[1,0].legend()
axes[1,0].grid(True, alpha=0.3)
# 测试集预测
axes[1,1].plot(test_dates, y_test, label='真实值', color='blue', alpha=0.7)
axes[1,1].plot(test_dates, self.predictions[best_model]['test_pred'],
label='预测值', color='red', alpha=0.7)
axes[1,1].set_title(f'{best_model} - 测试集预测')
axes[1,1].set_xlabel('日期')
axes[1,1].set_ylabel('数值')
axes[1,1].legend()
axes[1,1].grid(True, alpha=0.3)
# 残差分析
residuals = y_test.values - self.predictions[best_model]['test_pred']
axes[1,2].scatter(self.predictions[best_model]['test_pred'], residuals, alpha=0.6)
axes[1,2].axhline(y=0, color='red', linestyle='--')
axes[1,2].set_title(f'{best_model} - 残差分析')
axes[1,2].set_xlabel('预测值')
axes[1,2].set_ylabel('残差')
axes[1,2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# 打印最佳模型信息
print(f"\n最佳模型: {best_model}")
print(f"测试集 MAE: {results[best_model]['test_mae']:.4f}")
print(f"测试集 RMSE: {results[best_model]['test_rmse']:.4f}")
print(f"测试集 R²: {results[best_model]['test_r2']:.4f}")
def feature_importance_analysis(self, model_name='Random Forest'):
"""特征重要性分析"""
print(f"=== {model_name} 特征重要性分析 ===")
if model_name not in self.models:
print(f"模型 {model_name} 不存在")
return
model = self.models[model_name]
# 获取特征重要性
if hasattr(model, 'feature_importances_'):
importances = model.feature_importances_
feature_names = X.columns
# 创建特征重要性DataFrame
importance_df = pd.DataFrame({
'feature': feature_names,
'importance': importances
}).sort_values('importance', ascending=False)
# 可视化特征重要性
plt.figure(figsize=(12, 8))
top_features = importance_df.head(15)
plt.barh(range(len(top_features)), top_features['importance'], color='skyblue')
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('特征重要性')
plt.title(f'{model_name} - Top 15 特征重要性')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
print("Top 10 重要特征:")
for i, (_, row) in enumerate(importance_df.head(10).iterrows(), 1):
print(f"{i}. {row['feature']}: {row['importance']:.4f}")
else:
print(f"模型 {model_name} 不支持特征重要性分析")
# 添加方法到类
TimeSeriesForecasting.train_forecasting_models = train_forecasting_models
TimeSeriesForecasting.visualize_forecasting_results = visualize_forecasting_results
TimeSeriesForecasting.feature_importance_analysis = feature_importance_analysis
# 训练模型
results, X_train, X_test, y_train, y_test = ts_forecaster.train_forecasting_models(X, y)
# 可视化结果
ts_forecaster.visualize_forecasting_results(results, X_train, X_test, y_train, y_test)
# 特征重要性分析
ts_forecaster.feature_importance_analysis('Random Forest')
ts_forecaster.feature_importance_analysis('Gradient Boosting')
10.4.4 预测示例
def make_future_predictions(self, n_steps=30):
"""进行未来预测"""
print(f"=== 未来 {n_steps} 天预测 ===")
# 选择最佳模型
best_model_name = min(results.keys(), key=lambda x: results[x]['test_mae'])
best_model = self.models[best_model_name]
print(f"使用最佳模型: {best_model_name}")
# 获取最新的特征数据
last_features = X.iloc[-1:].copy()
# 存储预测结果
future_predictions = []
current_features = last_features.copy()
for step in range(n_steps):
# 进行预测
pred = best_model.predict(current_features)[0]
future_predictions.append(pred)
# 更新特征(简化版本,实际应用中需要更复杂的特征更新逻辑)
# 这里只更新滞后特征作为示例
if 'lag_1' in current_features.columns:
current_features['lag_1'] = pred
print(f"第 {step+1} 天预测值: {pred:.2f}")
# 可视化预测结果
plt.figure(figsize=(14, 8))
# 绘制历史数据
historical_dates = y.index[-100:] # 最近100天
historical_values = y.iloc[-100:]
plt.plot(historical_dates, historical_values, label='历史数据', color='blue', alpha=0.7)
# 绘制未来预测
future_dates = pd.date_range(start=y.index[-1] + pd.Timedelta(days=1),
periods=n_steps, freq='D')
plt.plot(future_dates, future_predictions, label='未来预测',
color='red', linestyle='--', marker='o', markersize=4)
plt.title(f'时间序列预测 - 未来 {n_steps} 天')
plt.xlabel('日期')
plt.ylabel('数值')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
return future_predictions, future_dates
# 添加方法到类
TimeSeriesForecasting.make_future_predictions = make_future_predictions
# 进行未来预测
future_preds, future_dates = ts_forecaster.make_future_predictions(30)
10.5 本章小结
10.5.1 项目总结
本章通过四个完整的实战项目,展示了Scikit-learn在不同领域的应用:
- 房价预测系统:回归问题的完整解决方案
- 客户分类与营销策略:分类问题的业务应用
- 推荐系统:协同过滤和矩阵分解技术
- 时间序列预测:时序数据的特征工程和预测
10.5.2 核心技能
通过这些项目,你掌握了:
- 数据预处理:缺失值处理、特征缩放、编码
- 特征工程:特征选择、构造、变换
- 模型选择:多种算法比较和评估
- 超参数调优:网格搜索和随机搜索
- 模型解释:特征重要性和可视化
- 业务应用:将技术转化为业务价值
10.5.3 最佳实践
- 数据质量:始终关注数据质量和完整性
- 特征工程:投入足够时间进行特征工程
- 模型验证:使用适当的验证策略
- 可解释性:确保模型结果可解释
- 业务理解:深入理解业务需求和约束
10.5.4 进阶学习
- 深度学习:TensorFlow、PyTorch
- 大数据处理:Spark MLlib、Dask
- 模型部署:Flask、FastAPI、Docker
- MLOps:模型版本控制、监控、自动化
10.5.5 练习建议
- 尝试其他数据集和问题类型
- 实现更复杂的特征工程
- 探索集成学习方法
- 学习模型部署和监控
- 参与Kaggle竞赛实践
通过这些实战项目的学习,你已经具备了使用Scikit-learn解决实际机器学习问题的能力。继续实践和探索,将帮助你成为更优秀的数据科学家!