第5章：数据选择与过滤 - 在线学习与练习平台

5.1 章节概述

数据选择与过滤是数据分析中最基础也是最重要的操作之一。本章将详细介绍Pandas中各种数据选择和过滤的方法，包括基于位置的选择、基于标签的选择、条件过滤等。

5.1.1 学习目标

掌握DataFrame和Series的索引操作
学会使用loc和iloc进行数据选择
理解布尔索引的原理和应用
掌握多条件查询和复杂过滤
学习数据切片和采样技术
了解查询性能优化方法

5.1.2 数据选择方法概览

graph TD
    A[数据选择] --> B[基于位置]
    A --> C[基于标签]
    A --> D[条件过滤]
    
    B --> B1[iloc]
    B --> B2[切片]
    
    C --> C1[loc]
    C --> C2[列名选择]
    
    D --> D1[布尔索引]
    D --> D2[query方法]
    D --> D3[多条件组合]

5.2 基础索引操作

5.2.1 创建示例数据

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 创建示例数据
np.random.seed(42)
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Henry'],
    'age': [25, 30, 35, 28, 32, 45, 29, 38],
    'salary': [50000, 60000, 75000, 55000, 68000, 90000, 52000, 72000],
    'department': ['IT', 'HR', 'Finance', 'IT', 'Marketing', 'Finance', 'HR', 'IT'],
    'experience': [2, 5, 8, 3, 6, 15, 4, 10],
    'performance': [8.5, 7.2, 9.1, 8.0, 8.8, 9.5, 7.8, 8.9]
}

df = pd.DataFrame(data)
print("示例数据：")
print(df)
print(f"\n数据形状：{df.shape}")
print(f"索引：{df.index}")
print(f"列名：{df.columns.tolist()}")

5.2.2 列选择

# 单列选择
print("选择单列 - name：")
print(df['name'])
print(f"类型：{type(df['name'])}")

# 多列选择
print("\n选择多列 - name和salary：")
selected_cols = df[['name', 'salary']]
print(selected_cols)
print(f"类型：{type(selected_cols)}")

# 使用点号访问（仅适用于有效的Python标识符）
print("\n使用点号访问age列：")
print(df.age.head())

# 列的切片选择
print("\n列切片选择（从name到salary）：")
print(df.loc[:, 'name':'salary'])

# 选择特定数据类型的列
print("\n选择数值类型的列：")
numeric_cols = df.select_dtypes(include=[np.number])
print(numeric_cols.columns.tolist())
print(numeric_cols.head())

# 选择字符串类型的列
print("\n选择字符串类型的列：")
string_cols = df.select_dtypes(include=['object'])
print(string_cols.columns.tolist())

5.2.3 行选择

# 使用索引选择行
print("选择第一行：")
print(df.iloc[0])
print(f"类型：{type(df.iloc[0])}")

# 选择多行
print("\n选择前三行：")
print(df.iloc[0:3])

# 选择特定行
print("\n选择第1、3、5行：")
print(df.iloc[[0, 2, 4]])

# 使用负索引
print("\n选择最后一行：")
print(df.iloc[-1])

# 选择最后三行
print("\n选择最后三行：")
print(df.iloc[-3:])

5.3 loc和iloc详解

5.3.1 iloc - 基于位置的索引

# iloc基本用法
print("iloc基本用法示例：")

# 选择单个元素
print(f"第2行第3列的值：{df.iloc[1, 2]}")

# 选择行和列的范围
print("\n选择前3行，前4列：")
print(df.iloc[0:3, 0:4])

# 选择特定行和列
print("\n选择第1、3行的第2、4列：")
print(df.iloc[[0, 2], [1, 3]])

# 选择所有行的特定列
print("\n选择所有行的第2、3列：")
print(df.iloc[:, [1, 2]])

# 选择特定行的所有列
print("\n选择第2行的所有列：")
print(df.iloc[1, :])

# 使用步长
print("\n每隔一行选择：")
print(df.iloc[::2])

# 反向选择
print("\n反向选择行：")
print(df.iloc[::-1])

5.3.2 loc - 基于标签的索引

# 设置有意义的索引
df_indexed = df.set_index('name')
print("设置name为索引后的数据：")
print(df_indexed)

# loc基本用法
print("\nloc基本用法示例：")

# 选择单行
print("选择Alice的信息：")
print(df_indexed.loc['Alice'])

# 选择多行
print("\n选择Alice和Charlie的信息：")
print(df_indexed.loc[['Alice', 'Charlie']])

# 选择行和列
print("\n选择Alice的age和salary：")
print(df_indexed.loc['Alice', ['age', 'salary']])

# 使用切片
print("\n从Alice到Charlie的所有信息：")
print(df_indexed.loc['Alice':'Charlie'])

# 选择特定行的特定列范围
print("\n选择Bob到David的age到department列：")
print(df_indexed.loc['Bob':'David', 'age':'department'])

# 使用布尔数组选择行
high_salary = df_indexed['salary'] > 60000
print("\n高薪员工：")
print(df_indexed.loc[high_salary])

5.3.3 loc vs iloc 比较

# 创建对比示例
print("loc vs iloc 对比：")

# 重置索引以便比较
df_reset = df_indexed.reset_index()

print("使用iloc选择前3行前3列：")
print(df_reset.iloc[0:3, 0:3])

print("\n使用loc选择前3行前3列（基于标签）：")
print(df_reset.loc[0:2, 'name':'salary'])

# 性能比较
import time

# 创建大数据集进行性能测试
large_df = pd.DataFrame({
    'A': np.random.randn(100000),
    'B': np.random.randn(100000),
    'C': np.random.randn(100000)
})

# iloc性能测试
start_time = time.time()
for _ in range(1000):
    _ = large_df.iloc[0:100, 0:2]
iloc_time = time.time() - start_time

# loc性能测试
start_time = time.time()
for _ in range(1000):
    _ = large_df.loc[0:99, 'A':'B']
loc_time = time.time() - start_time

print(f"\niloc执行时间：{iloc_time:.4f}秒")
print(f"loc执行时间：{loc_time:.4f}秒")

5.4 布尔索引

5.4.1 基本布尔索引

# 创建布尔条件
print("布尔索引示例：")

# 单条件过滤
high_salary_condition = df['salary'] > 60000
print("高薪条件：")
print(high_salary_condition)

print("\n高薪员工：")
high_salary_employees = df[high_salary_condition]
print(high_salary_employees)

# 直接在方括号中使用条件
print("\n年龄大于30的员工：")
print(df[df['age'] > 30])

# 字符串条件
print("\nIT部门员工：")
print(df[df['department'] == 'IT'])

# 使用isin方法
print("\nIT或HR部门员工：")
print(df[df['department'].isin(['IT', 'HR'])])

# 字符串包含
print("\n名字包含'a'的员工：")
print(df[df['name'].str.contains('a', case=False)])

5.4.2 多条件布尔索引

# 多条件组合
print("多条件布尔索引：")

# AND条件（&）
print("高薪且年龄大于30的员工：")
condition_and = (df['salary'] > 60000) & (df['age'] > 30)
print(df[condition_and])

# OR条件（|）
print("\n高薪或高绩效的员工：")
condition_or = (df['salary'] > 70000) | (df['performance'] > 9.0)
print(df[condition_or])

# NOT条件（~）
print("\n非IT部门员工：")
condition_not = ~(df['department'] == 'IT')
print(df[condition_not])

# 复杂条件组合
print("\n复杂条件：IT部门且（高薪或高绩效）：")
complex_condition = (df['department'] == 'IT') & ((df['salary'] > 60000) | (df['performance'] > 8.5))
print(df[complex_condition])

# 范围条件
print("\n年龄在25-35之间的员工：")
age_range = (df['age'] >= 25) & (df['age'] <= 35)
print(df[age_range])

# 使用between方法
print("\n薪资在50000-70000之间的员工：")
salary_between = df['salary'].between(50000, 70000)
print(df[salary_between])

5.4.3 条件赋值

# 基于条件修改数据
df_modified = df.copy()

# 简单条件赋值
df_modified.loc[df_modified['age'] > 35, 'category'] = 'Senior'
df_modified.loc[df_modified['age'] <= 35, 'category'] = 'Junior'

print("添加年龄分类后：")
print(df_modified[['name', 'age', 'category']])

# 使用numpy.where进行条件赋值
df_modified['salary_level'] = np.where(df_modified['salary'] > 65000, 'High', 'Normal')

print("\n添加薪资等级后：")
print(df_modified[['name', 'salary', 'salary_level']])

# 多条件赋值
conditions = [
    df_modified['performance'] >= 9.0,
    df_modified['performance'] >= 8.0,
    df_modified['performance'] >= 7.0
]
choices = ['Excellent', 'Good', 'Average']
df_modified['performance_grade'] = np.select(conditions, choices, default='Below Average')

print("\n添加绩效等级后：")
print(df_modified[['name', 'performance', 'performance_grade']])

5.5 query方法

5.5.1 基本query用法

# query方法基础
print("query方法示例：")

# 简单查询
print("薪资大于60000的员工：")
print(df.query('salary > 60000'))

# 字符串查询
print("\nIT部门员工：")
print(df.query('department == "IT"'))

# 多条件查询
print("\n年龄大于30且薪资大于60000的员工：")
print(df.query('age > 30 and salary > 60000'))

# 使用变量
min_salary = 65000
print(f"\n薪资大于{min_salary}的员工：")
print(df.query('salary > @min_salary'))

# 范围查询
print("\n年龄在28-35之间的员工：")
print(df.query('28 <= age <= 35'))

# 列表查询
departments = ['IT', 'Finance']
print(f"\n{departments}部门的员工：")
print(df.query('department in @departments'))

5.5.2 高级query用法

# 复杂查询表达式
print("高级query用法：")

# 使用函数
print("名字长度大于4的员工：")
print(df.query('name.str.len() > 4'))

# 正则表达式查询
print("\n名字以'A'或'B'开头的员工：")
print(df.query('name.str.match("^[AB]")'))

# 组合条件
print("\n高绩效且经验丰富的员工：")
print(df.query('performance > 8.5 and experience > 5'))

# 使用索引查询（如果有设置索引）
df_indexed = df.set_index('name')
print("\n使用索引查询（Alice或Bob）：")
print(df_indexed.query('index in ["Alice", "Bob"]'))

# 性能比较：query vs 布尔索引
import time

# 创建大数据集
large_df = pd.DataFrame({
    'A': np.random.randn(1000000),
    'B': np.random.randn(1000000),
    'C': np.random.choice(['X', 'Y', 'Z'], 1000000)
})

# 布尔索引性能
start_time = time.time()
result1 = large_df[(large_df['A'] > 0) & (large_df['B'] < 0)]
bool_time = time.time() - start_time

# query性能
start_time = time.time()
result2 = large_df.query('A > 0 and B < 0')
query_time = time.time() - start_time

print(f"\n性能比较（100万行数据）：")
print(f"布尔索引时间：{bool_time:.4f}秒")
print(f"query方法时间：{query_time:.4f}秒")
print(f"结果行数相同：{len(result1) == len(result2)}")

5.6 数据切片和采样

5.6.1 数据切片

# 数据切片技术
print("数据切片示例：")

# 行切片
print("前5行：")
print(df.head())

print("\n后3行：")
print(df.tail(3))

# 随机切片
print("\n随机选择3行：")
print(df.sample(n=3, random_state=42))

# 按比例采样
print("\n随机选择50%的数据：")
print(df.sample(frac=0.5, random_state=42))

# 分层采样
print("\n按部门分层采样：")
stratified_sample = df.groupby('department').apply(
    lambda x: x.sample(n=min(2, len(x)), random_state=42)
).reset_index(drop=True)
print(stratified_sample)

# 时间序列切片（创建时间序列数据）
dates = pd.date_range('2023-01-01', periods=100, freq='D')
ts_data = pd.DataFrame({
    'date': dates,
    'value': np.random.randn(100).cumsum()
})
ts_data.set_index('date', inplace=True)

print("\n时间序列数据（前5行）：")
print(ts_data.head())

# 按日期范围切片
print("\n2023年1月的数据：")
print(ts_data['2023-01'])

print("\n特定日期范围：")
print(ts_data['2023-01-10':'2023-01-20'])

5.6.2 高级采样技术

# 高级采样方法
print("高级采样技术：")

# 系统采样
print("每隔2行采样：")
systematic_sample = df.iloc[::2]
print(systematic_sample)

# 加权采样（基于某个列的权重）
weights = df['salary'] / df['salary'].sum()
print("\n基于薪资的加权采样：")
weighted_sample = df.sample(n=3, weights=weights, random_state=42)
print(weighted_sample)

# 聚类采样（简单示例）
from sklearn.cluster import KMeans

# 基于数值特征进行聚类
numeric_features = df[['age', 'salary', 'experience', 'performance']]
kmeans = KMeans(n_clusters=3, random_state=42)
df['cluster'] = kmeans.fit_predict(numeric_features)

print("\n聚类结果：")
print(df[['name', 'cluster']])

# 从每个聚类中采样
cluster_sample = df.groupby('cluster').apply(
    lambda x: x.sample(n=min(2, len(x)), random_state=42)
).reset_index(drop=True)
print("\n聚类采样结果：")
print(cluster_sample[['name', 'department', 'cluster']])

5.7 条件查询优化

5.7.1 查询性能优化

# 查询性能优化技巧
print("查询性能优化：")

# 创建大数据集进行测试
np.random.seed(42)
large_data = pd.DataFrame({
    'id': range(1000000),
    'category': np.random.choice(['A', 'B', 'C', 'D'], 1000000),
    'value': np.random.randn(1000000),
    'flag': np.random.choice([True, False], 1000000)
})

print(f"大数据集形状：{large_data.shape}")

# 1. 使用索引优化
print("\n1. 索引优化：")
start_time = time.time()
result1 = large_data[large_data['category'] == 'A']
no_index_time = time.time() - start_time

# 设置索引
large_data_indexed = large_data.set_index('category')
start_time = time.time()
result2 = large_data_indexed.loc['A']
index_time = time.time() - start_time

print(f"无索引查询时间：{no_index_time:.4f}秒")
print(f"有索引查询时间：{index_time:.4f}秒")
print(f"性能提升：{no_index_time/index_time:.2f}倍")

# 2. 数据类型优化
print("\n2. 数据类型优化：")
# 转换为分类类型
large_data['category'] = large_data['category'].astype('category')
print(f"转换为分类类型后的内存使用：")
print(large_data.memory_usage(deep=True))

# 3. 分块处理
print("\n3. 分块处理大数据：")
def process_chunk(chunk):
    return chunk[chunk['value'] > 0]

chunk_size = 100000
results = []
start_time = time.time()

for chunk in pd.read_csv('large_data.csv', chunksize=chunk_size):
    processed_chunk = process_chunk(chunk)
    results.append(processed_chunk)

# 注意：这里假设有CSV文件，实际使用时需要先保存数据
# large_data.to_csv('large_data.csv', index=False)

5.7.2 复杂查询示例

# 复杂查询场景
print("复杂查询示例：")

# 创建更复杂的数据集
complex_data = pd.DataFrame({
    'employee_id': range(1, 1001),
    'name': [f'Employee_{i}' for i in range(1, 1001)],
    'department': np.random.choice(['IT', 'HR', 'Finance', 'Marketing', 'Sales'], 1000),
    'level': np.random.choice(['Junior', 'Mid', 'Senior', 'Lead'], 1000),
    'salary': np.random.normal(60000, 20000, 1000),
    'bonus': np.random.normal(5000, 2000, 1000),
    'start_date': pd.date_range('2020-01-01', periods=1000, freq='D'),
    'performance_score': np.random.normal(7.5, 1.5, 1000)
})

# 确保薪资为正数
complex_data['salary'] = np.abs(complex_data['salary'])
complex_data['bonus'] = np.abs(complex_data['bonus'])

print("复杂数据集概览：")
print(complex_data.head())

# 复杂查询1：多层条件
print("\n查询1：高级IT员工，薪资前20%，绩效优秀：")
salary_threshold = complex_data['salary'].quantile(0.8)
performance_threshold = complex_data['performance_score'].quantile(0.8)

complex_query1 = complex_data[
    (complex_data['department'] == 'IT') &
    (complex_data['level'].isin(['Senior', 'Lead'])) &
    (complex_data['salary'] > salary_threshold) &
    (complex_data['performance_score'] > performance_threshold)
]
print(f"符合条件的员工数：{len(complex_query1)}")
print(complex_query1[['name', 'level', 'salary', 'performance_score']].head())

# 复杂查询2：时间范围和多条件
print("\n查询2：2022年入职的中高级员工，总收入前30%：")
complex_data['total_compensation'] = complex_data['salary'] + complex_data['bonus']
compensation_threshold = complex_data['total_compensation'].quantile(0.7)

complex_query2 = complex_data[
    (complex_data['start_date'].dt.year == 2022) &
    (complex_data['level'].isin(['Mid', 'Senior', 'Lead'])) &
    (complex_data['total_compensation'] > compensation_threshold)
]
print(f"符合条件的员工数：{len(complex_query2)}")

# 复杂查询3：使用query方法的复杂条件
print("\n查询3：使用query方法的复杂查询：")
min_salary = 70000
target_departments = ['IT', 'Finance']
min_performance = 8.0

complex_query3 = complex_data.query(
    'salary > @min_salary and '
    'department in @target_departments and '
    'performance_score > @min_performance and '
    'level != "Junior"'
)
print(f"符合条件的员工数：{len(complex_query3)}")
print(complex_query3[['name', 'department', 'level', 'salary', 'performance_score']].head())

5.8 实际应用案例

5.8.1 销售数据分析

# 销售数据分析案例
print("销售数据分析案例：")

# 创建销售数据
np.random.seed(42)
sales_data = pd.DataFrame({
    'order_id': range(1, 10001),
    'customer_id': np.random.randint(1, 1001, 10000),
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home', 'Sports'], 10000),
    'sales_amount': np.random.exponential(100, 10000),
    'order_date': pd.date_range('2023-01-01', periods=10000, freq='H'),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 10000),
    'sales_rep': np.random.choice([f'Rep_{i}' for i in range(1, 21)], 10000)
})

print("销售数据概览：")
print(sales_data.head())

# 分析1：高价值订单分析
print("\n分析1：高价值订单（>500）：")
high_value_orders = sales_data[sales_data['sales_amount'] > 500]
print(f"高价值订单数量：{len(high_value_orders)}")
print(f"高价值订单占比：{len(high_value_orders)/len(sales_data)*100:.2f}%")

# 分析2：特定时间段的销售
print("\n分析2：2023年第一季度的销售：")
q1_sales = sales_data[
    (sales_data['order_date'] >= '2023-01-01') &
    (sales_data['order_date'] < '2023-04-01')
]
print(f"Q1订单数量：{len(q1_sales)}")
print(f"Q1总销售额：${q1_sales['sales_amount'].sum():.2f}")

# 分析3：区域和品类组合分析
print("\n分析3：北部地区电子产品销售：")
north_electronics = sales_data[
    (sales_data['region'] == 'North') &
    (sales_data['product_category'] == 'Electronics')
]
print(f"北部电子产品订单数：{len(north_electronics)}")
print(f"平均订单金额：${north_electronics['sales_amount'].mean():.2f}")

# 分析4：销售代表绩效分析
print("\n分析4：顶级销售代表（销售额前5）：")
rep_performance = sales_data.groupby('sales_rep')['sales_amount'].agg(['count', 'sum', 'mean'])
top_reps = rep_performance.nlargest(5, 'sum')
print(top_reps)

5.8.2 用户行为分析

# 用户行为分析案例
print("\n用户行为分析案例：")

# 创建用户行为数据
user_behavior = pd.DataFrame({
    'user_id': np.random.randint(1, 5001, 50000),
    'session_id': range(1, 50001),
    'page_views': np.random.poisson(5, 50000),
    'time_spent': np.random.exponential(300, 50000),  # 秒
    'device_type': np.random.choice(['Desktop', 'Mobile', 'Tablet'], 50000),
    'browser': np.random.choice(['Chrome', 'Firefox', 'Safari', 'Edge'], 50000),
    'conversion': np.random.choice([True, False], 50000, p=[0.05, 0.95]),
    'timestamp': pd.date_range('2023-01-01', periods=50000, freq='min')
})

print("用户行为数据概览：")
print(user_behavior.head())

# 分析1：高活跃用户识别
print("\n分析1：高活跃用户（页面浏览>10且停留时间>600秒）：")
high_activity = user_behavior[
    (user_behavior['page_views'] > 10) &
    (user_behavior['time_spent'] > 600)
]
print(f"高活跃会话数：{len(high_activity)}")
print(f"高活跃会话转化率：{high_activity['conversion'].mean()*100:.2f}%")

# 分析2：设备类型和转化分析
print("\n分析2：移动设备用户转化分析：")
mobile_users = user_behavior[user_behavior['device_type'] == 'Mobile']
mobile_conversion_rate = mobile_users['conversion'].mean()
print(f"移动设备转化率：{mobile_conversion_rate*100:.2f}%")

# 对比所有设备类型
device_conversion = user_behavior.groupby('device_type')['conversion'].mean()
print("\n各设备类型转化率：")
print(device_conversion.sort_values(ascending=False))

# 分析3：时间段分析
print("\n分析3：工作时间vs非工作时间转化率：")
user_behavior['hour'] = user_behavior['timestamp'].dt.hour
work_hours = user_behavior[
    (user_behavior['hour'] >= 9) & (user_behavior['hour'] <= 17)
]
non_work_hours = user_behavior[
    (user_behavior['hour'] < 9) | (user_behavior['hour'] > 17)
]

print(f"工作时间转化率：{work_hours['conversion'].mean()*100:.2f}%")
print(f"非工作时间转化率：{non_work_hours['conversion'].mean()*100:.2f}%")

5.9 性能优化技巧

5.9.1 查询优化策略

# 查询性能优化策略
print("查询性能优化策略：")

# 创建测试数据
test_data = pd.DataFrame({
    'id': range(1000000),
    'category': np.random.choice(['A', 'B', 'C'], 1000000),
    'value': np.random.randn(1000000),
    'date': pd.date_range('2020-01-01', periods=1000000, freq='min')
})

# 策略1：使用分类数据类型
print("策略1：分类数据类型优化")
start_time = time.time()
result1 = test_data[test_data['category'] == 'A']
original_time = time.time() - start_time

test_data['category'] = test_data['category'].astype('category')
start_time = time.time()
result2 = test_data[test_data['category'] == 'A']
category_time = time.time() - start_time

print(f"原始查询时间：{original_time:.4f}秒")
print(f"分类类型查询时间：{category_time:.4f}秒")
print(f"性能提升：{original_time/category_time:.2f}倍")

# 策略2：预先计算常用条件
print("\n策略2：预先计算条件")
test_data['is_category_A'] = test_data['category'] == 'A'
start_time = time.time()
result3 = test_data[test_data['is_category_A']]
precomputed_time = time.time() - start_time
print(f"预计算条件查询时间：{precomputed_time:.4f}秒")

# 策略3：使用eval进行复杂表达式
print("\n策略3：使用eval优化复杂表达式")
start_time = time.time()
result4 = test_data[(test_data['value'] > 0) & (test_data['id'] % 2 == 0)]
normal_time = time.time() - start_time

start_time = time.time()
result5 = test_data.query('value > 0 and id % 2 == 0')
eval_time = time.time() - start_time

print(f"普通条件查询时间：{normal_time:.4f}秒")
print(f"eval查询时间：{eval_time:.4f}秒")

5.9.2 内存优化

# 内存使用优化
print("\n内存使用优化：")

# 检查内存使用
def check_memory_usage(df, name):
    memory_usage = df.memory_usage(deep=True).sum() / 1024**2  # MB
    print(f"{name}内存使用：{memory_usage:.2f} MB")
    return memory_usage

# 原始数据内存使用
original_memory = check_memory_usage(test_data, "原始数据")

# 优化数据类型
optimized_data = test_data.copy()

# 优化整数类型
optimized_data['id'] = pd.to_numeric(optimized_data['id'], downcast='integer')

# 优化浮点数类型
optimized_data['value'] = pd.to_numeric(optimized_data['value'], downcast='float')

# 检查优化后的内存使用
optimized_memory = check_memory_usage(optimized_data, "优化后数据")

print(f"内存节省：{((original_memory - optimized_memory) / original_memory * 100):.2f}%")

# 数据类型信息
print("\n数据类型对比：")
print("原始数据类型：")
print(test_data.dtypes)
print("\n优化后数据类型：")
print(optimized_data.dtypes)

5.10 本章小结

5.10.1 核心知识点

基础索引操作
- 列选择：df['col'], df[['col1', 'col2']]
- 行选择：df.iloc[0], df.iloc[0:3]
- 数据类型选择：select_dtypes()
loc和iloc
- iloc：基于位置的整数索引
- loc：基于标签的索引
- 支持行列同时选择
布尔索引
- 单条件：df[df['col'] > value]
- 多条件：使用&, |, ~操作符
- 条件赋值：df.loc[condition, 'col'] = value
query方法
- 字符串表达式查询
- 支持变量引用：@variable
- 复杂条件组合
数据切片和采样
- 随机采样：sample()
- 分层采样：groupby().apply()
- 时间序列切片
性能优化
- 使用分类数据类型
- 设置合适的索引
- 预计算常用条件
- 内存使用优化

5.10.2 最佳实践

根据数据特点选择合适的索引方法
对于重复查询，考虑设置索引
使用分类数据类型处理重复字符串
复杂条件优先考虑query方法
大数据集使用分块处理

5.10.3 常见陷阱

链式索引警告：避免df['A']['B']，使用df.loc[:, 'A']['B']
布尔条件需要使用括号：(condition1) & (condition2)
iloc和loc的区别：位置vs标签
性能考虑：避免在循环中进行复杂查询

5.10.4 下一步学习

在下一章中，我们将学习： - 数据分组操作（groupby） - 聚合函数的使用 - 分组后的数据转换 - 多级分组和透视表

练习题

使用多种方法选择DataFrame的特定行和列
实现复杂的多条件数据过滤
比较不同查询方法的性能差异
设计一个数据采样策略
优化大数据集的查询性能

记住：熟练掌握数据选择和过滤是进行有效数据分析的基础！

📂 分类导航

▶ 学与练
- ▶ 软件技术基础
  - ▶ 操作系统技术
    - Linux实战
    - ▶ Linux技巧
      - debug-remote-api.md
  - ▶ 容器化与编排
    - Docker实战
    - ▶ Docker高级
- ▶ 前端开发技术
  - ▶ 框架与库
    - js
    - vue
  - ▶ 前端生态
    - bootstrap
    - vue-ssr
- ▶ 后端开发技术
  - ▶ 编程语言
    - ▶ Java
    - ▶ Go
      - go-server.md
      - mini.md
    - Rust
    - Python
    - csharp
  - ▶ 中间件
    - redis
    - ▶ minio
      - minio.md
    - elasticsearch
    - kafka
    - elk
    - caddy
  - ▶ 数据库
    - MySQL
    - SQLServer
    - ▶ Dameng
      - sql.md
    - clickhouse
- ▶ 数据开发与运维
  - ▶ 数据开发
    - hadoop
  - ▶ 运维开发
    - ▶ CI/CD
      - jenkins
    - ▶ 自动化
      - allinssl.md
    - ▶ 日志处理
      - elk
    - ▶ 监控
- 软件速学教程
▶ 软件园
- AI智能体与应用
- 开发工具与环境
- AI 开发和编排
- 业务与生产力应用
- 数据和中间件
▶ 工具箱
- 内容管理
- 编码解码
- ▶ 系统监控
  - miaotixing.md
- ▶ 日常工具
- 工具命令
- 使用教程

📚 第5章：数据选择与过滤

5.1 章节概述

5.1.1 学习目标

5.1.2 数据选择方法概览

5.2 基础索引操作

5.2.1 创建示例数据

5.2.2 列选择

5.2.3 行选择

5.3 loc和iloc详解

5.3.1 iloc - 基于位置的索引

5.3.2 loc - 基于标签的索引

5.3.3 loc vs iloc 比较

5.4 布尔索引

5.4.1 基本布尔索引

5.4.2 多条件布尔索引

5.4.3 条件赋值

5.5 query方法

5.5.1 基本query用法

5.5.2 高级query用法

5.6 数据切片和采样

5.6.1 数据切片

5.6.2 高级采样技术

5.7 条件查询优化

5.7.1 查询性能优化

5.7.2 复杂查询示例

5.8 实际应用案例

5.8.1 销售数据分析

5.8.2 用户行为分析

5.9 性能优化技巧

5.9.1 查询优化策略

5.9.2 内存优化

5.10 本章小结

5.10.1 核心知识点

5.10.2 最佳实践

5.10.3 常见陷阱

5.10.4 下一步学习

📂 分类导航

📰 最新文章