3.1 数据仓库基础概念
3.1.1 事实表与维度表
事实表(Fact Table): - 存储业务过程中的度量数据 - 通常包含数值型字段(如销售额、数量) - 通过外键关联到维度表 - 数据量大,更新频繁
维度表(Dimension Table): - 存储描述性信息 - 提供分析的角度和上下文 - 相对稳定,变化较少 - 支持层次结构
示例数据模型:
-- 销售事实表
CREATE TABLE sales_fact (
sale_id BIGINT,
date_key INT, -- 关联日期维度
product_key INT, -- 关联产品维度
customer_key INT, -- 关联客户维度
store_key INT, -- 关联门店维度
quantity INT, -- 度量:数量
unit_price DECIMAL(10,2), -- 度量:单价
total_amount DECIMAL(12,2), -- 度量:总金额
discount_amount DECIMAL(10,2) -- 度量:折扣金额
);
-- 产品维度表
CREATE TABLE product_dim (
product_key INT PRIMARY KEY,
product_id VARCHAR(50),
product_name VARCHAR(200),
category VARCHAR(100),
subcategory VARCHAR(100),
brand VARCHAR(100),
supplier VARCHAR(100)
);
-- 日期维度表
CREATE TABLE date_dim (
date_key INT PRIMARY KEY,
full_date DATE,
year INT,
quarter INT,
month INT,
week INT,
day_of_week INT,
is_weekend BOOLEAN,
is_holiday BOOLEAN
);
3.1.2 星型模式与雪花模式
星型模式(Star Schema):
┌─────────────┐
│ Date Dim │
└──────┬──────┘
│
┌─────────────┐│┌─────────────┐
│ Product Dim ├┼┤ Sales Fact │
└─────────────┘││└─────────────┘
│
┌──────┴──────┐
│Customer Dim │
└─────────────┘
雪花模式(Snowflake Schema):
┌─────────────┐
│ Brand │
└──────┬──────┘
│
┌──────┴──────┐ ┌─────────────┐
│ Category ├────┤ Product Dim │
└─────────────┘ └──────┬──────┘
│
┌──────┴──────┐
│ Sales Fact │
└─────────────┘
对比分析:
特性 | 星型模式 | 雪花模式 |
---|---|---|
复杂度 | 简单 | 复杂 |
查询性能 | 高 | 中等 |
存储空间 | 较大 | 较小 |
维护成本 | 低 | 高 |
数据冗余 | 有 | 少 |
3.2 Kylin核心概念
3.2.1 数据源(Data Source)
定义: Kylin中的数据源是指存储原始数据的系统,主要是Hive表。
数据源类型: - Hive表:最常用的数据源 - Kafka流:实时数据源 - JDBC数据源:关系型数据库
数据源配置示例:
{
"name": "sales_hive_source",
"type": "hive",
"connection": {
"database": "sales_db",
"table": "sales_fact",
"location": "hdfs://namenode:9000/warehouse/sales_db.db/sales_fact"
},
"schema": {
"columns": [
{"name": "sale_date", "type": "date"},
{"name": "product_id", "type": "string"},
{"name": "customer_id", "type": "string"},
{"name": "quantity", "type": "int"},
{"name": "amount", "type": "decimal"}
]
}
}
3.2.2 数据模型(Data Model)
定义: 数据模型定义了事实表和维度表之间的关系,是构建Cube的基础。
模型组成: - 事实表:主要的数据表 - 维度表:描述性数据表 - 连接关系:表之间的关联 - 分区信息:数据分区策略
数据模型示例:
{
"name": "sales_model",
"description": "销售数据模型",
"fact_table": "sales_fact",
"lookups": [
{
"table": "product_dim",
"kind": "LOOKUP",
"alias": "product",
"join": {
"type": "inner",
"primary_key": ["product_key"],
"foreign_key": ["product_key"]
}
},
{
"table": "customer_dim",
"kind": "LOOKUP",
"alias": "customer",
"join": {
"type": "inner",
"primary_key": ["customer_key"],
"foreign_key": ["customer_key"]
}
}
],
"partition_desc": {
"partition_date_column": "sale_date",
"partition_date_start": "2023-01-01",
"partition_type": "APPEND"
}
}
3.2.3 维度(Dimension)
定义: 维度是分析数据的角度,提供了查询和聚合的分组依据。
维度类型:
普通维度(Normal Dimension)
-- 示例:按产品类别分析 SELECT product.category, SUM(sales.amount) FROM sales_fact sales JOIN product_dim product ON sales.product_key = product.product_key GROUP BY product.category;
- 层次维度(Hierarchy Dimension)
sql -- 示例:时间层次 年->季度->月->日 SELECT date.year, date.quarter, date.month, SUM(sales.amount) FROM sales_fact sales JOIN date_dim date ON sales.date_key = date.date_key GROUP BY date.year, date.quarter, date.month;
- 层次维度(Hierarchy Dimension)
派生维度(Derived Dimension)
-- 示例:从事实表派生的维度 SELECT CASE WHEN amount > 1000 THEN 'High' WHEN amount > 500 THEN 'Medium' ELSE 'Low' END as amount_level, COUNT(*) FROM sales_fact GROUP BY amount_level;
维度配置:
{ "dimensions": [ { "name": "product_category", "table": "product", "column": "category", "derived": null }, { "name": "sale_date", "table": "sales_fact", "column": "sale_date", "derived": null }, { "name": "customer_segment", "table": "customer", "column": "segment", "derived": null } ] }
3.2.4 度量(Measure)
定义: 度量是需要聚合计算的数值字段,是分析的目标指标。
度量类型:
SUM(求和)
SELECT SUM(amount) as total_sales FROM sales_fact;
- COUNT(计数)
sql SELECT COUNT(*) as order_count FROM sales_fact;
- COUNT(计数)
MAX/MIN(最大值/最小值)
SELECT MAX(amount) as max_sale, MIN(amount) as min_sale FROM sales_fact;
- COUNT_DISTINCT(去重计数)
sql SELECT COUNT(DISTINCT customer_id) as unique_customers FROM sales_fact;
- COUNT_DISTINCT(去重计数)
TOP_N(前N个)
-- 获取销售额前10的产品 SELECT product_id, SUM(amount) as total FROM sales_fact GROUP BY product_id ORDER BY total DESC LIMIT 10;
度量配置:
{ "measures": [ { "name": "total_amount", "function": { "expression": "SUM", "parameter": { "type": "column", "value": "amount" }, "returntype": "decimal(19,4)" } }, { "name": "order_count", "function": { "expression": "COUNT", "parameter": { "type": "constant", "value": "1" }, "returntype": "bigint" } }, { "name": "unique_customers", "function": { "expression": "COUNT_DISTINCT", "parameter": { "type": "column", "value": "customer_id" }, "returntype": "hllc(10)" } } ] }
3.2.5 立方体(Cube)
定义: Cube是Kylin的核心概念,它是多维数据的预聚合结果,包含了所有维度组合的聚合数据。
Cube结构:
维度组合示例(3个维度:A, B, C):
┌─────────────────────────────────────┐
│ Cube结构 │
├─────────────────────────────────────┤
│ 0维: [] (总计) │
│ 1维: [A], [B], [C] │
│ 2维: [A,B], [A,C], [B,C] │
│ 3维: [A,B,C] │
└─────────────────────────────────────┘
总共 2^3 = 8 个Cuboid
Cube配置示例:
{
"cube_name": "sales_cube",
"model_name": "sales_model",
"description": "销售数据立方体",
"dimensions": [
{
"name": "date",
"table": "sales_fact",
"column": "sale_date"
},
{
"name": "product",
"table": "product",
"column": "category"
},
{
"name": "customer",
"table": "customer",
"column": "segment"
}
],
"measures": [
{
"name": "total_sales",
"function": {
"expression": "SUM",
"parameter": {
"type": "column",
"value": "amount"
}
}
}
],
"aggregation_groups": [
{
"includes": ["date", "product", "customer"],
"select_rule": {
"hierarchy_dims": [["date"]],
"mandatory_dims": ["date"],
"joint_dims": [["product", "customer"]]
}
}
]
}
3.3 聚合组(Aggregation Group)
3.3.1 聚合组概念
定义: 聚合组是对Cube中维度组合的优化配置,用于减少不必要的Cuboid,提高构建效率和查询性能。
作用: - 减少Cube大小 - 提高构建速度 - 优化存储空间 - 提升查询性能
3.3.2 聚合组规则
1. 强制维度(Mandatory Dimensions)
{
"mandatory_dims": ["date"],
"description": "所有Cuboid都必须包含date维度"
}
效果:
原始Cuboid: [], [A], [B], [C], [A,B], [A,C], [B,C], [A,B,C]
强制维度A后: [A], [A,B], [A,C], [A,B,C]
减少了4个Cuboid
2. 层次维度(Hierarchy Dimensions)
{
"hierarchy_dims": [["year", "quarter", "month", "day"]],
"description": "时间维度具有层次关系"
}
效果:
层次关系: year -> quarter -> month -> day
有效组合: [year], [year,quarter], [year,quarter,month], [year,quarter,month,day]
无效组合: [quarter], [month], [day], [year,month] 等
3. 联合维度(Joint Dimensions)
{
"joint_dims": [["province", "city"]],
"description": "省份和城市总是一起出现"
}
效果:
联合维度: [province, city] 作为一个整体
有效组合: [], [province,city], [date], [date,province,city]
无效组合: [province], [city], [date,province], [date,city]
3.3.3 聚合组设计示例
场景:电商销售分析
{
"aggregation_groups": [
{
"includes": ["date", "product_category", "customer_segment", "region"],
"select_rule": {
"mandatory_dims": ["date"],
"hierarchy_dims": [],
"joint_dims": []
}
},
{
"includes": ["date", "product_brand", "product_type"],
"select_rule": {
"mandatory_dims": ["date"],
"hierarchy_dims": [["product_brand", "product_type"]],
"joint_dims": []
}
}
]
}
优化效果分析:
# 计算Cuboid数量
def calculate_cuboids(dimensions, rules):
"""
计算聚合组规则下的Cuboid数量
"""
total_dims = len(dimensions)
# 无优化情况
original_count = 2 ** total_dims
# 应用强制维度
if rules.get('mandatory_dims'):
mandatory_count = len(rules['mandatory_dims'])
optimized_count = 2 ** (total_dims - mandatory_count)
else:
optimized_count = original_count
# 应用层次维度
if rules.get('hierarchy_dims'):
for hierarchy in rules['hierarchy_dims']:
hierarchy_count = len(hierarchy)
# 层次维度只有 n 个有效组合,而不是 2^n
reduction = 2 ** hierarchy_count - hierarchy_count
optimized_count -= reduction
return original_count, optimized_count
# 示例计算
dimensions = ['date', 'product', 'customer', 'region']
rules = {
'mandatory_dims': ['date'],
'hierarchy_dims': [],
'joint_dims': []
}
original, optimized = calculate_cuboids(dimensions, rules)
print(f"原始Cuboid数量: {original}")
print(f"优化后Cuboid数量: {optimized}")
print(f"减少比例: {(original - optimized) / original * 100:.1f}%")
3.4 构建与刷新
3.4.1 构建类型
1. 全量构建(Full Build)
# 构建整个时间范围的数据
kylin.sh org.apache.kylin.tool.CubeMigrationCLI \
-cube sales_cube \
-startTime 0 \
-endTime 9223372036854775807
特点: - 处理所有历史数据 - 构建时间长 - 资源消耗大 - 适合初始构建
2. 增量构建(Incremental Build)
# 构建指定时间段的数据
kylin.sh org.apache.kylin.tool.CubeMigrationCLI \
-cube sales_cube \
-startTime 1672531200000 \
-endTime 1672617600000
特点: - 只处理新增数据 - 构建时间短 - 资源消耗小 - 适合日常更新
3. 刷新构建(Refresh Build)
# 重新构建指定时间段
kylin.sh org.apache.kylin.tool.CubeMigrationCLI \
-cube sales_cube \
-startTime 1672531200000 \
-endTime 1672617600000 \
-refresh true
特点: - 重新处理已有数据 - 用于数据修正 - 覆盖原有结果
3.4.2 构建流程
构建步骤:
graph TD
A[开始构建] --> B[资源检查]
B --> C[创建临时表]
C --> D[数据抽取]
D --> E[维度字典构建]
E --> F[事实表处理]
F --> G[Cuboid计算]
G --> H[数据编码]
H --> I[HFile生成]
I --> J[数据加载]
J --> K[索引更新]
K --> L[构建完成]
详细流程说明:
Step 1: 资源检查 “`bash
检查YARN资源
yarn application -list
# 检查HDFS空间 hadoop fs -df -h
# 检查HBase状态 echo “status” | hbase shell
2. **Step 2: 创建临时表**
```sql
-- 创建中间表
CREATE TABLE kylin_intermediate_sales_cube_xxx (
date_key INT,
product_key INT,
customer_key INT,
amount DECIMAL(12,2)
)
STORED AS PARQUET;
- Step 3: 数据抽取
sql -- 抽取并关联数据 INSERT INTO kylin_intermediate_sales_cube_xxx SELECT f.date_key, f.product_key, f.customer_key, f.amount FROM sales_fact f JOIN product_dim p ON f.product_key = p.product_key JOIN customer_dim c ON f.customer_key = c.customer_key WHERE f.sale_date >= '2023-01-01' AND f.sale_date < '2023-01-02';
4. Step 4: Cuboid计算sql -- 计算各个维度组合的聚合结果 -- Cuboid [date, product] SELECT date_key, product_key, SUM(amount) FROM kylin_intermediate_sales_cube_xxx GROUP BY date_key, product_key; -- Cuboid [date, customer] SELECT date_key, customer_key, SUM(amount) FROM kylin_intermediate_sales_cube_xxx GROUP BY date_key, customer_key;
3.4.3 构建优化
Spark构建配置:
# kylin.properties中的Spark配置
kylin.engine.spark-conf.spark.master=yarn
kylin.engine.spark-conf.spark.submit.deployMode=cluster
kylin.engine.spark-conf.spark.driver.memory=4g
kylin.engine.spark-conf.spark.driver.cores=2
kylin.engine.spark-conf.spark.executor.memory=8g
kylin.engine.spark-conf.spark.executor.cores=4
kylin.engine.spark-conf.spark.executor.instances=10
kylin.engine.spark-conf.spark.sql.adaptive.enabled=true
kylin.engine.spark-conf.spark.sql.adaptive.coalescePartitions.enabled=true
构建监控脚本:
#!/bin/bash
# monitor_build.sh - 构建监控脚本
CUBE_NAME="$1"
BUILD_ID="$2"
if [ -z "$CUBE_NAME" ] || [ -z "$BUILD_ID" ]; then
echo "用法: $0 <cube_name> <build_id>"
exit 1
fi
echo "监控Cube构建: $CUBE_NAME (Build ID: $BUILD_ID)"
while true; do
# 获取构建状态
status=$(curl -s -X GET \
-H "Authorization: Basic QURNSU46S1lMSU4=" \
"http://localhost:7070/kylin/api/jobs/$BUILD_ID" | \
jq -r '.job_status')
echo "$(date): 构建状态 - $status"
case $status in
"FINISHED")
echo "构建成功完成!"
break
;;
"ERROR")
echo "构建失败!"
# 获取错误信息
curl -s -X GET \
-H "Authorization: Basic QURNSU46S1lMSU4=" \
"http://localhost:7070/kylin/api/jobs/$BUILD_ID" | \
jq -r '.steps[] | select(.status=="ERROR") | .cmd_output'
break
;;
"RUNNING")
# 获取进度信息
progress=$(curl -s -X GET \
-H "Authorization: Basic QURNSU46S1lMSU4=" \
"http://localhost:7070/kylin/api/jobs/$BUILD_ID" | \
jq -r '.progress')
echo "构建进度: $progress%"
;;
esac
sleep 30
done
3.5 查询与路由
3.5.1 查询流程
查询处理流程:
graph TD
A[SQL查询] --> B[SQL解析]
B --> C[查询优化]
C --> D[Cube选择]
D --> E[查询重写]
E --> F[Cuboid匹配]
F --> G[数据读取]
G --> H[结果聚合]
H --> I[返回结果]
查询路由逻辑:
class QueryRouter:
def __init__(self):
self.cubes = self.load_available_cubes()
def route_query(self, sql_query):
"""
查询路由逻辑
"""
# 1. 解析SQL
parsed_query = self.parse_sql(sql_query)
# 2. 提取查询要素
dimensions = self.extract_dimensions(parsed_query)
measures = self.extract_measures(parsed_query)
filters = self.extract_filters(parsed_query)
# 3. 选择最优Cube
best_cube = self.select_optimal_cube(
dimensions, measures, filters
)
if best_cube:
# 4. 重写查询
rewritten_query = self.rewrite_query(
sql_query, best_cube
)
return self.execute_cube_query(rewritten_query)
else:
# 5. 回退到原始数据源
return self.execute_raw_query(sql_query)
def select_optimal_cube(self, dimensions, measures, filters):
"""
选择最优Cube
"""
candidates = []
for cube in self.cubes:
# 检查维度覆盖
if not self.check_dimension_coverage(cube, dimensions):
continue
# 检查度量覆盖
if not self.check_measure_coverage(cube, measures):
continue
# 计算匹配度
score = self.calculate_match_score(
cube, dimensions, measures, filters
)
candidates.append((cube, score))
# 返回得分最高的Cube
if candidates:
return max(candidates, key=lambda x: x[1])[0]
return None
3.5.2 Cuboid选择
Cuboid匹配规则:
def find_best_cuboid(cube, query_dimensions):
"""
查找最佳Cuboid
"""
available_cuboids = cube.get_cuboids()
# 1. 精确匹配
exact_match = find_exact_match(available_cuboids, query_dimensions)
if exact_match:
return exact_match
# 2. 超集匹配(包含所有查询维度的最小Cuboid)
superset_matches = find_superset_matches(
available_cuboids, query_dimensions
)
if superset_matches:
return min(superset_matches, key=lambda x: len(x.dimensions))
# 3. 基础Cuboid(包含所有维度)
return cube.get_base_cuboid()
def find_exact_match(cuboids, query_dims):
"""
查找精确匹配的Cuboid
"""
query_set = set(query_dims)
for cuboid in cuboids:
if set(cuboid.dimensions) == query_set:
return cuboid
return None
def find_superset_matches(cuboids, query_dims):
"""
查找超集匹配的Cuboid
"""
query_set = set(query_dims)
matches = []
for cuboid in cuboids:
if query_set.issubset(set(cuboid.dimensions)):
matches.append(cuboid)
return matches
查询示例:
-- 原始查询
SELECT
p.category,
d.year,
SUM(f.amount) as total_sales
FROM sales_fact f
JOIN product_dim p ON f.product_key = p.product_key
JOIN date_dim d ON f.date_key = d.date_key
WHERE d.year = 2023
GROUP BY p.category, d.year;
-- Kylin重写后的查询
SELECT
category,
year,
SUM(total_sales) as total_sales
FROM kylin_sales_cube
WHERE year = 2023
GROUP BY category, year;
3.6 本章小结
本章详细介绍了Apache Kylin的核心概念和术语:
核心概念: 1. 数据仓库基础:事实表、维度表、星型模式、雪花模式 2. Kylin核心概念:数据源、数据模型、维度、度量、Cube 3. 聚合组优化:强制维度、层次维度、联合维度 4. 构建与刷新:全量构建、增量构建、刷新构建 5. 查询与路由:查询流程、Cube选择、Cuboid匹配
关键要点: 1. 理解多维数据模型是使用Kylin的基础 2. 合理设计聚合组可以显著优化性能 3. 选择合适的构建策略平衡效率和资源消耗 4. 查询路由机制保证了查询的高效执行
下一章预告: 下一章将介绍数据源管理与连接,包括如何配置和管理各种类型的数据源。
3.7 练习与思考
理论练习
- 设计一个电商场景的星型模式数据模型
- 计算给定维度组合下的Cuboid数量
- 分析不同聚合组规则的优化效果
实践练习
- 创建示例数据模型和Cube定义
- 编写查询路由的伪代码
- 设计构建监控和告警机制
思考题
- 如何平衡Cube的完整性和构建效率?
- 在什么情况下应该使用雪花模式而不是星型模式?
- 如何根据查询模式优化聚合组设计?