3.1 数据仓库基础概念

3.1.1 事实表与维度表

事实表(Fact Table): - 存储业务过程中的度量数据 - 通常包含数值型字段(如销售额、数量) - 通过外键关联到维度表 - 数据量大,更新频繁

维度表(Dimension Table): - 存储描述性信息 - 提供分析的角度和上下文 - 相对稳定,变化较少 - 支持层次结构

示例数据模型:

-- 销售事实表
CREATE TABLE sales_fact (
    sale_id BIGINT,
    date_key INT,           -- 关联日期维度
    product_key INT,        -- 关联产品维度
    customer_key INT,       -- 关联客户维度
    store_key INT,          -- 关联门店维度
    quantity INT,           -- 度量:数量
    unit_price DECIMAL(10,2), -- 度量:单价
    total_amount DECIMAL(12,2), -- 度量:总金额
    discount_amount DECIMAL(10,2) -- 度量:折扣金额
);

-- 产品维度表
CREATE TABLE product_dim (
    product_key INT PRIMARY KEY,
    product_id VARCHAR(50),
    product_name VARCHAR(200),
    category VARCHAR(100),
    subcategory VARCHAR(100),
    brand VARCHAR(100),
    supplier VARCHAR(100)
);

-- 日期维度表
CREATE TABLE date_dim (
    date_key INT PRIMARY KEY,
    full_date DATE,
    year INT,
    quarter INT,
    month INT,
    week INT,
    day_of_week INT,
    is_weekend BOOLEAN,
    is_holiday BOOLEAN
);

3.1.2 星型模式与雪花模式

星型模式(Star Schema):

        ┌─────────────┐
        │  Date Dim   │
        └──────┬──────┘
               │
┌─────────────┐│┌─────────────┐
│ Product Dim ├┼┤ Sales Fact  │
└─────────────┘││└─────────────┘
               │
        ┌──────┴──────┐
        │Customer Dim │
        └─────────────┘

雪花模式(Snowflake Schema):

    ┌─────────────┐
    │   Brand     │
    └──────┬──────┘
           │
    ┌──────┴──────┐    ┌─────────────┐
    │  Category   ├────┤ Product Dim │
    └─────────────┘    └──────┬──────┘
                              │
                       ┌──────┴──────┐
                       │ Sales Fact  │
                       └─────────────┘

对比分析:

特性 星型模式 雪花模式
复杂度 简单 复杂
查询性能 中等
存储空间 较大 较小
维护成本
数据冗余

3.2 Kylin核心概念

3.2.1 数据源(Data Source)

定义: Kylin中的数据源是指存储原始数据的系统,主要是Hive表。

数据源类型: - Hive表:最常用的数据源 - Kafka流:实时数据源 - JDBC数据源:关系型数据库

数据源配置示例:

{
  "name": "sales_hive_source",
  "type": "hive",
  "connection": {
    "database": "sales_db",
    "table": "sales_fact",
    "location": "hdfs://namenode:9000/warehouse/sales_db.db/sales_fact"
  },
  "schema": {
    "columns": [
      {"name": "sale_date", "type": "date"},
      {"name": "product_id", "type": "string"},
      {"name": "customer_id", "type": "string"},
      {"name": "quantity", "type": "int"},
      {"name": "amount", "type": "decimal"}
    ]
  }
}

3.2.2 数据模型(Data Model)

定义: 数据模型定义了事实表和维度表之间的关系,是构建Cube的基础。

模型组成: - 事实表:主要的数据表 - 维度表:描述性数据表 - 连接关系:表之间的关联 - 分区信息:数据分区策略

数据模型示例:

{
  "name": "sales_model",
  "description": "销售数据模型",
  "fact_table": "sales_fact",
  "lookups": [
    {
      "table": "product_dim",
      "kind": "LOOKUP",
      "alias": "product",
      "join": {
        "type": "inner",
        "primary_key": ["product_key"],
        "foreign_key": ["product_key"]
      }
    },
    {
      "table": "customer_dim",
      "kind": "LOOKUP",
      "alias": "customer",
      "join": {
        "type": "inner",
        "primary_key": ["customer_key"],
        "foreign_key": ["customer_key"]
      }
    }
  ],
  "partition_desc": {
    "partition_date_column": "sale_date",
    "partition_date_start": "2023-01-01",
    "partition_type": "APPEND"
  }
}

3.2.3 维度(Dimension)

定义: 维度是分析数据的角度,提供了查询和聚合的分组依据。

维度类型:

  1. 普通维度(Normal Dimension)

    -- 示例:按产品类别分析
    SELECT product.category, SUM(sales.amount)
    FROM sales_fact sales
    JOIN product_dim product ON sales.product_key = product.product_key
    GROUP BY product.category;
    
    1. 层次维度(Hierarchy Dimension) sql -- 示例:时间层次 年->季度->月->日 SELECT date.year, date.quarter, date.month, SUM(sales.amount) FROM sales_fact sales JOIN date_dim date ON sales.date_key = date.date_key GROUP BY date.year, date.quarter, date.month;
  2. 派生维度(Derived Dimension)

    -- 示例:从事实表派生的维度
    SELECT 
       CASE 
           WHEN amount > 1000 THEN 'High'
           WHEN amount > 500 THEN 'Medium'
           ELSE 'Low'
       END as amount_level,
       COUNT(*)
    FROM sales_fact
    GROUP BY amount_level;
    

    维度配置:

    {
    "dimensions": [
    {
      "name": "product_category",
      "table": "product",
      "column": "category",
      "derived": null
    },
    {
      "name": "sale_date",
      "table": "sales_fact",
      "column": "sale_date",
      "derived": null
    },
    {
      "name": "customer_segment",
      "table": "customer",
      "column": "segment",
      "derived": null
    }
    ]
    }
    

3.2.4 度量(Measure)

定义: 度量是需要聚合计算的数值字段,是分析的目标指标。

度量类型:

  1. SUM(求和)

    SELECT SUM(amount) as total_sales
    FROM sales_fact;
    
    1. COUNT(计数) sql SELECT COUNT(*) as order_count FROM sales_fact;
  2. MAX/MIN(最大值/最小值)

    SELECT MAX(amount) as max_sale, MIN(amount) as min_sale
    FROM sales_fact;
    
    1. COUNT_DISTINCT(去重计数) sql SELECT COUNT(DISTINCT customer_id) as unique_customers FROM sales_fact;
  3. TOP_N(前N个)

    -- 获取销售额前10的产品
    SELECT product_id, SUM(amount) as total
    FROM sales_fact
    GROUP BY product_id
    ORDER BY total DESC
    LIMIT 10;
    

    度量配置:

    {
    "measures": [
    {
      "name": "total_amount",
      "function": {
        "expression": "SUM",
        "parameter": {
          "type": "column",
          "value": "amount"
        },
        "returntype": "decimal(19,4)"
      }
    },
    {
      "name": "order_count",
      "function": {
        "expression": "COUNT",
        "parameter": {
          "type": "constant",
          "value": "1"
        },
        "returntype": "bigint"
      }
    },
    {
      "name": "unique_customers",
      "function": {
        "expression": "COUNT_DISTINCT",
        "parameter": {
          "type": "column",
          "value": "customer_id"
        },
        "returntype": "hllc(10)"
      }
    }
    ]
    }
    

3.2.5 立方体(Cube)

定义: Cube是Kylin的核心概念,它是多维数据的预聚合结果,包含了所有维度组合的聚合数据。

Cube结构:

维度组合示例(3个维度:A, B, C):

┌─────────────────────────────────────┐
│              Cube结构                │
├─────────────────────────────────────┤
│ 0维: [] (总计)                      │
│ 1维: [A], [B], [C]                 │
│ 2维: [A,B], [A,C], [B,C]           │
│ 3维: [A,B,C]                       │
└─────────────────────────────────────┘

总共 2^3 = 8 个Cuboid

Cube配置示例:

{
  "cube_name": "sales_cube",
  "model_name": "sales_model",
  "description": "销售数据立方体",
  "dimensions": [
    {
      "name": "date",
      "table": "sales_fact",
      "column": "sale_date"
    },
    {
      "name": "product",
      "table": "product",
      "column": "category"
    },
    {
      "name": "customer",
      "table": "customer",
      "column": "segment"
    }
  ],
  "measures": [
    {
      "name": "total_sales",
      "function": {
        "expression": "SUM",
        "parameter": {
          "type": "column",
          "value": "amount"
        }
      }
    }
  ],
  "aggregation_groups": [
    {
      "includes": ["date", "product", "customer"],
      "select_rule": {
        "hierarchy_dims": [["date"]],
        "mandatory_dims": ["date"],
        "joint_dims": [["product", "customer"]]
      }
    }
  ]
}

3.3 聚合组(Aggregation Group)

3.3.1 聚合组概念

定义: 聚合组是对Cube中维度组合的优化配置,用于减少不必要的Cuboid,提高构建效率和查询性能。

作用: - 减少Cube大小 - 提高构建速度 - 优化存储空间 - 提升查询性能

3.3.2 聚合组规则

1. 强制维度(Mandatory Dimensions)

{
  "mandatory_dims": ["date"],
  "description": "所有Cuboid都必须包含date维度"
}

效果:

原始Cuboid: [], [A], [B], [C], [A,B], [A,C], [B,C], [A,B,C]
强制维度A后: [A], [A,B], [A,C], [A,B,C]
减少了4个Cuboid

2. 层次维度(Hierarchy Dimensions)

{
  "hierarchy_dims": [["year", "quarter", "month", "day"]],
  "description": "时间维度具有层次关系"
}

效果:

层次关系: year -> quarter -> month -> day
有效组合: [year], [year,quarter], [year,quarter,month], [year,quarter,month,day]
无效组合: [quarter], [month], [day], [year,month] 等

3. 联合维度(Joint Dimensions)

{
  "joint_dims": [["province", "city"]],
  "description": "省份和城市总是一起出现"
}

效果:

联合维度: [province, city] 作为一个整体
有效组合: [], [province,city], [date], [date,province,city]
无效组合: [province], [city], [date,province], [date,city]

3.3.3 聚合组设计示例

场景:电商销售分析

{
  "aggregation_groups": [
    {
      "includes": ["date", "product_category", "customer_segment", "region"],
      "select_rule": {
        "mandatory_dims": ["date"],
        "hierarchy_dims": [],
        "joint_dims": []
      }
    },
    {
      "includes": ["date", "product_brand", "product_type"],
      "select_rule": {
        "mandatory_dims": ["date"],
        "hierarchy_dims": [["product_brand", "product_type"]],
        "joint_dims": []
      }
    }
  ]
}

优化效果分析:

# 计算Cuboid数量
def calculate_cuboids(dimensions, rules):
    """
    计算聚合组规则下的Cuboid数量
    """
    total_dims = len(dimensions)
    
    # 无优化情况
    original_count = 2 ** total_dims
    
    # 应用强制维度
    if rules.get('mandatory_dims'):
        mandatory_count = len(rules['mandatory_dims'])
        optimized_count = 2 ** (total_dims - mandatory_count)
    else:
        optimized_count = original_count
    
    # 应用层次维度
    if rules.get('hierarchy_dims'):
        for hierarchy in rules['hierarchy_dims']:
            hierarchy_count = len(hierarchy)
            # 层次维度只有 n 个有效组合,而不是 2^n
            reduction = 2 ** hierarchy_count - hierarchy_count
            optimized_count -= reduction
    
    return original_count, optimized_count

# 示例计算
dimensions = ['date', 'product', 'customer', 'region']
rules = {
    'mandatory_dims': ['date'],
    'hierarchy_dims': [],
    'joint_dims': []
}

original, optimized = calculate_cuboids(dimensions, rules)
print(f"原始Cuboid数量: {original}")
print(f"优化后Cuboid数量: {optimized}")
print(f"减少比例: {(original - optimized) / original * 100:.1f}%")

3.4 构建与刷新

3.4.1 构建类型

1. 全量构建(Full Build)

# 构建整个时间范围的数据
kylin.sh org.apache.kylin.tool.CubeMigrationCLI \
  -cube sales_cube \
  -startTime 0 \
  -endTime 9223372036854775807

特点: - 处理所有历史数据 - 构建时间长 - 资源消耗大 - 适合初始构建

2. 增量构建(Incremental Build)

# 构建指定时间段的数据
kylin.sh org.apache.kylin.tool.CubeMigrationCLI \
  -cube sales_cube \
  -startTime 1672531200000 \
  -endTime 1672617600000

特点: - 只处理新增数据 - 构建时间短 - 资源消耗小 - 适合日常更新

3. 刷新构建(Refresh Build)

# 重新构建指定时间段
kylin.sh org.apache.kylin.tool.CubeMigrationCLI \
  -cube sales_cube \
  -startTime 1672531200000 \
  -endTime 1672617600000 \
  -refresh true

特点: - 重新处理已有数据 - 用于数据修正 - 覆盖原有结果

3.4.2 构建流程

构建步骤:

graph TD
    A[开始构建] --> B[资源检查]
    B --> C[创建临时表]
    C --> D[数据抽取]
    D --> E[维度字典构建]
    E --> F[事实表处理]
    F --> G[Cuboid计算]
    G --> H[数据编码]
    H --> I[HFile生成]
    I --> J[数据加载]
    J --> K[索引更新]
    K --> L[构建完成]

详细流程说明:

  1. Step 1: 资源检查 “`bash

    检查YARN资源

    yarn application -list

# 检查HDFS空间 hadoop fs -df -h

# 检查HBase状态 echo “status” | hbase shell


2. **Step 2: 创建临时表**
   ```sql
   -- 创建中间表
   CREATE TABLE kylin_intermediate_sales_cube_xxx (
       date_key INT,
       product_key INT,
       customer_key INT,
       amount DECIMAL(12,2)
   )
   STORED AS PARQUET;
  1. Step 3: 数据抽取 sql -- 抽取并关联数据 INSERT INTO kylin_intermediate_sales_cube_xxx SELECT f.date_key, f.product_key, f.customer_key, f.amount FROM sales_fact f JOIN product_dim p ON f.product_key = p.product_key JOIN customer_dim c ON f.customer_key = c.customer_key WHERE f.sale_date >= '2023-01-01' AND f.sale_date < '2023-01-02'; 4. Step 4: Cuboid计算 sql -- 计算各个维度组合的聚合结果 -- Cuboid [date, product] SELECT date_key, product_key, SUM(amount) FROM kylin_intermediate_sales_cube_xxx GROUP BY date_key, product_key; -- Cuboid [date, customer] SELECT date_key, customer_key, SUM(amount) FROM kylin_intermediate_sales_cube_xxx GROUP BY date_key, customer_key;

3.4.3 构建优化

Spark构建配置:

# kylin.properties中的Spark配置
kylin.engine.spark-conf.spark.master=yarn
kylin.engine.spark-conf.spark.submit.deployMode=cluster
kylin.engine.spark-conf.spark.driver.memory=4g
kylin.engine.spark-conf.spark.driver.cores=2
kylin.engine.spark-conf.spark.executor.memory=8g
kylin.engine.spark-conf.spark.executor.cores=4
kylin.engine.spark-conf.spark.executor.instances=10
kylin.engine.spark-conf.spark.sql.adaptive.enabled=true
kylin.engine.spark-conf.spark.sql.adaptive.coalescePartitions.enabled=true

构建监控脚本:

#!/bin/bash
# monitor_build.sh - 构建监控脚本

CUBE_NAME="$1"
BUILD_ID="$2"

if [ -z "$CUBE_NAME" ] || [ -z "$BUILD_ID" ]; then
    echo "用法: $0 <cube_name> <build_id>"
    exit 1
fi

echo "监控Cube构建: $CUBE_NAME (Build ID: $BUILD_ID)"

while true; do
    # 获取构建状态
    status=$(curl -s -X GET \
        -H "Authorization: Basic QURNSU46S1lMSU4=" \
        "http://localhost:7070/kylin/api/jobs/$BUILD_ID" | \
        jq -r '.job_status')
    
    echo "$(date): 构建状态 - $status"
    
    case $status in
        "FINISHED")
            echo "构建成功完成!"
            break
            ;;
        "ERROR")
            echo "构建失败!"
            # 获取错误信息
            curl -s -X GET \
                -H "Authorization: Basic QURNSU46S1lMSU4=" \
                "http://localhost:7070/kylin/api/jobs/$BUILD_ID" | \
                jq -r '.steps[] | select(.status=="ERROR") | .cmd_output'
            break
            ;;
        "RUNNING")
            # 获取进度信息
            progress=$(curl -s -X GET \
                -H "Authorization: Basic QURNSU46S1lMSU4=" \
                "http://localhost:7070/kylin/api/jobs/$BUILD_ID" | \
                jq -r '.progress')
            echo "构建进度: $progress%"
            ;;
    esac
    
    sleep 30
done

3.5 查询与路由

3.5.1 查询流程

查询处理流程:

graph TD
    A[SQL查询] --> B[SQL解析]
    B --> C[查询优化]
    C --> D[Cube选择]
    D --> E[查询重写]
    E --> F[Cuboid匹配]
    F --> G[数据读取]
    G --> H[结果聚合]
    H --> I[返回结果]

查询路由逻辑:

class QueryRouter:
    def __init__(self):
        self.cubes = self.load_available_cubes()
    
    def route_query(self, sql_query):
        """
        查询路由逻辑
        """
        # 1. 解析SQL
        parsed_query = self.parse_sql(sql_query)
        
        # 2. 提取查询要素
        dimensions = self.extract_dimensions(parsed_query)
        measures = self.extract_measures(parsed_query)
        filters = self.extract_filters(parsed_query)
        
        # 3. 选择最优Cube
        best_cube = self.select_optimal_cube(
            dimensions, measures, filters
        )
        
        if best_cube:
            # 4. 重写查询
            rewritten_query = self.rewrite_query(
                sql_query, best_cube
            )
            return self.execute_cube_query(rewritten_query)
        else:
            # 5. 回退到原始数据源
            return self.execute_raw_query(sql_query)
    
    def select_optimal_cube(self, dimensions, measures, filters):
        """
        选择最优Cube
        """
        candidates = []
        
        for cube in self.cubes:
            # 检查维度覆盖
            if not self.check_dimension_coverage(cube, dimensions):
                continue
            
            # 检查度量覆盖
            if not self.check_measure_coverage(cube, measures):
                continue
            
            # 计算匹配度
            score = self.calculate_match_score(
                cube, dimensions, measures, filters
            )
            
            candidates.append((cube, score))
        
        # 返回得分最高的Cube
        if candidates:
            return max(candidates, key=lambda x: x[1])[0]
        return None

3.5.2 Cuboid选择

Cuboid匹配规则:

def find_best_cuboid(cube, query_dimensions):
    """
    查找最佳Cuboid
    """
    available_cuboids = cube.get_cuboids()
    
    # 1. 精确匹配
    exact_match = find_exact_match(available_cuboids, query_dimensions)
    if exact_match:
        return exact_match
    
    # 2. 超集匹配(包含所有查询维度的最小Cuboid)
    superset_matches = find_superset_matches(
        available_cuboids, query_dimensions
    )
    if superset_matches:
        return min(superset_matches, key=lambda x: len(x.dimensions))
    
    # 3. 基础Cuboid(包含所有维度)
    return cube.get_base_cuboid()

def find_exact_match(cuboids, query_dims):
    """
    查找精确匹配的Cuboid
    """
    query_set = set(query_dims)
    for cuboid in cuboids:
        if set(cuboid.dimensions) == query_set:
            return cuboid
    return None

def find_superset_matches(cuboids, query_dims):
    """
    查找超集匹配的Cuboid
    """
    query_set = set(query_dims)
    matches = []
    for cuboid in cuboids:
        if query_set.issubset(set(cuboid.dimensions)):
            matches.append(cuboid)
    return matches

查询示例:

-- 原始查询
SELECT 
    p.category,
    d.year,
    SUM(f.amount) as total_sales
FROM sales_fact f
JOIN product_dim p ON f.product_key = p.product_key
JOIN date_dim d ON f.date_key = d.date_key
WHERE d.year = 2023
GROUP BY p.category, d.year;

-- Kylin重写后的查询
SELECT 
    category,
    year,
    SUM(total_sales) as total_sales
FROM kylin_sales_cube
WHERE year = 2023
GROUP BY category, year;

3.6 本章小结

本章详细介绍了Apache Kylin的核心概念和术语:

核心概念: 1. 数据仓库基础:事实表、维度表、星型模式、雪花模式 2. Kylin核心概念:数据源、数据模型、维度、度量、Cube 3. 聚合组优化:强制维度、层次维度、联合维度 4. 构建与刷新:全量构建、增量构建、刷新构建 5. 查询与路由:查询流程、Cube选择、Cuboid匹配

关键要点: 1. 理解多维数据模型是使用Kylin的基础 2. 合理设计聚合组可以显著优化性能 3. 选择合适的构建策略平衡效率和资源消耗 4. 查询路由机制保证了查询的高效执行

下一章预告: 下一章将介绍数据源管理与连接,包括如何配置和管理各种类型的数据源。

3.7 练习与思考

理论练习

  1. 设计一个电商场景的星型模式数据模型
  2. 计算给定维度组合下的Cuboid数量
  3. 分析不同聚合组规则的优化效果

实践练习

  1. 创建示例数据模型和Cube定义
  2. 编写查询路由的伪代码
  3. 设计构建监控和告警机制

思考题

  1. 如何平衡Cube的完整性和构建效率?
  2. 在什么情况下应该使用雪花模式而不是星型模式?
  3. 如何根据查询模式优化聚合组设计?