4.1 路由系统概述

路由工作原理

Alertmanager 的路由系统是一个树形结构,用于决定告警应该发送给哪个接收器。路由系统的工作流程如下:

flowchart TD
    A[接收告警] --> B[根路由匹配]
    B --> C{匹配子路由?}
    C -->|是| D[子路由处理]
    C -->|否| E[使用当前路由接收器]
    D --> F{continue=true?}
    F -->|是| G[继续匹配其他路由]
    F -->|否| H[停止匹配]
    G --> I[发送到多个接收器]
    H --> J[发送到单个接收器]
    E --> J
    I --> K[告警分组]
    J --> K
    K --> L[应用时间配置]
    L --> M[发送通知]

路由匹配顺序

# 路由匹配示例
route:
  receiver: 'default'  # 默认接收器
  group_by: ['alertname']
  
  routes:
  # 1. 首先匹配严重告警
  - match:
      severity: critical
    receiver: 'critical-team'
    routes:
    # 1.1 严重告警中的数据库告警
    - match:
        team: database
      receiver: 'dba-critical'
    # 1.2 严重告警中的基础设施告警
    - match:
        team: infrastructure
      receiver: 'infra-critical'
  
  # 2. 然后匹配团队告警
  - match:
      team: web
    receiver: 'web-team'
    continue: true  # 继续匹配后续路由
  
  # 3. 最后匹配服务告警
  - match_re:
      service: '^(api|frontend).*'
    receiver: 'service-team'

4.2 路由配置详解

基础路由配置

route:
  # 分组标签 - 决定哪些告警会被分到同一组
  group_by: ['alertname', 'cluster', 'service']
  
  # 时间配置
  group_wait: 10s      # 等待同组其他告警的时间
  group_interval: 10s  # 同组告警的发送间隔
  repeat_interval: 1h  # 重复发送间隔
  
  # 默认接收器
  receiver: 'default'
  
  # 是否继续匹配后续路由
  continue: false

匹配器类型

1. 精确匹配(match)

routes:
- match:
    severity: critical      # 精确匹配 severity=critical
    alertname: HighCPUUsage # 精确匹配 alertname=HighCPUUsage
    team: database         # 精确匹配 team=database
  receiver: 'db-critical'

2. 正则匹配(match_re)

routes:
- match_re:
    instance: '^prod-.*'           # 实例名以 prod- 开头
    service: '(web|api|frontend)'  # 服务名匹配多个值
    alertname: '.*CPU.*'           # 告警名包含 CPU
  receiver: 'prod-team'

3. 新式匹配器(matchers)- 推荐

routes:
- matchers:
  - alertname = "HighCPUUsage"        # 精确匹配
  - severity =~ "warning|critical"    # 正则匹配
  - instance !~ "test.*"              # 正则不匹配
  - team != "development"             # 不等于匹配
  - environment =~ "prod|staging"     # 多值正则匹配
  receiver: 'production-team'

高级路由配置

多层路由结构

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'
  
  routes:
  # 第一层:按严重程度分类
  - match:
      severity: critical
    receiver: 'critical-default'
    group_wait: 5s
    repeat_interval: 30m
    routes:
    # 第二层:按团队分类
    - match:
        team: database
      receiver: 'dba-critical'
      group_by: ['alertname', 'instance']
      routes:
      # 第三层:按数据库类型分类
      - match:
          db_type: mysql
        receiver: 'mysql-dba'
      - match:
          db_type: postgresql
        receiver: 'postgres-dba'
    
    - match:
        team: infrastructure
      receiver: 'infra-critical'
      group_by: ['alertname', 'datacenter']
      routes:
      - match:
          datacenter: us-west
        receiver: 'infra-west'
      - match:
          datacenter: us-east
        receiver: 'infra-east'
  
  # 警告级别告警
  - match:
      severity: warning
    receiver: 'warning-default'
    repeat_interval: 2h
    routes:
    - match:
        team: application
      receiver: 'app-warnings'
      group_by: ['alertname', 'service']

条件路由和继续匹配

route:
  receiver: 'default'
  routes:
  # 所有生产环境告警都发送给运维团队
  - match:
      environment: production
    receiver: 'ops-team'
    continue: true  # 继续匹配后续路由
  
  # 严重告警额外发送给管理层
  - match:
      severity: critical
    receiver: 'management'
    continue: true
  
  # 数据库告警发送给DBA团队
  - match:
      team: database
    receiver: 'dba-team'
    # continue: false (默认值,停止匹配)
  
  # 安全告警发送给安全团队
  - match:
      category: security
    receiver: 'security-team'
    group_wait: 0s  # 安全告警立即发送

基于时间的路由

time_intervals:
- name: 'business-hours'
  time_intervals:
  - times:
    - start_time: '09:00'
      end_time: '17:00'
    weekdays: ['monday:friday']

- name: 'after-hours'
  time_intervals:
  - times:
    - start_time: '17:01'
      end_time: '08:59'
    weekdays: ['monday:friday']
  - times:
    - start_time: '00:00'
      end_time: '23:59'
    weekdays: ['saturday', 'sunday']

route:
  receiver: 'default'
  routes:
  # 工作时间的严重告警
  - match:
      severity: critical
    receiver: 'oncall-business'
    active_time_intervals:
    - 'business-hours'
    group_wait: 5s
    repeat_interval: 15m
  
  # 非工作时间的严重告警
  - match:
      severity: critical
    receiver: 'oncall-afterhours'
    active_time_intervals:
    - 'after-hours'
    group_wait: 2s
    repeat_interval: 10m
  
  # 维护窗口静默
  - match:
      team: infrastructure
    receiver: 'infra-team'
    mute_time_intervals:
    - 'maintenance-window'

4.3 告警分组策略

分组原理

告警分组是将相关的告警聚合在一起,避免告警风暴。分组基于 group_by 字段指定的标签进行。

flowchart LR
    A[告警1: alertname=HighCPU, instance=web1] --> D[分组1]
    B[告警2: alertname=HighCPU, instance=web2] --> D
    C[告警3: alertname=HighCPU, instance=web3] --> D
    E[告警4: alertname=HighMemory, instance=web1] --> F[分组2]
    G[告警5: alertname=HighMemory, instance=web2] --> F
    
    D --> H[group_by: alertname]
    F --> H

基础分组配置

# 按告警名称分组
route:
  group_by: ['alertname']
  receiver: 'default'

# 按告警名称和集群分组
route:
  group_by: ['alertname', 'cluster']
  receiver: 'default'

# 按告警名称、集群和服务分组
route:
  group_by: ['alertname', 'cluster', 'service']
  receiver: 'default'

# 不分组(每个告警单独发送)
route:
  group_by: []
  receiver: 'default'

# 所有告警分为一组
route:
  group_by: ['...']
  receiver: 'default'

高级分组策略

1. 按服务层级分组

route:
  receiver: 'default'
  routes:
  # 基础设施层告警按节点分组
  - match:
      layer: infrastructure
    receiver: 'infra-team'
    group_by: ['alertname', 'instance']
    group_wait: 30s  # 等待更多节点告警
    group_interval: 5m
  
  # 应用层告警按服务分组
  - match:
      layer: application
    receiver: 'app-team'
    group_by: ['alertname', 'service']
    group_wait: 10s
    group_interval: 2m
  
  # 数据库告警按数据库实例分组
  - match:
      layer: database
    receiver: 'dba-team'
    group_by: ['alertname', 'database', 'instance']
    group_wait: 15s
    group_interval: 3m

2. 按严重程度分组

route:
  receiver: 'default'
  routes:
  # 严重告警快速分组
  - match:
      severity: critical
    receiver: 'critical-team'
    group_by: ['alertname']
    group_wait: 5s   # 快速响应
    group_interval: 1m
    repeat_interval: 15m
  
  # 警告告警延迟分组
  - match:
      severity: warning
    receiver: 'warning-team'
    group_by: ['alertname', 'service']
    group_wait: 2m   # 等待更多相关告警
    group_interval: 10m
    repeat_interval: 2h
  
  # 信息告警大批量分组
  - match:
      severity: info
    receiver: 'info-team'
    group_by: ['alertname']
    group_wait: 10m  # 长时间等待
    group_interval: 1h
    repeat_interval: 24h

3. 动态分组策略

route:
  receiver: 'default'
  routes:
  # 按团队动态分组
  - matchers:
    - team =~ ".+"  # 有团队标签的告警
    receiver: 'team-router'
    group_by: ['team', 'alertname']
    routes:
    - match:
        team: web
      receiver: 'web-team'
      group_by: ['alertname', 'service', 'environment']
    - match:
        team: mobile
      receiver: 'mobile-team'
      group_by: ['alertname', 'platform', 'version']
    - match:
        team: data
      receiver: 'data-team'
      group_by: ['alertname', 'pipeline', 'stage']
  
  # 按地理位置分组
  - matchers:
    - datacenter =~ ".+"
    receiver: 'regional-router'
    group_by: ['datacenter', 'alertname']
    routes:
    - match:
        datacenter: us-west
      receiver: 'west-team'
    - match:
        datacenter: us-east
      receiver: 'east-team'
    - match:
        datacenter: eu-central
      receiver: 'eu-team'

分组时间配置

时间参数详解

route:
  # group_wait: 等待同组其他告警的时间
  # 场景:避免在短时间内发送多个相似告警
  group_wait: 10s
  
  # group_interval: 同组告警的最小发送间隔
  # 场景:控制同一组告警的发送频率
  group_interval: 5m
  
  # repeat_interval: 重复发送未解决告警的间隔
  # 场景:定期提醒未解决的告警
  repeat_interval: 1h

不同场景的时间配置

route:
  receiver: 'default'
  routes:
  # 严重告警:快速响应
  - match:
      severity: critical
    receiver: 'critical-team'
    group_wait: 5s      # 快速分组
    group_interval: 1m  # 频繁更新
    repeat_interval: 15m # 频繁提醒
  
  # 警告告警:平衡响应
  - match:
      severity: warning
    receiver: 'warning-team'
    group_wait: 30s     # 适中等待
    group_interval: 5m  # 适中频率
    repeat_interval: 1h # 适中提醒
  
  # 信息告警:批量处理
  - match:
      severity: info
    receiver: 'info-team'
    group_wait: 5m      # 长时间等待
    group_interval: 30m # 低频更新
    repeat_interval: 12h # 低频提醒
  
  # 测试环境:低优先级
  - match:
      environment: test
    receiver: 'test-team'
    group_wait: 2m
    group_interval: 15m
    repeat_interval: 6h

4.4 路由测试和调试

使用 amtool 测试路由

# 基础路由测试
amtool config routes test \
  --config.file=alertmanager.yml \
  alertname=HighCPUUsage severity=critical team=web

# 测试多个标签
amtool config routes test \
  --config.file=alertmanager.yml \
  alertname=DatabaseDown severity=critical team=database instance=db-01

# 测试正则匹配
amtool config routes test \
  --config.file=alertmanager.yml \
  alertname=ServiceError service=api-gateway environment=production

# 显示详细路由信息
amtool config routes show --config.file=alertmanager.yml

路由测试脚本

#!/bin/bash
# route-test.sh

CONFIG_FILE="alertmanager.yml"

echo "=== Alertmanager 路由测试 ==="

# 测试用例数组
declare -a test_cases=(
    "alertname=HighCPUUsage severity=critical team=web environment=production"
    "alertname=DatabaseDown severity=critical team=database instance=db-01"
    "alertname=ServiceError severity=warning service=api-gateway"
    "alertname=NodeDown severity=critical instance=node-01 datacenter=us-west"
    "alertname=DiskFull severity=warning instance=storage-01 team=infrastructure"
    "alertname=SecurityAlert severity=critical category=security source=firewall"
    "alertname=BackupFailed severity=warning service=backup team=data"
    "alertname=HighMemoryUsage severity=warning instance=app-01 environment=staging"
)

# 执行测试
for i in "${!test_cases[@]}"; do
    echo "\n--- 测试 $((i+1)): ${test_cases[i]} ---"
    
    result=$(amtool config routes test --config.file="$CONFIG_FILE" ${test_cases[i]} 2>&1)
    
    if [ $? -eq 0 ]; then
        echo "✅ 路由匹配成功"
        echo "$result"
    else
        echo "❌ 路由匹配失败"
        echo "$result"
    fi
done

echo "\n=== 路由树结构 ==="
amtool config routes show --config.file="$CONFIG_FILE"

echo "\n=== 测试完成 ==="

路由调试技巧

1. 使用标签验证路由

# 在配置中添加调试标签
route:
  receiver: 'default'
  routes:
  - match:
      severity: critical
    receiver: 'critical-team'
    # 添加调试标签帮助识别路由
    group_by: ['alertname', 'debug_route']
    routes:
    - match:
        team: database
      receiver: 'dba-critical'
      # 可以在告警中添加 debug_route=dba-critical 来验证

2. 路由日志分析

# 查看 Alertmanager 日志中的路由信息
docker logs alertmanager 2>&1 | grep -i route

# 查看特定告警的路由处理
docker logs alertmanager 2>&1 | grep "HighCPUUsage"

# 实时监控路由处理
docker logs -f alertmanager 2>&1 | grep -E "(route|receiver|group)"

3. API 调试

# 查看当前活跃的告警组
curl -s http://localhost:9093/api/v1/alerts/groups | jq .

# 查看特定接收器的告警
curl -s "http://localhost:9093/api/v1/alerts/groups?receiver=critical-team" | jq .

# 查看路由配置
curl -s http://localhost:9093/api/v1/status | jq .config

4.5 路由性能优化

路由匹配优化

1. 匹配器顺序优化

# 优化前:低效的匹配顺序
route:
  routes:
  # 宽泛的正则匹配放在前面(低效)
  - match_re:
      alertname: '.*'
    receiver: 'catch-all'
    continue: true
  
  # 具体匹配放在后面
  - match:
      severity: critical
    receiver: 'critical-team'

# 优化后:高效的匹配顺序
route:
  routes:
  # 具体匹配放在前面(高效)
  - match:
      severity: critical
    receiver: 'critical-team'
  
  # 宽泛匹配放在后面
  - match_re:
      alertname: '.*'
    receiver: 'catch-all'

2. 减少路由层级

# 优化前:过深的路由层级
route:
  routes:
  - match:
      environment: production
    routes:
    - match:
        severity: critical
      routes:
      - match:
          team: database
        routes:
        - match:
            db_type: mysql
          receiver: 'mysql-critical'

# 优化后:扁平化路由结构
route:
  routes:
  - matchers:
    - environment = "production"
    - severity = "critical"
    - team = "database"
    - db_type = "mysql"
    receiver: 'mysql-critical'

3. 使用新式匹配器

# 优化前:多个 match 块
route:
  routes:
  - match:
      severity: critical
    match_re:
      team: '^(web|api|mobile)$'
      environment: '^(prod|staging)$'
    receiver: 'app-critical'

# 优化后:单个 matchers 块
route:
  routes:
  - matchers:
    - severity = "critical"
    - team =~ "^(web|api|mobile)$"
    - environment =~ "^(prod|staging)$"
    receiver: 'app-critical'

分组性能优化

1. 合理的分组策略

# 避免过度分组
route:
  # 过度分组:每个实例单独分组
  group_by: ['alertname', 'instance', 'job', 'severity', 'team']
  
  # 合理分组:按核心维度分组
  group_by: ['alertname', 'cluster']

2. 优化时间配置

route:
  receiver: 'default'
  routes:
  # 高频告警:短等待时间
  - match:
      frequency: high
    group_wait: 5s
    group_interval: 30s
    repeat_interval: 5m
  
  # 低频告警:长等待时间
  - match:
      frequency: low
    group_wait: 2m
    group_interval: 10m
    repeat_interval: 1h

4.6 实战案例

案例1:微服务架构的路由配置

# 微服务环境的复杂路由配置
route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'default'
  
  routes:
  # 基础设施层
  - matchers:
    - layer = "infrastructure"
    receiver: 'infra-team'
    group_by: ['alertname', 'datacenter']
    routes:
    # 网络告警
    - matchers:
      - component = "network"
      - severity = "critical"
      receiver: 'network-oncall'
      group_wait: 0s
    
    # 存储告警
    - matchers:
      - component = "storage"
      receiver: 'storage-team'
      group_by: ['alertname', 'volume']
    
    # 计算资源告警
    - matchers:
      - component = "compute"
      receiver: 'compute-team'
      group_by: ['alertname', 'node']
  
  # 平台层
  - matchers:
    - layer = "platform"
    receiver: 'platform-team'
    group_by: ['alertname', 'service']
    routes:
    # Kubernetes 告警
    - matchers:
      - platform = "kubernetes"
      receiver: 'k8s-team'
      group_by: ['alertname', 'namespace']
    
    # 数据库告警
    - matchers:
      - platform = "database"
      receiver: 'dba-team'
      group_by: ['alertname', 'database']
      routes:
      - matchers:
        - db_type = "mysql"
        - severity = "critical"
        receiver: 'mysql-oncall'
      - matchers:
        - db_type = "postgresql"
        - severity = "critical"
        receiver: 'postgres-oncall'
    
    # 消息队列告警
    - matchers:
      - platform = "messaging"
      receiver: 'messaging-team'
      group_by: ['alertname', 'queue']
  
  # 应用层
  - matchers:
    - layer = "application"
    receiver: 'app-team'
    group_by: ['alertname', 'service']
    routes:
    # 前端应用
    - matchers:
      - app_type = "frontend"
      receiver: 'frontend-team'
      group_by: ['alertname', 'service', 'environment']
    
    # 后端API
    - matchers:
      - app_type = "backend"
      receiver: 'backend-team'
      group_by: ['alertname', 'service', 'version']
      routes:
      # 用户服务
      - matchers:
        - service = "user-service"
        receiver: 'user-team'
      # 订单服务
      - matchers:
        - service = "order-service"
        receiver: 'order-team'
      # 支付服务
      - matchers:
        - service = "payment-service"
        receiver: 'payment-team'
        group_wait: 0s  # 支付告警立即发送
    
    # 移动应用
    - matchers:
      - app_type = "mobile"
      receiver: 'mobile-team'
      group_by: ['alertname', 'platform', 'version']
  
  # 业务层
  - matchers:
    - layer = "business"
    receiver: 'business-team'
    group_by: ['alertname', 'business_unit']
    routes:
    # 关键业务指标
    - matchers:
      - metric_type = "business_critical"
      receiver: 'business-critical'
      group_wait: 0s
      repeat_interval: 30m
    
    # 用户体验指标
    - matchers:
      - metric_type = "user_experience"
      receiver: 'ux-team'
      group_by: ['alertname', 'user_segment']
  
  # 安全告警
  - matchers:
    - category = "security"
    receiver: 'security-team'
    group_wait: 0s
    group_interval: 1m
    repeat_interval: 15m
    routes:
    # 高危安全事件
    - matchers:
      - severity = "critical"
      - threat_level = "high"
      receiver: 'security-incident'
      continue: true
    
    # 合规告警
    - matchers:
      - compliance = "required"
      receiver: 'compliance-team'
  
  # 环境特定路由
  - matchers:
    - environment = "production"
    receiver: 'prod-oncall'
    continue: true
    group_wait: 5s
    repeat_interval: 30m
  
  - matchers:
    - environment = "staging"
    receiver: 'staging-team'
    continue: true
    repeat_interval: 2h
  
  - matchers:
    - environment = "development"
    receiver: 'dev-team'
    repeat_interval: 6h

案例2:多租户环境的路由配置

# 多租户SaaS平台的路由配置
route:
  group_by: ['tenant', 'alertname']
  group_wait: 15s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'platform-default'
  
  routes:
  # 平台级告警
  - matchers:
    - scope = "platform"
    receiver: 'platform-team'
    group_by: ['alertname', 'component']
    routes:
    # 平台严重告警
    - matchers:
      - severity = "critical"
      receiver: 'platform-oncall'
      group_wait: 0s
      repeat_interval: 15m
      continue: true
    
    # 平台容量告警
    - matchers:
      - alert_type = "capacity"
      receiver: 'capacity-team'
      group_by: ['alertname', 'resource_type']
  
  # 租户特定告警
  - matchers:
    - scope = "tenant"
    receiver: 'tenant-router'
    group_by: ['tenant', 'alertname']
    routes:
    # 企业客户(高优先级)
    - matchers:
      - tenant_tier = "enterprise"
      receiver: 'enterprise-support'
      group_wait: 5s
      repeat_interval: 30m
      routes:
      # 企业客户的严重告警
      - matchers:
        - severity = "critical"
        receiver: 'enterprise-oncall'
        group_wait: 0s
        repeat_interval: 10m
    
    # 专业客户(中优先级)
    - matchers:
      - tenant_tier = "professional"
      receiver: 'professional-support'
      group_wait: 10s
      repeat_interval: 1h
    
    # 基础客户(低优先级)
    - matchers:
      - tenant_tier = "basic"
      receiver: 'basic-support'
      group_wait: 30s
      repeat_interval: 4h
    
    # 试用客户(最低优先级)
    - matchers:
      - tenant_tier = "trial"
      receiver: 'trial-support'
      group_wait: 2m
      repeat_interval: 12h
  
  # 按地理区域路由
  - matchers:
    - region =~ ".+"
    receiver: 'regional-router'
    continue: true
    routes:
    - matchers:
      - region = "us-west"
      receiver: 'us-west-team'
      active_time_intervals:
      - 'us-west-business-hours'
    
    - matchers:
      - region = "us-east"
      receiver: 'us-east-team'
      active_time_intervals:
      - 'us-east-business-hours'
    
    - matchers:
      - region = "eu-central"
      receiver: 'eu-team'
      active_time_intervals:
      - 'eu-business-hours'
    
    - matchers:
      - region = "asia-pacific"
      receiver: 'apac-team'
      active_time_intervals:
      - 'apac-business-hours'
  
  # SLA 相关告警
  - matchers:
    - alert_type = "sla"
    receiver: 'sla-team'
    group_by: ['tenant', 'sla_metric']
    routes:
    # SLA 违反告警
    - matchers:
      - sla_status = "violated"
      receiver: 'sla-violation'
      group_wait: 0s
      repeat_interval: 15m
      continue: true
    
    # SLA 风险告警
    - matchers:
      - sla_status = "at_risk"
      receiver: 'sla-risk'
      group_wait: 5s
      repeat_interval: 30m

案例3:金融行业的路由配置

# 金融行业的严格路由配置
route:
  group_by: ['system', 'alertname']
  group_wait: 5s
  group_interval: 2m
  repeat_interval: 30m
  receiver: 'financial-default'
  
  routes:
  # 交易系统告警(最高优先级)
  - matchers:
    - system = "trading"
    receiver: 'trading-team'
    group_wait: 0s
    group_interval: 30s
    repeat_interval: 5m
    routes:
    # 交易执行告警
    - matchers:
      - component = "execution"
      receiver: 'trading-execution'
      continue: true
    
    # 风险管理告警
    - matchers:
      - component = "risk"
      receiver: 'risk-management'
      continue: true
    
    # 市场数据告警
    - matchers:
      - component = "market_data"
      receiver: 'market-data-team'
  
  # 支付系统告警
  - matchers:
    - system = "payment"
    receiver: 'payment-team'
    group_wait: 2s
    repeat_interval: 10m
    routes:
    # 支付处理告警
    - matchers:
      - severity = "critical"
      receiver: 'payment-critical'
      continue: true
    
    # 反欺诈告警
    - matchers:
      - component = "fraud_detection"
      receiver: 'fraud-team'
      group_wait: 0s
  
  # 核心银行系统告警
  - matchers:
    - system = "core_banking"
    receiver: 'core-banking-team'
    group_wait: 1s
    repeat_interval: 15m
    routes:
    # 账户服务告警
    - matchers:
      - service = "account"
      receiver: 'account-team'
    
    # 贷款服务告警
    - matchers:
      - service = "lending"
      receiver: 'lending-team'
  
  # 合规和监管告警
  - matchers:
    - category = "compliance"
    receiver: 'compliance-team'
    group_wait: 0s
    repeat_interval: 5m
    routes:
    # 监管报告告警
    - matchers:
      - report_type = "regulatory"
      receiver: 'regulatory-team'
      continue: true
    
    # AML 告警
    - matchers:
      - compliance_type = "aml"
      receiver: 'aml-team'
      continue: true
  
  # 安全告警(金融级别)
  - matchers:
    - category = "security"
    receiver: 'financial-security'
    group_wait: 0s
    group_interval: 30s
    repeat_interval: 5m
    routes:
    # 网络安全事件
    - matchers:
      - security_type = "network"
      receiver: 'network-security'
      continue: true
    
    # 数据泄露告警
    - matchers:
      - security_type = "data_breach"
      receiver: 'data-security'
      continue: true
    
    # 身份认证告警
    - matchers:
      - security_type = "authentication"
      receiver: 'identity-team'
  
  # 业务连续性告警
  - matchers:
    - category = "business_continuity"
    receiver: 'bcp-team'
    group_wait: 0s
    repeat_interval: 10m
    routes:
    # 灾难恢复告警
    - matchers:
      - event_type = "disaster_recovery"
      receiver: 'dr-team'
      continue: true
    
    # 备份系统告警
    - matchers:
      - component = "backup"
      receiver: 'backup-team'

本章小结

本章深入介绍了 Alertmanager 的路由和分组系统:

核心概念

  1. 路由系统:树形结构的告警分发机制
  2. 匹配器:精确匹配、正则匹配和新式匹配器
  3. 分组策略:基于标签的告警聚合机制
  4. 时间配置:控制告警发送时机和频率

配置要点

  1. 路由设计:合理的层级结构和匹配顺序
  2. 分组策略:平衡告警聚合和响应速度
  3. 时间配置:根据严重程度调整时间参数
  4. 性能优化:减少匹配复杂度和路由层级

最佳实践

  1. 测试驱动:使用 amtool 验证路由配置
  2. 渐进优化:从简单配置开始逐步完善
  3. 监控调试:持续监控路由性能和效果
  4. 文档维护:记录路由设计决策和变更

下一步学习

在下一章中,我们将学习通知渠道的详细配置,包括: - 邮件通知的高级配置 - Slack 集成和自定义 - Webhook 开发和集成 - 第三方服务集成


下一章: 通知渠道配置