4.1 路由系统概述
路由工作原理
Alertmanager 的路由系统是一个树形结构,用于决定告警应该发送给哪个接收器。路由系统的工作流程如下:
flowchart TD
A[接收告警] --> B[根路由匹配]
B --> C{匹配子路由?}
C -->|是| D[子路由处理]
C -->|否| E[使用当前路由接收器]
D --> F{continue=true?}
F -->|是| G[继续匹配其他路由]
F -->|否| H[停止匹配]
G --> I[发送到多个接收器]
H --> J[发送到单个接收器]
E --> J
I --> K[告警分组]
J --> K
K --> L[应用时间配置]
L --> M[发送通知]
路由匹配顺序
# 路由匹配示例
route:
receiver: 'default' # 默认接收器
group_by: ['alertname']
routes:
# 1. 首先匹配严重告警
- match:
severity: critical
receiver: 'critical-team'
routes:
# 1.1 严重告警中的数据库告警
- match:
team: database
receiver: 'dba-critical'
# 1.2 严重告警中的基础设施告警
- match:
team: infrastructure
receiver: 'infra-critical'
# 2. 然后匹配团队告警
- match:
team: web
receiver: 'web-team'
continue: true # 继续匹配后续路由
# 3. 最后匹配服务告警
- match_re:
service: '^(api|frontend).*'
receiver: 'service-team'
4.2 路由配置详解
基础路由配置
route:
# 分组标签 - 决定哪些告警会被分到同一组
group_by: ['alertname', 'cluster', 'service']
# 时间配置
group_wait: 10s # 等待同组其他告警的时间
group_interval: 10s # 同组告警的发送间隔
repeat_interval: 1h # 重复发送间隔
# 默认接收器
receiver: 'default'
# 是否继续匹配后续路由
continue: false
匹配器类型
1. 精确匹配(match)
routes:
- match:
severity: critical # 精确匹配 severity=critical
alertname: HighCPUUsage # 精确匹配 alertname=HighCPUUsage
team: database # 精确匹配 team=database
receiver: 'db-critical'
2. 正则匹配(match_re)
routes:
- match_re:
instance: '^prod-.*' # 实例名以 prod- 开头
service: '(web|api|frontend)' # 服务名匹配多个值
alertname: '.*CPU.*' # 告警名包含 CPU
receiver: 'prod-team'
3. 新式匹配器(matchers)- 推荐
routes:
- matchers:
- alertname = "HighCPUUsage" # 精确匹配
- severity =~ "warning|critical" # 正则匹配
- instance !~ "test.*" # 正则不匹配
- team != "development" # 不等于匹配
- environment =~ "prod|staging" # 多值正则匹配
receiver: 'production-team'
高级路由配置
多层路由结构
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
routes:
# 第一层:按严重程度分类
- match:
severity: critical
receiver: 'critical-default'
group_wait: 5s
repeat_interval: 30m
routes:
# 第二层:按团队分类
- match:
team: database
receiver: 'dba-critical'
group_by: ['alertname', 'instance']
routes:
# 第三层:按数据库类型分类
- match:
db_type: mysql
receiver: 'mysql-dba'
- match:
db_type: postgresql
receiver: 'postgres-dba'
- match:
team: infrastructure
receiver: 'infra-critical'
group_by: ['alertname', 'datacenter']
routes:
- match:
datacenter: us-west
receiver: 'infra-west'
- match:
datacenter: us-east
receiver: 'infra-east'
# 警告级别告警
- match:
severity: warning
receiver: 'warning-default'
repeat_interval: 2h
routes:
- match:
team: application
receiver: 'app-warnings'
group_by: ['alertname', 'service']
条件路由和继续匹配
route:
receiver: 'default'
routes:
# 所有生产环境告警都发送给运维团队
- match:
environment: production
receiver: 'ops-team'
continue: true # 继续匹配后续路由
# 严重告警额外发送给管理层
- match:
severity: critical
receiver: 'management'
continue: true
# 数据库告警发送给DBA团队
- match:
team: database
receiver: 'dba-team'
# continue: false (默认值,停止匹配)
# 安全告警发送给安全团队
- match:
category: security
receiver: 'security-team'
group_wait: 0s # 安全告警立即发送
基于时间的路由
time_intervals:
- name: 'business-hours'
time_intervals:
- times:
- start_time: '09:00'
end_time: '17:00'
weekdays: ['monday:friday']
- name: 'after-hours'
time_intervals:
- times:
- start_time: '17:01'
end_time: '08:59'
weekdays: ['monday:friday']
- times:
- start_time: '00:00'
end_time: '23:59'
weekdays: ['saturday', 'sunday']
route:
receiver: 'default'
routes:
# 工作时间的严重告警
- match:
severity: critical
receiver: 'oncall-business'
active_time_intervals:
- 'business-hours'
group_wait: 5s
repeat_interval: 15m
# 非工作时间的严重告警
- match:
severity: critical
receiver: 'oncall-afterhours'
active_time_intervals:
- 'after-hours'
group_wait: 2s
repeat_interval: 10m
# 维护窗口静默
- match:
team: infrastructure
receiver: 'infra-team'
mute_time_intervals:
- 'maintenance-window'
4.3 告警分组策略
分组原理
告警分组是将相关的告警聚合在一起,避免告警风暴。分组基于 group_by
字段指定的标签进行。
flowchart LR
A[告警1: alertname=HighCPU, instance=web1] --> D[分组1]
B[告警2: alertname=HighCPU, instance=web2] --> D
C[告警3: alertname=HighCPU, instance=web3] --> D
E[告警4: alertname=HighMemory, instance=web1] --> F[分组2]
G[告警5: alertname=HighMemory, instance=web2] --> F
D --> H[group_by: alertname]
F --> H
基础分组配置
# 按告警名称分组
route:
group_by: ['alertname']
receiver: 'default'
# 按告警名称和集群分组
route:
group_by: ['alertname', 'cluster']
receiver: 'default'
# 按告警名称、集群和服务分组
route:
group_by: ['alertname', 'cluster', 'service']
receiver: 'default'
# 不分组(每个告警单独发送)
route:
group_by: []
receiver: 'default'
# 所有告警分为一组
route:
group_by: ['...']
receiver: 'default'
高级分组策略
1. 按服务层级分组
route:
receiver: 'default'
routes:
# 基础设施层告警按节点分组
- match:
layer: infrastructure
receiver: 'infra-team'
group_by: ['alertname', 'instance']
group_wait: 30s # 等待更多节点告警
group_interval: 5m
# 应用层告警按服务分组
- match:
layer: application
receiver: 'app-team'
group_by: ['alertname', 'service']
group_wait: 10s
group_interval: 2m
# 数据库告警按数据库实例分组
- match:
layer: database
receiver: 'dba-team'
group_by: ['alertname', 'database', 'instance']
group_wait: 15s
group_interval: 3m
2. 按严重程度分组
route:
receiver: 'default'
routes:
# 严重告警快速分组
- match:
severity: critical
receiver: 'critical-team'
group_by: ['alertname']
group_wait: 5s # 快速响应
group_interval: 1m
repeat_interval: 15m
# 警告告警延迟分组
- match:
severity: warning
receiver: 'warning-team'
group_by: ['alertname', 'service']
group_wait: 2m # 等待更多相关告警
group_interval: 10m
repeat_interval: 2h
# 信息告警大批量分组
- match:
severity: info
receiver: 'info-team'
group_by: ['alertname']
group_wait: 10m # 长时间等待
group_interval: 1h
repeat_interval: 24h
3. 动态分组策略
route:
receiver: 'default'
routes:
# 按团队动态分组
- matchers:
- team =~ ".+" # 有团队标签的告警
receiver: 'team-router'
group_by: ['team', 'alertname']
routes:
- match:
team: web
receiver: 'web-team'
group_by: ['alertname', 'service', 'environment']
- match:
team: mobile
receiver: 'mobile-team'
group_by: ['alertname', 'platform', 'version']
- match:
team: data
receiver: 'data-team'
group_by: ['alertname', 'pipeline', 'stage']
# 按地理位置分组
- matchers:
- datacenter =~ ".+"
receiver: 'regional-router'
group_by: ['datacenter', 'alertname']
routes:
- match:
datacenter: us-west
receiver: 'west-team'
- match:
datacenter: us-east
receiver: 'east-team'
- match:
datacenter: eu-central
receiver: 'eu-team'
分组时间配置
时间参数详解
route:
# group_wait: 等待同组其他告警的时间
# 场景:避免在短时间内发送多个相似告警
group_wait: 10s
# group_interval: 同组告警的最小发送间隔
# 场景:控制同一组告警的发送频率
group_interval: 5m
# repeat_interval: 重复发送未解决告警的间隔
# 场景:定期提醒未解决的告警
repeat_interval: 1h
不同场景的时间配置
route:
receiver: 'default'
routes:
# 严重告警:快速响应
- match:
severity: critical
receiver: 'critical-team'
group_wait: 5s # 快速分组
group_interval: 1m # 频繁更新
repeat_interval: 15m # 频繁提醒
# 警告告警:平衡响应
- match:
severity: warning
receiver: 'warning-team'
group_wait: 30s # 适中等待
group_interval: 5m # 适中频率
repeat_interval: 1h # 适中提醒
# 信息告警:批量处理
- match:
severity: info
receiver: 'info-team'
group_wait: 5m # 长时间等待
group_interval: 30m # 低频更新
repeat_interval: 12h # 低频提醒
# 测试环境:低优先级
- match:
environment: test
receiver: 'test-team'
group_wait: 2m
group_interval: 15m
repeat_interval: 6h
4.4 路由测试和调试
使用 amtool 测试路由
# 基础路由测试
amtool config routes test \
--config.file=alertmanager.yml \
alertname=HighCPUUsage severity=critical team=web
# 测试多个标签
amtool config routes test \
--config.file=alertmanager.yml \
alertname=DatabaseDown severity=critical team=database instance=db-01
# 测试正则匹配
amtool config routes test \
--config.file=alertmanager.yml \
alertname=ServiceError service=api-gateway environment=production
# 显示详细路由信息
amtool config routes show --config.file=alertmanager.yml
路由测试脚本
#!/bin/bash
# route-test.sh
CONFIG_FILE="alertmanager.yml"
echo "=== Alertmanager 路由测试 ==="
# 测试用例数组
declare -a test_cases=(
"alertname=HighCPUUsage severity=critical team=web environment=production"
"alertname=DatabaseDown severity=critical team=database instance=db-01"
"alertname=ServiceError severity=warning service=api-gateway"
"alertname=NodeDown severity=critical instance=node-01 datacenter=us-west"
"alertname=DiskFull severity=warning instance=storage-01 team=infrastructure"
"alertname=SecurityAlert severity=critical category=security source=firewall"
"alertname=BackupFailed severity=warning service=backup team=data"
"alertname=HighMemoryUsage severity=warning instance=app-01 environment=staging"
)
# 执行测试
for i in "${!test_cases[@]}"; do
echo "\n--- 测试 $((i+1)): ${test_cases[i]} ---"
result=$(amtool config routes test --config.file="$CONFIG_FILE" ${test_cases[i]} 2>&1)
if [ $? -eq 0 ]; then
echo "✅ 路由匹配成功"
echo "$result"
else
echo "❌ 路由匹配失败"
echo "$result"
fi
done
echo "\n=== 路由树结构 ==="
amtool config routes show --config.file="$CONFIG_FILE"
echo "\n=== 测试完成 ==="
路由调试技巧
1. 使用标签验证路由
# 在配置中添加调试标签
route:
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-team'
# 添加调试标签帮助识别路由
group_by: ['alertname', 'debug_route']
routes:
- match:
team: database
receiver: 'dba-critical'
# 可以在告警中添加 debug_route=dba-critical 来验证
2. 路由日志分析
# 查看 Alertmanager 日志中的路由信息
docker logs alertmanager 2>&1 | grep -i route
# 查看特定告警的路由处理
docker logs alertmanager 2>&1 | grep "HighCPUUsage"
# 实时监控路由处理
docker logs -f alertmanager 2>&1 | grep -E "(route|receiver|group)"
3. API 调试
# 查看当前活跃的告警组
curl -s http://localhost:9093/api/v1/alerts/groups | jq .
# 查看特定接收器的告警
curl -s "http://localhost:9093/api/v1/alerts/groups?receiver=critical-team" | jq .
# 查看路由配置
curl -s http://localhost:9093/api/v1/status | jq .config
4.5 路由性能优化
路由匹配优化
1. 匹配器顺序优化
# 优化前:低效的匹配顺序
route:
routes:
# 宽泛的正则匹配放在前面(低效)
- match_re:
alertname: '.*'
receiver: 'catch-all'
continue: true
# 具体匹配放在后面
- match:
severity: critical
receiver: 'critical-team'
# 优化后:高效的匹配顺序
route:
routes:
# 具体匹配放在前面(高效)
- match:
severity: critical
receiver: 'critical-team'
# 宽泛匹配放在后面
- match_re:
alertname: '.*'
receiver: 'catch-all'
2. 减少路由层级
# 优化前:过深的路由层级
route:
routes:
- match:
environment: production
routes:
- match:
severity: critical
routes:
- match:
team: database
routes:
- match:
db_type: mysql
receiver: 'mysql-critical'
# 优化后:扁平化路由结构
route:
routes:
- matchers:
- environment = "production"
- severity = "critical"
- team = "database"
- db_type = "mysql"
receiver: 'mysql-critical'
3. 使用新式匹配器
# 优化前:多个 match 块
route:
routes:
- match:
severity: critical
match_re:
team: '^(web|api|mobile)$'
environment: '^(prod|staging)$'
receiver: 'app-critical'
# 优化后:单个 matchers 块
route:
routes:
- matchers:
- severity = "critical"
- team =~ "^(web|api|mobile)$"
- environment =~ "^(prod|staging)$"
receiver: 'app-critical'
分组性能优化
1. 合理的分组策略
# 避免过度分组
route:
# 过度分组:每个实例单独分组
group_by: ['alertname', 'instance', 'job', 'severity', 'team']
# 合理分组:按核心维度分组
group_by: ['alertname', 'cluster']
2. 优化时间配置
route:
receiver: 'default'
routes:
# 高频告警:短等待时间
- match:
frequency: high
group_wait: 5s
group_interval: 30s
repeat_interval: 5m
# 低频告警:长等待时间
- match:
frequency: low
group_wait: 2m
group_interval: 10m
repeat_interval: 1h
4.6 实战案例
案例1:微服务架构的路由配置
# 微服务环境的复杂路由配置
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 5m
repeat_interval: 1h
receiver: 'default'
routes:
# 基础设施层
- matchers:
- layer = "infrastructure"
receiver: 'infra-team'
group_by: ['alertname', 'datacenter']
routes:
# 网络告警
- matchers:
- component = "network"
- severity = "critical"
receiver: 'network-oncall'
group_wait: 0s
# 存储告警
- matchers:
- component = "storage"
receiver: 'storage-team'
group_by: ['alertname', 'volume']
# 计算资源告警
- matchers:
- component = "compute"
receiver: 'compute-team'
group_by: ['alertname', 'node']
# 平台层
- matchers:
- layer = "platform"
receiver: 'platform-team'
group_by: ['alertname', 'service']
routes:
# Kubernetes 告警
- matchers:
- platform = "kubernetes"
receiver: 'k8s-team'
group_by: ['alertname', 'namespace']
# 数据库告警
- matchers:
- platform = "database"
receiver: 'dba-team'
group_by: ['alertname', 'database']
routes:
- matchers:
- db_type = "mysql"
- severity = "critical"
receiver: 'mysql-oncall'
- matchers:
- db_type = "postgresql"
- severity = "critical"
receiver: 'postgres-oncall'
# 消息队列告警
- matchers:
- platform = "messaging"
receiver: 'messaging-team'
group_by: ['alertname', 'queue']
# 应用层
- matchers:
- layer = "application"
receiver: 'app-team'
group_by: ['alertname', 'service']
routes:
# 前端应用
- matchers:
- app_type = "frontend"
receiver: 'frontend-team'
group_by: ['alertname', 'service', 'environment']
# 后端API
- matchers:
- app_type = "backend"
receiver: 'backend-team'
group_by: ['alertname', 'service', 'version']
routes:
# 用户服务
- matchers:
- service = "user-service"
receiver: 'user-team'
# 订单服务
- matchers:
- service = "order-service"
receiver: 'order-team'
# 支付服务
- matchers:
- service = "payment-service"
receiver: 'payment-team'
group_wait: 0s # 支付告警立即发送
# 移动应用
- matchers:
- app_type = "mobile"
receiver: 'mobile-team'
group_by: ['alertname', 'platform', 'version']
# 业务层
- matchers:
- layer = "business"
receiver: 'business-team'
group_by: ['alertname', 'business_unit']
routes:
# 关键业务指标
- matchers:
- metric_type = "business_critical"
receiver: 'business-critical'
group_wait: 0s
repeat_interval: 30m
# 用户体验指标
- matchers:
- metric_type = "user_experience"
receiver: 'ux-team'
group_by: ['alertname', 'user_segment']
# 安全告警
- matchers:
- category = "security"
receiver: 'security-team'
group_wait: 0s
group_interval: 1m
repeat_interval: 15m
routes:
# 高危安全事件
- matchers:
- severity = "critical"
- threat_level = "high"
receiver: 'security-incident'
continue: true
# 合规告警
- matchers:
- compliance = "required"
receiver: 'compliance-team'
# 环境特定路由
- matchers:
- environment = "production"
receiver: 'prod-oncall'
continue: true
group_wait: 5s
repeat_interval: 30m
- matchers:
- environment = "staging"
receiver: 'staging-team'
continue: true
repeat_interval: 2h
- matchers:
- environment = "development"
receiver: 'dev-team'
repeat_interval: 6h
案例2:多租户环境的路由配置
# 多租户SaaS平台的路由配置
route:
group_by: ['tenant', 'alertname']
group_wait: 15s
group_interval: 5m
repeat_interval: 1h
receiver: 'platform-default'
routes:
# 平台级告警
- matchers:
- scope = "platform"
receiver: 'platform-team'
group_by: ['alertname', 'component']
routes:
# 平台严重告警
- matchers:
- severity = "critical"
receiver: 'platform-oncall'
group_wait: 0s
repeat_interval: 15m
continue: true
# 平台容量告警
- matchers:
- alert_type = "capacity"
receiver: 'capacity-team'
group_by: ['alertname', 'resource_type']
# 租户特定告警
- matchers:
- scope = "tenant"
receiver: 'tenant-router'
group_by: ['tenant', 'alertname']
routes:
# 企业客户(高优先级)
- matchers:
- tenant_tier = "enterprise"
receiver: 'enterprise-support'
group_wait: 5s
repeat_interval: 30m
routes:
# 企业客户的严重告警
- matchers:
- severity = "critical"
receiver: 'enterprise-oncall'
group_wait: 0s
repeat_interval: 10m
# 专业客户(中优先级)
- matchers:
- tenant_tier = "professional"
receiver: 'professional-support'
group_wait: 10s
repeat_interval: 1h
# 基础客户(低优先级)
- matchers:
- tenant_tier = "basic"
receiver: 'basic-support'
group_wait: 30s
repeat_interval: 4h
# 试用客户(最低优先级)
- matchers:
- tenant_tier = "trial"
receiver: 'trial-support'
group_wait: 2m
repeat_interval: 12h
# 按地理区域路由
- matchers:
- region =~ ".+"
receiver: 'regional-router'
continue: true
routes:
- matchers:
- region = "us-west"
receiver: 'us-west-team'
active_time_intervals:
- 'us-west-business-hours'
- matchers:
- region = "us-east"
receiver: 'us-east-team'
active_time_intervals:
- 'us-east-business-hours'
- matchers:
- region = "eu-central"
receiver: 'eu-team'
active_time_intervals:
- 'eu-business-hours'
- matchers:
- region = "asia-pacific"
receiver: 'apac-team'
active_time_intervals:
- 'apac-business-hours'
# SLA 相关告警
- matchers:
- alert_type = "sla"
receiver: 'sla-team'
group_by: ['tenant', 'sla_metric']
routes:
# SLA 违反告警
- matchers:
- sla_status = "violated"
receiver: 'sla-violation'
group_wait: 0s
repeat_interval: 15m
continue: true
# SLA 风险告警
- matchers:
- sla_status = "at_risk"
receiver: 'sla-risk'
group_wait: 5s
repeat_interval: 30m
案例3:金融行业的路由配置
# 金融行业的严格路由配置
route:
group_by: ['system', 'alertname']
group_wait: 5s
group_interval: 2m
repeat_interval: 30m
receiver: 'financial-default'
routes:
# 交易系统告警(最高优先级)
- matchers:
- system = "trading"
receiver: 'trading-team'
group_wait: 0s
group_interval: 30s
repeat_interval: 5m
routes:
# 交易执行告警
- matchers:
- component = "execution"
receiver: 'trading-execution'
continue: true
# 风险管理告警
- matchers:
- component = "risk"
receiver: 'risk-management'
continue: true
# 市场数据告警
- matchers:
- component = "market_data"
receiver: 'market-data-team'
# 支付系统告警
- matchers:
- system = "payment"
receiver: 'payment-team'
group_wait: 2s
repeat_interval: 10m
routes:
# 支付处理告警
- matchers:
- severity = "critical"
receiver: 'payment-critical'
continue: true
# 反欺诈告警
- matchers:
- component = "fraud_detection"
receiver: 'fraud-team'
group_wait: 0s
# 核心银行系统告警
- matchers:
- system = "core_banking"
receiver: 'core-banking-team'
group_wait: 1s
repeat_interval: 15m
routes:
# 账户服务告警
- matchers:
- service = "account"
receiver: 'account-team'
# 贷款服务告警
- matchers:
- service = "lending"
receiver: 'lending-team'
# 合规和监管告警
- matchers:
- category = "compliance"
receiver: 'compliance-team'
group_wait: 0s
repeat_interval: 5m
routes:
# 监管报告告警
- matchers:
- report_type = "regulatory"
receiver: 'regulatory-team'
continue: true
# AML 告警
- matchers:
- compliance_type = "aml"
receiver: 'aml-team'
continue: true
# 安全告警(金融级别)
- matchers:
- category = "security"
receiver: 'financial-security'
group_wait: 0s
group_interval: 30s
repeat_interval: 5m
routes:
# 网络安全事件
- matchers:
- security_type = "network"
receiver: 'network-security'
continue: true
# 数据泄露告警
- matchers:
- security_type = "data_breach"
receiver: 'data-security'
continue: true
# 身份认证告警
- matchers:
- security_type = "authentication"
receiver: 'identity-team'
# 业务连续性告警
- matchers:
- category = "business_continuity"
receiver: 'bcp-team'
group_wait: 0s
repeat_interval: 10m
routes:
# 灾难恢复告警
- matchers:
- event_type = "disaster_recovery"
receiver: 'dr-team'
continue: true
# 备份系统告警
- matchers:
- component = "backup"
receiver: 'backup-team'
本章小结
本章深入介绍了 Alertmanager 的路由和分组系统:
核心概念
- 路由系统:树形结构的告警分发机制
- 匹配器:精确匹配、正则匹配和新式匹配器
- 分组策略:基于标签的告警聚合机制
- 时间配置:控制告警发送时机和频率
配置要点
- 路由设计:合理的层级结构和匹配顺序
- 分组策略:平衡告警聚合和响应速度
- 时间配置:根据严重程度调整时间参数
- 性能优化:减少匹配复杂度和路由层级
最佳实践
- 测试驱动:使用 amtool 验证路由配置
- 渐进优化:从简单配置开始逐步完善
- 监控调试:持续监控路由性能和效果
- 文档维护:记录路由设计决策和变更
下一步学习
在下一章中,我们将学习通知渠道的详细配置,包括: - 邮件通知的高级配置 - Slack 集成和自定义 - Webhook 开发和集成 - 第三方服务集成
下一章: 通知渠道配置