6.1 告警抑制概述
抑制机制原理
告警抑制(Inhibition)是 Alertmanager 的重要功能,用于在某些条件下自动抑制相关的告警,避免告警风暴和减少噪音。
flowchart TD
A[告警产生] --> B{检查抑制规则}
B -->|匹配抑制条件| C[抑制告警]
B -->|不匹配| D[正常发送]
C --> E[记录抑制日志]
D --> F[发送通知]
G[源告警] --> H[抑制规则]
I[目标告警] --> H
H --> J[标签匹配]
J --> K[抑制生效]
抑制规则结构
inhibit_rules:
- source_match: # 源告警匹配条件
severity: 'critical'
target_match: # 目标告警匹配条件
severity: 'warning'
equal: # 相等标签列表
- 'cluster'
- 'service'
抑制场景分析
场景 | 源告警 | 目标告警 | 抑制逻辑 | 业务价值 |
---|---|---|---|---|
服务级联 | 服务不可用 | 服务响应慢 | 服务已不可用时抑制性能告警 | 减少噪音 |
基础设施 | 节点宕机 | 节点上的应用告警 | 节点宕机时抑制应用告警 | 聚焦根因 |
网络分区 | 网络不可达 | 服务连接失败 | 网络问题时抑制连接告警 | 避免误报 |
维护窗口 | 维护模式 | 所有相关告警 | 维护期间抑制业务告警 | 减少干扰 |
6.2 基础抑制规则配置
简单抑制规则
# alertmanager.yml
inhibit_rules:
# 规则1:严重告警抑制警告告警
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal:
- 'instance'
- 'job'
# 规则2:节点宕机抑制节点相关告警
- source_match:
alertname: 'NodeDown'
target_match_re:
alertname: 'Node.*'
equal:
- 'instance'
# 规则3:服务不可用抑制性能告警
- source_match:
alertname: 'ServiceUnavailable'
target_match_re:
alertname: '(HighLatency|HighErrorRate)'
equal:
- 'service'
- 'environment'
复杂抑制规则
inhibit_rules:
# 数据库主从切换场景
- source_match:
alertname: 'DatabaseMasterDown'
severity: 'critical'
target_match_re:
alertname: '(DatabaseSlowQuery|DatabaseConnectionHigh|DatabaseReplicationLag)'
equal:
- 'cluster'
- 'environment'
# 网络分区场景
- source_match:
alertname: 'NetworkPartition'
severity: 'critical'
target_match_re:
alertname: '(ServiceUnavailable|HighLatency|ConnectionFailed)'
equal:
- 'datacenter'
- 'zone'
# Kubernetes 节点问题
- source_match:
alertname: 'KubernetesNodeNotReady'
target_match_re:
alertname: '(KubernetesPodCrashLooping|KubernetesPodNotReady|KubernetesContainerOOMKilled)'
equal:
- 'node'
- 'cluster'
# 存储系统级联
- source_match:
alertname: 'StorageClusterDown'
severity: 'critical'
target_match_re:
alertname: '(DiskSpaceHigh|DiskIOHigh|FileSystemReadOnly)'
equal:
- 'storage_cluster'
- 'environment'
# 负载均衡器故障
- source_match:
alertname: 'LoadBalancerDown'
target_match_re:
alertname: '(BackendUnhealthy|HighResponseTime|ConnectionRefused)'
equal:
- 'lb_cluster'
- 'service'
# 微服务依赖链
- source_match:
alertname: 'UpstreamServiceDown'
severity: 'critical'
target_match:
severity: 'warning'
equal:
- 'service_chain'
- 'environment'
时间窗口抑制
# 使用标签实现时间窗口抑制
inhibit_rules:
# 维护窗口抑制
- source_match:
alertname: 'MaintenanceMode'
maintenance: 'true'
target_match_re:
alertname: '.*'
equal:
- 'cluster'
- 'environment'
# 部署期间抑制
- source_match:
alertname: 'DeploymentInProgress'
deployment: 'active'
target_match_re:
alertname: '(ServiceUnavailable|HighLatency|HighErrorRate)'
equal:
- 'service'
- 'version'
# 备份期间抑制
- source_match:
alertname: 'BackupInProgress'
target_match_re:
alertname: '(DiskIOHigh|DatabaseSlowQuery)'
equal:
- 'database'
- 'instance'
6.3 静默管理
静默概念和用途
静默(Silence)是主动抑制告警的机制,通常用于: - 计划维护期间 - 已知问题的临时处理 - 测试环境的告警屏蔽 - 紧急情况下的快速止血
flowchart LR
A[创建静默] --> B[匹配器配置]
B --> C[时间范围设置]
C --> D[静默生效]
D --> E[告警被抑制]
E --> F[静默到期]
F --> G[恢复正常]
H[Web UI] --> A
I[API] --> A
J[amtool] --> A
通过 Web UI 创建静默
访问 Alertmanager Web UI
http://alertmanager.example.com:9093
创建静默步骤
- 点击 “Silences” 标签
- 点击 “New Silence” 按钮
- 配置匹配器和时间范围
- 添加注释和创建者信息
- 提交静默规则
通过 API 创建静默
#!/bin/bash
# create-silence.sh
ALERTMANAGER_URL="http://localhost:9093"
# 创建静默的 JSON 数据
silence_data='{
"matchers": [
{
"name": "alertname",
"value": "HighCPUUsage",
"isRegex": false
},
{
"name": "instance",
"value": "server1:9100",
"isRegex": false
}
],
"startsAt": "'$(date -u +%Y-%m-%dT%H:%M:%S.%3NZ)'",
"endsAt": "'$(date -u -d '+2 hours' +%Y-%m-%dT%H:%M:%S.%3NZ)'",
"createdBy": "admin",
"comment": "Planned maintenance for server1"
}'
echo "创建静默规则..."
if curl -XPOST "$ALERTMANAGER_URL/api/v1/silences" \
-H "Content-Type: application/json" \
-d "$silence_data"; then
echo "✅ 静默规则创建成功"
else
echo "❌ 静默规则创建失败"
exit 1
fi
使用 amtool 管理静默
#!/bin/bash
# amtool-silence-management.sh
ALERTMANAGER_URL="http://localhost:9093"
echo "=== amtool 静默管理示例 ==="
# 1. 查看当前静默
echo "\n1. 查看当前静默:"
amtool --alertmanager.url="$ALERTMANAGER_URL" silence query
# 2. 创建静默(维护窗口)
echo "\n2. 创建维护窗口静默:"
amtool --alertmanager.url="$ALERTMANAGER_URL" silence add \
alertname="HighCPUUsage" \
instance="server1:9100" \
--duration="2h" \
--author="admin" \
--comment="Planned maintenance"
# 3. 创建正则表达式静默
echo "\n3. 创建正则表达式静默:"
amtool --alertmanager.url="$ALERTMANAGER_URL" silence add \
alertname=~"High.*" \
environment="production" \
--duration="1h" \
--author="oncall" \
--comment="Performance optimization"
# 4. 查询特定静默
echo "\n4. 查询特定静默:"
amtool --alertmanager.url="$ALERTMANAGER_URL" silence query \
alertname="HighCPUUsage"
# 5. 删除静默
echo "\n5. 删除静默(需要静默ID):"
# SILENCE_ID=$(amtool --alertmanager.url="$ALERTMANAGER_URL" silence query -q | head -1 | awk '{print $1}')
# amtool --alertmanager.url="$ALERTMANAGER_URL" silence expire "$SILENCE_ID"
# 6. 批量创建静默
echo "\n6. 批量创建静默:"
cat << EOF | while read line; do
eval "amtool --alertmanager.url='$ALERTMANAGER_URL' silence add $line"
done
alertname=\"DiskSpaceHigh\" instance=\"server1:9100\" --duration=\"4h\" --author=\"admin\" --comment=\"Disk cleanup\"
alertname=\"MemoryHigh\" instance=\"server2:9100\" --duration=\"2h\" --author=\"admin\" --comment=\"Memory optimization\"
alertname=\"NetworkLatency\" datacenter=\"dc1\" --duration=\"1h\" --author=\"network-team\" --comment=\"Network upgrade\"
EOF
echo "\n=== 静默管理完成 ==="
高级静默配置
#!/bin/bash
# advanced-silence.sh
ALERTMANAGER_URL="http://localhost:9093"
# 函数:创建复杂静默
create_complex_silence() {
local name="$1"
local matchers="$2"
local duration="$3"
local comment="$4"
echo "创建静默: $name"
silence_data=$(cat << EOF
{
"matchers": $matchers,
"startsAt": "$(date -u +%Y-%m-%dT%H:%M:%S.%3NZ)",
"endsAt": "$(date -u -d "+$duration" +%Y-%m-%dT%H:%M:%S.%3NZ)",
"createdBy": "automation",
"comment": "$comment"
}
EOF
)
curl -XPOST "$ALERTMANAGER_URL/api/v1/silences" \
-H "Content-Type: application/json" \
-d "$silence_data"
}
# 1. 数据中心维护静默
dc_maintenance_matchers='[
{
"name": "datacenter",
"value": "dc1",
"isRegex": false
},
{
"name": "severity",
"value": "warning|info",
"isRegex": true
}
]'
create_complex_silence "DC1 Maintenance" \
"$dc_maintenance_matchers" \
"4 hours" \
"Scheduled datacenter maintenance"
# 2. 应用部署静默
app_deployment_matchers='[
{
"name": "service",
"value": "user-service",
"isRegex": false
},
{
"name": "environment",
"value": "production",
"isRegex": false
},
{
"name": "alertname",
"value": "(HighLatency|HighErrorRate|ServiceUnavailable)",
"isRegex": true
}
]'
create_complex_silence "User Service Deployment" \
"$app_deployment_matchers" \
"30 minutes" \
"Production deployment of user-service v2.1.0"
# 3. 测试环境静默
test_env_matchers='[
{
"name": "environment",
"value": "test|staging",
"isRegex": true
}
]'
create_complex_silence "Test Environment" \
"$test_env_matchers" \
"24 hours" \
"Suppress all test environment alerts"
# 4. 特定团队静默
team_matchers='[
{
"name": "team",
"value": "frontend",
"isRegex": false
},
{
"name": "severity",
"value": "info",
"isRegex": false
}
]'
create_complex_silence "Frontend Team Info Alerts" \
"$team_matchers" \
"8 hours" \
"Suppress info level alerts for frontend team during sprint"
6.4 自动化静默管理
基于事件的自动静默
# auto_silence.py
import requests
import json
from datetime import datetime, timedelta
import logging
class AlertmanagerSilenceManager:
def __init__(self, alertmanager_url):
self.base_url = alertmanager_url
self.logger = logging.getLogger(__name__)
def create_silence(self, matchers, duration_hours, comment, created_by="automation"):
"""创建静默规则"""
start_time = datetime.utcnow()
end_time = start_time + timedelta(hours=duration_hours)
silence_data = {
"matchers": matchers,
"startsAt": start_time.strftime("%Y-%m-%dT%H:%M:%S.%fZ"),
"endsAt": end_time.strftime("%Y-%m-%dT%H:%M:%S.%fZ"),
"createdBy": created_by,
"comment": comment
}
try:
response = requests.post(
f"{self.base_url}/api/v1/silences",
json=silence_data,
headers={"Content-Type": "application/json"}
)
response.raise_for_status()
silence_id = response.json().get("silenceID")
self.logger.info(f"Created silence {silence_id}: {comment}")
return silence_id
except requests.exceptions.RequestException as e:
self.logger.error(f"Failed to create silence: {e}")
return None
def get_active_silences(self):
"""获取活跃的静默规则"""
try:
response = requests.get(f"{self.base_url}/api/v1/silences")
response.raise_for_status()
silences = response.json().get("data", [])
active_silences = [
s for s in silences
if s.get("status", {}).get("state") == "active"
]
return active_silences
except requests.exceptions.RequestException as e:
self.logger.error(f"Failed to get silences: {e}")
return []
def expire_silence(self, silence_id):
"""使静默规则过期"""
try:
response = requests.delete(
f"{self.base_url}/api/v1/silence/{silence_id}"
)
response.raise_for_status()
self.logger.info(f"Expired silence {silence_id}")
return True
except requests.exceptions.RequestException as e:
self.logger.error(f"Failed to expire silence {silence_id}: {e}")
return False
def create_maintenance_silence(self, service, environment, duration_hours):
"""为服务维护创建静默"""
matchers = [
{"name": "service", "value": service, "isRegex": False},
{"name": "environment", "value": environment, "isRegex": False}
]
comment = f"Automated maintenance silence for {service} in {environment}"
return self.create_silence(matchers, duration_hours, comment)
def create_deployment_silence(self, service, version, duration_minutes=30):
"""为部署创建静默"""
matchers = [
{"name": "service", "value": service, "isRegex": False},
{"name": "alertname", "value": "(HighLatency|HighErrorRate|ServiceUnavailable)", "isRegex": True}
]
comment = f"Automated deployment silence for {service} v{version}"
return self.create_silence(matchers, duration_minutes/60, comment)
def create_infrastructure_silence(self, node, duration_hours):
"""为基础设施维护创建静默"""
matchers = [
{"name": "instance", "value": f"{node}:.*", "isRegex": True}
]
comment = f"Automated infrastructure maintenance silence for {node}"
return self.create_silence(matchers, duration_hours, comment)
# 使用示例
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
# 初始化管理器
silence_manager = AlertmanagerSilenceManager("http://localhost:9093")
# 创建维护静默
silence_manager.create_maintenance_silence(
service="user-service",
environment="production",
duration_hours=2
)
# 创建部署静默
silence_manager.create_deployment_silence(
service="api-gateway",
version="v2.1.0",
duration_minutes=45
)
# 创建基础设施静默
silence_manager.create_infrastructure_silence(
node="worker-node-01",
duration_hours=4
)
# 查看活跃静默
active_silences = silence_manager.get_active_silences()
print(f"Active silences: {len(active_silences)}")
for silence in active_silences:
print(f" - {silence['id']}: {silence['comment']}")
集成 CI/CD 的自动静默
# .github/workflows/deploy-with-silence.yml
name: Deploy with Alertmanager Silence
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Create deployment silence
run: |
SILENCE_DATA='{
"matchers": [
{
"name": "service",
"value": "'${{ github.event.repository.name }}'",
"isRegex": false
},
{
"name": "environment",
"value": "production",
"isRegex": false
},
{
"name": "alertname",
"value": "(HighLatency|HighErrorRate|ServiceUnavailable)",
"isRegex": true
}
],
"startsAt": "'$(date -u +%Y-%m-%dT%H:%M:%S.%3NZ)'",
"endsAt": "'$(date -u -d '+30 minutes' +%Y-%m-%dT%H:%M:%S.%3NZ)'",
"createdBy": "github-actions",
"comment": "Automated deployment silence for '${{ github.sha }}'"
}'
SILENCE_ID=$(curl -s -XPOST "${{ secrets.ALERTMANAGER_URL }}/api/v1/silences" \
-H "Content-Type: application/json" \
-d "$SILENCE_DATA" | jq -r '.silenceID')
echo "SILENCE_ID=$SILENCE_ID" >> $GITHUB_ENV
echo "Created silence: $SILENCE_ID"
- name: Deploy application
run: |
# 部署逻辑
echo "Deploying application..."
# kubectl apply -f k8s/
# helm upgrade --install app ./chart
- name: Wait for deployment
run: |
echo "Waiting for deployment to stabilize..."
sleep 300 # 等待5分钟
- name: Remove silence if deployment successful
if: success()
run: |
if [ -n "$SILENCE_ID" ]; then
curl -XDELETE "${{ secrets.ALERTMANAGER_URL }}/api/v1/silence/$SILENCE_ID"
echo "Removed silence: $SILENCE_ID"
fi
- name: Extend silence if deployment failed
if: failure()
run: |
echo "Deployment failed, keeping silence active for investigation"
# 可以选择延长静默时间或发送通知
Kubernetes 集成的自动静默
# k8s-silence-controller.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: silence-controller-config
namespace: monitoring
data:
config.yaml: |
alertmanager:
url: http://alertmanager:9093
rules:
# 节点维护静默
- name: node-maintenance
trigger:
annotation: "maintenance.alertmanager.io/silence"
value: "true"
matchers:
- name: "instance"
value: "{{ .NodeName }}:.*"
isRegex: true
duration: "4h"
comment: "Automated node maintenance silence"
# Pod 部署静默
- name: pod-deployment
trigger:
label: "app.kubernetes.io/version"
matchers:
- name: "service"
value: "{{ .Labels.app }}"
isRegex: false
- name: "namespace"
value: "{{ .Namespace }}"
isRegex: false
duration: "30m"
comment: "Automated deployment silence for {{ .Labels.app }}"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: silence-controller
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: silence-controller
template:
metadata:
labels:
app: silence-controller
spec:
serviceAccountName: silence-controller
containers:
- name: controller
image: silence-controller:latest
env:
- name: CONFIG_PATH
value: "/etc/config/config.yaml"
volumeMounts:
- name: config
mountPath: /etc/config
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "100m"
volumes:
- name: config
configMap:
name: silence-controller-config
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: silence-controller
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: silence-controller
rules:
- apiGroups: [""]
resources: ["nodes", "pods"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: silence-controller
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: silence-controller
subjects:
- kind: ServiceAccount
name: silence-controller
namespace: monitoring
6.5 告警降噪策略
多层降噪架构
flowchart TD
A[原始告警] --> B[Prometheus 规则过滤]
B --> C[Alertmanager 接收]
C --> D[抑制规则处理]
D --> E[静默规则处理]
E --> F[路由分组]
F --> G[频率限制]
G --> H[最终通知]
I[配置层面] --> B
J[规则层面] --> D
K[运维层面] --> E
L[策略层面] --> G
智能降噪配置
# 智能降噪配置
route:
receiver: 'default'
group_by: ['cluster', 'service', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 2h
routes:
# 严重告警快速通道
- match:
severity: critical
receiver: 'critical-fast'
group_wait: 10s
group_interval: 2m
repeat_interval: 30m
continue: true
# 高频告警降噪
- match:
frequency: high
receiver: 'high-frequency-reduced'
group_wait: 2m
group_interval: 15m
repeat_interval: 6h
# 测试环境降噪
- match:
environment: test
receiver: 'test-env-reduced'
group_wait: 5m
group_interval: 30m
repeat_interval: 24h
# 信息级别告警聚合
- match:
severity: info
receiver: 'info-aggregated'
group_wait: 10m
group_interval: 1h
repeat_interval: 12h
# 降噪抑制规则
inhibit_rules:
# 基础设施级联抑制
- source_match:
category: 'infrastructure'
severity: 'critical'
target_match:
category: 'application'
equal: ['cluster', 'zone']
# 服务依赖抑制
- source_match:
service_type: 'upstream'
severity: 'critical'
target_match:
service_type: 'downstream'
equal: ['service_chain']
# 时间窗口抑制
- source_match:
window: 'maintenance'
target_match_re:
severity: '(warning|info)'
equal: ['environment']
动态降噪脚本
# dynamic_noise_reduction.py
import requests
import json
from datetime import datetime, timedelta
from collections import defaultdict
import logging
class DynamicNoiseReducer:
def __init__(self, alertmanager_url, prometheus_url):
self.alertmanager_url = alertmanager_url
self.prometheus_url = prometheus_url
self.logger = logging.getLogger(__name__)
def analyze_alert_patterns(self, hours=24):
"""分析告警模式"""
# 获取历史告警数据
end_time = datetime.utcnow()
start_time = end_time - timedelta(hours=hours)
query = f'increase(alertmanager_alerts_received_total[{hours}h])'
try:
response = requests.get(
f"{self.prometheus_url}/api/v1/query",
params={'query': query}
)
response.raise_for_status()
data = response.json().get('data', {}).get('result', [])
# 分析告警频率
alert_frequency = defaultdict(int)
for item in data:
alertname = item['metric'].get('alertname', 'unknown')
count = float(item['value'][1])
alert_frequency[alertname] = count
return alert_frequency
except requests.exceptions.RequestException as e:
self.logger.error(f"Failed to analyze alert patterns: {e}")
return {}
def create_adaptive_silences(self, alert_frequency, threshold=50):
"""基于频率创建自适应静默"""
high_frequency_alerts = {
alert: freq for alert, freq in alert_frequency.items()
if freq > threshold
}
for alertname, frequency in high_frequency_alerts.items():
# 计算静默时间(频率越高,静默时间越长)
silence_hours = min(frequency / 10, 24) # 最多24小时
matchers = [
{
"name": "alertname",
"value": alertname,
"isRegex": False
},
{
"name": "severity",
"value": "info|warning",
"isRegex": True
}
]
self.create_silence(
matchers=matchers,
duration_hours=silence_hours,
comment=f"Adaptive silence for high-frequency alert (freq: {frequency})"
)
def create_silence(self, matchers, duration_hours, comment):
"""创建静默规则"""
start_time = datetime.utcnow()
end_time = start_time + timedelta(hours=duration_hours)
silence_data = {
"matchers": matchers,
"startsAt": start_time.strftime("%Y-%m-%dT%H:%M:%S.%fZ"),
"endsAt": end_time.strftime("%Y-%m-%dT%H:%M:%S.%fZ"),
"createdBy": "dynamic-reducer",
"comment": comment
}
try:
response = requests.post(
f"{self.alertmanager_url}/api/v1/silences",
json=silence_data
)
response.raise_for_status()
silence_id = response.json().get("silenceID")
self.logger.info(f"Created adaptive silence {silence_id}: {comment}")
return silence_id
except requests.exceptions.RequestException as e:
self.logger.error(f"Failed to create silence: {e}")
return None
def cleanup_expired_silences(self):
"""清理过期的静默规则"""
try:
response = requests.get(f"{self.alertmanager_url}/api/v1/silences")
response.raise_for_status()
silences = response.json().get("data", [])
for silence in silences:
if (silence.get("createdBy") == "dynamic-reducer" and
silence.get("status", {}).get("state") == "expired"):
silence_id = silence.get("id")
self.logger.info(f"Cleaning up expired silence {silence_id}")
except requests.exceptions.RequestException as e:
self.logger.error(f"Failed to cleanup silences: {e}")
def run_noise_reduction_cycle(self):
"""运行降噪周期"""
self.logger.info("Starting noise reduction cycle")
# 1. 分析告警模式
alert_frequency = self.analyze_alert_patterns()
self.logger.info(f"Analyzed {len(alert_frequency)} alert types")
# 2. 创建自适应静默
self.create_adaptive_silences(alert_frequency)
# 3. 清理过期静默
self.cleanup_expired_silences()
self.logger.info("Noise reduction cycle completed")
# 定时任务脚本
if __name__ == "__main__":
import schedule
import time
logging.basicConfig(level=logging.INFO)
reducer = DynamicNoiseReducer(
alertmanager_url="http://localhost:9093",
prometheus_url="http://localhost:9090"
)
# 每小时运行一次降噪分析
schedule.every().hour.do(reducer.run_noise_reduction_cycle)
while True:
schedule.run_pending()
time.sleep(60)
6.6 维护窗口管理
计划维护窗口
#!/bin/bash
# maintenance-window.sh
ALERTMANAGER_URL="http://localhost:9093"
# 函数:创建维护窗口
create_maintenance_window() {
local service="$1"
local environment="$2"
local start_time="$3"
local duration_hours="$4"
local description="$5"
echo "创建维护窗口: $service ($environment)"
echo "开始时间: $start_time"
echo "持续时间: ${duration_hours}小时"
echo "描述: $description"
# 计算结束时间
end_time=$(date -u -d "$start_time + $duration_hours hours" +%Y-%m-%dT%H:%M:%S.%3NZ)
start_time_iso=$(date -u -d "$start_time" +%Y-%m-%dT%H:%M:%S.%3NZ)
# 创建静默规则
silence_data=$(cat << EOF
{
"matchers": [
{
"name": "service",
"value": "$service",
"isRegex": false
},
{
"name": "environment",
"value": "$environment",
"isRegex": false
}
],
"startsAt": "$start_time_iso",
"endsAt": "$end_time",
"createdBy": "maintenance-scheduler",
"comment": "Scheduled maintenance: $description"
}
EOF
)
# 发送请求
response=$(curl -s -XPOST "$ALERTMANAGER_URL/api/v1/silences" \
-H "Content-Type: application/json" \
-d "$silence_data")
silence_id=$(echo "$response" | jq -r '.silenceID')
if [ "$silence_id" != "null" ] && [ -n "$silence_id" ]; then
echo "✅ 维护窗口创建成功,静默ID: $silence_id"
# 记录到日志文件
echo "$(date): Created maintenance window $silence_id for $service ($environment)" >> /var/log/maintenance-windows.log
return 0
else
echo "❌ 维护窗口创建失败"
echo "响应: $response"
return 1
fi
}
# 函数:批量创建维护窗口
batch_create_maintenance() {
local config_file="$1"
if [ ! -f "$config_file" ]; then
echo "配置文件不存在: $config_file"
return 1
fi
echo "从配置文件批量创建维护窗口: $config_file"
while IFS=',' read -r service environment start_time duration description; do
# 跳过注释行和空行
[[ "$service" =~ ^#.*$ ]] && continue
[[ -z "$service" ]] && continue
create_maintenance_window "$service" "$environment" "$start_time" "$duration" "$description"
sleep 1 # 避免请求过快
done < "$config_file"
}
# 使用示例
echo "=== 维护窗口管理 ==="
# 1. 单个维护窗口
create_maintenance_window \
"user-service" \
"production" \
"2024-01-20 02:00:00" \
"4" \
"Database migration and service upgrade"
# 2. 批量维护窗口
cat > maintenance-schedule.csv << EOF
# service,environment,start_time,duration_hours,description
api-gateway,production,2024-01-21 01:00:00,2,Load balancer configuration update
user-service,production,2024-01-21 03:00:00,3,Database schema migration
payment-service,production,2024-01-21 06:00:00,1,Security patch deployment
notification-service,staging,2024-01-20 20:00:00,8,Performance testing
EOF
batch_create_maintenance("maintenance-schedule.csv")
echo "\n=== 维护窗口管理完成 ==="
维护窗口自动化
# maintenance_automation.py
import yaml
import requests
import schedule
import time
from datetime import datetime, timedelta
from dataclasses import dataclass
from typing import List, Dict
import logging
@dataclass
class MaintenanceWindow:
service: str
environment: str
start_time: datetime
duration_hours: int
description: str
recurrence: str = None # daily, weekly, monthly
severity_filter: List[str] = None
class MaintenanceScheduler:
def __init__(self, alertmanager_url: str, config_file: str):
self.alertmanager_url = alertmanager_url
self.config_file = config_file
self.logger = logging.getLogger(__name__)
self.scheduled_windows = []
def load_maintenance_config(self):
"""加载维护配置"""
try:
with open(self.config_file, 'r') as f:
config = yaml.safe_load(f)
self.scheduled_windows = []
for window_config in config.get('maintenance_windows', []):
window = MaintenanceWindow(
service=window_config['service'],
environment=window_config['environment'],
start_time=datetime.fromisoformat(window_config['start_time']),
duration_hours=window_config['duration_hours'],
description=window_config['description'],
recurrence=window_config.get('recurrence'),
severity_filter=window_config.get('severity_filter', ['warning', 'info'])
)
self.scheduled_windows.append(window)
self.logger.info(f"Loaded {len(self.scheduled_windows)} maintenance windows")
except Exception as e:
self.logger.error(f"Failed to load maintenance config: {e}")
def create_silence_for_window(self, window: MaintenanceWindow):
"""为维护窗口创建静默"""
matchers = [
{
"name": "service",
"value": window.service,
"isRegex": False
},
{
"name": "environment",
"value": window.environment,
"isRegex": False
}
]
# 添加严重程度过滤
if window.severity_filter:
matchers.append({
"name": "severity",
"value": "|".join(window.severity_filter),
"isRegex": True
})
end_time = window.start_time + timedelta(hours=window.duration_hours)
silence_data = {
"matchers": matchers,
"startsAt": window.start_time.strftime("%Y-%m-%dT%H:%M:%S.%fZ"),
"endsAt": end_time.strftime("%Y-%m-%dT%H:%M:%S.%fZ"),
"createdBy": "maintenance-scheduler",
"comment": f"Scheduled maintenance: {window.description}"
}
try:
response = requests.post(
f"{self.alertmanager_url}/api/v1/silences",
json=silence_data
)
response.raise_for_status()
silence_id = response.json().get("silenceID")
self.logger.info(
f"Created maintenance silence {silence_id} for {window.service} "
f"({window.environment}) from {window.start_time} for {window.duration_hours}h"
)
return silence_id
except requests.exceptions.RequestException as e:
self.logger.error(f"Failed to create maintenance silence: {e}")
return None
def schedule_upcoming_windows(self, days_ahead=7):
"""调度即将到来的维护窗口"""
now = datetime.utcnow()
future_limit = now + timedelta(days=days_ahead)
for window in self.scheduled_windows:
# 检查是否在调度范围内
if now <= window.start_time <= future_limit:
# 计算调度时间(维护开始前5分钟)
schedule_time = window.start_time - timedelta(minutes=5)
if schedule_time > now:
# 调度静默创建
schedule.every().day.at(schedule_time.strftime("%H:%M")).do(
self.create_silence_for_window, window
).tag(f"maintenance-{window.service}-{window.environment}")
self.logger.info(
f"Scheduled maintenance window for {window.service} "
f"({window.environment}) at {schedule_time}"
)
def handle_recurring_windows(self):
"""处理重复维护窗口"""
for window in self.scheduled_windows:
if not window.recurrence:
continue
now = datetime.utcnow()
# 计算下一次维护时间
if window.recurrence == 'daily':
next_time = window.start_time + timedelta(days=1)
elif window.recurrence == 'weekly':
next_time = window.start_time + timedelta(weeks=1)
elif window.recurrence == 'monthly':
next_time = window.start_time + timedelta(days=30)
else:
continue
# 如果下一次维护时间已过,更新到未来
while next_time < now:
if window.recurrence == 'daily':
next_time += timedelta(days=1)
elif window.recurrence == 'weekly':
next_time += timedelta(weeks=1)
elif window.recurrence == 'monthly':
next_time += timedelta(days=30)
# 更新维护窗口时间
window.start_time = next_time
def run_scheduler(self):
"""运行调度器"""
self.logger.info("Starting maintenance scheduler")
# 加载配置
self.load_maintenance_config()
# 处理重复窗口
self.handle_recurring_windows()
# 调度即将到来的窗口
self.schedule_upcoming_windows()
# 运行调度器
while True:
schedule.run_pending()
time.sleep(60) # 每分钟检查一次
# 配置文件示例
maintenance_config = """
maintenance_windows:
- service: "user-service"
environment: "production"
start_time: "2024-01-20T02:00:00"
duration_hours: 4
description: "Monthly database maintenance"
recurrence: "monthly"
severity_filter: ["warning", "info"]
- service: "api-gateway"
environment: "production"
start_time: "2024-01-21T01:00:00"
duration_hours: 2
description: "Load balancer update"
severity_filter: ["warning"]
- service: "payment-service"
environment: "staging"
start_time: "2024-01-20T20:00:00"
duration_hours: 8
description: "Performance testing"
recurrence: "weekly"
severity_filter: ["warning", "info", "critical"]
"""
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
# 创建配置文件
with open('maintenance-config.yaml', 'w') as f:
f.write(maintenance_config)
# 启动调度器
scheduler = MaintenanceScheduler(
alertmanager_url="http://localhost:9093",
config_file="maintenance-config.yaml"
)
scheduler.run_scheduler()
本章小结
本章深入介绍了 Alertmanager 的告警抑制与静默功能:
核心概念
- 告警抑制:基于规则的自动告警抑制机制
- 静默管理:主动的告警屏蔽和管理
- 降噪策略:多层次的告警噪音控制
- 维护窗口:计划性的告警管理
技术要点
- 抑制规则设计:源告警、目标告警、匹配条件
- 静默创建方式:Web UI、API、amtool 工具
- 自动化集成:CI/CD、Kubernetes、监控系统
- 智能降噪:基于模式分析的动态调整
最佳实践
- 分层抑制:基础设施 → 平台 → 应用的抑制链
- 时间管理:合理设置静默时间和重复间隔
- 自动化优先:减少手动操作,提高效率
- 监控抑制效果:定期评估和优化抑制策略
运维价值
- 减少告警风暴:避免级联告警影响运维效率
- 聚焦根本问题:通过抑制突出主要问题
- 提升用户体验:减少无效通知的干扰
- 支持维护操作:为计划维护提供告警管理
下一步学习
在下一章中,我们将学习 Alertmanager 的高可用集群部署,包括: - 集群架构设计 - 数据同步和一致性 - 负载均衡配置 - 故障转移机制
下一章: 高可用集群部署