9.1 监控体系概述

9.1.1 Kubernetes监控架构

# 监控架构说明
apiVersion: v1
kind: ConfigMap
metadata:
  name: monitoring-architecture
data:
  layers: |
    1. 基础设施监控:
       - 节点资源使用率
       - 网络和存储性能
       - 硬件健康状态
    
    2. Kubernetes组件监控:
       - API Server性能
       - etcd集群状态
       - kubelet和容器运行时
    
    3. 应用监控:
       - Pod和容器指标
       - 应用自定义指标
       - 业务指标监控
    
    4. 日志监控:
       - 系统日志收集
       - 应用日志聚合
       - 审计日志分析
  components: |
    - Prometheus: 指标收集和存储
    - Grafana: 可视化和告警
    - AlertManager: 告警管理
    - Jaeger: 分布式追踪
    - ELK Stack: 日志管理

9.1.2 监控指标类型

apiVersion: v1
kind: ConfigMap
metadata:
  name: metrics-types
data:
  resource-metrics: |
    # 资源指标
    - CPU使用率和请求/限制
    - 内存使用率和请求/限制
    - 磁盘使用率和I/O
    - 网络流量和延迟
  
  kubernetes-metrics: |
    # Kubernetes指标
    - Pod状态和重启次数
    - Service端点健康状态
    - Deployment副本状态
    - 节点就绪状态
  
  application-metrics: |
    # 应用指标
    - HTTP请求率和延迟
    - 数据库连接池状态
    - 队列长度和处理时间
    - 业务KPI指标
  
  custom-metrics: |
    # 自定义指标
    - 应用特定的业务指标
    - 第三方服务集成指标
    - 用户定义的SLI指标

9.2 Prometheus监控系统

9.2.1 Prometheus部署

# Prometheus ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    rule_files:
      - "/etc/prometheus/rules/*.yml"
    
    alerting:
      alertmanagers:
        - static_configs:
            - targets:
              - alertmanager:9093
    
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']
      
      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
          - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
            action: keep
            regex: default;kubernetes;https
      
      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
          - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}/proxy/metrics
      
      - job_name: 'kubernetes-cadvisor'
        kubernetes_sd_configs:
          - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
      
      - job_name: 'kubernetes-service-endpoints'
        kubernetes_sd_configs:
          - role: endpoints
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
          - action: labelmap
            regex: __meta_kubernetes_service_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: kubernetes_name
      
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: kubernetes_pod_name
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      serviceAccountName: prometheus
      containers:
      - name: prometheus
        image: prom/prometheus:v2.37.0
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus/'
          - '--web.console.libraries=/etc/prometheus/console_libraries'
          - '--web.console.templates=/etc/prometheus/consoles'
          - '--storage.tsdb.retention.time=200h'
          - '--web.enable-lifecycle'
          - '--web.enable-admin-api'
        ports:
        - containerPort: 9090
        volumeMounts:
        - name: prometheus-config-volume
          mountPath: /etc/prometheus/
        - name: prometheus-storage-volume
          mountPath: /prometheus/
        - name: prometheus-rules-volume
          mountPath: /etc/prometheus/rules/
      volumes:
      - name: prometheus-config-volume
        configMap:
          defaultMode: 420
          name: prometheus-config
      - name: prometheus-storage-volume
        emptyDir: {}
      - name: prometheus-rules-volume
        configMap:
          name: prometheus-rules
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitoring
spec:
  selector:
    app: prometheus
  type: NodePort
  ports:
    - port: 9090
      targetPort: 9090
      nodePort: 30090

9.2.2 Prometheus告警规则

# Prometheus告警规则
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: monitoring
data:
  kubernetes.yml: |
    groups:
    - name: kubernetes
      rules:
      - alert: KubernetesNodeReady
        expr: kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: Kubernetes Node ready (instance {{ $labels.instance }})
          description: "Node {{ $labels.node }} has been unready for a long time\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: KubernetesMemoryPressure
        expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: Kubernetes memory pressure (instance {{ $labels.instance }})
          description: "Node {{ $labels.node }} has MemoryPressure condition\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: KubernetesDiskPressure
        expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: Kubernetes disk pressure (instance {{ $labels.instance }})
          description: "Node {{ $labels.node }} has DiskPressure condition\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: KubernetesOutOfDisk
        expr: kube_node_status_condition{condition="OutOfDisk",status="true"} == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: Kubernetes out of disk (instance {{ $labels.instance }})
          description: "Node {{ $labels.node }} has OutOfDisk condition\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: KubernetesOutOfCapacity
        expr: sum by (node) ((kube_pod_status_phase{phase="Running"} == 1) + on(uid) group_left(node) (0 * kube_pod_info{pod_template_hash=""})) / sum by (node) (kube_node_status_allocatable{resource="pods"}) * 100 > 90
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes out of capacity (instance {{ $labels.instance }})
          description: "Node {{ $labels.node }} is out of capacity\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: KubernetesContainerOomKiller
        expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes container oom killer (instance {{ $labels.instance }})
          description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: KubernetesPodCrashLooping
        expr: max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", job="kube-state-metrics"}[5m]) >= 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: KubernetesReplicassetMismatch
        expr: kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes ReplicasSet mismatch (instance {{ $labels.instance }})
          description: "Deployment Replicas mismatch\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: KubernetesDeploymentReplicasMismatch
        expr: kube_deployment_spec_replicas != kube_deployment_status_available_replicas
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes Deployment replicas mismatch (instance {{ $labels.instance }})
          description: "Deployment Replicas mismatch\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  
  resource.yml: |
    groups:
    - name: resource
      rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High CPU usage detected
          description: "CPU usage is above 80% for more than 5 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High memory usage detected
          description: "Memory usage is above 85% for more than 5 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: HighDiskUsage
        expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High disk usage detected
          description: "Disk usage is above 85% for more than 5 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: PodHighCPU
        expr: sum(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])) by (namespace, pod) > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: Pod high CPU usage
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} CPU usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: PodHighMemory
        expr: sum(container_memory_working_set_bytes{container!="POD",container!=""}) by (namespace, pod) / sum(container_spec_memory_limit_bytes{container!="POD",container!=""}) by (namespace, pod) > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: Pod high memory usage
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

9.2.3 Node Exporter部署

# Node Exporter DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      hostPID: true
      hostIPC: true
      hostNetwork: true
      containers:
      - name: node-exporter
        image: prom/node-exporter:v1.3.1
        args:
          - '--path.procfs=/host/proc'
          - '--path.rootfs=/rootfs'
          - '--path.sysfs=/host/sys'
          - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
        ports:
        - containerPort: 9100
          hostPort: 9100
        volumeMounts:
        - name: proc
          mountPath: /host/proc
          readOnly: true
        - name: sys
          mountPath: /host/sys
          readOnly: true
        - name: rootfs
          mountPath: /rootfs
          readOnly: true
        securityContext:
          runAsNonRoot: true
          runAsUser: 65534
      volumes:
      - name: proc
        hostPath:
          path: /proc
      - name: sys
        hostPath:
          path: /sys
      - name: rootfs
        hostPath:
          path: /
      tolerations:
      - operator: Exists
        effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: node-exporter
  namespace: monitoring
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9100"
spec:
  selector:
    app: node-exporter
  ports:
  - name: metrics
    port: 9100
    targetPort: 9100

9.3 Grafana可视化

9.3.1 Grafana部署

# Grafana ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-config
  namespace: monitoring
data:
  grafana.ini: |
    [analytics]
    check_for_updates = true
    [grafana_net]
    url = https://grafana.net
    [log]
    mode = console
    [paths]
    data = /var/lib/grafana/
    logs = /var/log/grafana
    plugins = /var/lib/grafana/plugins
    provisioning = /etc/grafana/provisioning
    [server]
    root_url = http://localhost:3000/
    [security]
    admin_user = admin
    admin_password = admin123
    [users]
    allow_sign_up = false
    auto_assign_org = true
    auto_assign_org_role = Viewer
    default_theme = dark
  
  datasources.yml: |
    apiVersion: 1
    datasources:
    - name: Prometheus
      type: prometheus
      access: proxy
      url: http://prometheus:9090
      isDefault: true
      editable: true
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
      - name: grafana
        image: grafana/grafana:9.1.0
        ports:
        - containerPort: 3000
        env:
        - name: GF_SECURITY_ADMIN_PASSWORD
          value: "admin123"
        volumeMounts:
        - name: grafana-config
          mountPath: /etc/grafana/grafana.ini
          subPath: grafana.ini
        - name: grafana-datasources
          mountPath: /etc/grafana/provisioning/datasources/datasources.yml
          subPath: datasources.yml
        - name: grafana-storage
          mountPath: /var/lib/grafana
      volumes:
      - name: grafana-config
        configMap:
          name: grafana-config
      - name: grafana-datasources
        configMap:
          name: grafana-config
      - name: grafana-storage
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: grafana
  namespace: monitoring
spec:
  selector:
    app: grafana
  type: NodePort
  ports:
    - port: 3000
      targetPort: 3000
      nodePort: 30030

9.3.2 Grafana Dashboard配置

{
  "dashboard": {
    "id": null,
    "title": "Kubernetes Cluster Monitoring",
    "tags": ["kubernetes"],
    "style": "dark",
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Cluster CPU Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "100 - (avg(irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Cluster Memory Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 6, "y": 0}
      },
      {
        "id": 3,
        "title": "Pod Count",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(kube_pod_info)",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "short",
            "color": {"mode": "palette-classic"}
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 12, "y": 0}
      },
      {
        "id": 4,
        "title": "Node Count",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(kube_node_info)",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "short",
            "color": {"mode": "palette-classic"}
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 18, "y": 0}
      },
      {
        "id": 5,
        "title": "CPU Usage by Node",
        "type": "timeseries",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "refId": "A",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
      },
      {
        "id": 6,
        "title": "Memory Usage by Node",
        "type": "timeseries",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
            "refId": "A",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s"
  }
}

9.4 AlertManager告警管理

9.4.1 AlertManager部署

# AlertManager ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  alertmanager.yml: |
    global:
      smtp_smarthost: 'smtp.gmail.com:587'
      smtp_from: 'alerts@example.com'
      smtp_auth_username: 'alerts@example.com'
      smtp_auth_password: 'password'
    
    route:
      group_by: ['alertname']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h
      receiver: 'web.hook'
      routes:
      - match:
          severity: critical
        receiver: 'critical-alerts'
      - match:
          severity: warning
        receiver: 'warning-alerts'
    
    receivers:
    - name: 'web.hook'
      webhook_configs:
      - url: 'http://webhook-service:5000/alerts'
        send_resolved: true
    
    - name: 'critical-alerts'
      email_configs:
      - to: 'admin@example.com'
        subject: 'Critical Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Labels: {{ .Labels }}
          {{ end }}
      slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: 'Critical Alert'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
    
    - name: 'warning-alerts'
      email_configs:
      - to: 'team@example.com'
        subject: 'Warning Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Labels: {{ .Labels }}
          {{ end }}
    
    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      equal: ['alertname', 'dev', 'instance']
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      labels:
        app: alertmanager
    spec:
      containers:
      - name: alertmanager
        image: prom/alertmanager:v0.24.0
        args:
          - '--config.file=/etc/alertmanager/alertmanager.yml'
          - '--storage.path=/alertmanager'
          - '--web.external-url=http://localhost:9093'
        ports:
        - containerPort: 9093
        volumeMounts:
        - name: alertmanager-config-volume
          mountPath: /etc/alertmanager
        - name: alertmanager-storage-volume
          mountPath: /alertmanager
      volumes:
      - name: alertmanager-config-volume
        configMap:
          defaultMode: 420
          name: alertmanager-config
      - name: alertmanager-storage-volume
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: alertmanager
  namespace: monitoring
spec:
  selector:
    app: alertmanager
  type: NodePort
  ports:
    - port: 9093
      targetPort: 9093
      nodePort: 30093

9.4.2 自定义Webhook接收器

# webhook-receiver.py
from flask import Flask, request, jsonify
import json
import logging
from datetime import datetime

app = Flask(__name__)
logging.basicConfig(level=logging.INFO)

@app.route('/alerts', methods=['POST'])
def receive_alerts():
    try:
        data = request.get_json()
        
        for alert in data.get('alerts', []):
            alert_name = alert.get('labels', {}).get('alertname', 'Unknown')
            status = alert.get('status', 'Unknown')
            summary = alert.get('annotations', {}).get('summary', 'No summary')
            description = alert.get('annotations', {}).get('description', 'No description')
            
            log_message = f"Alert: {alert_name}, Status: {status}, Summary: {summary}"
            
            if status == 'firing':
                logging.warning(log_message)
                # 这里可以添加自定义的告警处理逻辑
                # 例如:发送到企业微信、钉钉等
                send_to_custom_system(alert)
            else:
                logging.info(f"Resolved: {log_message}")
        
        return jsonify({'status': 'success'}), 200
    
    except Exception as e:
        logging.error(f"Error processing alerts: {str(e)}")
        return jsonify({'status': 'error', 'message': str(e)}), 500

def send_to_custom_system(alert):
    """发送告警到自定义系统"""
    # 实现自定义告警发送逻辑
    pass

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)
# Webhook接收器部署
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webhook-receiver
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: webhook-receiver
  template:
    metadata:
      labels:
        app: webhook-receiver
    spec:
      containers:
      - name: webhook-receiver
        image: python:3.9-slim
        command: ["python", "/app/webhook-receiver.py"]
        ports:
        - containerPort: 5000
        volumeMounts:
        - name: webhook-code
          mountPath: /app
        env:
        - name: FLASK_ENV
          value: "production"
      volumes:
      - name: webhook-code
        configMap:
          name: webhook-receiver-code
---
apiVersion: v1
kind: Service
metadata:
  name: webhook-service
  namespace: monitoring
spec:
  selector:
    app: webhook-receiver
  ports:
  - port: 5000
    targetPort: 5000

9.5 日志管理系统

9.5.1 Elasticsearch部署

# Elasticsearch ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: elasticsearch-config
  namespace: logging
data:
  elasticsearch.yml: |
    cluster.name: kubernetes-logs
    node.name: ${HOSTNAME}
    network.host: 0.0.0.0
    discovery.type: single-node
    xpack.security.enabled: false
    xpack.monitoring.collection.enabled: true
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
  namespace: logging
spec:
  serviceName: elasticsearch
  replicas: 1
  selector:
    matchLabels:
      app: elasticsearch
  template:
    metadata:
      labels:
        app: elasticsearch
    spec:
      containers:
      - name: elasticsearch
        image: docker.elastic.co/elasticsearch/elasticsearch:7.17.0
        env:
        - name: discovery.type
          value: single-node
        - name: ES_JAVA_OPTS
          value: "-Xms512m -Xmx512m"
        ports:
        - containerPort: 9200
        - containerPort: 9300
        volumeMounts:
        - name: elasticsearch-data
          mountPath: /usr/share/elasticsearch/data
        - name: elasticsearch-config
          mountPath: /usr/share/elasticsearch/config/elasticsearch.yml
          subPath: elasticsearch.yml
      volumes:
      - name: elasticsearch-config
        configMap:
          name: elasticsearch-config
  volumeClaimTemplates:
  - metadata:
      name: elasticsearch-data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
  name: elasticsearch
  namespace: logging
spec:
  selector:
    app: elasticsearch
  ports:
  - name: http
    port: 9200
    targetPort: 9200
  - name: transport
    port: 9300
    targetPort: 9300

9.5.2 Logstash部署

# Logstash ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: logstash-config
  namespace: logging
data:
  logstash.yml: |
    http.host: "0.0.0.0"
    path.config: /usr/share/logstash/pipeline
    xpack.monitoring.elasticsearch.hosts: ["http://elasticsearch:9200"]
  
  logstash.conf: |
    input {
      beats {
        port => 5044
      }
      http {
        port => 8080
        codec => json
      }
    }
    
    filter {
      if [kubernetes] {
        mutate {
          add_field => { "cluster_name" => "kubernetes" }
        }
        
        # 解析容器日志
        if [kubernetes][container][name] {
          mutate {
            add_field => { "container_name" => "%{[kubernetes][container][name]}" }
          }
        }
        
        # 解析Pod信息
        if [kubernetes][pod][name] {
          mutate {
            add_field => { "pod_name" => "%{[kubernetes][pod][name]}" }
          }
        }
        
        # 解析命名空间
        if [kubernetes][namespace] {
          mutate {
            add_field => { "namespace" => "%{[kubernetes][namespace]}" }
          }
        }
        
        # 尝试解析JSON格式的日志
        if [message] =~ /^\{.*\}$/ {
          json {
            source => "message"
            target => "parsed_json"
          }
        }
        
        # 添加时间戳
        date {
          match => [ "@timestamp", "ISO8601" ]
        }
      }
      
      # 过滤敏感信息
      mutate {
        gsub => [
          "message", "password=[^\s]+", "password=***",
          "message", "token=[^\s]+", "token=***",
          "message", "secret=[^\s]+", "secret=***"
        ]
      }
    }
    
    output {
      elasticsearch {
        hosts => ["http://elasticsearch:9200"]
        index => "kubernetes-logs-%{+YYYY.MM.dd}"
        template_name => "kubernetes"
        template_pattern => "kubernetes-*"
        template => {
          "index_patterns" => ["kubernetes-*"],
          "settings" => {
            "number_of_shards" => 1,
            "number_of_replicas" => 0
          },
          "mappings" => {
            "properties" => {
              "@timestamp" => { "type" => "date" },
              "message" => { "type" => "text" },
              "level" => { "type" => "keyword" },
              "namespace" => { "type" => "keyword" },
              "pod_name" => { "type" => "keyword" },
              "container_name" => { "type" => "keyword" }
            }
          }
        }
      }
      
      # 调试输出
      stdout {
        codec => rubydebug
      }
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: logstash
  namespace: logging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: logstash
  template:
    metadata:
      labels:
        app: logstash
    spec:
      containers:
      - name: logstash
        image: docker.elastic.co/logstash/logstash:7.17.0
        env:
        - name: LS_JAVA_OPTS
          value: "-Xms256m -Xmx256m"
        ports:
        - containerPort: 5044
        - containerPort: 8080
        - containerPort: 9600
        volumeMounts:
        - name: logstash-config
          mountPath: /usr/share/logstash/config/logstash.yml
          subPath: logstash.yml
        - name: logstash-pipeline
          mountPath: /usr/share/logstash/pipeline/logstash.conf
          subPath: logstash.conf
      volumes:
      - name: logstash-config
        configMap:
          name: logstash-config
      - name: logstash-pipeline
        configMap:
          name: logstash-config
---
apiVersion: v1
kind: Service
metadata:
  name: logstash
  namespace: logging
spec:
  selector:
    app: logstash
  ports:
  - name: beats
    port: 5044
    targetPort: 5044
  - name: http
    port: 8080
    targetPort: 8080
  - name: monitoring
    port: 9600
    targetPort: 9600

9.5.3 Filebeat日志收集

# Filebeat ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: filebeat-config
  namespace: logging
data:
  filebeat.yml: |
    filebeat.inputs:
    - type: container
      paths:
        - /var/log/containers/*.log
      processors:
        - add_kubernetes_metadata:
            host: ${NODE_NAME}
            matchers:
            - logs_path:
                logs_path: "/var/log/containers/"
        - drop_event:
            when:
              or:
                - contains:
                    kubernetes.container.name: "filebeat"
                - contains:
                    kubernetes.container.name: "logstash"
                - contains:
                    kubernetes.container.name: "elasticsearch"
    
    output.logstash:
      hosts: ["logstash:5044"]
    
    logging.level: info
    logging.to_files: true
    logging.files:
      path: /var/log/filebeat
      name: filebeat
      keepfiles: 7
      permissions: 0644
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: filebeat
  namespace: logging
spec:
  selector:
    matchLabels:
      app: filebeat
  template:
    metadata:
      labels:
        app: filebeat
    spec:
      serviceAccountName: filebeat
      terminationGracePeriodSeconds: 30
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - name: filebeat
        image: docker.elastic.co/beats/filebeat:7.17.0
        args: [
          "-c", "/etc/filebeat.yml",
          "-e",
        ]
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        securityContext:
          runAsUser: 0
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - name: config
          mountPath: /etc/filebeat.yml
          readOnly: true
          subPath: filebeat.yml
        - name: data
          mountPath: /usr/share/filebeat/data
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: varlog
          mountPath: /var/log
          readOnly: true
      volumes:
      - name: config
        configMap:
          defaultMode: 0640
          name: filebeat-config
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: varlog
        hostPath:
          path: /var/log
      - name: data
        hostPath:
          path: /var/lib/filebeat-data
          type: DirectoryOrCreate
      tolerations:
      - operator: Exists
        effect: NoSchedule
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: filebeat
  namespace: logging
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: filebeat
rules:
- apiGroups: [""]
  resources:
  - nodes
  - namespaces
  - pods
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: filebeat
roleRef:
  kind: ClusterRole
  name: filebeat
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
  name: filebeat
  namespace: logging

9.5.4 Kibana可视化

# Kibana部署
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kibana
  namespace: logging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kibana
  template:
    metadata:
      labels:
        app: kibana
    spec:
      containers:
      - name: kibana
        image: docker.elastic.co/kibana/kibana:7.17.0
        env:
        - name: ELASTICSEARCH_HOSTS
          value: "http://elasticsearch:9200"
        - name: SERVER_NAME
          value: "kibana"
        - name: SERVER_HOST
          value: "0.0.0.0"
        ports:
        - containerPort: 5601
        resources:
          limits:
            memory: 1Gi
          requests:
            memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:
  name: kibana
  namespace: logging
spec:
  selector:
    app: kibana
  type: NodePort
  ports:
  - port: 5601
    targetPort: 5601
    nodePort: 30601

9.6 监控和日志管理脚本

9.6.1 监控部署脚本

#!/bin/bash
# deploy-monitoring.sh

echo "=== 部署Kubernetes监控系统 ==="

# 创建命名空间
echo "1. 创建监控命名空间"
kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace logging --dry-run=client -o yaml | kubectl apply -f -

# 创建RBAC
echo "2. 创建RBAC权限"
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring
EOF

# 部署kube-state-metrics
echo "3. 部署kube-state-metrics"
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/master/examples/standard/cluster-role-binding.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/master/examples/standard/cluster-role.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/master/examples/standard/deployment.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/master/examples/standard/service-account.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/master/examples/standard/service.yaml

# 等待部署完成
echo "4. 等待组件启动"
kubectl wait --for=condition=available --timeout=300s deployment/kube-state-metrics -n kube-system

echo "5. 检查部署状态"
kubectl get pods -n monitoring
kubectl get pods -n logging
kubectl get svc -n monitoring
kubectl get svc -n logging

echo "\n=== 监控系统部署完成 ==="
echo "Prometheus: http://localhost:30090"
echo "Grafana: http://localhost:30030 (admin/admin123)"
echo "AlertManager: http://localhost:30093"
echo "Kibana: http://localhost:30601"

9.6.2 监控检查脚本

#!/bin/bash
# check-monitoring.sh

echo "=== 监控系统健康检查 ==="

# 检查Prometheus
echo "\n1. 检查Prometheus状态:"
kubectl get pods -n monitoring -l app=prometheus
PROM_STATUS=$(kubectl get pods -n monitoring -l app=prometheus -o jsonpath='{.items[0].status.phase}')
if [ "$PROM_STATUS" = "Running" ]; then
    echo "✓ Prometheus运行正常"
else
    echo "✗ Prometheus状态异常: $PROM_STATUS"
fi

# 检查Grafana
echo "\n2. 检查Grafana状态:"
kubectl get pods -n monitoring -l app=grafana
GRAFANA_STATUS=$(kubectl get pods -n monitoring -l app=grafana -o jsonpath='{.items[0].status.phase}')
if [ "$GRAFANA_STATUS" = "Running" ]; then
    echo "✓ Grafana运行正常"
else
    echo "✗ Grafana状态异常: $GRAFANA_STATUS"
fi

# 检查AlertManager
echo "\n3. 检查AlertManager状态:"
kubectl get pods -n monitoring -l app=alertmanager
ALERT_STATUS=$(kubectl get pods -n monitoring -l app=alertmanager -o jsonpath='{.items[0].status.phase}')
if [ "$ALERT_STATUS" = "Running" ]; then
    echo "✓ AlertManager运行正常"
else
    echo "✗ AlertManager状态异常: $ALERT_STATUS"
fi

# 检查Node Exporter
echo "\n4. 检查Node Exporter状态:"
kubectl get ds -n monitoring node-exporter
NODE_READY=$(kubectl get ds -n monitoring node-exporter -o jsonpath='{.status.numberReady}')
NODE_DESIRED=$(kubectl get ds -n monitoring node-exporter -o jsonpath='{.status.desiredNumberScheduled}')
if [ "$NODE_READY" = "$NODE_DESIRED" ]; then
    echo "✓ Node Exporter运行正常 ($NODE_READY/$NODE_DESIRED)"
else
    echo "✗ Node Exporter状态异常 ($NODE_READY/$NODE_DESIRED)"
fi

# 检查Elasticsearch
echo "\n5. 检查Elasticsearch状态:"
kubectl get pods -n logging -l app=elasticsearch
ES_STATUS=$(kubectl get pods -n logging -l app=elasticsearch -o jsonpath='{.items[0].status.phase}')
if [ "$ES_STATUS" = "Running" ]; then
    echo "✓ Elasticsearch运行正常"
else
    echo "✗ Elasticsearch状态异常: $ES_STATUS"
fi

# 检查Logstash
echo "\n6. 检查Logstash状态:"
kubectl get pods -n logging -l app=logstash
LOGSTASH_STATUS=$(kubectl get pods -n logging -l app=logstash -o jsonpath='{.items[0].status.phase}')
if [ "$LOGSTASH_STATUS" = "Running" ]; then
    echo "✓ Logstash运行正常"
else
    echo "✗ Logstash状态异常: $LOGSTASH_STATUS"
fi

# 检查Filebeat
echo "\n7. 检查Filebeat状态:"
kubectl get ds -n logging filebeat
FILEBEAT_READY=$(kubectl get ds -n logging filebeat -o jsonpath='{.status.numberReady}')
FILEBEAT_DESIRED=$(kubectl get ds -n logging filebeat -o jsonpath='{.status.desiredNumberScheduled}')
if [ "$FILEBEAT_READY" = "$FILEBEAT_DESIRED" ]; then
    echo "✓ Filebeat运行正常 ($FILEBEAT_READY/$FILEBEAT_DESIRED)"
else
    echo "✗ Filebeat状态异常 ($FILEBEAT_READY/$FILEBEAT_DESIRED)"
fi

# 检查Kibana
echo "\n8. 检查Kibana状态:"
kubectl get pods -n logging -l app=kibana
KIBANA_STATUS=$(kubectl get pods -n logging -l app=kibana -o jsonpath='{.items[0].status.phase}')
if [ "$KIBANA_STATUS" = "Running" ]; then
    echo "✓ Kibana运行正常"
else
    echo "✗ Kibana状态异常: $KIBANA_STATUS"
fi

# 检查服务端点
echo "\n9. 检查服务端点:"
echo "Prometheus: $(kubectl get svc -n monitoring prometheus -o jsonpath='{.spec.type}')端口$(kubectl get svc -n monitoring prometheus -o jsonpath='{.spec.ports[0].nodePort}')"
echo "Grafana: $(kubectl get svc -n monitoring grafana -o jsonpath='{.spec.type}')端口$(kubectl get svc -n monitoring grafana -o jsonpath='{.spec.ports[0].nodePort}')"
echo "AlertManager: $(kubectl get svc -n monitoring alertmanager -o jsonpath='{.spec.type}')端口$(kubectl get svc -n monitoring alertmanager -o jsonpath='{.spec.ports[0].nodePort}')"
echo "Kibana: $(kubectl get svc -n logging kibana -o jsonpath='{.spec.type}')端口$(kubectl get svc -n logging kibana -o jsonpath='{.spec.ports[0].nodePort}')"

echo "\n=== 监控系统检查完成 ==="

9.6.3 日志查询脚本

#!/bin/bash
# query-logs.sh

NAMESPACE=${1:-default}
POD_NAME=${2:-""}
CONTAINER=${3:-""}
LINES=${4:-100}

echo "=== Kubernetes日志查询工具 ==="
echo "命名空间: $NAMESPACE"
echo "Pod名称: $POD_NAME"
echo "容器名称: $CONTAINER"
echo "行数: $LINES"
echo ""

if [ -z "$POD_NAME" ]; then
    echo "可用的Pod列表:"
    kubectl get pods -n $NAMESPACE
    echo ""
    echo "用法: $0 <namespace> <pod-name> [container-name] [lines]"
    exit 1
fi

# 检查Pod是否存在
if ! kubectl get pod $POD_NAME -n $NAMESPACE &>/dev/null; then
    echo "错误: Pod $POD_NAME 在命名空间 $NAMESPACE 中不存在"
    exit 1
fi

# 获取Pod中的容器列表
if [ -z "$CONTAINER" ]; then
    CONTAINERS=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[*].name}')
    echo "Pod中的容器列表: $CONTAINERS"
    
    # 如果只有一个容器,自动选择
    CONTAINER_COUNT=$(echo $CONTAINERS | wc -w)
    if [ $CONTAINER_COUNT -eq 1 ]; then
        CONTAINER=$CONTAINERS
        echo "自动选择容器: $CONTAINER"
    else
        echo "请指定容器名称"
        exit 1
    fi
fi

echo "\n=== 实时日志 (按Ctrl+C退出) ==="
kubectl logs -f $POD_NAME -c $CONTAINER -n $NAMESPACE --tail=$LINES

9.6.4 性能监控脚本

#!/bin/bash
# performance-monitor.sh

echo "=== Kubernetes性能监控报告 ==="
echo "生成时间: $(date)"
echo ""

# 集群资源使用情况
echo "1. 集群资源使用情况:"
echo "节点数量: $(kubectl get nodes --no-headers | wc -l)"
echo "Pod总数: $(kubectl get pods --all-namespaces --no-headers | wc -l)"
echo "Service总数: $(kubectl get svc --all-namespaces --no-headers | wc -l)"
echo ""

# 节点资源使用
echo "2. 节点资源使用:"
kubectl top nodes 2>/dev/null || echo "需要安装metrics-server"
echo ""

# Pod资源使用Top 10
echo "3. Pod资源使用Top 10:"
echo "CPU使用率最高的Pod:"
kubectl top pods --all-namespaces --sort-by=cpu 2>/dev/null | head -11 || echo "需要安装metrics-server"
echo ""
echo "内存使用率最高的Pod:"
kubectl top pods --all-namespaces --sort-by=memory 2>/dev/null | head -11 || echo "需要安装metrics-server"
echo ""

# 检查问题Pod
echo "4. 问题Pod检查:"
echo "重启次数较多的Pod:"
kubectl get pods --all-namespaces --field-selector=status.phase=Running -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount | awk 'NR>1 && $3>5'
echo ""
echo "非Running状态的Pod:"
kubectl get pods --all-namespaces --field-selector=status.phase!=Running
echo ""

# 存储使用情况
echo "5. 存储使用情况:"
kubectl get pv
echo ""
kubectl get pvc --all-namespaces
echo ""

# 网络策略
echo "6. 网络策略:"
kubectl get networkpolicies --all-namespaces
echo ""

# 事件检查
echo "7. 最近的Warning事件:"
kubectl get events --all-namespaces --field-selector type=Warning --sort-by='.lastTimestamp' | tail -10
echo ""

echo "=== 性能监控报告完成 ==="

9.6.5 告警测试脚本

#!/bin/bash
# test-alerts.sh

echo "=== 告警系统测试 ==="

# 创建高CPU使用的测试Pod
echo "1. 创建高CPU使用测试Pod"
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: cpu-stress-test
  namespace: default
  labels:
    app: stress-test
spec:
  containers:
  - name: cpu-stress
    image: progrium/stress
    command: ["stress"]
    args: ["--cpu", "2", "--timeout", "300s"]
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 200m
        memory: 256Mi
EOF

echo "等待Pod启动..."
kubectl wait --for=condition=Ready pod/cpu-stress-test --timeout=60s

echo "\n2. 创建高内存使用测试Pod"
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: memory-stress-test
  namespace: default
  labels:
    app: stress-test
spec:
  containers:
  - name: memory-stress
    image: progrium/stress
    command: ["stress"]
    args: ["--vm", "1", "--vm-bytes", "200M", "--timeout", "300s"]
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 200m
        memory: 256Mi
EOF

echo "等待Pod启动..."
kubectl wait --for=condition=Ready pod/memory-stress-test --timeout=60s

echo "\n3. 监控告警状态"
echo "检查Prometheus告警状态..."
echo "请访问 http://localhost:30090/alerts 查看告警"
echo "请访问 http://localhost:30093 查看AlertManager"

echo "\n4. 等待5分钟后清理测试资源"
sleep 300

echo "\n5. 清理测试资源"
kubectl delete pod cpu-stress-test memory-stress-test

echo "\n=== 告警测试完成 ==="

9.7 监控最佳实践

9.7.1 监控策略

# 监控最佳实践配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: monitoring-best-practices
data:
  monitoring-strategy: |
    1. 分层监控:
       - 基础设施层: 节点、网络、存储
       - 平台层: Kubernetes组件
       - 应用层: 业务指标
    
    2. 关键指标:
       - 黄金信号: 延迟、流量、错误、饱和度
       - RED方法: 请求率、错误率、持续时间
       - USE方法: 使用率、饱和度、错误
    
    3. 告警策略:
       - 基于SLI/SLO设置告警
       - 避免告警疲劳
       - 分级告警处理
    
    4. 数据保留:
       - 高精度数据: 7-30天
       - 中精度数据: 3-6个月
       - 低精度数据: 1-2年
  
  sli-slo-examples: |
    # 服务水平指标和目标示例
    
    API可用性:
    - SLI: 成功请求数 / 总请求数
    - SLO: 99.9% (月度)
    
    API延迟:
    - SLI: 95%请求响应时间
    - SLO: < 200ms
    
    错误率:
    - SLI: 错误请求数 / 总请求数
    - SLO: < 0.1%
    
    数据持久性:
    - SLI: 成功备份数 / 计划备份数
    - SLO: 99.99%

9.7.2 日志管理最佳实践

apiVersion: v1
kind: ConfigMap
metadata:
  name: logging-best-practices
data:
  log-levels: |
    # 日志级别使用指南
    
    ERROR: 系统错误,需要立即关注
    - 应用崩溃
    - 数据库连接失败
    - 外部服务不可用
    
    WARN: 潜在问题,需要监控
    - 重试操作
    - 性能降级
    - 配置问题
    
    INFO: 重要业务事件
    - 用户登录/登出
    - 重要操作完成
    - 系统启动/关闭
    
    DEBUG: 调试信息
    - 详细执行流程
    - 变量值
    - 函数调用
  
  log-format: |
    # 结构化日志格式
    {
      "timestamp": "2023-01-01T12:00:00Z",
      "level": "INFO",
      "service": "user-service",
      "version": "v1.2.3",
      "trace_id": "abc123",
      "span_id": "def456",
      "user_id": "user123",
      "action": "login",
      "message": "User logged in successfully",
      "duration_ms": 150,
      "status_code": 200
    }
  
  log-retention: |
    # 日志保留策略
    
    应用日志:
    - 热数据: 7天 (快速查询)
    - 温数据: 30天 (常规查询)
    - 冷数据: 90天 (归档存储)
    
    审计日志:
    - 热数据: 30天
    - 温数据: 1年
    - 冷数据: 7年 (合规要求)
    
    系统日志:
    - 热数据: 3天
    - 温数据: 14天
    - 冷数据: 30天

9.8 故障排查和性能优化

9.8.1 常见监控问题

#!/bin/bash
# troubleshoot-monitoring.sh

echo "=== 监控系统故障排查 ==="

# 检查Prometheus数据收集
echo "1. 检查Prometheus目标状态:"
echo "访问 http://localhost:30090/targets 检查目标状态"
echo ""

# 检查指标数据
echo "2. 检查关键指标:"
echo "up{job=\"kubernetes-nodes\"} - 节点状态"
echo "up{job=\"kubernetes-apiservers\"} - API Server状态"
echo "up{job=\"kubernetes-cadvisor\"} - cAdvisor状态"
echo ""

# 检查存储空间
echo "3. 检查存储使用:"
kubectl exec -n monitoring deployment/prometheus -- df -h /prometheus
echo ""

# 检查日志
echo "4. 检查Prometheus日志:"
kubectl logs -n monitoring deployment/prometheus --tail=20
echo ""

# 检查配置
echo "5. 检查配置重载:"
echo "POST http://localhost:30090/-/reload 重载配置"
echo ""

# 性能优化建议
echo "6. 性能优化建议:"
echo "- 调整scrape_interval减少数据收集频率"
echo "- 使用recording rules预计算复杂查询"
echo "- 配置适当的retention时间"
echo "- 使用remote storage扩展存储"
echo ""

echo "=== 故障排查完成 ==="

9.8.2 性能优化配置

# Prometheus性能优化配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-performance-config
data:
  recording-rules.yml: |
    groups:
    - name: performance.rules
      interval: 30s
      rules:
      # 节点CPU使用率
      - record: node:cpu_utilization:rate5m
        expr: |
          100 - (
            avg by (instance) (
              irate(node_cpu_seconds_total{mode="idle"}[5m])
            ) * 100
          )
      
      # 节点内存使用率
      - record: node:memory_utilization:ratio
        expr: |
          (
            node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
          ) / node_memory_MemTotal_bytes
      
      # Pod CPU使用率
      - record: pod:cpu_usage:rate5m
        expr: |
          sum by (namespace, pod) (
            rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])
          )
      
      # Pod内存使用率
      - record: pod:memory_usage:ratio
        expr: |
          sum by (namespace, pod) (
            container_memory_working_set_bytes{container!="POD",container!=""}
          ) / sum by (namespace, pod) (
            container_spec_memory_limit_bytes{container!="POD",container!=""}
          )
      
      # 集群资源使用汇总
      - record: cluster:cpu_usage:rate5m
        expr: |
          sum(node:cpu_utilization:rate5m) / count(node:cpu_utilization:rate5m)
      
      - record: cluster:memory_usage:ratio
        expr: |
          sum(node:memory_utilization:ratio) / count(node:memory_utilization:ratio)
  
  optimization-tips: |
    # Prometheus性能优化技巧
    
    1. 存储优化:
       - 使用SSD存储提高I/O性能
       - 配置适当的retention时间
       - 启用压缩减少存储空间
    
    2. 查询优化:
       - 使用recording rules预计算
       - 避免高基数标签
       - 限制查询时间范围
    
    3. 网络优化:
       - 减少scrape间隔
       - 使用服务发现减少配置
       - 启用gzip压缩
    
    4. 资源优化:
       - 合理配置内存限制
       - 使用多个Prometheus实例
       - 配置联邦集群

9.9 总结

本章详细介绍了Kubernetes的监控和日志管理系统,主要内容包括:

9.9.1 监控体系

  • 监控架构:了解了Kubernetes监控的分层架构和核心组件
  • Prometheus:学习了Prometheus的部署、配置和告警规则
  • Grafana:掌握了可视化仪表板的创建和配置
  • AlertManager:实现了告警的管理和通知机制

9.9.2 日志管理

  • ELK Stack:部署了完整的日志收集、处理和可视化系统
  • Filebeat:实现了容器日志的自动收集
  • Logstash:配置了日志的解析和转换
  • Kibana:提供了日志的查询和分析界面

9.9.3 最佳实践

  • 监控策略:建立了基于SLI/SLO的监控体系
  • 告警管理:实现了分级告警和通知机制
  • 性能优化:掌握了监控系统的性能调优方法
  • 故障排查:学会了常见问题的诊断和解决

9.9.4 核心要点

  1. 全面监控:覆盖基础设施、平台和应用三个层面
  2. 主动告警:基于业务指标设置合理的告警阈值
  3. 日志聚合:统一收集和管理所有组件的日志
  4. 可视化展示:通过仪表板直观展示系统状态
  5. 持续优化:根据实际使用情况调整监控策略

9.9.5 注意事项

  • 合理配置资源限制,避免监控系统影响业务
  • 定期清理历史数据,控制存储成本
  • 建立监控系统的备份和恢复机制
  • 培训团队成员使用监控工具
  • 建立监控数据的安全访问控制

通过本章的学习,你已经掌握了Kubernetes监控和日志管理的完整解决方案。下一章我们将学习Kubernetes的安全管理,包括RBAC、网络策略、Pod安全策略等内容。