第9章监控和日志管理 - 在线学习与练习平台

9.1 监控体系概述

9.1.1 Kubernetes监控架构

# 监控架构说明
apiVersion: v1
kind: ConfigMap
metadata:
  name: monitoring-architecture
data:
  layers: |
    1. 基础设施监控:
       - 节点资源使用率
       - 网络和存储性能
       - 硬件健康状态
    
    2. Kubernetes组件监控:
       - API Server性能
       - etcd集群状态
       - kubelet和容器运行时
    
    3. 应用监控:
       - Pod和容器指标
       - 应用自定义指标
       - 业务指标监控
    
    4. 日志监控:
       - 系统日志收集
       - 应用日志聚合
       - 审计日志分析
  components: |
    - Prometheus: 指标收集和存储
    - Grafana: 可视化和告警
    - AlertManager: 告警管理
    - Jaeger: 分布式追踪
    - ELK Stack: 日志管理

9.1.2 监控指标类型

apiVersion: v1
kind: ConfigMap
metadata:
  name: metrics-types
data:
  resource-metrics: |
    # 资源指标
    - CPU使用率和请求/限制
    - 内存使用率和请求/限制
    - 磁盘使用率和I/O
    - 网络流量和延迟
  
  kubernetes-metrics: |
    # Kubernetes指标
    - Pod状态和重启次数
    - Service端点健康状态
    - Deployment副本状态
    - 节点就绪状态
  
  application-metrics: |
    # 应用指标
    - HTTP请求率和延迟
    - 数据库连接池状态
    - 队列长度和处理时间
    - 业务KPI指标
  
  custom-metrics: |
    # 自定义指标
    - 应用特定的业务指标
    - 第三方服务集成指标
    - 用户定义的SLI指标

9.2 Prometheus监控系统

9.2.1 Prometheus部署

# Prometheus ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    rule_files:
      - "/etc/prometheus/rules/*.yml"
    
    alerting:
      alertmanagers:
        - static_configs:
            - targets:
              - alertmanager:9093
    
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']
      
      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
          - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
            action: keep
            regex: default;kubernetes;https
      
      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
          - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}/proxy/metrics
      
      - job_name: 'kubernetes-cadvisor'
        kubernetes_sd_configs:
          - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
      
      - job_name: 'kubernetes-service-endpoints'
        kubernetes_sd_configs:
          - role: endpoints
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
          - action: labelmap
            regex: __meta_kubernetes_service_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: kubernetes_name
      
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: kubernetes_pod_name
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      serviceAccountName: prometheus
      containers:
      - name: prometheus
        image: prom/prometheus:v2.37.0
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus/'
          - '--web.console.libraries=/etc/prometheus/console_libraries'
          - '--web.console.templates=/etc/prometheus/consoles'
          - '--storage.tsdb.retention.time=200h'
          - '--web.enable-lifecycle'
          - '--web.enable-admin-api'
        ports:
        - containerPort: 9090
        volumeMounts:
        - name: prometheus-config-volume
          mountPath: /etc/prometheus/
        - name: prometheus-storage-volume
          mountPath: /prometheus/
        - name: prometheus-rules-volume
          mountPath: /etc/prometheus/rules/
      volumes:
      - name: prometheus-config-volume
        configMap:
          defaultMode: 420
          name: prometheus-config
      - name: prometheus-storage-volume
        emptyDir: {}
      - name: prometheus-rules-volume
        configMap:
          name: prometheus-rules
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitoring
spec:
  selector:
    app: prometheus
  type: NodePort
  ports:
    - port: 9090
      targetPort: 9090
      nodePort: 30090

9.2.2 Prometheus告警规则

# Prometheus告警规则
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: monitoring
data:
  kubernetes.yml: |
    groups:
    - name: kubernetes
      rules:
      - alert: KubernetesNodeReady
        expr: kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: Kubernetes Node ready (instance {{ $labels.instance }})
          description: "Node {{ $labels.node }} has been unready for a long time\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: KubernetesMemoryPressure
        expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: Kubernetes memory pressure (instance {{ $labels.instance }})
          description: "Node {{ $labels.node }} has MemoryPressure condition\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: KubernetesDiskPressure
        expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: Kubernetes disk pressure (instance {{ $labels.instance }})
          description: "Node {{ $labels.node }} has DiskPressure condition\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: KubernetesOutOfDisk
        expr: kube_node_status_condition{condition="OutOfDisk",status="true"} == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: Kubernetes out of disk (instance {{ $labels.instance }})
          description: "Node {{ $labels.node }} has OutOfDisk condition\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: KubernetesOutOfCapacity
        expr: sum by (node) ((kube_pod_status_phase{phase="Running"} == 1) + on(uid) group_left(node) (0 * kube_pod_info{pod_template_hash=""})) / sum by (node) (kube_node_status_allocatable{resource="pods"}) * 100 > 90
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes out of capacity (instance {{ $labels.instance }})
          description: "Node {{ $labels.node }} is out of capacity\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: KubernetesContainerOomKiller
        expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes container oom killer (instance {{ $labels.instance }})
          description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: KubernetesPodCrashLooping
        expr: max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", job="kube-state-metrics"}[5m]) >= 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: KubernetesReplicassetMismatch
        expr: kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes ReplicasSet mismatch (instance {{ $labels.instance }})
          description: "Deployment Replicas mismatch\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: KubernetesDeploymentReplicasMismatch
        expr: kube_deployment_spec_replicas != kube_deployment_status_available_replicas
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes Deployment replicas mismatch (instance {{ $labels.instance }})
          description: "Deployment Replicas mismatch\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  
  resource.yml: |
    groups:
    - name: resource
      rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High CPU usage detected
          description: "CPU usage is above 80% for more than 5 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High memory usage detected
          description: "Memory usage is above 85% for more than 5 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: HighDiskUsage
        expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High disk usage detected
          description: "Disk usage is above 85% for more than 5 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: PodHighCPU
        expr: sum(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])) by (namespace, pod) > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: Pod high CPU usage
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} CPU usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      
      - alert: PodHighMemory
        expr: sum(container_memory_working_set_bytes{container!="POD",container!=""}) by (namespace, pod) / sum(container_spec_memory_limit_bytes{container!="POD",container!=""}) by (namespace, pod) > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: Pod high memory usage
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

9.2.3 Node Exporter部署

# Node Exporter DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      hostPID: true
      hostIPC: true
      hostNetwork: true
      containers:
      - name: node-exporter
        image: prom/node-exporter:v1.3.1
        args:
          - '--path.procfs=/host/proc'
          - '--path.rootfs=/rootfs'
          - '--path.sysfs=/host/sys'
          - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
        ports:
        - containerPort: 9100
          hostPort: 9100
        volumeMounts:
        - name: proc
          mountPath: /host/proc
          readOnly: true
        - name: sys
          mountPath: /host/sys
          readOnly: true
        - name: rootfs
          mountPath: /rootfs
          readOnly: true
        securityContext:
          runAsNonRoot: true
          runAsUser: 65534
      volumes:
      - name: proc
        hostPath:
          path: /proc
      - name: sys
        hostPath:
          path: /sys
      - name: rootfs
        hostPath:
          path: /
      tolerations:
      - operator: Exists
        effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: node-exporter
  namespace: monitoring
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9100"
spec:
  selector:
    app: node-exporter
  ports:
  - name: metrics
    port: 9100
    targetPort: 9100

9.3 Grafana可视化

9.3.1 Grafana部署

# Grafana ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-config
  namespace: monitoring
data:
  grafana.ini: |
    [analytics]
    check_for_updates = true
    [grafana_net]
    url = https://grafana.net
    [log]
    mode = console
    [paths]
    data = /var/lib/grafana/
    logs = /var/log/grafana
    plugins = /var/lib/grafana/plugins
    provisioning = /etc/grafana/provisioning
    [server]
    root_url = http://localhost:3000/
    [security]
    admin_user = admin
    admin_password = admin123
    [users]
    allow_sign_up = false
    auto_assign_org = true
    auto_assign_org_role = Viewer
    default_theme = dark
  
  datasources.yml: |
    apiVersion: 1
    datasources:
    - name: Prometheus
      type: prometheus
      access: proxy
      url: http://prometheus:9090
      isDefault: true
      editable: true
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
      - name: grafana
        image: grafana/grafana:9.1.0
        ports:
        - containerPort: 3000
        env:
        - name: GF_SECURITY_ADMIN_PASSWORD
          value: "admin123"
        volumeMounts:
        - name: grafana-config
          mountPath: /etc/grafana/grafana.ini
          subPath: grafana.ini
        - name: grafana-datasources
          mountPath: /etc/grafana/provisioning/datasources/datasources.yml
          subPath: datasources.yml
        - name: grafana-storage
          mountPath: /var/lib/grafana
      volumes:
      - name: grafana-config
        configMap:
          name: grafana-config
      - name: grafana-datasources
        configMap:
          name: grafana-config
      - name: grafana-storage
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: grafana
  namespace: monitoring
spec:
  selector:
    app: grafana
  type: NodePort
  ports:
    - port: 3000
      targetPort: 3000
      nodePort: 30030

9.3.2 Grafana Dashboard配置

{
  "dashboard": {
    "id": null,
    "title": "Kubernetes Cluster Monitoring",
    "tags": ["kubernetes"],
    "style": "dark",
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Cluster CPU Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "100 - (avg(irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Cluster Memory Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 6, "y": 0}
      },
      {
        "id": 3,
        "title": "Pod Count",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(kube_pod_info)",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "short",
            "color": {"mode": "palette-classic"}
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 12, "y": 0}
      },
      {
        "id": 4,
        "title": "Node Count",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(kube_node_info)",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "short",
            "color": {"mode": "palette-classic"}
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 18, "y": 0}
      },
      {
        "id": 5,
        "title": "CPU Usage by Node",
        "type": "timeseries",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "refId": "A",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
      },
      {
        "id": 6,
        "title": "Memory Usage by Node",
        "type": "timeseries",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
            "refId": "A",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100
          }
        },
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s"
  }
}

9.4 AlertManager告警管理

9.4.1 AlertManager部署

# AlertManager ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  alertmanager.yml: |
    global:
      smtp_smarthost: 'smtp.gmail.com:587'
      smtp_from: 'alerts@example.com'
      smtp_auth_username: 'alerts@example.com'
      smtp_auth_password: 'password'
    
    route:
      group_by: ['alertname']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h
      receiver: 'web.hook'
      routes:
      - match:
          severity: critical
        receiver: 'critical-alerts'
      - match:
          severity: warning
        receiver: 'warning-alerts'
    
    receivers:
    - name: 'web.hook'
      webhook_configs:
      - url: 'http://webhook-service:5000/alerts'
        send_resolved: true
    
    - name: 'critical-alerts'
      email_configs:
      - to: 'admin@example.com'
        subject: 'Critical Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Labels: {{ .Labels }}
          {{ end }}
      slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: 'Critical Alert'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
    
    - name: 'warning-alerts'
      email_configs:
      - to: 'team@example.com'
        subject: 'Warning Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Labels: {{ .Labels }}
          {{ end }}
    
    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      equal: ['alertname', 'dev', 'instance']
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      labels:
        app: alertmanager
    spec:
      containers:
      - name: alertmanager
        image: prom/alertmanager:v0.24.0
        args:
          - '--config.file=/etc/alertmanager/alertmanager.yml'
          - '--storage.path=/alertmanager'
          - '--web.external-url=http://localhost:9093'
        ports:
        - containerPort: 9093
        volumeMounts:
        - name: alertmanager-config-volume
          mountPath: /etc/alertmanager
        - name: alertmanager-storage-volume
          mountPath: /alertmanager
      volumes:
      - name: alertmanager-config-volume
        configMap:
          defaultMode: 420
          name: alertmanager-config
      - name: alertmanager-storage-volume
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: alertmanager
  namespace: monitoring
spec:
  selector:
    app: alertmanager
  type: NodePort
  ports:
    - port: 9093
      targetPort: 9093
      nodePort: 30093

9.4.2 自定义Webhook接收器

# webhook-receiver.py
from flask import Flask, request, jsonify
import json
import logging
from datetime import datetime

app = Flask(__name__)
logging.basicConfig(level=logging.INFO)

@app.route('/alerts', methods=['POST'])
def receive_alerts():
    try:
        data = request.get_json()
        
        for alert in data.get('alerts', []):
            alert_name = alert.get('labels', {}).get('alertname', 'Unknown')
            status = alert.get('status', 'Unknown')
            summary = alert.get('annotations', {}).get('summary', 'No summary')
            description = alert.get('annotations', {}).get('description', 'No description')
            
            log_message = f"Alert: {alert_name}, Status: {status}, Summary: {summary}"
            
            if status == 'firing':
                logging.warning(log_message)
                # 这里可以添加自定义的告警处理逻辑
                # 例如：发送到企业微信、钉钉等
                send_to_custom_system(alert)
            else:
                logging.info(f"Resolved: {log_message}")
        
        return jsonify({'status': 'success'}), 200
    
    except Exception as e:
        logging.error(f"Error processing alerts: {str(e)}")
        return jsonify({'status': 'error', 'message': str(e)}), 500

def send_to_custom_system(alert):
    """发送告警到自定义系统"""
    # 实现自定义告警发送逻辑
    pass

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

# Webhook接收器部署
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webhook-receiver
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: webhook-receiver
  template:
    metadata:
      labels:
        app: webhook-receiver
    spec:
      containers:
      - name: webhook-receiver
        image: python:3.9-slim
        command: ["python", "/app/webhook-receiver.py"]
        ports:
        - containerPort: 5000
        volumeMounts:
        - name: webhook-code
          mountPath: /app
        env:
        - name: FLASK_ENV
          value: "production"
      volumes:
      - name: webhook-code
        configMap:
          name: webhook-receiver-code
---
apiVersion: v1
kind: Service
metadata:
  name: webhook-service
  namespace: monitoring
spec:
  selector:
    app: webhook-receiver
  ports:
  - port: 5000
    targetPort: 5000

9.5 日志管理系统

9.5.1 Elasticsearch部署

# Elasticsearch ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: elasticsearch-config
  namespace: logging
data:
  elasticsearch.yml: |
    cluster.name: kubernetes-logs
    node.name: ${HOSTNAME}
    network.host: 0.0.0.0
    discovery.type: single-node
    xpack.security.enabled: false
    xpack.monitoring.collection.enabled: true
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
  namespace: logging
spec:
  serviceName: elasticsearch
  replicas: 1
  selector:
    matchLabels:
      app: elasticsearch
  template:
    metadata:
      labels:
        app: elasticsearch
    spec:
      containers:
      - name: elasticsearch
        image: docker.elastic.co/elasticsearch/elasticsearch:7.17.0
        env:
        - name: discovery.type
          value: single-node
        - name: ES_JAVA_OPTS
          value: "-Xms512m -Xmx512m"
        ports:
        - containerPort: 9200
        - containerPort: 9300
        volumeMounts:
        - name: elasticsearch-data
          mountPath: /usr/share/elasticsearch/data
        - name: elasticsearch-config
          mountPath: /usr/share/elasticsearch/config/elasticsearch.yml
          subPath: elasticsearch.yml
      volumes:
      - name: elasticsearch-config
        configMap:
          name: elasticsearch-config
  volumeClaimTemplates:
  - metadata:
      name: elasticsearch-data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
  name: elasticsearch
  namespace: logging
spec:
  selector:
    app: elasticsearch
  ports:
  - name: http
    port: 9200
    targetPort: 9200
  - name: transport
    port: 9300
    targetPort: 9300

9.5.2 Logstash部署

# Logstash ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: logstash-config
  namespace: logging
data:
  logstash.yml: |
    http.host: "0.0.0.0"
    path.config: /usr/share/logstash/pipeline
    xpack.monitoring.elasticsearch.hosts: ["http://elasticsearch:9200"]
  
  logstash.conf: |
    input {
      beats {
        port => 5044
      }
      http {
        port => 8080
        codec => json
      }
    }
    
    filter {
      if [kubernetes] {
        mutate {
          add_field => { "cluster_name" => "kubernetes" }
        }
        
        # 解析容器日志
        if [kubernetes][container][name] {
          mutate {
            add_field => { "container_name" => "%{[kubernetes][container][name]}" }
          }
        }
        
        # 解析Pod信息
        if [kubernetes][pod][name] {
          mutate {
            add_field => { "pod_name" => "%{[kubernetes][pod][name]}" }
          }
        }
        
        # 解析命名空间
        if [kubernetes][namespace] {
          mutate {
            add_field => { "namespace" => "%{[kubernetes][namespace]}" }
          }
        }
        
        # 尝试解析JSON格式的日志
        if [message] =~ /^\{.*\}$/ {
          json {
            source => "message"
            target => "parsed_json"
          }
        }
        
        # 添加时间戳
        date {
          match => [ "@timestamp", "ISO8601" ]
        }
      }
      
      # 过滤敏感信息
      mutate {
        gsub => [
          "message", "password=[^\s]+", "password=***",
          "message", "token=[^\s]+", "token=***",
          "message", "secret=[^\s]+", "secret=***"
        ]
      }
    }
    
    output {
      elasticsearch {
        hosts => ["http://elasticsearch:9200"]
        index => "kubernetes-logs-%{+YYYY.MM.dd}"
        template_name => "kubernetes"
        template_pattern => "kubernetes-*"
        template => {
          "index_patterns" => ["kubernetes-*"],
          "settings" => {
            "number_of_shards" => 1,
            "number_of_replicas" => 0
          },
          "mappings" => {
            "properties" => {
              "@timestamp" => { "type" => "date" },
              "message" => { "type" => "text" },
              "level" => { "type" => "keyword" },
              "namespace" => { "type" => "keyword" },
              "pod_name" => { "type" => "keyword" },
              "container_name" => { "type" => "keyword" }
            }
          }
        }
      }
      
      # 调试输出
      stdout {
        codec => rubydebug
      }
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: logstash
  namespace: logging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: logstash
  template:
    metadata:
      labels:
        app: logstash
    spec:
      containers:
      - name: logstash
        image: docker.elastic.co/logstash/logstash:7.17.0
        env:
        - name: LS_JAVA_OPTS
          value: "-Xms256m -Xmx256m"
        ports:
        - containerPort: 5044
        - containerPort: 8080
        - containerPort: 9600
        volumeMounts:
        - name: logstash-config
          mountPath: /usr/share/logstash/config/logstash.yml
          subPath: logstash.yml
        - name: logstash-pipeline
          mountPath: /usr/share/logstash/pipeline/logstash.conf
          subPath: logstash.conf
      volumes:
      - name: logstash-config
        configMap:
          name: logstash-config
      - name: logstash-pipeline
        configMap:
          name: logstash-config
---
apiVersion: v1
kind: Service
metadata:
  name: logstash
  namespace: logging
spec:
  selector:
    app: logstash
  ports:
  - name: beats
    port: 5044
    targetPort: 5044
  - name: http
    port: 8080
    targetPort: 8080
  - name: monitoring
    port: 9600
    targetPort: 9600

9.5.3 Filebeat日志收集

# Filebeat ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: filebeat-config
  namespace: logging
data:
  filebeat.yml: |
    filebeat.inputs:
    - type: container
      paths:
        - /var/log/containers/*.log
      processors:
        - add_kubernetes_metadata:
            host: ${NODE_NAME}
            matchers:
            - logs_path:
                logs_path: "/var/log/containers/"
        - drop_event:
            when:
              or:
                - contains:
                    kubernetes.container.name: "filebeat"
                - contains:
                    kubernetes.container.name: "logstash"
                - contains:
                    kubernetes.container.name: "elasticsearch"
    
    output.logstash:
      hosts: ["logstash:5044"]
    
    logging.level: info
    logging.to_files: true
    logging.files:
      path: /var/log/filebeat
      name: filebeat
      keepfiles: 7
      permissions: 0644
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: filebeat
  namespace: logging
spec:
  selector:
    matchLabels:
      app: filebeat
  template:
    metadata:
      labels:
        app: filebeat
    spec:
      serviceAccountName: filebeat
      terminationGracePeriodSeconds: 30
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - name: filebeat
        image: docker.elastic.co/beats/filebeat:7.17.0
        args: [
          "-c", "/etc/filebeat.yml",
          "-e",
        ]
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        securityContext:
          runAsUser: 0
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - name: config
          mountPath: /etc/filebeat.yml
          readOnly: true
          subPath: filebeat.yml
        - name: data
          mountPath: /usr/share/filebeat/data
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: varlog
          mountPath: /var/log
          readOnly: true
      volumes:
      - name: config
        configMap:
          defaultMode: 0640
          name: filebeat-config
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: varlog
        hostPath:
          path: /var/log
      - name: data
        hostPath:
          path: /var/lib/filebeat-data
          type: DirectoryOrCreate
      tolerations:
      - operator: Exists
        effect: NoSchedule
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: filebeat
  namespace: logging
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: filebeat
rules:
- apiGroups: [""]
  resources:
  - nodes
  - namespaces
  - pods
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: filebeat
roleRef:
  kind: ClusterRole
  name: filebeat
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
  name: filebeat
  namespace: logging

9.5.4 Kibana可视化

# Kibana部署
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kibana
  namespace: logging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kibana
  template:
    metadata:
      labels:
        app: kibana
    spec:
      containers:
      - name: kibana
        image: docker.elastic.co/kibana/kibana:7.17.0
        env:
        - name: ELASTICSEARCH_HOSTS
          value: "http://elasticsearch:9200"
        - name: SERVER_NAME
          value: "kibana"
        - name: SERVER_HOST
          value: "0.0.0.0"
        ports:
        - containerPort: 5601
        resources:
          limits:
            memory: 1Gi
          requests:
            memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:
  name: kibana
  namespace: logging
spec:
  selector:
    app: kibana
  type: NodePort
  ports:
  - port: 5601
    targetPort: 5601
    nodePort: 30601

9.6 监控和日志管理脚本

9.6.1 监控部署脚本

#!/bin/bash
# deploy-monitoring.sh

echo "=== 部署Kubernetes监控系统 ==="

# 创建命名空间
echo "1. 创建监控命名空间"
kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace logging --dry-run=client -o yaml | kubectl apply -f -

# 创建RBAC
echo "2. 创建RBAC权限"
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring
EOF

# 部署kube-state-metrics
echo "3. 部署kube-state-metrics"
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/master/examples/standard/cluster-role-binding.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/master/examples/standard/cluster-role.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/master/examples/standard/deployment.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/master/examples/standard/service-account.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/master/examples/standard/service.yaml

# 等待部署完成
echo "4. 等待组件启动"
kubectl wait --for=condition=available --timeout=300s deployment/kube-state-metrics -n kube-system

echo "5. 检查部署状态"
kubectl get pods -n monitoring
kubectl get pods -n logging
kubectl get svc -n monitoring
kubectl get svc -n logging

echo "\n=== 监控系统部署完成 ==="
echo "Prometheus: http://localhost:30090"
echo "Grafana: http://localhost:30030 (admin/admin123)"
echo "AlertManager: http://localhost:30093"
echo "Kibana: http://localhost:30601"

9.6.2 监控检查脚本

#!/bin/bash
# check-monitoring.sh

echo "=== 监控系统健康检查 ==="

# 检查Prometheus
echo "\n1. 检查Prometheus状态:"
kubectl get pods -n monitoring -l app=prometheus
PROM_STATUS=$(kubectl get pods -n monitoring -l app=prometheus -o jsonpath='{.items[0].status.phase}')
if [ "$PROM_STATUS" = "Running" ]; then
    echo "✓ Prometheus运行正常"
else
    echo "✗ Prometheus状态异常: $PROM_STATUS"
fi

# 检查Grafana
echo "\n2. 检查Grafana状态:"
kubectl get pods -n monitoring -l app=grafana
GRAFANA_STATUS=$(kubectl get pods -n monitoring -l app=grafana -o jsonpath='{.items[0].status.phase}')
if [ "$GRAFANA_STATUS" = "Running" ]; then
    echo "✓ Grafana运行正常"
else
    echo "✗ Grafana状态异常: $GRAFANA_STATUS"
fi

# 检查AlertManager
echo "\n3. 检查AlertManager状态:"
kubectl get pods -n monitoring -l app=alertmanager
ALERT_STATUS=$(kubectl get pods -n monitoring -l app=alertmanager -o jsonpath='{.items[0].status.phase}')
if [ "$ALERT_STATUS" = "Running" ]; then
    echo "✓ AlertManager运行正常"
else
    echo "✗ AlertManager状态异常: $ALERT_STATUS"
fi

# 检查Node Exporter
echo "\n4. 检查Node Exporter状态:"
kubectl get ds -n monitoring node-exporter
NODE_READY=$(kubectl get ds -n monitoring node-exporter -o jsonpath='{.status.numberReady}')
NODE_DESIRED=$(kubectl get ds -n monitoring node-exporter -o jsonpath='{.status.desiredNumberScheduled}')
if [ "$NODE_READY" = "$NODE_DESIRED" ]; then
    echo "✓ Node Exporter运行正常 ($NODE_READY/$NODE_DESIRED)"
else
    echo "✗ Node Exporter状态异常 ($NODE_READY/$NODE_DESIRED)"
fi

# 检查Elasticsearch
echo "\n5. 检查Elasticsearch状态:"
kubectl get pods -n logging -l app=elasticsearch
ES_STATUS=$(kubectl get pods -n logging -l app=elasticsearch -o jsonpath='{.items[0].status.phase}')
if [ "$ES_STATUS" = "Running" ]; then
    echo "✓ Elasticsearch运行正常"
else
    echo "✗ Elasticsearch状态异常: $ES_STATUS"
fi

# 检查Logstash
echo "\n6. 检查Logstash状态:"
kubectl get pods -n logging -l app=logstash
LOGSTASH_STATUS=$(kubectl get pods -n logging -l app=logstash -o jsonpath='{.items[0].status.phase}')
if [ "$LOGSTASH_STATUS" = "Running" ]; then
    echo "✓ Logstash运行正常"
else
    echo "✗ Logstash状态异常: $LOGSTASH_STATUS"
fi

# 检查Filebeat
echo "\n7. 检查Filebeat状态:"
kubectl get ds -n logging filebeat
FILEBEAT_READY=$(kubectl get ds -n logging filebeat -o jsonpath='{.status.numberReady}')
FILEBEAT_DESIRED=$(kubectl get ds -n logging filebeat -o jsonpath='{.status.desiredNumberScheduled}')
if [ "$FILEBEAT_READY" = "$FILEBEAT_DESIRED" ]; then
    echo "✓ Filebeat运行正常 ($FILEBEAT_READY/$FILEBEAT_DESIRED)"
else
    echo "✗ Filebeat状态异常 ($FILEBEAT_READY/$FILEBEAT_DESIRED)"
fi

# 检查Kibana
echo "\n8. 检查Kibana状态:"
kubectl get pods -n logging -l app=kibana
KIBANA_STATUS=$(kubectl get pods -n logging -l app=kibana -o jsonpath='{.items[0].status.phase}')
if [ "$KIBANA_STATUS" = "Running" ]; then
    echo "✓ Kibana运行正常"
else
    echo "✗ Kibana状态异常: $KIBANA_STATUS"
fi

# 检查服务端点
echo "\n9. 检查服务端点:"
echo "Prometheus: $(kubectl get svc -n monitoring prometheus -o jsonpath='{.spec.type}')端口$(kubectl get svc -n monitoring prometheus -o jsonpath='{.spec.ports[0].nodePort}')"
echo "Grafana: $(kubectl get svc -n monitoring grafana -o jsonpath='{.spec.type}')端口$(kubectl get svc -n monitoring grafana -o jsonpath='{.spec.ports[0].nodePort}')"
echo "AlertManager: $(kubectl get svc -n monitoring alertmanager -o jsonpath='{.spec.type}')端口$(kubectl get svc -n monitoring alertmanager -o jsonpath='{.spec.ports[0].nodePort}')"
echo "Kibana: $(kubectl get svc -n logging kibana -o jsonpath='{.spec.type}')端口$(kubectl get svc -n logging kibana -o jsonpath='{.spec.ports[0].nodePort}')"

echo "\n=== 监控系统检查完成 ==="

9.6.3 日志查询脚本

#!/bin/bash
# query-logs.sh

NAMESPACE=${1:-default}
POD_NAME=${2:-""}
CONTAINER=${3:-""}
LINES=${4:-100}

echo "=== Kubernetes日志查询工具 ==="
echo "命名空间: $NAMESPACE"
echo "Pod名称: $POD_NAME"
echo "容器名称: $CONTAINER"
echo "行数: $LINES"
echo ""

if [ -z "$POD_NAME" ]; then
    echo "可用的Pod列表:"
    kubectl get pods -n $NAMESPACE
    echo ""
    echo "用法: $0 <namespace> <pod-name> [container-name] [lines]"
    exit 1
fi

# 检查Pod是否存在
if ! kubectl get pod $POD_NAME -n $NAMESPACE &>/dev/null; then
    echo "错误: Pod $POD_NAME 在命名空间 $NAMESPACE 中不存在"
    exit 1
fi

# 获取Pod中的容器列表
if [ -z "$CONTAINER" ]; then
    CONTAINERS=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[*].name}')
    echo "Pod中的容器列表: $CONTAINERS"
    
    # 如果只有一个容器，自动选择
    CONTAINER_COUNT=$(echo $CONTAINERS | wc -w)
    if [ $CONTAINER_COUNT -eq 1 ]; then
        CONTAINER=$CONTAINERS
        echo "自动选择容器: $CONTAINER"
    else
        echo "请指定容器名称"
        exit 1
    fi
fi

echo "\n=== 实时日志 (按Ctrl+C退出) ==="
kubectl logs -f $POD_NAME -c $CONTAINER -n $NAMESPACE --tail=$LINES

9.6.4 性能监控脚本

#!/bin/bash
# performance-monitor.sh

echo "=== Kubernetes性能监控报告 ==="
echo "生成时间: $(date)"
echo ""

# 集群资源使用情况
echo "1. 集群资源使用情况:"
echo "节点数量: $(kubectl get nodes --no-headers | wc -l)"
echo "Pod总数: $(kubectl get pods --all-namespaces --no-headers | wc -l)"
echo "Service总数: $(kubectl get svc --all-namespaces --no-headers | wc -l)"
echo ""

# 节点资源使用
echo "2. 节点资源使用:"
kubectl top nodes 2>/dev/null || echo "需要安装metrics-server"
echo ""

# Pod资源使用Top 10
echo "3. Pod资源使用Top 10:"
echo "CPU使用率最高的Pod:"
kubectl top pods --all-namespaces --sort-by=cpu 2>/dev/null | head -11 || echo "需要安装metrics-server"
echo ""
echo "内存使用率最高的Pod:"
kubectl top pods --all-namespaces --sort-by=memory 2>/dev/null | head -11 || echo "需要安装metrics-server"
echo ""

# 检查问题Pod
echo "4. 问题Pod检查:"
echo "重启次数较多的Pod:"
kubectl get pods --all-namespaces --field-selector=status.phase=Running -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount | awk 'NR>1 && $3>5'
echo ""
echo "非Running状态的Pod:"
kubectl get pods --all-namespaces --field-selector=status.phase!=Running
echo ""

# 存储使用情况
echo "5. 存储使用情况:"
kubectl get pv
echo ""
kubectl get pvc --all-namespaces
echo ""

# 网络策略
echo "6. 网络策略:"
kubectl get networkpolicies --all-namespaces
echo ""

# 事件检查
echo "7. 最近的Warning事件:"
kubectl get events --all-namespaces --field-selector type=Warning --sort-by='.lastTimestamp' | tail -10
echo ""

echo "=== 性能监控报告完成 ==="

9.6.5 告警测试脚本

#!/bin/bash
# test-alerts.sh

echo "=== 告警系统测试 ==="

# 创建高CPU使用的测试Pod
echo "1. 创建高CPU使用测试Pod"
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: cpu-stress-test
  namespace: default
  labels:
    app: stress-test
spec:
  containers:
  - name: cpu-stress
    image: progrium/stress
    command: ["stress"]
    args: ["--cpu", "2", "--timeout", "300s"]
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 200m
        memory: 256Mi
EOF

echo "等待Pod启动..."
kubectl wait --for=condition=Ready pod/cpu-stress-test --timeout=60s

echo "\n2. 创建高内存使用测试Pod"
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: memory-stress-test
  namespace: default
  labels:
    app: stress-test
spec:
  containers:
  - name: memory-stress
    image: progrium/stress
    command: ["stress"]
    args: ["--vm", "1", "--vm-bytes", "200M", "--timeout", "300s"]
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 200m
        memory: 256Mi
EOF

echo "等待Pod启动..."
kubectl wait --for=condition=Ready pod/memory-stress-test --timeout=60s

echo "\n3. 监控告警状态"
echo "检查Prometheus告警状态..."
echo "请访问 http://localhost:30090/alerts 查看告警"
echo "请访问 http://localhost:30093 查看AlertManager"

echo "\n4. 等待5分钟后清理测试资源"
sleep 300

echo "\n5. 清理测试资源"
kubectl delete pod cpu-stress-test memory-stress-test

echo "\n=== 告警测试完成 ==="

9.7 监控最佳实践

9.7.1 监控策略

# 监控最佳实践配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: monitoring-best-practices
data:
  monitoring-strategy: |
    1. 分层监控:
       - 基础设施层: 节点、网络、存储
       - 平台层: Kubernetes组件
       - 应用层: 业务指标
    
    2. 关键指标:
       - 黄金信号: 延迟、流量、错误、饱和度
       - RED方法: 请求率、错误率、持续时间
       - USE方法: 使用率、饱和度、错误
    
    3. 告警策略:
       - 基于SLI/SLO设置告警
       - 避免告警疲劳
       - 分级告警处理
    
    4. 数据保留:
       - 高精度数据: 7-30天
       - 中精度数据: 3-6个月
       - 低精度数据: 1-2年
  
  sli-slo-examples: |
    # 服务水平指标和目标示例
    
    API可用性:
    - SLI: 成功请求数 / 总请求数
    - SLO: 99.9% (月度)
    
    API延迟:
    - SLI: 95%请求响应时间
    - SLO: < 200ms
    
    错误率:
    - SLI: 错误请求数 / 总请求数
    - SLO: < 0.1%
    
    数据持久性:
    - SLI: 成功备份数 / 计划备份数
    - SLO: 99.99%

9.7.2 日志管理最佳实践

apiVersion: v1
kind: ConfigMap
metadata:
  name: logging-best-practices
data:
  log-levels: |
    # 日志级别使用指南
    
    ERROR: 系统错误，需要立即关注
    - 应用崩溃
    - 数据库连接失败
    - 外部服务不可用
    
    WARN: 潜在问题，需要监控
    - 重试操作
    - 性能降级
    - 配置问题
    
    INFO: 重要业务事件
    - 用户登录/登出
    - 重要操作完成
    - 系统启动/关闭
    
    DEBUG: 调试信息
    - 详细执行流程
    - 变量值
    - 函数调用
  
  log-format: |
    # 结构化日志格式
    {
      "timestamp": "2023-01-01T12:00:00Z",
      "level": "INFO",
      "service": "user-service",
      "version": "v1.2.3",
      "trace_id": "abc123",
      "span_id": "def456",
      "user_id": "user123",
      "action": "login",
      "message": "User logged in successfully",
      "duration_ms": 150,
      "status_code": 200
    }
  
  log-retention: |
    # 日志保留策略
    
    应用日志:
    - 热数据: 7天 (快速查询)
    - 温数据: 30天 (常规查询)
    - 冷数据: 90天 (归档存储)
    
    审计日志:
    - 热数据: 30天
    - 温数据: 1年
    - 冷数据: 7年 (合规要求)
    
    系统日志:
    - 热数据: 3天
    - 温数据: 14天
    - 冷数据: 30天

9.8 故障排查和性能优化

9.8.1 常见监控问题

#!/bin/bash
# troubleshoot-monitoring.sh

echo "=== 监控系统故障排查 ==="

# 检查Prometheus数据收集
echo "1. 检查Prometheus目标状态:"
echo "访问 http://localhost:30090/targets 检查目标状态"
echo ""

# 检查指标数据
echo "2. 检查关键指标:"
echo "up{job=\"kubernetes-nodes\"} - 节点状态"
echo "up{job=\"kubernetes-apiservers\"} - API Server状态"
echo "up{job=\"kubernetes-cadvisor\"} - cAdvisor状态"
echo ""

# 检查存储空间
echo "3. 检查存储使用:"
kubectl exec -n monitoring deployment/prometheus -- df -h /prometheus
echo ""

# 检查日志
echo "4. 检查Prometheus日志:"
kubectl logs -n monitoring deployment/prometheus --tail=20
echo ""

# 检查配置
echo "5. 检查配置重载:"
echo "POST http://localhost:30090/-/reload 重载配置"
echo ""

# 性能优化建议
echo "6. 性能优化建议:"
echo "- 调整scrape_interval减少数据收集频率"
echo "- 使用recording rules预计算复杂查询"
echo "- 配置适当的retention时间"
echo "- 使用remote storage扩展存储"
echo ""

echo "=== 故障排查完成 ==="

9.8.2 性能优化配置

# Prometheus性能优化配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-performance-config
data:
  recording-rules.yml: |
    groups:
    - name: performance.rules
      interval: 30s
      rules:
      # 节点CPU使用率
      - record: node:cpu_utilization:rate5m
        expr: |
          100 - (
            avg by (instance) (
              irate(node_cpu_seconds_total{mode="idle"}[5m])
            ) * 100
          )
      
      # 节点内存使用率
      - record: node:memory_utilization:ratio
        expr: |
          (
            node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
          ) / node_memory_MemTotal_bytes
      
      # Pod CPU使用率
      - record: pod:cpu_usage:rate5m
        expr: |
          sum by (namespace, pod) (
            rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])
          )
      
      # Pod内存使用率
      - record: pod:memory_usage:ratio
        expr: |
          sum by (namespace, pod) (
            container_memory_working_set_bytes{container!="POD",container!=""}
          ) / sum by (namespace, pod) (
            container_spec_memory_limit_bytes{container!="POD",container!=""}
          )
      
      # 集群资源使用汇总
      - record: cluster:cpu_usage:rate5m
        expr: |
          sum(node:cpu_utilization:rate5m) / count(node:cpu_utilization:rate5m)
      
      - record: cluster:memory_usage:ratio
        expr: |
          sum(node:memory_utilization:ratio) / count(node:memory_utilization:ratio)
  
  optimization-tips: |
    # Prometheus性能优化技巧
    
    1. 存储优化:
       - 使用SSD存储提高I/O性能
       - 配置适当的retention时间
       - 启用压缩减少存储空间
    
    2. 查询优化:
       - 使用recording rules预计算
       - 避免高基数标签
       - 限制查询时间范围
    
    3. 网络优化:
       - 减少scrape间隔
       - 使用服务发现减少配置
       - 启用gzip压缩
    
    4. 资源优化:
       - 合理配置内存限制
       - 使用多个Prometheus实例
       - 配置联邦集群

9.9 总结

本章详细介绍了Kubernetes的监控和日志管理系统，主要内容包括：

9.9.1 监控体系

监控架构：了解了Kubernetes监控的分层架构和核心组件
Prometheus：学习了Prometheus的部署、配置和告警规则
Grafana：掌握了可视化仪表板的创建和配置
AlertManager：实现了告警的管理和通知机制

9.9.2 日志管理

ELK Stack：部署了完整的日志收集、处理和可视化系统
Filebeat：实现了容器日志的自动收集
Logstash：配置了日志的解析和转换
Kibana：提供了日志的查询和分析界面

9.9.3 最佳实践

监控策略：建立了基于SLI/SLO的监控体系
告警管理：实现了分级告警和通知机制
性能优化：掌握了监控系统的性能调优方法
故障排查：学会了常见问题的诊断和解决

9.9.4 核心要点

全面监控：覆盖基础设施、平台和应用三个层面
主动告警：基于业务指标设置合理的告警阈值
日志聚合：统一收集和管理所有组件的日志
可视化展示：通过仪表板直观展示系统状态
持续优化：根据实际使用情况调整监控策略

9.9.5 注意事项

合理配置资源限制，避免监控系统影响业务
定期清理历史数据，控制存储成本
建立监控系统的备份和恢复机制
培训团队成员使用监控工具
建立监控数据的安全访问控制

通过本章的学习，你已经掌握了Kubernetes监控和日志管理的完整解决方案。下一章我们将学习Kubernetes的安全管理，包括RBAC、网络策略、Pod安全策略等内容。

📂 分类导航

▶ 学与练
- ▶ 软件技术基础
  - ▶ 操作系统技术
    - Linux实战
    - ▶ Linux技巧
      - debug-remote-api.md
  - ▶ 容器化与编排
    - Docker实战
    - ▶ Docker高级
- ▶ 前端开发技术
  - ▶ 框架与库
    - js
    - vue
  - ▶ 前端生态
    - bootstrap
    - vue-ssr
- ▶ 后端开发技术
  - ▶ 编程语言
    - ▶ Java
    - ▶ Go
      - go-server.md
      - mini.md
    - Rust
    - Python
    - csharp
  - ▶ 中间件
    - redis
    - ▶ minio
      - minio.md
    - elasticsearch
    - kafka
    - elk
    - caddy
  - ▶ 数据库
    - MySQL
    - SQLServer
    - ▶ Dameng
      - sql.md
    - clickhouse
- ▶ 数据开发与运维
  - ▶ 数据开发
    - hadoop
  - ▶ 运维开发
    - ▶ CI/CD
      - jenkins
    - ▶ 自动化
      - allinssl.md
    - ▶ 日志处理
      - elk
    - ▶ 监控
- 软件速学教程
▶ 软件园
- AI智能体与应用
- 开发工具与环境
- AI 开发和编排
- 业务与生产力应用
- 数据和中间件
▶ 工具箱
- 内容管理
- 编码解码
- ▶ 系统监控
  - miaotixing.md
- ▶ 日常工具
- 工具命令
- 使用教程

📚 第9章监控和日志管理

9.1 监控体系概述

9.1.1 Kubernetes监控架构

9.1.2 监控指标类型

9.2 Prometheus监控系统

9.2.1 Prometheus部署

9.2.2 Prometheus告警规则

9.2.3 Node Exporter部署

9.3 Grafana可视化

9.3.1 Grafana部署

9.3.2 Grafana Dashboard配置

9.4 AlertManager告警管理

9.4.1 AlertManager部署

9.4.2 自定义Webhook接收器

9.5 日志管理系统

9.5.1 Elasticsearch部署

9.5.2 Logstash部署

9.5.3 Filebeat日志收集

9.5.4 Kibana可视化

9.6 监控和日志管理脚本

9.6.1 监控部署脚本

9.6.2 监控检查脚本

9.6.3 日志查询脚本

9.6.4 性能监控脚本

9.6.5 告警测试脚本

9.7 监控最佳实践

9.7.1 监控策略

9.7.2 日志管理最佳实践

9.8 故障排查和性能优化

9.8.1 常见监控问题

9.8.2 性能优化配置

9.9 总结

9.9.1 监控体系

9.9.2 日志管理

9.9.3 最佳实践

9.9.4 核心要点

9.9.5 注意事项

📂 分类导航

📰 最新文章

📚 第9章 监控和日志管理

9.1 监控体系概述

9.1.1 Kubernetes监控架构

9.1.2 监控指标类型

9.2 Prometheus监控系统

9.2.1 Prometheus部署

9.2.2 Prometheus告警规则

9.2.3 Node Exporter部署

9.3 Grafana可视化

9.3.1 Grafana部署

9.3.2 Grafana Dashboard配置

9.4 AlertManager告警管理

9.4.1 AlertManager部署

9.4.2 自定义Webhook接收器

9.5 日志管理系统

9.5.1 Elasticsearch部署

9.5.2 Logstash部署

9.5.3 Filebeat日志收集

9.5.4 Kibana可视化

9.6 监控和日志管理脚本

9.6.1 监控部署脚本

9.6.2 监控检查脚本

9.6.3 日志查询脚本

9.6.4 性能监控脚本

9.6.5 告警测试脚本

9.7 监控最佳实践

9.7.1 监控策略

9.7.2 日志管理最佳实践

9.8 故障排查和性能优化

9.8.1 常见监控问题

9.8.2 性能优化配置

9.9 总结

9.9.1 监控体系

9.9.2 日志管理

9.9.3 最佳实践

9.9.4 核心要点

9.9.5 注意事项

📂 分类导航

📰 最新文章

📚 第9章监控和日志管理