9.1 监控体系概述
9.1.1 Kubernetes监控架构
# 监控架构说明
apiVersion: v1
kind: ConfigMap
metadata:
name: monitoring-architecture
data:
layers: |
1. 基础设施监控:
- 节点资源使用率
- 网络和存储性能
- 硬件健康状态
2. Kubernetes组件监控:
- API Server性能
- etcd集群状态
- kubelet和容器运行时
3. 应用监控:
- Pod和容器指标
- 应用自定义指标
- 业务指标监控
4. 日志监控:
- 系统日志收集
- 应用日志聚合
- 审计日志分析
components: |
- Prometheus: 指标收集和存储
- Grafana: 可视化和告警
- AlertManager: 告警管理
- Jaeger: 分布式追踪
- ELK Stack: 日志管理
9.1.2 监控指标类型
apiVersion: v1
kind: ConfigMap
metadata:
name: metrics-types
data:
resource-metrics: |
# 资源指标
- CPU使用率和请求/限制
- 内存使用率和请求/限制
- 磁盘使用率和I/O
- 网络流量和延迟
kubernetes-metrics: |
# Kubernetes指标
- Pod状态和重启次数
- Service端点健康状态
- Deployment副本状态
- 节点就绪状态
application-metrics: |
# 应用指标
- HTTP请求率和延迟
- 数据库连接池状态
- 队列长度和处理时间
- 业务KPI指标
custom-metrics: |
# 自定义指标
- 应用特定的业务指标
- 第三方服务集成指标
- 用户定义的SLI指标
9.2 Prometheus监控系统
9.2.1 Prometheus部署
# Prometheus ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "/etc/prometheus/rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
- job_name: 'kubernetes-cadvisor'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:v2.37.0
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus/'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=200h'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
ports:
- containerPort: 9090
volumeMounts:
- name: prometheus-config-volume
mountPath: /etc/prometheus/
- name: prometheus-storage-volume
mountPath: /prometheus/
- name: prometheus-rules-volume
mountPath: /etc/prometheus/rules/
volumes:
- name: prometheus-config-volume
configMap:
defaultMode: 420
name: prometheus-config
- name: prometheus-storage-volume
emptyDir: {}
- name: prometheus-rules-volume
configMap:
name: prometheus-rules
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
spec:
selector:
app: prometheus
type: NodePort
ports:
- port: 9090
targetPort: 9090
nodePort: 30090
9.2.2 Prometheus告警规则
# Prometheus告警规则
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: monitoring
data:
kubernetes.yml: |
groups:
- name: kubernetes
rules:
- alert: KubernetesNodeReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: Kubernetes Node ready (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has been unready for a long time\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes memory pressure (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has MemoryPressure condition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesDiskPressure
expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes disk pressure (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has DiskPressure condition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesOutOfDisk
expr: kube_node_status_condition{condition="OutOfDisk",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes out of disk (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has OutOfDisk condition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesOutOfCapacity
expr: sum by (node) ((kube_pod_status_phase{phase="Running"} == 1) + on(uid) group_left(node) (0 * kube_pod_info{pod_template_hash=""})) / sum by (node) (kube_node_status_allocatable{resource="pods"}) * 100 > 90
for: 2m
labels:
severity: warning
annotations:
summary: Kubernetes out of capacity (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} is out of capacity\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesContainerOomKiller
expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
for: 0m
labels:
severity: warning
annotations:
summary: Kubernetes container oom killer (instance {{ $labels.instance }})
description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesPodCrashLooping
expr: max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", job="kube-state-metrics"}[5m]) >= 1
for: 2m
labels:
severity: warning
annotations:
summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesReplicassetMismatch
expr: kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas
for: 10m
labels:
severity: warning
annotations:
summary: Kubernetes ReplicasSet mismatch (instance {{ $labels.instance }})
description: "Deployment Replicas mismatch\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesDeploymentReplicasMismatch
expr: kube_deployment_spec_replicas != kube_deployment_status_available_replicas
for: 10m
labels:
severity: warning
annotations:
summary: Kubernetes Deployment replicas mismatch (instance {{ $labels.instance }})
description: "Deployment Replicas mismatch\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
resource.yml: |
groups:
- name: resource
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: High CPU usage detected
description: "CPU usage is above 80% for more than 5 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: High memory usage detected
description: "Memory usage is above 85% for more than 5 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HighDiskUsage
expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: High disk usage detected
description: "Disk usage is above 85% for more than 5 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PodHighCPU
expr: sum(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])) by (namespace, pod) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: Pod high CPU usage
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} CPU usage is above 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PodHighMemory
expr: sum(container_memory_working_set_bytes{container!="POD",container!=""}) by (namespace, pod) / sum(container_spec_memory_limit_bytes{container!="POD",container!=""}) by (namespace, pod) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: Pod high memory usage
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} memory usage is above 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
9.2.3 Node Exporter部署
# Node Exporter DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
hostPID: true
hostIPC: true
hostNetwork: true
containers:
- name: node-exporter
image: prom/node-exporter:v1.3.1
args:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- containerPort: 9100
hostPort: 9100
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
- name: rootfs
mountPath: /rootfs
readOnly: true
securityContext:
runAsNonRoot: true
runAsUser: 65534
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: rootfs
hostPath:
path: /
tolerations:
- operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: node-exporter
namespace: monitoring
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9100"
spec:
selector:
app: node-exporter
ports:
- name: metrics
port: 9100
targetPort: 9100
9.3 Grafana可视化
9.3.1 Grafana部署
# Grafana ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-config
namespace: monitoring
data:
grafana.ini: |
[analytics]
check_for_updates = true
[grafana_net]
url = https://grafana.net
[log]
mode = console
[paths]
data = /var/lib/grafana/
logs = /var/log/grafana
plugins = /var/lib/grafana/plugins
provisioning = /etc/grafana/provisioning
[server]
root_url = http://localhost:3000/
[security]
admin_user = admin
admin_password = admin123
[users]
allow_sign_up = false
auto_assign_org = true
auto_assign_org_role = Viewer
default_theme = dark
datasources.yml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:9.1.0
ports:
- containerPort: 3000
env:
- name: GF_SECURITY_ADMIN_PASSWORD
value: "admin123"
volumeMounts:
- name: grafana-config
mountPath: /etc/grafana/grafana.ini
subPath: grafana.ini
- name: grafana-datasources
mountPath: /etc/grafana/provisioning/datasources/datasources.yml
subPath: datasources.yml
- name: grafana-storage
mountPath: /var/lib/grafana
volumes:
- name: grafana-config
configMap:
name: grafana-config
- name: grafana-datasources
configMap:
name: grafana-config
- name: grafana-storage
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: monitoring
spec:
selector:
app: grafana
type: NodePort
ports:
- port: 3000
targetPort: 3000
nodePort: 30030
9.3.2 Grafana Dashboard配置
{
"dashboard": {
"id": null,
"title": "Kubernetes Cluster Monitoring",
"tags": ["kubernetes"],
"style": "dark",
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Cluster CPU Usage",
"type": "stat",
"targets": [
{
"expr": "100 - (avg(irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
}
}
},
"gridPos": {"h": 8, "w": 6, "x": 0, "y": 0}
},
{
"id": 2,
"title": "Cluster Memory Usage",
"type": "stat",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
}
}
},
"gridPos": {"h": 8, "w": 6, "x": 6, "y": 0}
},
{
"id": 3,
"title": "Pod Count",
"type": "stat",
"targets": [
{
"expr": "sum(kube_pod_info)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"color": {"mode": "palette-classic"}
}
},
"gridPos": {"h": 8, "w": 6, "x": 12, "y": 0}
},
{
"id": 4,
"title": "Node Count",
"type": "stat",
"targets": [
{
"expr": "sum(kube_node_info)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"color": {"mode": "palette-classic"}
}
},
"gridPos": {"h": 8, "w": 6, "x": 18, "y": 0}
},
{
"id": 5,
"title": "CPU Usage by Node",
"type": "timeseries",
"targets": [
{
"expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100
}
},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
},
{
"id": 6,
"title": "Memory Usage by Node",
"type": "timeseries",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"refId": "A",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100
}
},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}
}
],
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "30s"
}
}
9.4 AlertManager告警管理
9.4.1 AlertManager部署
# AlertManager ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'password'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://webhook-service:5000/alerts'
send_resolved: true
- name: 'critical-alerts'
email_configs:
- to: 'admin@example.com'
subject: 'Critical Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels: {{ .Labels }}
{{ end }}
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts'
title: 'Critical Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'warning-alerts'
email_configs:
- to: 'team@example.com'
subject: 'Warning Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels: {{ .Labels }}
{{ end }}
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: prom/alertmanager:v0.24.0
args:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--web.external-url=http://localhost:9093'
ports:
- containerPort: 9093
volumeMounts:
- name: alertmanager-config-volume
mountPath: /etc/alertmanager
- name: alertmanager-storage-volume
mountPath: /alertmanager
volumes:
- name: alertmanager-config-volume
configMap:
defaultMode: 420
name: alertmanager-config
- name: alertmanager-storage-volume
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: alertmanager
namespace: monitoring
spec:
selector:
app: alertmanager
type: NodePort
ports:
- port: 9093
targetPort: 9093
nodePort: 30093
9.4.2 自定义Webhook接收器
# webhook-receiver.py
from flask import Flask, request, jsonify
import json
import logging
from datetime import datetime
app = Flask(__name__)
logging.basicConfig(level=logging.INFO)
@app.route('/alerts', methods=['POST'])
def receive_alerts():
try:
data = request.get_json()
for alert in data.get('alerts', []):
alert_name = alert.get('labels', {}).get('alertname', 'Unknown')
status = alert.get('status', 'Unknown')
summary = alert.get('annotations', {}).get('summary', 'No summary')
description = alert.get('annotations', {}).get('description', 'No description')
log_message = f"Alert: {alert_name}, Status: {status}, Summary: {summary}"
if status == 'firing':
logging.warning(log_message)
# 这里可以添加自定义的告警处理逻辑
# 例如:发送到企业微信、钉钉等
send_to_custom_system(alert)
else:
logging.info(f"Resolved: {log_message}")
return jsonify({'status': 'success'}), 200
except Exception as e:
logging.error(f"Error processing alerts: {str(e)}")
return jsonify({'status': 'error', 'message': str(e)}), 500
def send_to_custom_system(alert):
"""发送告警到自定义系统"""
# 实现自定义告警发送逻辑
pass
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
# Webhook接收器部署
apiVersion: apps/v1
kind: Deployment
metadata:
name: webhook-receiver
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: webhook-receiver
template:
metadata:
labels:
app: webhook-receiver
spec:
containers:
- name: webhook-receiver
image: python:3.9-slim
command: ["python", "/app/webhook-receiver.py"]
ports:
- containerPort: 5000
volumeMounts:
- name: webhook-code
mountPath: /app
env:
- name: FLASK_ENV
value: "production"
volumes:
- name: webhook-code
configMap:
name: webhook-receiver-code
---
apiVersion: v1
kind: Service
metadata:
name: webhook-service
namespace: monitoring
spec:
selector:
app: webhook-receiver
ports:
- port: 5000
targetPort: 5000
9.5 日志管理系统
9.5.1 Elasticsearch部署
# Elasticsearch ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: elasticsearch-config
namespace: logging
data:
elasticsearch.yml: |
cluster.name: kubernetes-logs
node.name: ${HOSTNAME}
network.host: 0.0.0.0
discovery.type: single-node
xpack.security.enabled: false
xpack.monitoring.collection.enabled: true
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
namespace: logging
spec:
serviceName: elasticsearch
replicas: 1
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
containers:
- name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:7.17.0
env:
- name: discovery.type
value: single-node
- name: ES_JAVA_OPTS
value: "-Xms512m -Xmx512m"
ports:
- containerPort: 9200
- containerPort: 9300
volumeMounts:
- name: elasticsearch-data
mountPath: /usr/share/elasticsearch/data
- name: elasticsearch-config
mountPath: /usr/share/elasticsearch/config/elasticsearch.yml
subPath: elasticsearch.yml
volumes:
- name: elasticsearch-config
configMap:
name: elasticsearch-config
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
name: elasticsearch
namespace: logging
spec:
selector:
app: elasticsearch
ports:
- name: http
port: 9200
targetPort: 9200
- name: transport
port: 9300
targetPort: 9300
9.5.2 Logstash部署
# Logstash ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: logstash-config
namespace: logging
data:
logstash.yml: |
http.host: "0.0.0.0"
path.config: /usr/share/logstash/pipeline
xpack.monitoring.elasticsearch.hosts: ["http://elasticsearch:9200"]
logstash.conf: |
input {
beats {
port => 5044
}
http {
port => 8080
codec => json
}
}
filter {
if [kubernetes] {
mutate {
add_field => { "cluster_name" => "kubernetes" }
}
# 解析容器日志
if [kubernetes][container][name] {
mutate {
add_field => { "container_name" => "%{[kubernetes][container][name]}" }
}
}
# 解析Pod信息
if [kubernetes][pod][name] {
mutate {
add_field => { "pod_name" => "%{[kubernetes][pod][name]}" }
}
}
# 解析命名空间
if [kubernetes][namespace] {
mutate {
add_field => { "namespace" => "%{[kubernetes][namespace]}" }
}
}
# 尝试解析JSON格式的日志
if [message] =~ /^\{.*\}$/ {
json {
source => "message"
target => "parsed_json"
}
}
# 添加时间戳
date {
match => [ "@timestamp", "ISO8601" ]
}
}
# 过滤敏感信息
mutate {
gsub => [
"message", "password=[^\s]+", "password=***",
"message", "token=[^\s]+", "token=***",
"message", "secret=[^\s]+", "secret=***"
]
}
}
output {
elasticsearch {
hosts => ["http://elasticsearch:9200"]
index => "kubernetes-logs-%{+YYYY.MM.dd}"
template_name => "kubernetes"
template_pattern => "kubernetes-*"
template => {
"index_patterns" => ["kubernetes-*"],
"settings" => {
"number_of_shards" => 1,
"number_of_replicas" => 0
},
"mappings" => {
"properties" => {
"@timestamp" => { "type" => "date" },
"message" => { "type" => "text" },
"level" => { "type" => "keyword" },
"namespace" => { "type" => "keyword" },
"pod_name" => { "type" => "keyword" },
"container_name" => { "type" => "keyword" }
}
}
}
}
# 调试输出
stdout {
codec => rubydebug
}
}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: logstash
namespace: logging
spec:
replicas: 1
selector:
matchLabels:
app: logstash
template:
metadata:
labels:
app: logstash
spec:
containers:
- name: logstash
image: docker.elastic.co/logstash/logstash:7.17.0
env:
- name: LS_JAVA_OPTS
value: "-Xms256m -Xmx256m"
ports:
- containerPort: 5044
- containerPort: 8080
- containerPort: 9600
volumeMounts:
- name: logstash-config
mountPath: /usr/share/logstash/config/logstash.yml
subPath: logstash.yml
- name: logstash-pipeline
mountPath: /usr/share/logstash/pipeline/logstash.conf
subPath: logstash.conf
volumes:
- name: logstash-config
configMap:
name: logstash-config
- name: logstash-pipeline
configMap:
name: logstash-config
---
apiVersion: v1
kind: Service
metadata:
name: logstash
namespace: logging
spec:
selector:
app: logstash
ports:
- name: beats
port: 5044
targetPort: 5044
- name: http
port: 8080
targetPort: 8080
- name: monitoring
port: 9600
targetPort: 9600
9.5.3 Filebeat日志收集
# Filebeat ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: filebeat-config
namespace: logging
data:
filebeat.yml: |
filebeat.inputs:
- type: container
paths:
- /var/log/containers/*.log
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
- drop_event:
when:
or:
- contains:
kubernetes.container.name: "filebeat"
- contains:
kubernetes.container.name: "logstash"
- contains:
kubernetes.container.name: "elasticsearch"
output.logstash:
hosts: ["logstash:5044"]
logging.level: info
logging.to_files: true
logging.files:
path: /var/log/filebeat
name: filebeat
keepfiles: 7
permissions: 0644
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: filebeat
namespace: logging
spec:
selector:
matchLabels:
app: filebeat
template:
metadata:
labels:
app: filebeat
spec:
serviceAccountName: filebeat
terminationGracePeriodSeconds: 30
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: filebeat
image: docker.elastic.co/beats/filebeat:7.17.0
args: [
"-c", "/etc/filebeat.yml",
"-e",
]
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
securityContext:
runAsUser: 0
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
volumeMounts:
- name: config
mountPath: /etc/filebeat.yml
readOnly: true
subPath: filebeat.yml
- name: data
mountPath: /usr/share/filebeat/data
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: varlog
mountPath: /var/log
readOnly: true
volumes:
- name: config
configMap:
defaultMode: 0640
name: filebeat-config
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: varlog
hostPath:
path: /var/log
- name: data
hostPath:
path: /var/lib/filebeat-data
type: DirectoryOrCreate
tolerations:
- operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: filebeat
namespace: logging
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: filebeat
rules:
- apiGroups: [""]
resources:
- nodes
- namespaces
- pods
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: filebeat
roleRef:
kind: ClusterRole
name: filebeat
apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
name: filebeat
namespace: logging
9.5.4 Kibana可视化
# Kibana部署
apiVersion: apps/v1
kind: Deployment
metadata:
name: kibana
namespace: logging
spec:
replicas: 1
selector:
matchLabels:
app: kibana
template:
metadata:
labels:
app: kibana
spec:
containers:
- name: kibana
image: docker.elastic.co/kibana/kibana:7.17.0
env:
- name: ELASTICSEARCH_HOSTS
value: "http://elasticsearch:9200"
- name: SERVER_NAME
value: "kibana"
- name: SERVER_HOST
value: "0.0.0.0"
ports:
- containerPort: 5601
resources:
limits:
memory: 1Gi
requests:
memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:
name: kibana
namespace: logging
spec:
selector:
app: kibana
type: NodePort
ports:
- port: 5601
targetPort: 5601
nodePort: 30601
9.6 监控和日志管理脚本
9.6.1 监控部署脚本
#!/bin/bash
# deploy-monitoring.sh
echo "=== 部署Kubernetes监控系统 ==="
# 创建命名空间
echo "1. 创建监控命名空间"
kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace logging --dry-run=client -o yaml | kubectl apply -f -
# 创建RBAC
echo "2. 创建RBAC权限"
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
EOF
# 部署kube-state-metrics
echo "3. 部署kube-state-metrics"
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/master/examples/standard/cluster-role-binding.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/master/examples/standard/cluster-role.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/master/examples/standard/deployment.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/master/examples/standard/service-account.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/master/examples/standard/service.yaml
# 等待部署完成
echo "4. 等待组件启动"
kubectl wait --for=condition=available --timeout=300s deployment/kube-state-metrics -n kube-system
echo "5. 检查部署状态"
kubectl get pods -n monitoring
kubectl get pods -n logging
kubectl get svc -n monitoring
kubectl get svc -n logging
echo "\n=== 监控系统部署完成 ==="
echo "Prometheus: http://localhost:30090"
echo "Grafana: http://localhost:30030 (admin/admin123)"
echo "AlertManager: http://localhost:30093"
echo "Kibana: http://localhost:30601"
9.6.2 监控检查脚本
#!/bin/bash
# check-monitoring.sh
echo "=== 监控系统健康检查 ==="
# 检查Prometheus
echo "\n1. 检查Prometheus状态:"
kubectl get pods -n monitoring -l app=prometheus
PROM_STATUS=$(kubectl get pods -n monitoring -l app=prometheus -o jsonpath='{.items[0].status.phase}')
if [ "$PROM_STATUS" = "Running" ]; then
echo "✓ Prometheus运行正常"
else
echo "✗ Prometheus状态异常: $PROM_STATUS"
fi
# 检查Grafana
echo "\n2. 检查Grafana状态:"
kubectl get pods -n monitoring -l app=grafana
GRAFANA_STATUS=$(kubectl get pods -n monitoring -l app=grafana -o jsonpath='{.items[0].status.phase}')
if [ "$GRAFANA_STATUS" = "Running" ]; then
echo "✓ Grafana运行正常"
else
echo "✗ Grafana状态异常: $GRAFANA_STATUS"
fi
# 检查AlertManager
echo "\n3. 检查AlertManager状态:"
kubectl get pods -n monitoring -l app=alertmanager
ALERT_STATUS=$(kubectl get pods -n monitoring -l app=alertmanager -o jsonpath='{.items[0].status.phase}')
if [ "$ALERT_STATUS" = "Running" ]; then
echo "✓ AlertManager运行正常"
else
echo "✗ AlertManager状态异常: $ALERT_STATUS"
fi
# 检查Node Exporter
echo "\n4. 检查Node Exporter状态:"
kubectl get ds -n monitoring node-exporter
NODE_READY=$(kubectl get ds -n monitoring node-exporter -o jsonpath='{.status.numberReady}')
NODE_DESIRED=$(kubectl get ds -n monitoring node-exporter -o jsonpath='{.status.desiredNumberScheduled}')
if [ "$NODE_READY" = "$NODE_DESIRED" ]; then
echo "✓ Node Exporter运行正常 ($NODE_READY/$NODE_DESIRED)"
else
echo "✗ Node Exporter状态异常 ($NODE_READY/$NODE_DESIRED)"
fi
# 检查Elasticsearch
echo "\n5. 检查Elasticsearch状态:"
kubectl get pods -n logging -l app=elasticsearch
ES_STATUS=$(kubectl get pods -n logging -l app=elasticsearch -o jsonpath='{.items[0].status.phase}')
if [ "$ES_STATUS" = "Running" ]; then
echo "✓ Elasticsearch运行正常"
else
echo "✗ Elasticsearch状态异常: $ES_STATUS"
fi
# 检查Logstash
echo "\n6. 检查Logstash状态:"
kubectl get pods -n logging -l app=logstash
LOGSTASH_STATUS=$(kubectl get pods -n logging -l app=logstash -o jsonpath='{.items[0].status.phase}')
if [ "$LOGSTASH_STATUS" = "Running" ]; then
echo "✓ Logstash运行正常"
else
echo "✗ Logstash状态异常: $LOGSTASH_STATUS"
fi
# 检查Filebeat
echo "\n7. 检查Filebeat状态:"
kubectl get ds -n logging filebeat
FILEBEAT_READY=$(kubectl get ds -n logging filebeat -o jsonpath='{.status.numberReady}')
FILEBEAT_DESIRED=$(kubectl get ds -n logging filebeat -o jsonpath='{.status.desiredNumberScheduled}')
if [ "$FILEBEAT_READY" = "$FILEBEAT_DESIRED" ]; then
echo "✓ Filebeat运行正常 ($FILEBEAT_READY/$FILEBEAT_DESIRED)"
else
echo "✗ Filebeat状态异常 ($FILEBEAT_READY/$FILEBEAT_DESIRED)"
fi
# 检查Kibana
echo "\n8. 检查Kibana状态:"
kubectl get pods -n logging -l app=kibana
KIBANA_STATUS=$(kubectl get pods -n logging -l app=kibana -o jsonpath='{.items[0].status.phase}')
if [ "$KIBANA_STATUS" = "Running" ]; then
echo "✓ Kibana运行正常"
else
echo "✗ Kibana状态异常: $KIBANA_STATUS"
fi
# 检查服务端点
echo "\n9. 检查服务端点:"
echo "Prometheus: $(kubectl get svc -n monitoring prometheus -o jsonpath='{.spec.type}')端口$(kubectl get svc -n monitoring prometheus -o jsonpath='{.spec.ports[0].nodePort}')"
echo "Grafana: $(kubectl get svc -n monitoring grafana -o jsonpath='{.spec.type}')端口$(kubectl get svc -n monitoring grafana -o jsonpath='{.spec.ports[0].nodePort}')"
echo "AlertManager: $(kubectl get svc -n monitoring alertmanager -o jsonpath='{.spec.type}')端口$(kubectl get svc -n monitoring alertmanager -o jsonpath='{.spec.ports[0].nodePort}')"
echo "Kibana: $(kubectl get svc -n logging kibana -o jsonpath='{.spec.type}')端口$(kubectl get svc -n logging kibana -o jsonpath='{.spec.ports[0].nodePort}')"
echo "\n=== 监控系统检查完成 ==="
9.6.3 日志查询脚本
#!/bin/bash
# query-logs.sh
NAMESPACE=${1:-default}
POD_NAME=${2:-""}
CONTAINER=${3:-""}
LINES=${4:-100}
echo "=== Kubernetes日志查询工具 ==="
echo "命名空间: $NAMESPACE"
echo "Pod名称: $POD_NAME"
echo "容器名称: $CONTAINER"
echo "行数: $LINES"
echo ""
if [ -z "$POD_NAME" ]; then
echo "可用的Pod列表:"
kubectl get pods -n $NAMESPACE
echo ""
echo "用法: $0 <namespace> <pod-name> [container-name] [lines]"
exit 1
fi
# 检查Pod是否存在
if ! kubectl get pod $POD_NAME -n $NAMESPACE &>/dev/null; then
echo "错误: Pod $POD_NAME 在命名空间 $NAMESPACE 中不存在"
exit 1
fi
# 获取Pod中的容器列表
if [ -z "$CONTAINER" ]; then
CONTAINERS=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.spec.containers[*].name}')
echo "Pod中的容器列表: $CONTAINERS"
# 如果只有一个容器,自动选择
CONTAINER_COUNT=$(echo $CONTAINERS | wc -w)
if [ $CONTAINER_COUNT -eq 1 ]; then
CONTAINER=$CONTAINERS
echo "自动选择容器: $CONTAINER"
else
echo "请指定容器名称"
exit 1
fi
fi
echo "\n=== 实时日志 (按Ctrl+C退出) ==="
kubectl logs -f $POD_NAME -c $CONTAINER -n $NAMESPACE --tail=$LINES
9.6.4 性能监控脚本
#!/bin/bash
# performance-monitor.sh
echo "=== Kubernetes性能监控报告 ==="
echo "生成时间: $(date)"
echo ""
# 集群资源使用情况
echo "1. 集群资源使用情况:"
echo "节点数量: $(kubectl get nodes --no-headers | wc -l)"
echo "Pod总数: $(kubectl get pods --all-namespaces --no-headers | wc -l)"
echo "Service总数: $(kubectl get svc --all-namespaces --no-headers | wc -l)"
echo ""
# 节点资源使用
echo "2. 节点资源使用:"
kubectl top nodes 2>/dev/null || echo "需要安装metrics-server"
echo ""
# Pod资源使用Top 10
echo "3. Pod资源使用Top 10:"
echo "CPU使用率最高的Pod:"
kubectl top pods --all-namespaces --sort-by=cpu 2>/dev/null | head -11 || echo "需要安装metrics-server"
echo ""
echo "内存使用率最高的Pod:"
kubectl top pods --all-namespaces --sort-by=memory 2>/dev/null | head -11 || echo "需要安装metrics-server"
echo ""
# 检查问题Pod
echo "4. 问题Pod检查:"
echo "重启次数较多的Pod:"
kubectl get pods --all-namespaces --field-selector=status.phase=Running -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount | awk 'NR>1 && $3>5'
echo ""
echo "非Running状态的Pod:"
kubectl get pods --all-namespaces --field-selector=status.phase!=Running
echo ""
# 存储使用情况
echo "5. 存储使用情况:"
kubectl get pv
echo ""
kubectl get pvc --all-namespaces
echo ""
# 网络策略
echo "6. 网络策略:"
kubectl get networkpolicies --all-namespaces
echo ""
# 事件检查
echo "7. 最近的Warning事件:"
kubectl get events --all-namespaces --field-selector type=Warning --sort-by='.lastTimestamp' | tail -10
echo ""
echo "=== 性能监控报告完成 ==="
9.6.5 告警测试脚本
#!/bin/bash
# test-alerts.sh
echo "=== 告警系统测试 ==="
# 创建高CPU使用的测试Pod
echo "1. 创建高CPU使用测试Pod"
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: cpu-stress-test
namespace: default
labels:
app: stress-test
spec:
containers:
- name: cpu-stress
image: progrium/stress
command: ["stress"]
args: ["--cpu", "2", "--timeout", "300s"]
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
EOF
echo "等待Pod启动..."
kubectl wait --for=condition=Ready pod/cpu-stress-test --timeout=60s
echo "\n2. 创建高内存使用测试Pod"
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: memory-stress-test
namespace: default
labels:
app: stress-test
spec:
containers:
- name: memory-stress
image: progrium/stress
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "200M", "--timeout", "300s"]
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
EOF
echo "等待Pod启动..."
kubectl wait --for=condition=Ready pod/memory-stress-test --timeout=60s
echo "\n3. 监控告警状态"
echo "检查Prometheus告警状态..."
echo "请访问 http://localhost:30090/alerts 查看告警"
echo "请访问 http://localhost:30093 查看AlertManager"
echo "\n4. 等待5分钟后清理测试资源"
sleep 300
echo "\n5. 清理测试资源"
kubectl delete pod cpu-stress-test memory-stress-test
echo "\n=== 告警测试完成 ==="
9.7 监控最佳实践
9.7.1 监控策略
# 监控最佳实践配置
apiVersion: v1
kind: ConfigMap
metadata:
name: monitoring-best-practices
data:
monitoring-strategy: |
1. 分层监控:
- 基础设施层: 节点、网络、存储
- 平台层: Kubernetes组件
- 应用层: 业务指标
2. 关键指标:
- 黄金信号: 延迟、流量、错误、饱和度
- RED方法: 请求率、错误率、持续时间
- USE方法: 使用率、饱和度、错误
3. 告警策略:
- 基于SLI/SLO设置告警
- 避免告警疲劳
- 分级告警处理
4. 数据保留:
- 高精度数据: 7-30天
- 中精度数据: 3-6个月
- 低精度数据: 1-2年
sli-slo-examples: |
# 服务水平指标和目标示例
API可用性:
- SLI: 成功请求数 / 总请求数
- SLO: 99.9% (月度)
API延迟:
- SLI: 95%请求响应时间
- SLO: < 200ms
错误率:
- SLI: 错误请求数 / 总请求数
- SLO: < 0.1%
数据持久性:
- SLI: 成功备份数 / 计划备份数
- SLO: 99.99%
9.7.2 日志管理最佳实践
apiVersion: v1
kind: ConfigMap
metadata:
name: logging-best-practices
data:
log-levels: |
# 日志级别使用指南
ERROR: 系统错误,需要立即关注
- 应用崩溃
- 数据库连接失败
- 外部服务不可用
WARN: 潜在问题,需要监控
- 重试操作
- 性能降级
- 配置问题
INFO: 重要业务事件
- 用户登录/登出
- 重要操作完成
- 系统启动/关闭
DEBUG: 调试信息
- 详细执行流程
- 变量值
- 函数调用
log-format: |
# 结构化日志格式
{
"timestamp": "2023-01-01T12:00:00Z",
"level": "INFO",
"service": "user-service",
"version": "v1.2.3",
"trace_id": "abc123",
"span_id": "def456",
"user_id": "user123",
"action": "login",
"message": "User logged in successfully",
"duration_ms": 150,
"status_code": 200
}
log-retention: |
# 日志保留策略
应用日志:
- 热数据: 7天 (快速查询)
- 温数据: 30天 (常规查询)
- 冷数据: 90天 (归档存储)
审计日志:
- 热数据: 30天
- 温数据: 1年
- 冷数据: 7年 (合规要求)
系统日志:
- 热数据: 3天
- 温数据: 14天
- 冷数据: 30天
9.8 故障排查和性能优化
9.8.1 常见监控问题
#!/bin/bash
# troubleshoot-monitoring.sh
echo "=== 监控系统故障排查 ==="
# 检查Prometheus数据收集
echo "1. 检查Prometheus目标状态:"
echo "访问 http://localhost:30090/targets 检查目标状态"
echo ""
# 检查指标数据
echo "2. 检查关键指标:"
echo "up{job=\"kubernetes-nodes\"} - 节点状态"
echo "up{job=\"kubernetes-apiservers\"} - API Server状态"
echo "up{job=\"kubernetes-cadvisor\"} - cAdvisor状态"
echo ""
# 检查存储空间
echo "3. 检查存储使用:"
kubectl exec -n monitoring deployment/prometheus -- df -h /prometheus
echo ""
# 检查日志
echo "4. 检查Prometheus日志:"
kubectl logs -n monitoring deployment/prometheus --tail=20
echo ""
# 检查配置
echo "5. 检查配置重载:"
echo "POST http://localhost:30090/-/reload 重载配置"
echo ""
# 性能优化建议
echo "6. 性能优化建议:"
echo "- 调整scrape_interval减少数据收集频率"
echo "- 使用recording rules预计算复杂查询"
echo "- 配置适当的retention时间"
echo "- 使用remote storage扩展存储"
echo ""
echo "=== 故障排查完成 ==="
9.8.2 性能优化配置
# Prometheus性能优化配置
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-performance-config
data:
recording-rules.yml: |
groups:
- name: performance.rules
interval: 30s
rules:
# 节点CPU使用率
- record: node:cpu_utilization:rate5m
expr: |
100 - (
avg by (instance) (
irate(node_cpu_seconds_total{mode="idle"}[5m])
) * 100
)
# 节点内存使用率
- record: node:memory_utilization:ratio
expr: |
(
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
) / node_memory_MemTotal_bytes
# Pod CPU使用率
- record: pod:cpu_usage:rate5m
expr: |
sum by (namespace, pod) (
rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])
)
# Pod内存使用率
- record: pod:memory_usage:ratio
expr: |
sum by (namespace, pod) (
container_memory_working_set_bytes{container!="POD",container!=""}
) / sum by (namespace, pod) (
container_spec_memory_limit_bytes{container!="POD",container!=""}
)
# 集群资源使用汇总
- record: cluster:cpu_usage:rate5m
expr: |
sum(node:cpu_utilization:rate5m) / count(node:cpu_utilization:rate5m)
- record: cluster:memory_usage:ratio
expr: |
sum(node:memory_utilization:ratio) / count(node:memory_utilization:ratio)
optimization-tips: |
# Prometheus性能优化技巧
1. 存储优化:
- 使用SSD存储提高I/O性能
- 配置适当的retention时间
- 启用压缩减少存储空间
2. 查询优化:
- 使用recording rules预计算
- 避免高基数标签
- 限制查询时间范围
3. 网络优化:
- 减少scrape间隔
- 使用服务发现减少配置
- 启用gzip压缩
4. 资源优化:
- 合理配置内存限制
- 使用多个Prometheus实例
- 配置联邦集群
9.9 总结
本章详细介绍了Kubernetes的监控和日志管理系统,主要内容包括:
9.9.1 监控体系
- 监控架构:了解了Kubernetes监控的分层架构和核心组件
- Prometheus:学习了Prometheus的部署、配置和告警规则
- Grafana:掌握了可视化仪表板的创建和配置
- AlertManager:实现了告警的管理和通知机制
9.9.2 日志管理
- ELK Stack:部署了完整的日志收集、处理和可视化系统
- Filebeat:实现了容器日志的自动收集
- Logstash:配置了日志的解析和转换
- Kibana:提供了日志的查询和分析界面
9.9.3 最佳实践
- 监控策略:建立了基于SLI/SLO的监控体系
- 告警管理:实现了分级告警和通知机制
- 性能优化:掌握了监控系统的性能调优方法
- 故障排查:学会了常见问题的诊断和解决
9.9.4 核心要点
- 全面监控:覆盖基础设施、平台和应用三个层面
- 主动告警:基于业务指标设置合理的告警阈值
- 日志聚合:统一收集和管理所有组件的日志
- 可视化展示:通过仪表板直观展示系统状态
- 持续优化:根据实际使用情况调整监控策略
9.9.5 注意事项
- 合理配置资源限制,避免监控系统影响业务
- 定期清理历史数据,控制存储成本
- 建立监控系统的备份和恢复机制
- 培训团队成员使用监控工具
- 建立监控数据的安全访问控制
通过本章的学习,你已经掌握了Kubernetes监控和日志管理的完整解决方案。下一章我们将学习Kubernetes的安全管理,包括RBAC、网络策略、Pod安全策略等内容。