学习目标
通过本章学习,您将能够:
- 理解 Docker Swarm 监控架构和指标体系
- 掌握集群和服务监控的配置方法
- 学会日志收集、聚合和分析
- 了解告警和通知机制的设置
- 掌握性能优化和故障排除技巧
1. 监控架构概述
1.1 监控体系架构
监控层次结构
# Docker Swarm 监控层次:
# 1. 基础设施层监控
# - 主机资源(CPU、内存、磁盘、网络)
# - 操作系统指标
# - 硬件状态
# - 网络连通性
# 2. Docker 引擎层监控
# - Docker 守护进程状态
# - 容器运行时指标
# - 镜像和存储使用
# - 网络和存储驱动
# 3. Swarm 集群层监控
# - 节点状态和健康度
# - 服务部署状态
# - 任务调度情况
# - 集群网络状态
# 4. 应用层监控
# - 应用性能指标
# - 业务逻辑监控
# - 用户体验指标
# - 自定义业务指标
监控架构图
# 完整监控架构:
┌─────────────────────────────────────────────────────────────┐
│ Monitoring Stack │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Grafana │ │ Alertmanager│ │ Kibana │ │
│ │ (Dashboard) │ │ (Alerting) │ │ (Log Viz) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ └─────────────────┼─────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Prometheus │ │
│ │ (Metrics Storage) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Elasticsearch │ │
│ │ (Log Storage) │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Docker Swarm Cluster │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Manager │ │ Worker │ │ Worker │ │
│ │ Node │ │ Node │ │ Node │ │
│ │ │ │ │ │ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │
│ │ │ Node │ │ │ │ Node │ │ │ │ Node │ │ │
│ │ │Exporter │ │ │ │Exporter │ │ │ │Exporter │ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │
│ │ │cAdvisor │ │ │ │cAdvisor │ │ │ │cAdvisor │ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │
│ │ │Filebeat │ │ │ │Filebeat │ │ │ │Filebeat │ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
1.2 监控指标体系
核心监控指标
# 1. 集群级别指标
# - 节点数量和状态
# - 服务数量和健康状态
# - 任务分布和调度情况
# - 网络连接状态
# 2. 节点级别指标
# - CPU 使用率和负载
# - 内存使用率和可用性
# - 磁盘使用率和 I/O
# - 网络流量和连接数
# 3. 服务级别指标
# - 副本数量和期望状态
# - 服务响应时间
# - 错误率和成功率
# - 资源使用情况
# 4. 容器级别指标
# - 容器状态和重启次数
# - 资源限制和使用
# - 网络和存储 I/O
# - 进程和线程数量
监控指标收集
# 使用 Docker 内置指标
docker stats --no-stream
docker system df
docker system events
# 查看集群指标
docker node ls
docker service ls
docker stack ls
# 查看服务详细指标
docker service ps <service-name>
docker service inspect <service-name>
# 查看容器指标
docker container stats
docker container inspect <container-id>
2. Prometheus 监控部署
2.1 Prometheus 集群部署
部署 Prometheus Stack
# prometheus-stack.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- prometheus-data:/prometheus
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./rules:/etc/prometheus/rules:ro
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=15d'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
deploy:
replicas: 1
placement:
constraints:
- node.role == manager
resources:
limits:
memory: 2G
cpus: '1.0'
reservations:
memory: 1G
cpus: '0.5'
networks:
- monitoring
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
- GF_INSTALL_PLUGINS=grafana-piechart-panel
deploy:
replicas: 1
placement:
constraints:
- node.role == manager
resources:
limits:
memory: 512M
cpus: '0.5'
reservations:
memory: 256M
cpus: '0.2'
networks:
- monitoring
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- alertmanager-data:/alertmanager
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--web.external-url=http://localhost:9093'
deploy:
replicas: 1
placement:
constraints:
- node.role == manager
resources:
limits:
memory: 256M
cpus: '0.3'
reservations:
memory: 128M
cpus: '0.1'
networks:
- monitoring
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
deploy:
mode: global
resources:
limits:
memory: 128M
cpus: '0.2'
reservations:
memory: 64M
cpus: '0.1'
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
privileged: true
devices:
- /dev/kmsg
deploy:
mode: global
resources:
limits:
memory: 256M
cpus: '0.3'
reservations:
memory: 128M
cpus: '0.1'
networks:
- monitoring
volumes:
prometheus-data:
grafana-data:
alertmanager-data:
networks:
monitoring:
driver: overlay
attachable: true
Prometheus 配置文件
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'docker-swarm'
replica: 'prometheus-1'
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
# Prometheus 自身监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter 监控
- job_name: 'node-exporter'
dns_sd_configs:
- names:
- 'tasks.node-exporter'
type: 'A'
port: 9100
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: node-exporter:9100
# cAdvisor 监控
- job_name: 'cadvisor'
dns_sd_configs:
- names:
- 'tasks.cadvisor'
type: 'A'
port: 8080
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: cadvisor:8080
# Docker Engine 监控
- job_name: 'docker-engine'
static_configs:
- targets:
- 'manager-node:9323'
- 'worker-node-1:9323'
- 'worker-node-2:9323'
metrics_path: /metrics
# 应用服务监控
- job_name: 'app-services'
dns_sd_configs:
- names:
- 'tasks.web-app'
type: 'A'
port: 8080
metrics_path: /metrics
scrape_interval: 30s
# Swarm 集群监控
- job_name: 'swarm-endpoints'
static_configs:
- targets:
- 'manager-node:2376'
metrics_path: /metrics
scheme: https
tls_config:
ca_file: /etc/docker/ca.pem
cert_file: /etc/docker/cert.pem
key_file: /etc/docker/key.pem
insecure_skip_verify: true
2.2 告警规则配置
集群告警规则
# rules/swarm-alerts.yml
groups:
- name: swarm-cluster
rules:
# 节点离线告警
- alert: SwarmNodeDown
expr: up{job="node-exporter"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Swarm node {{ $labels.instance }} is down"
description: "Node {{ $labels.instance }} has been down for more than 1 minute."
# 管理节点数量告警
- alert: SwarmManagerCount
expr: count(swarm_manager_nodes) < 3
for: 5m
labels:
severity: warning
annotations:
summary: "Insufficient Swarm managers"
description: "Only {{ $value }} manager nodes available, recommended minimum is 3."
# 服务副本不足告警
- alert: SwarmServiceReplicasDown
expr: |
(
swarm_service_spec_replicas - swarm_service_running_replicas
) > 0
for: 2m
labels:
severity: warning
annotations:
summary: "Service {{ $labels.service_name }} has insufficient replicas"
description: "Service {{ $labels.service_name }} has {{ $value }} missing replicas."
# 任务失败告警
- alert: SwarmTaskFailed
expr: increase(swarm_task_desired_state{state="failed"}[5m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Swarm task failed"
description: "Task {{ $labels.task_name }} in service {{ $labels.service_name }} has failed."
- name: node-resources
rules:
# CPU 使用率告警
- alert: HighCPUUsage
expr: |
(
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}% on {{ $labels.instance }}."
# 内存使用率告警
- alert: HighMemoryUsage
expr: |
(
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}% on {{ $labels.instance }}."
# 磁盘空间告警
- alert: HighDiskUsage
expr: |
(
(node_filesystem_size_bytes{fstype!="tmpfs"} - node_filesystem_free_bytes{fstype!="tmpfs"}) / node_filesystem_size_bytes{fstype!="tmpfs"} * 100
) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High disk usage on {{ $labels.instance }}"
description: "Disk usage is {{ $value }}% on {{ $labels.instance }} ({{ $labels.mountpoint }})."
# 磁盘 I/O 告警
- alert: HighDiskIO
expr: |
(
rate(node_disk_io_time_seconds_total[5m]) * 100
) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High disk I/O on {{ $labels.instance }}"
description: "Disk I/O utilization is {{ $value }}% on {{ $labels.instance }} ({{ $labels.device }})."
- name: container-resources
rules:
# 容器重启告警
- alert: ContainerRestarting
expr: increase(container_start_time_seconds[1h]) > 3
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} is restarting frequently"
description: "Container {{ $labels.name }} has restarted {{ $value }} times in the last hour."
# 容器内存使用告警
- alert: ContainerHighMemoryUsage
expr: |
(
container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} * 100
) > 90
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} high memory usage"
description: "Container {{ $labels.name }} memory usage is {{ $value }}%."
# 容器 CPU 使用告警
- alert: ContainerHighCPUUsage
expr: |
(
rate(container_cpu_usage_seconds_total{name!=""}[5m]) * 100
) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} high CPU usage"
description: "Container {{ $labels.name }} CPU usage is {{ $value }}%."
2.3 Alertmanager 配置
告警管理配置
# alertmanager.yml
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'your-app-password'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
group_wait: 5s
repeat_interval: 30m
- match:
severity: warning
receiver: 'warning-alerts'
repeat_interval: 2h
- match_re:
service: '^(web|api|database).*'
receiver: 'app-team'
receivers:
- name: 'default-receiver'
email_configs:
- to: 'admin@example.com'
subject: '[ALERT] {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
{{ end }}
- name: 'critical-alerts'
email_configs:
- to: 'oncall@example.com'
subject: '[CRITICAL] {{ .GroupLabels.alertname }}'
body: |
CRITICAL ALERT!
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Severity: {{ .Labels.severity }}
Instance: {{ .Labels.instance }}
Time: {{ .StartsAt }}
{{ end }}
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts'
title: 'Critical Alert: {{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Instance:* {{ .Labels.instance }}
{{ end }}
- name: 'warning-alerts'
email_configs:
- to: 'team@example.com'
subject: '[WARNING] {{ .GroupLabels.alertname }}'
body: |
Warning Alert
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
- name: 'app-team'
email_configs:
- to: 'app-team@example.com'
subject: '[APP] {{ .GroupLabels.alertname }}'
body: |
Application Alert
{{ range .Alerts }}
Service: {{ .Labels.service }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
3. 日志管理
3.1 ELK Stack 部署
Elasticsearch 集群
# elk-stack.yml
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.8.0
environment:
- node.name=elasticsearch
- cluster.name=docker-swarm-logs
- discovery.type=single-node
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms1g -Xmx1g"
- xpack.security.enabled=false
- xpack.security.enrollment.enabled=false
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- elasticsearch-data:/usr/share/elasticsearch/data
ports:
- "9200:9200"
deploy:
replicas: 1
placement:
constraints:
- node.role == manager
resources:
limits:
memory: 2G
cpus: '1.0'
reservations:
memory: 1G
cpus: '0.5'
networks:
- logging
logstash:
image: docker.elastic.co/logstash/logstash:8.8.0
volumes:
- ./logstash/config:/usr/share/logstash/config:ro
- ./logstash/pipeline:/usr/share/logstash/pipeline:ro
ports:
- "5044:5044"
- "5000:5000/tcp"
- "5000:5000/udp"
- "9600:9600"
environment:
- "LS_JAVA_OPTS=-Xmx512m -Xms512m"
depends_on:
- elasticsearch
deploy:
replicas: 1
placement:
constraints:
- node.role == manager
resources:
limits:
memory: 1G
cpus: '0.5'
reservations:
memory: 512M
cpus: '0.3'
networks:
- logging
kibana:
image: docker.elastic.co/kibana/kibana:8.8.0
ports:
- "5601:5601"
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
- SERVER_NAME=kibana
- SERVER_HOST=0.0.0.0
depends_on:
- elasticsearch
deploy:
replicas: 1
placement:
constraints:
- node.role == manager
resources:
limits:
memory: 1G
cpus: '0.5'
reservations:
memory: 512M
cpus: '0.3'
networks:
- logging
filebeat:
image: docker.elastic.co/beats/filebeat:8.8.0
user: root
volumes:
- ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- /var/log:/var/log:ro
- filebeat-data:/usr/share/filebeat/data
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
- LOGSTASH_HOSTS=logstash:5044
depends_on:
- elasticsearch
- logstash
deploy:
mode: global
resources:
limits:
memory: 256M
cpus: '0.2'
reservations:
memory: 128M
cpus: '0.1'
networks:
- logging
volumes:
elasticsearch-data:
filebeat-data:
networks:
logging:
driver: overlay
attachable: true
Filebeat 配置
# filebeat/filebeat.yml
filebeat.inputs:
# Docker 容器日志
- type: container
paths:
- '/var/lib/docker/containers/*/*.log'
processors:
- add_docker_metadata:
host: "unix:///var/run/docker.sock"
- decode_json_fields:
fields: ["message"]
target: ""
overwrite_keys: true
# 系统日志
- type: log
paths:
- /var/log/syslog
- /var/log/auth.log
fields:
log_type: system
fields_under_root: true
# Docker 守护进程日志
- type: log
paths:
- /var/log/docker.log
fields:
log_type: docker
fields_under_root: true
# 处理器配置
processors:
- add_host_metadata:
when.not.contains.tags: forwarded
- add_cloud_metadata: ~
- add_docker_metadata: ~
- add_kubernetes_metadata: ~
# 输出配置
output.logstash:
hosts: ["logstash:5044"]
# 或直接输出到 Elasticsearch
# output.elasticsearch:
# hosts: ["elasticsearch:9200"]
# index: "swarm-logs-%{+yyyy.MM.dd}"
# 日志级别
logging.level: info
logging.to_files: true
logging.files:
path: /var/log/filebeat
name: filebeat
keepfiles: 7
permissions: 0644
# 监控
monitoring.enabled: true
monitoring.elasticsearch:
hosts: ["elasticsearch:9200"]
Logstash 管道配置
# logstash/pipeline/docker-logs.conf
input {
beats {
port => 5044
}
}
filter {
# 处理 Docker 容器日志
if [container] {
# 添加容器信息
mutate {
add_field => {
"service_name" => "%{[container][labels][com.docker.swarm.service.name]}"
"task_name" => "%{[container][labels][com.docker.swarm.task.name]}"
"node_id" => "%{[container][labels][com.docker.swarm.node.id]}"
"stack_name" => "%{[container][labels][com.docker.stack.namespace]}"
}
}
# 解析时间戳
date {
match => [ "@timestamp", "ISO8601" ]
}
# 解析 JSON 日志
if [message] =~ /^\{.*\}$/ {
json {
source => "message"
}
}
# 解析 Nginx 访问日志
if [service_name] == "nginx" {
grok {
match => {
"message" => "%{NGINXACCESS}"
}
}
# 转换响应时间为数字
mutate {
convert => {
"response_time" => "float"
"status" => "integer"
"body_bytes_sent" => "integer"
}
}
}
# 解析应用日志级别
if [message] =~ /(ERROR|WARN|INFO|DEBUG)/ {
grok {
match => {
"message" => "(?<log_timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(?<log_level>\w+)\] (?<log_message>.*)"
}
}
}
}
# 处理系统日志
if [log_type] == "system" {
grok {
match => {
"message" => "%{SYSLOGTIMESTAMP:timestamp} %{IPORHOST:host} %{DATA:program}(?:\[%{POSINT:pid}\])?: %{GREEDYDATA:log_message}"
}
}
}
# 添加地理位置信息(如果有 IP 地址)
if [client_ip] {
geoip {
source => "client_ip"
target => "geoip"
}
}
# 移除不需要的字段
mutate {
remove_field => [ "agent", "ecs", "host", "input" ]
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "swarm-logs-%{+yyyy.MM.dd}"
# 根据服务名称创建不同的索引
if [service_name] {
index => "swarm-%{service_name}-%{+yyyy.MM.dd}"
}
}
# 调试输出
# stdout {
# codec => rubydebug
# }
}
3.2 日志收集策略
日志收集脚本
#!/bin/bash
# log-collector.sh
LOG_DIR="/var/log/swarm"
ARCHIVE_DIR="/var/log/swarm/archive"
RETENTION_DAYS=30
COMPRESS_DAYS=7
# 创建日志目录
mkdir -p $LOG_DIR $ARCHIVE_DIR
# 日志函数
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a $LOG_DIR/collector.log
}
# 收集集群状态日志
collect_cluster_logs() {
local date_suffix=$(date +%Y%m%d_%H%M%S)
log "Collecting cluster status logs..."
# 集群信息
{
echo "=== Cluster Info ==="
docker info --format 'Swarm: {{.Swarm.LocalNodeState}}'
docker info --format 'Nodes: {{.Swarm.Nodes}}'
docker info --format 'Managers: {{.Swarm.Managers}}'
echo
echo "=== Node Status ==="
docker node ls
echo
echo "=== Service Status ==="
docker service ls
echo
echo "=== Stack Status ==="
docker stack ls
echo
echo "=== Network Status ==="
docker network ls
echo
echo "=== Volume Status ==="
docker volume ls
echo
} > $LOG_DIR/cluster-status-$date_suffix.log
log "Cluster status logs saved to cluster-status-$date_suffix.log"
}
# 收集服务日志
collect_service_logs() {
local service_name=$1
local lines=${2:-1000}
local date_suffix=$(date +%Y%m%d_%H%M%S)
if [ -z "$service_name" ]; then
log "Collecting logs for all services..."
for service in $(docker service ls --format '{{.Name}}'); do
log "Collecting logs for service: $service"
docker service logs --tail $lines $service > $LOG_DIR/service-${service}-$date_suffix.log 2>&1
done
else
log "Collecting logs for service: $service_name"
docker service logs --tail $lines $service_name > $LOG_DIR/service-${service_name}-$date_suffix.log 2>&1
fi
}
# 收集容器日志
collect_container_logs() {
local date_suffix=$(date +%Y%m%d_%H%M%S)
log "Collecting container logs..."
for container in $(docker ps --format '{{.Names}}'); do
log "Collecting logs for container: $container"
docker logs --tail 1000 $container > $LOG_DIR/container-${container}-$date_suffix.log 2>&1
done
}
# 收集系统事件
collect_system_events() {
local date_suffix=$(date +%Y%m%d_%H%M%S)
log "Collecting system events..."
# Docker 事件(最近1小时)
docker events --since="1h" --until="now" > $LOG_DIR/docker-events-$date_suffix.log 2>&1 &
local events_pid=$!
# 等待5秒收集事件
sleep 5
kill $events_pid 2>/dev/null
# 系统日志
if [ -f "/var/log/syslog" ]; then
tail -1000 /var/log/syslog > $LOG_DIR/syslog-$date_suffix.log
fi
# Docker 守护进程日志
if command -v journalctl > /dev/null; then
journalctl -u docker --since="1 hour ago" > $LOG_DIR/docker-daemon-$date_suffix.log
fi
}
# 压缩旧日志
compress_old_logs() {
log "Compressing logs older than $COMPRESS_DAYS days..."
find $LOG_DIR -name "*.log" -type f -mtime +$COMPRESS_DAYS ! -name "*.gz" -exec gzip {} \;
log "Log compression completed"
}
# 清理过期日志
cleanup_old_logs() {
log "Cleaning up logs older than $RETENTION_DAYS days..."
# 移动到归档目录
find $LOG_DIR -name "*.log.gz" -type f -mtime +$RETENTION_DAYS -exec mv {} $ARCHIVE_DIR/ \;
# 删除非常旧的归档
find $ARCHIVE_DIR -name "*.log.gz" -type f -mtime +$((RETENTION_DAYS * 2)) -delete
log "Log cleanup completed"
}
# 生成日志报告
generate_log_report() {
local report_file="$LOG_DIR/log-report-$(date +%Y%m%d_%H%M%S).html"
log "Generating log report: $report_file"
cat > $report_file << 'EOF'
<!DOCTYPE html>
<html>
<head>
<title>Swarm Log Report</title>
<style>
body { font-family: Arial, sans-serif; margin: 20px; }
.header { background-color: #f0f0f0; padding: 10px; border-radius: 5px; }
.section { margin: 20px 0; }
table { border-collapse: collapse; width: 100%; }
th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
th { background-color: #f2f2f2; }
.error { color: red; }
.warning { color: orange; }
.info { color: blue; }
</style>
</head>
<body>
<div class="header">
<h1>Docker Swarm Log Report</h1>
<p>Generated: $(date)</p>
</div>
EOF
# 添加日志统计
echo " <div class='section'>" >> $report_file
echo " <h2>Log Statistics</h2>" >> $report_file
echo " <table>" >> $report_file
echo " <tr><th>Log Type</th><th>Count</th><th>Size</th></tr>" >> $report_file
# 统计各类日志
for log_type in cluster service container events; do
count=$(find $LOG_DIR -name "${log_type}-*.log*" | wc -l)
size=$(find $LOG_DIR -name "${log_type}-*.log*" -exec du -ch {} + | tail -1 | cut -f1)
echo " <tr><td>$log_type</td><td>$count</td><td>$size</td></tr>" >> $report_file
done
echo " </table>" >> $report_file
echo " </div>" >> $report_file
# 添加错误统计
echo " <div class='section'>" >> $report_file
echo " <h2>Error Summary</h2>" >> $report_file
echo " <table>" >> $report_file
echo " <tr><th>Service</th><th>Error Count</th><th>Last Error</th></tr>" >> $report_file
for service in $(docker service ls --format '{{.Name}}'); do
error_count=$(docker service logs $service 2>&1 | grep -i error | wc -l)
last_error=$(docker service logs --tail 100 $service 2>&1 | grep -i error | tail -1 | cut -c1-100)
echo " <tr><td>$service</td><td>$error_count</td><td>$last_error</td></tr>" >> $report_file
done
echo " </table>" >> $report_file
echo " </div>" >> $report_file
# 结束 HTML
echo "</body></html>" >> $report_file
log "Log report generated: $report_file"
}
# 主菜单
case "$1" in
"cluster")
collect_cluster_logs
;;
"service")
collect_service_logs $2 $3
;;
"container")
collect_container_logs
;;
"events")
collect_system_events
;;
"compress")
compress_old_logs
;;
"cleanup")
cleanup_old_logs
;;
"report")
generate_log_report
;;
"all")
collect_cluster_logs
collect_service_logs
collect_container_logs
collect_system_events
compress_old_logs
cleanup_old_logs
generate_log_report
;;
*)
echo "Usage: $0 {cluster|service|container|events|compress|cleanup|report|all}"
echo " cluster - Collect cluster status logs"
echo " service [name] [lines] - Collect service logs"
echo " container - Collect container logs"
echo " events - Collect system events"
echo " compress - Compress old logs"
echo " cleanup - Clean up old logs"
echo " report - Generate log report"
echo " all - Run all collection tasks"
;;
esac
4. 性能监控和优化
4.1 性能指标监控
自定义性能监控脚本
#!/bin/bash
# performance-monitor.sh
MONITOR_INTERVAL=30
REPORT_DIR="/var/reports/performance"
LOG_FILE="/var/log/performance-monitor.log"
THRESHOLD_CPU=80
THRESHOLD_MEMORY=85
THRESHOLD_DISK=90
# 创建目录
mkdir -p $REPORT_DIR
# 日志函数
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a $LOG_FILE
}
# 收集系统性能指标
collect_system_metrics() {
local timestamp=$(date +%s)
local date_str=$(date '+%Y-%m-%d %H:%M:%S')
# CPU 使用率
local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
# 内存使用率
local mem_total=$(free | grep Mem | awk '{print $2}')
local mem_used=$(free | grep Mem | awk '{print $3}')
local mem_usage=$(echo "scale=2; $mem_used * 100 / $mem_total" | bc)
# 磁盘使用率
local disk_usage=$(df / | tail -1 | awk '{print $5}' | cut -d'%' -f1)
# 负载平均值
local load_avg=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | cut -d',' -f1)
# 网络流量
local rx_bytes=$(cat /proc/net/dev | grep eth0 | awk '{print $2}')
local tx_bytes=$(cat /proc/net/dev | grep eth0 | awk '{print $10}')
# 输出指标
echo "$timestamp,$date_str,$cpu_usage,$mem_usage,$disk_usage,$load_avg,$rx_bytes,$tx_bytes" >> $REPORT_DIR/system-metrics.csv
# 检查阈值
if (( $(echo "$cpu_usage > $THRESHOLD_CPU" | bc -l) )); then
log "WARNING: High CPU usage: ${cpu_usage}%"
fi
if (( $(echo "$mem_usage > $THRESHOLD_MEMORY" | bc -l) )); then
log "WARNING: High memory usage: ${mem_usage}%"
fi
if [ "$disk_usage" -gt "$THRESHOLD_DISK" ]; then
log "WARNING: High disk usage: ${disk_usage}%"
fi
}
# 收集 Docker 性能指标
collect_docker_metrics() {
local timestamp=$(date +%s)
local date_str=$(date '+%Y-%m-%d %H:%M:%S')
# 容器统计
local container_count=$(docker ps -q | wc -l)
local image_count=$(docker images -q | wc -l)
local volume_count=$(docker volume ls -q | wc -l)
local network_count=$(docker network ls -q | wc -l)
# Docker 存储使用
local docker_size=$(docker system df --format "table {{.Type}}\t{{.Size}}" | grep -E "Images|Containers|Local Volumes" | awk '{sum += $2} END {print sum}')
echo "$timestamp,$date_str,$container_count,$image_count,$volume_count,$network_count,$docker_size" >> $REPORT_DIR/docker-metrics.csv
}
# 收集服务性能指标
collect_service_metrics() {
local timestamp=$(date +%s)
local date_str=$(date '+%Y-%m-%d %H:%M:%S')
for service in $(docker service ls --format '{{.Name}}'); do
# 服务副本状态
local replicas=$(docker service ls --filter name=$service --format '{{.Replicas}}')
local running=$(echo $replicas | cut -d'/' -f1)
local desired=$(echo $replicas | cut -d'/' -f2)
# 服务任务状态
local tasks_running=$(docker service ps $service --filter "desired-state=running" --format '{{.CurrentState}}' | grep -c "Running")
local tasks_failed=$(docker service ps $service --filter "desired-state=shutdown" --format '{{.CurrentState}}' | grep -c "Failed")
echo "$timestamp,$date_str,$service,$running,$desired,$tasks_running,$tasks_failed" >> $REPORT_DIR/service-metrics.csv
# 检查服务健康状态
if [ "$running" -lt "$desired" ]; then
log "WARNING: Service $service has insufficient replicas: $running/$desired"
fi
if [ "$tasks_failed" -gt 0 ]; then
log "WARNING: Service $service has $tasks_failed failed tasks"
fi
done
}
# 收集容器资源使用
collect_container_resources() {
local timestamp=$(date +%s)
local date_str=$(date '+%Y-%m-%d %H:%M:%S')
# 获取容器资源使用情况
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.NetIO}}\t{{.BlockIO}}" | tail -n +2 | while read line; do
local container=$(echo $line | awk '{print $1}')
local cpu_perc=$(echo $line | awk '{print $2}' | cut -d'%' -f1)
local mem_usage=$(echo $line | awk '{print $3}')
local mem_perc=$(echo $line | awk '{print $4}' | cut -d'%' -f1)
local net_io=$(echo $line | awk '{print $5}')
local block_io=$(echo $line | awk '{print $6}')
echo "$timestamp,$date_str,$container,$cpu_perc,$mem_usage,$mem_perc,$net_io,$block_io" >> $REPORT_DIR/container-resources.csv
done
}
# 生成性能报告
generate_performance_report() {
local report_file="$REPORT_DIR/performance-report-$(date +%Y%m%d_%H%M%S).html"
log "Generating performance report: $report_file"
cat > $report_file << 'EOF'
<!DOCTYPE html>
<html>
<head>
<title>Swarm Performance Report</title>
<style>
body { font-family: Arial, sans-serif; margin: 20px; }
.header { background-color: #f0f0f0; padding: 10px; border-radius: 5px; }
.section { margin: 20px 0; }
.metric { display: inline-block; margin: 10px; padding: 10px; border: 1px solid #ddd; border-radius: 5px; }
.high { background-color: #ffebee; }
.medium { background-color: #fff3e0; }
.low { background-color: #e8f5e8; }
table { border-collapse: collapse; width: 100%; }
th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
th { background-color: #f2f2f2; }
</style>
</head>
<body>
<div class="header">
<h1>Docker Swarm Performance Report</h1>
<p>Generated: $(date)</p>
</div>
EOF
# 系统性能概览
echo " <div class='section'>" >> $report_file
echo " <h2>System Performance Overview</h2>" >> $report_file
# 获取最新的系统指标
if [ -f "$REPORT_DIR/system-metrics.csv" ]; then
local latest_metrics=$(tail -1 $REPORT_DIR/system-metrics.csv)
local cpu_usage=$(echo $latest_metrics | cut -d',' -f3)
local mem_usage=$(echo $latest_metrics | cut -d',' -f4)
local disk_usage=$(echo $latest_metrics | cut -d',' -f5)
local load_avg=$(echo $latest_metrics | cut -d',' -f6)
# 确定性能等级
local cpu_class="low"
local mem_class="low"
local disk_class="low"
if (( $(echo "$cpu_usage > 80" | bc -l) )); then cpu_class="high"; elif (( $(echo "$cpu_usage > 60" | bc -l) )); then cpu_class="medium"; fi
if (( $(echo "$mem_usage > 85" | bc -l) )); then mem_class="high"; elif (( $(echo "$mem_usage > 70" | bc -l) )); then mem_class="medium"; fi
if [ "$disk_usage" -gt 90 ]; then disk_class="high"; elif [ "$disk_usage" -gt 75 ]; then disk_class="medium"; fi
echo " <div class='metric $cpu_class'>" >> $report_file
echo " <h3>CPU Usage</h3>" >> $report_file
echo " <p>${cpu_usage}%</p>" >> $report_file
echo " </div>" >> $report_file
echo " <div class='metric $mem_class'>" >> $report_file
echo " <h3>Memory Usage</h3>" >> $report_file
echo " <p>${mem_usage}%</p>" >> $report_file
echo " </div>" >> $report_file
echo " <div class='metric $disk_class'>" >> $report_file
echo " <h3>Disk Usage</h3>" >> $report_file
echo " <p>${disk_usage}%</p>" >> $report_file
echo " </div>" >> $report_file
echo " <div class='metric low'>" >> $report_file
echo " <h3>Load Average</h3>" >> $report_file
echo " <p>$load_avg</p>" >> $report_file
echo " </div>" >> $report_file
fi
echo " </div>" >> $report_file
# 服务状态表
echo " <div class='section'>" >> $report_file
echo " <h2>Service Status</h2>" >> $report_file
echo " <table>" >> $report_file
echo " <tr><th>Service</th><th>Replicas</th><th>Running Tasks</th><th>Failed Tasks</th><th>Status</th></tr>" >> $report_file
for service in $(docker service ls --format '{{.Name}}'); do
local replicas=$(docker service ls --filter name=$service --format '{{.Replicas}}')
local running=$(echo $replicas | cut -d'/' -f1)
local desired=$(echo $replicas | cut -d'/' -f2)
local tasks_running=$(docker service ps $service --filter "desired-state=running" --format '{{.CurrentState}}' | grep -c "Running" || echo "0")
local tasks_failed=$(docker service ps $service --filter "desired-state=shutdown" --format '{{.CurrentState}}' | grep -c "Failed" || echo "0")
local status="OK"
if [ "$running" -lt "$desired" ] || [ "$tasks_failed" -gt 0 ]; then
status="WARNING"
fi
echo " <tr><td>$service</td><td>$replicas</td><td>$tasks_running</td><td>$tasks_failed</td><td>$status</td></tr>" >> $report_file
done
echo " </table>" >> $report_file
echo " </div>" >> $report_file
# 结束 HTML
echo "</body></html>" >> $report_file
log "Performance report generated: $report_file"
}
# 性能优化建议
performance_recommendations() {
log "Analyzing performance and generating recommendations..."
local recommendations_file="$REPORT_DIR/recommendations-$(date +%Y%m%d_%H%M%S).txt"
{
echo "Docker Swarm Performance Recommendations"
echo "Generated: $(date)"
echo "========================================"
echo
# 检查系统资源
echo "System Resource Analysis:"
# CPU 分析
local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
if (( $(echo "$cpu_usage > 80" | bc -l) )); then
echo " ⚠ High CPU usage detected ($cpu_usage%)"
echo " - Consider scaling out services to more nodes"
echo " - Review CPU-intensive containers"
echo " - Implement CPU limits for containers"
fi
# 内存分析
local mem_usage=$(free | awk 'NR==2{printf "%.2f", $3*100/$2 }')
if (( $(echo "$mem_usage > 85" | bc -l) )); then
echo " ⚠ High memory usage detected ($mem_usage%)"
echo " - Review memory-intensive containers"
echo " - Implement memory limits and reservations"
echo " - Consider adding more memory or nodes"
fi
# 磁盘分析
local disk_usage=$(df / | tail -1 | awk '{print $5}' | cut -d'%' -f1)
if [ "$disk_usage" -gt 85 ]; then
echo " ⚠ High disk usage detected ($disk_usage%)"
echo " - Clean up unused Docker images and volumes"
echo " - Implement log rotation"
echo " - Consider adding more storage"
fi
echo
echo "Service Analysis:"
# 检查服务状态
for service in $(docker service ls --format '{{.Name}}'); do
local replicas=$(docker service ls --filter name=$service --format '{{.Replicas}}')
local running=$(echo $replicas | cut -d'/' -f1)
local desired=$(echo $replicas | cut -d'/' -f2)
if [ "$running" -lt "$desired" ]; then
echo " ⚠ Service $service has insufficient replicas ($running/$desired)"
echo " - Check node availability and resources"
echo " - Review service constraints and placement preferences"
echo " - Verify image availability on all nodes"
fi
# 检查失败的任务
local failed_tasks=$(docker service ps $service --filter "desired-state=shutdown" --format '{{.CurrentState}}' | grep -c "Failed" || echo "0")
if [ "$failed_tasks" -gt 0 ]; then
echo " ⚠ Service $service has $failed_tasks failed tasks"
echo " - Review service logs for error details"
echo " - Check resource limits and health checks"
echo " - Verify service configuration"
fi
done
echo
echo "Docker System Analysis:"
# 检查 Docker 存储使用
local images_size=$(docker system df --format "{{.Size}}" | head -1)
local containers_size=$(docker system df --format "{{.Size}}" | sed -n '2p')
local volumes_size=$(docker system df --format "{{.Size}}" | sed -n '3p')
echo " Storage Usage:"
echo " - Images: $images_size"
echo " - Containers: $containers_size"
echo " - Volumes: $volumes_size"
# 检查未使用的资源
local unused_images=$(docker images -f "dangling=true" -q | wc -l)
local unused_volumes=$(docker volume ls -f "dangling=true" -q | wc -l)
if [ "$unused_images" -gt 0 ]; then
echo " ⚠ Found $unused_images unused images"
echo " - Run: docker image prune -f"
fi
if [ "$unused_volumes" -gt 0 ]; then
echo " ⚠ Found $unused_volumes unused volumes"
echo " - Run: docker volume prune -f"
fi
echo
echo "Network Analysis:"
# 检查网络连接
local overlay_networks=$(docker network ls --filter driver=overlay --format '{{.Name}}' | wc -l)
echo " Overlay Networks: $overlay_networks"
# 检查端口使用
local published_ports=$(docker service ls --format '{{.Ports}}' | grep -v "" | wc -l)
echo " Published Ports: $published_ports"
echo
echo "General Recommendations:"
echo " 1. Implement resource limits for all services"
echo " 2. Use health checks for critical services"
echo " 3. Monitor and alert on key metrics"
echo " 4. Regular cleanup of unused resources"
echo " 5. Implement proper logging and log rotation"
echo " 6. Use placement constraints for optimal resource distribution"
echo " 7. Regular backup of swarm configuration and data"
} > $recommendations_file
log "Performance recommendations saved to: $recommendations_file"
}
# 主循环
monitor_loop() {
log "Starting performance monitoring loop (interval: ${MONITOR_INTERVAL}s)"
# 创建 CSV 头部
if [ ! -f "$REPORT_DIR/system-metrics.csv" ]; then
echo "timestamp,datetime,cpu_usage,mem_usage,disk_usage,load_avg,rx_bytes,tx_bytes" > $REPORT_DIR/system-metrics.csv
fi
if [ ! -f "$REPORT_DIR/docker-metrics.csv" ]; then
echo "timestamp,datetime,container_count,image_count,volume_count,network_count,docker_size" > $REPORT_DIR/docker-metrics.csv
fi
if [ ! -f "$REPORT_DIR/service-metrics.csv" ]; then
echo "timestamp,datetime,service,running,desired,tasks_running,tasks_failed" > $REPORT_DIR/service-metrics.csv
fi
if [ ! -f "$REPORT_DIR/container-resources.csv" ]; then
echo "timestamp,datetime,container,cpu_perc,mem_usage,mem_perc,net_io,block_io" > $REPORT_DIR/container-resources.csv
fi
while true; do
collect_system_metrics
collect_docker_metrics
collect_service_metrics
collect_container_resources
sleep $MONITOR_INTERVAL
done
}
# 主菜单
case "$1" in
"start")
monitor_loop
;;
"system")
collect_system_metrics
;;
"docker")
collect_docker_metrics
;;
"services")
collect_service_metrics
;;
"containers")
collect_container_resources
;;
"report")
generate_performance_report
;;
"recommendations")
performance_recommendations
;;
"all")
collect_system_metrics
collect_docker_metrics
collect_service_metrics
collect_container_resources
generate_performance_report
performance_recommendations
;;
*)
echo "Usage: $0 {start|system|docker|services|containers|report|recommendations|all}"
echo " start - Start continuous monitoring"
echo " system - Collect system metrics"
echo " docker - Collect Docker metrics"
echo " services - Collect service metrics"
echo " containers - Collect container resources"
echo " report - Generate performance report"
echo " recommendations - Generate optimization recommendations"
echo " all - Run all collection tasks"
;;
esac
4.2 性能优化策略
资源优化配置
# optimized-service.yml
version: '3.8'
services:
web-app:
image: nginx:alpine
ports:
- "80:80"
deploy:
replicas: 3
resources:
limits:
cpus: '0.5'
memory: 512M
reservations:
cpus: '0.25'
memory: 256M
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 120s
update_config:
parallelism: 1
delay: 10s
failure_action: rollback
monitor: 60s
max_failure_ratio: 0.3
rollback_config:
parallelism: 1
delay: 5s
failure_action: pause
monitor: 60s
max_failure_ratio: 0.3
placement:
constraints:
- node.role == worker
preferences:
- spread: node.labels.zone
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
networks:
- app-network
database:
image: postgres:13
environment:
POSTGRES_DB: myapp
POSTGRES_USER: user
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
volumes:
- db-data:/var/lib/postgresql/data
deploy:
replicas: 1
resources:
limits:
cpus: '1.0'
memory: 1G
reservations:
cpus: '0.5'
memory: 512M
placement:
constraints:
- node.labels.storage == ssd
healthcheck:
test: ["CMD-SHELL", "pg_isready -U user -d myapp"]
interval: 30s
timeout: 5s
retries: 5
start_period: 30s
secrets:
- db_password
networks:
- app-network
secrets:
db_password:
external: true
volumes:
db-data:
driver: local
driver_opts:
type: none
o: bind
device: /opt/database
networks:
app-network:
driver: overlay
driver_opts:
encrypted: "true"
自动扩缩容脚本
#!/bin/bash
# auto-scaler.sh
SERVICE_NAME="web-app"
MIN_REPLICAS=2
MAX_REPLICAS=10
CPU_THRESHOLD_UP=70
CPU_THRESHOLD_DOWN=30
MEM_THRESHOLD_UP=80
MEM_THRESHOLD_DOWN=40
SCALE_UP_COOLDOWN=300 # 5分钟
SCALE_DOWN_COOLDOWN=600 # 10分钟
LOG_FILE="/var/log/auto-scaler.log"
# 状态文件
STATE_FILE="/tmp/auto-scaler-state"
# 日志函数
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a $LOG_FILE
}
# 获取当前副本数
get_current_replicas() {
docker service ls --filter name=$SERVICE_NAME --format '{{.Replicas}}' | cut -d'/' -f1
}
# 获取期望副本数
get_desired_replicas() {
docker service ls --filter name=$SERVICE_NAME --format '{{.Replicas}}' | cut -d'/' -f2
}
# 获取服务平均 CPU 使用率
get_avg_cpu_usage() {
local total_cpu=0
local container_count=0
for container in $(docker service ps $SERVICE_NAME --filter "desired-state=running" --format '{{.Name}}.{{.ID}}'); do
local cpu_usage=$(docker stats --no-stream --format '{{.CPUPerc}}' $container 2>/dev/null | cut -d'%' -f1)
if [ ! -z "$cpu_usage" ] && [ "$cpu_usage" != "--" ]; then
total_cpu=$(echo "$total_cpu + $cpu_usage" | bc)
container_count=$((container_count + 1))
fi
done
if [ $container_count -gt 0 ]; then
echo "scale=2; $total_cpu / $container_count" | bc
else
echo "0"
fi
}
# 获取服务平均内存使用率
get_avg_memory_usage() {
local total_mem=0
local container_count=0
for container in $(docker service ps $SERVICE_NAME --filter "desired-state=running" --format '{{.Name}}.{{.ID}}'); do
local mem_usage=$(docker stats --no-stream --format '{{.MemPerc}}' $container 2>/dev/null | cut -d'%' -f1)
if [ ! -z "$mem_usage" ] && [ "$mem_usage" != "--" ]; then
total_mem=$(echo "$total_mem + $mem_usage" | bc)
container_count=$((container_count + 1))
fi
done
if [ $container_count -gt 0 ]; then
echo "scale=2; $total_mem / $container_count" | bc
else
echo "0"
fi
}
# 检查冷却时间
check_cooldown() {
local action=$1
local current_time=$(date +%s)
if [ -f "$STATE_FILE" ]; then
local last_action=$(grep "last_action" $STATE_FILE | cut -d'=' -f2)
local last_time=$(grep "last_time" $STATE_FILE | cut -d'=' -f2)
if [ "$last_action" = "scale_up" ]; then
local cooldown_end=$((last_time + SCALE_UP_COOLDOWN))
if [ $current_time -lt $cooldown_end ]; then
return 1
fi
elif [ "$last_action" = "scale_down" ]; then
local cooldown_end=$((last_time + SCALE_DOWN_COOLDOWN))
if [ $current_time -lt $cooldown_end ]; then
return 1
fi
fi
fi
return 0
}
# 更新状态文件
update_state() {
local action=$1
local current_time=$(date +%s)
echo "last_action=$action" > $STATE_FILE
echo "last_time=$current_time" >> $STATE_FILE
}
# 扩容
scale_up() {
local current_replicas=$(get_current_replicas)
local new_replicas=$((current_replicas + 1))
if [ $new_replicas -le $MAX_REPLICAS ]; then
log "Scaling up $SERVICE_NAME from $current_replicas to $new_replicas replicas"
docker service scale $SERVICE_NAME=$new_replicas
update_state "scale_up"
return 0
else
log "Cannot scale up $SERVICE_NAME: already at maximum replicas ($MAX_REPLICAS)"
return 1
fi
}
# 缩容
scale_down() {
local current_replicas=$(get_current_replicas)
local new_replicas=$((current_replicas - 1))
if [ $new_replicas -ge $MIN_REPLICAS ]; then
log "Scaling down $SERVICE_NAME from $current_replicas to $new_replicas replicas"
docker service scale $SERVICE_NAME=$new_replicas
update_state "scale_down"
return 0
else
log "Cannot scale down $SERVICE_NAME: already at minimum replicas ($MIN_REPLICAS)"
return 1
fi
}
# 主要扩缩容逻辑
auto_scale() {
local current_replicas=$(get_current_replicas)
local desired_replicas=$(get_desired_replicas)
# 等待服务稳定
if [ "$current_replicas" != "$desired_replicas" ]; then
log "Service $SERVICE_NAME is not stable ($current_replicas/$desired_replicas), waiting..."
return
fi
local avg_cpu=$(get_avg_cpu_usage)
local avg_memory=$(get_avg_memory_usage)
log "Current metrics - Replicas: $current_replicas, CPU: ${avg_cpu}%, Memory: ${avg_memory}%"
# 检查是否需要扩容
if (( $(echo "$avg_cpu > $CPU_THRESHOLD_UP" | bc -l) )) || (( $(echo "$avg_memory > $MEM_THRESHOLD_UP" | bc -l) )); then
if check_cooldown "scale_up"; then
scale_up
else
log "Scale up is in cooldown period"
fi
# 检查是否需要缩容
elif (( $(echo "$avg_cpu < $CPU_THRESHOLD_DOWN" | bc -l) )) && (( $(echo "$avg_memory < $MEM_THRESHOLD_DOWN" | bc -l) )); then
if check_cooldown "scale_down"; then
scale_down
else
log "Scale down is in cooldown period"
fi
else
log "No scaling action needed"
fi
}
# 主循环
if [ "$1" = "start" ]; then
log "Starting auto-scaler for service: $SERVICE_NAME"
log "Configuration: Min=$MIN_REPLICAS, Max=$MAX_REPLICAS, CPU_UP=$CPU_THRESHOLD_UP%, CPU_DOWN=$CPU_THRESHOLD_DOWN%, MEM_UP=$MEM_THRESHOLD_UP%, MEM_DOWN=$MEM_THRESHOLD_DOWN%"
while true; do
auto_scale
sleep 60 # 每分钟检查一次
done
else
echo "Usage: $0 start"
echo "Auto-scaler for Docker Swarm services"
echo "Configuration:"
echo " Service: $SERVICE_NAME"
echo " Min Replicas: $MIN_REPLICAS"
echo " Max Replicas: $MAX_REPLICAS"
echo " CPU Thresholds: $CPU_THRESHOLD_DOWN% - $CPU_THRESHOLD_UP%"
echo " Memory Thresholds: $MEM_THRESHOLD_DOWN% - $MEM_THRESHOLD_UP%"
fi
5. 实践练习
练习1:监控系统部署
目标:部署完整的监控系统
步骤:
部署 Prometheus Stack “`bash
创建监控网络
docker network create –driver overlay monitoring
部署监控服务
docker stack deploy -c prometheus-stack.yml monitoring
验证部署
docker service ls docker stack ps monitoring
2. **配置告警规则**
```bash
# 创建告警规则目录
mkdir -p rules
# 复制告警规则文件
cp swarm-alerts.yml rules/
# 重新加载 Prometheus 配置
curl -X POST http://localhost:9090/-/reload
访问监控界面
# Prometheus: http://localhost:9090 # Grafana: http://localhost:3000 (admin/admin123) # Alertmanager: http://localhost:9093
练习2:日志管理系统
目标:部署 ELK Stack 进行日志管理 步骤: 1. 部署 ELK Stack
# 创建日志网络 docker network create --driver overlay logging # 部署 ELK 服务 docker stack deploy -c elk-stack.yml logging # 等待服务启动 docker service logs logging_elasticsearch
配置 Filebeat “`bash
创建 Filebeat 配置
mkdir -p filebeat cp filebeat.yml filebeat/
重新部署以应用配置
docker service update –force logging_filebeat
3. **配置 Kibana 仪表板**
```bash
# 访问 Kibana: http://localhost:5601
# 创建索引模式: swarm-logs-*
# 导入预定义仪表板
练习3:性能优化实践
目标:实施性能监控和自动扩缩容
步骤:
部署性能监控 “`bash
启动性能监控
./performance-monitor.sh start &
生成性能报告
./performance-monitor.sh report
查看优化建议
./performance-monitor.sh recommendations
2. **配置自动扩缩容**
```bash
# 部署测试服务
docker service create --name web-app --replicas 2 nginx:alpine
# 启动自动扩缩容
./auto-scaler.sh start &
# 模拟负载测试
for i in {1..100}; do
curl -s http://localhost > /dev/null &
done
验证扩缩容效果 “`bash
监控副本数变化
watch docker service ls
查看扩缩容日志
tail -f /var/log/auto-scaler.log “`
本章总结
通过本章学习,我们掌握了 Docker Swarm 的监控与日志管理:
关键要点
监控架构
- 多层次监控体系
- 指标收集和存储
- 可视化和告警
Prometheus 监控
- 指标收集和存储
- 告警规则配置
- Grafana 可视化
日志管理
- ELK Stack 部署
- 日志收集和聚合
- 日志分析和搜索
性能优化
- 性能指标监控
- 资源优化配置
- 自动扩缩容
最佳实践
监控策略
- 建立完整的监控体系
- 设置合理的告警阈值
- 定期审查和优化监控规则
日志管理
- 统一日志格式和标准
- 实施日志轮转和归档
- 建立日志分析流程
性能优化
- 持续监控关键指标
- 实施资源限制和预留
- 自动化扩缩容策略
运维自动化
- 自动化监控和告警
- 自动化日志收集和分析
- 自动化性能优化
下一章我们将学习 Docker Swarm 的故障排除与调试技巧。