10.1 章节概述
本章将通过真实的企业级案例,展示 Helm 在大型组织中的实际应用。我们将学习如何在复杂的企业环境中设计、部署和管理 Helm Charts,包括多环境管理、CI/CD 集成、大规模部署、运维自动化等关键实践。
学习目标
- 掌握企业级 Helm 架构设计原则
- 学习多环境管理和配置策略
- 了解 CI/CD 流水线中的 Helm 集成
- 掌握大规模部署和运维自动化
- 学习企业级监控和治理实践
- 通过真实案例理解最佳实践
本章架构
graph TB
A[企业级 Helm 实践] --> B[架构设计]
A --> C[多环境管理]
A --> D[CI/CD 集成]
A --> E[大规模部署]
A --> F[运维自动化]
A --> G[案例研究]
B --> B1[Chart 组织结构]
B --> B2[依赖管理策略]
B --> B3[版本控制]
C --> C1[环境隔离]
C --> C2[配置管理]
C --> C3[部署策略]
D --> D1[GitOps 流程]
D --> D2[自动化测试]
D --> D3[发布管理]
E --> E1[集群管理]
E --> E2[资源调度]
E --> E3[性能优化]
F --> F1[监控告警]
F --> F2[日志管理]
F --> F3[故障恢复]
G --> G1[电商平台]
G --> G2[金融系统]
G --> G3[物联网平台]
10.2 企业级架构设计
10.2.1 Chart 组织结构
enterprise-charts/
├── platform/ # 平台基础组件
│ ├── monitoring/
│ │ ├── prometheus/
│ │ ├── grafana/
│ │ └── alertmanager/
│ ├── logging/
│ │ ├── elasticsearch/
│ │ ├── logstash/
│ │ └── kibana/
│ ├── security/
│ │ ├── vault/
│ │ ├── cert-manager/
│ │ └── oauth2-proxy/
│ └── networking/
│ ├── ingress-nginx/
│ ├── istio/
│ └── calico/
├── applications/ # 业务应用
│ ├── user-service/
│ ├── order-service/
│ ├── payment-service/
│ └── notification-service/
├── shared/ # 共享组件
│ ├── database/
│ │ ├── postgresql/
│ │ ├── redis/
│ │ └── mongodb/
│ ├── messaging/
│ │ ├── kafka/
│ │ ├── rabbitmq/
│ │ └── nats/
│ └── storage/
│ ├── minio/
│ └── ceph/
├── environments/ # 环境配置
│ ├── dev/
│ ├── staging/
│ ├── production/
│ └── dr/ # 灾备环境
└── umbrella/ # 伞状 Chart
├── platform-stack/
├── application-stack/
└── full-stack/
10.2.2 企业级 Chart 模板
# charts/enterprise-app/Chart.yaml
apiVersion: v2
name: enterprise-app
description: Enterprise-grade application template
type: application
version: 1.0.0
appVersion: "1.0.0"
# 企业级依赖管理
dependencies:
# 监控组件
- name: prometheus
version: "15.x.x"
repository: "https://prometheus-community.github.io/helm-charts"
condition: monitoring.prometheus.enabled
# 日志组件
- name: elasticsearch
version: "7.x.x"
repository: "https://helm.elastic.co"
condition: logging.elasticsearch.enabled
# 数据库
- name: postgresql
version: "11.x.x"
repository: "https://charts.bitnami.com/bitnami"
condition: database.postgresql.enabled
# 缓存
- name: redis
version: "16.x.x"
repository: "https://charts.bitnami.com/bitnami"
condition: cache.redis.enabled
# 消息队列
- name: kafka
version: "18.x.x"
repository: "https://charts.bitnami.com/bitnami"
condition: messaging.kafka.enabled
# 企业级注解
annotations:
category: "Enterprise Application"
licenses: "Apache-2.0"
images: |
- name: app
image: docker.io/company/enterprise-app:1.0.0
- name: sidecar
image: docker.io/company/sidecar:1.0.0
# 关键词
keywords:
- enterprise
- microservices
- cloud-native
- kubernetes
- helm
# 维护者信息
maintainers:
- name: Platform Team
email: platform@company.com
url: https://platform.company.com
- name: DevOps Team
email: devops@company.com
url: https://devops.company.com
# 主页和源码
home: https://company.com/enterprise-app
sources:
- https://github.com/company/enterprise-app
- https://github.com/company/enterprise-charts
# API 版本兼容性
kubeVersion: ">=1.20.0-0"
10.2.3 企业级 Values 结构
# charts/enterprise-app/values.yaml
# 全局配置
global:
# 镜像仓库配置
imageRegistry: "registry.company.com"
imagePullSecrets:
- name: "company-registry-secret"
# 存储类配置
storageClass: "company-ssd"
# 网络配置
networkPolicy:
enabled: true
type: "calico"
# 安全配置
security:
podSecurityPolicy: "restricted"
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
# 监控配置
monitoring:
enabled: true
namespace: "monitoring"
serviceMonitor:
enabled: true
interval: "30s"
# 日志配置
logging:
enabled: true
level: "info"
format: "json"
destination: "elasticsearch"
# 应用配置
app:
name: "enterprise-app"
version: "1.0.0"
# 镜像配置
image:
repository: "enterprise-app"
tag: "1.0.0"
pullPolicy: "IfNotPresent"
# 副本配置
replicaCount: 3
# 资源配置
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
# 健康检查
healthCheck:
enabled: true
livenessProbe:
httpGet:
path: "/health/live"
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: "/health/ready"
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
# 环境变量
env:
- name: "ENVIRONMENT"
value: "production"
- name: "LOG_LEVEL"
value: "info"
- name: "DATABASE_URL"
valueFrom:
secretKeyRef:
name: "database-credentials"
key: "url"
# 服务配置
service:
type: "ClusterIP"
port: 80
targetPort: 8080
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
# Ingress 配置
ingress:
enabled: true
className: "nginx"
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
cert-manager.io/cluster-issuer: "company-ca-issuer"
hosts:
- host: "app.company.com"
paths:
- path: "/"
pathType: "Prefix"
tls:
- secretName: "app-tls-secret"
hosts:
- "app.company.com"
# 自动扩缩容
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: "Percent"
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: "Percent"
value: 50
periodSeconds: 60
# 持久化存储
persistence:
enabled: true
storageClass: "company-ssd"
accessMode: "ReadWriteOnce"
size: "10Gi"
annotations:
volume.beta.kubernetes.io/storage-provisioner: "ebs.csi.aws.com"
# 数据库配置
database:
postgresql:
enabled: true
auth:
existingSecret: "postgresql-credentials"
primary:
persistence:
enabled: true
size: "100Gi"
storageClass: "company-ssd"
metrics:
enabled: true
serviceMonitor:
enabled: true
# 缓存配置
cache:
redis:
enabled: true
auth:
enabled: true
existingSecret: "redis-credentials"
master:
persistence:
enabled: true
size: "20Gi"
storageClass: "company-ssd"
metrics:
enabled: true
serviceMonitor:
enabled: true
# 消息队列配置
messaging:
kafka:
enabled: true
auth:
clientProtocol: "sasl"
existingSecret: "kafka-credentials"
persistence:
enabled: true
size: "100Gi"
storageClass: "company-ssd"
metrics:
kafka:
enabled: true
jmx:
enabled: true
# 监控配置
monitoring:
prometheus:
enabled: true
serviceMonitor:
enabled: true
interval: "30s"
scrapeTimeout: "10s"
labels:
app: "enterprise-app"
team: "platform"
grafana:
enabled: true
dashboards:
enabled: true
configMapName: "enterprise-app-dashboards"
# 日志配置
logging:
elasticsearch:
enabled: true
index: "enterprise-app"
template: "enterprise-app-template"
fluentd:
enabled: true
configMap: "enterprise-app-fluentd-config"
# 安全配置
security:
networkPolicy:
enabled: true
ingress:
- from:
- namespaceSelector:
matchLabels:
name: "ingress-nginx"
ports:
- protocol: "TCP"
port: 8080
- from:
- namespaceSelector:
matchLabels:
name: "monitoring"
ports:
- protocol: "TCP"
port: 8080
egress:
- to:
- namespaceSelector:
matchLabels:
name: "database"
ports:
- protocol: "TCP"
port: 5432
- to:
- namespaceSelector:
matchLabels:
name: "cache"
ports:
- protocol: "TCP"
port: 6379
podSecurityPolicy:
enabled: true
name: "enterprise-app-psp"
serviceAccount:
create: true
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::123456789012:role/enterprise-app-role"
# 备份配置
backup:
enabled: true
schedule: "0 2 * * *"
retention: "30d"
storage:
type: "s3"
bucket: "company-backups"
prefix: "enterprise-app"
10.3 多环境管理策略
10.3.1 环境配置分离
# environments/dev/values.yaml
# 开发环境配置
global:
imageRegistry: "dev-registry.company.com"
environment: "development"
app:
replicaCount: 1
image:
tag: "dev-latest"
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
autoscaling:
enabled: false
database:
postgresql:
primary:
persistence:
size: "10Gi"
cache:
redis:
master:
persistence:
size: "5Gi"
monitoring:
prometheus:
enabled: false
logging:
elasticsearch:
enabled: false
---
# environments/staging/values.yaml
# 预发布环境配置
global:
imageRegistry: "staging-registry.company.com"
environment: "staging"
app:
replicaCount: 2
image:
tag: "staging-v1.0.0"
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 5
database:
postgresql:
primary:
persistence:
size: "50Gi"
cache:
redis:
master:
persistence:
size: "10Gi"
monitoring:
prometheus:
enabled: true
logging:
elasticsearch:
enabled: true
---
# environments/production/values.yaml
# 生产环境配置
global:
imageRegistry: "prod-registry.company.com"
environment: "production"
app:
replicaCount: 5
image:
tag: "v1.0.0"
resources:
requests:
memory: "1Gi"
cpu: "1000m"
limits:
memory: "2Gi"
cpu: "2000m"
autoscaling:
enabled: true
minReplicas: 5
maxReplicas: 50
database:
postgresql:
primary:
persistence:
size: "500Gi"
readReplicas:
replicaCount: 2
persistence:
size: "500Gi"
cache:
redis:
master:
persistence:
size: "100Gi"
replica:
replicaCount: 2
persistence:
size: "100Gi"
monitoring:
prometheus:
enabled: true
retention: "30d"
logging:
elasticsearch:
enabled: true
retention: "90d"
backup:
enabled: true
schedule: "0 2 * * *"
retention: "90d"
10.3.2 环境部署脚本
#!/bin/bash
# scripts/deploy.sh
set -euo pipefail
# 配置参数
ENVIRONMENT=${1:-dev}
CHART_NAME=${2:-enterprise-app}
NAMESPACE=${3:-default}
VERSION=${4:-latest}
# 颜色输出
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
echo -e "${GREEN}Starting deployment to ${ENVIRONMENT} environment${NC}"
# 验证环境
case $ENVIRONMENT in
dev|staging|production)
echo -e "${GREEN}Environment: $ENVIRONMENT${NC}"
;;
*)
echo -e "${RED}Error: Invalid environment. Use dev, staging, or production${NC}"
exit 1
;;
esac
# 设置 Kubernetes 上下文
case $ENVIRONMENT in
dev)
KUBE_CONTEXT="dev-cluster"
REGISTRY="dev-registry.company.com"
;;
staging)
KUBE_CONTEXT="staging-cluster"
REGISTRY="staging-registry.company.com"
;;
production)
KUBE_CONTEXT="prod-cluster"
REGISTRY="prod-registry.company.com"
;;
esac
echo -e "${YELLOW}Switching to Kubernetes context: $KUBE_CONTEXT${NC}"
kubectl config use-context $KUBE_CONTEXT
# 创建命名空间(如果不存在)
echo -e "${YELLOW}Ensuring namespace exists: $NAMESPACE${NC}"
kubectl create namespace $NAMESPACE --dry-run=client -o yaml | kubectl apply -f -
# 添加标签
kubectl label namespace $NAMESPACE environment=$ENVIRONMENT --overwrite
kubectl label namespace $NAMESPACE managed-by=helm --overwrite
# 更新 Helm 仓库
echo -e "${YELLOW}Updating Helm repositories${NC}"
helm repo update
# 验证 Chart
echo -e "${YELLOW}Validating Helm chart${NC}"
helm lint charts/$CHART_NAME
# 模板渲染测试
echo -e "${YELLOW}Testing template rendering${NC}"
helm template $CHART_NAME charts/$CHART_NAME \
-f environments/$ENVIRONMENT/values.yaml \
--namespace $NAMESPACE \
--dry-run > /tmp/rendered-$ENVIRONMENT.yaml
# 安全检查
echo -e "${YELLOW}Running security checks${NC}"
kubesec scan /tmp/rendered-$ENVIRONMENT.yaml
# 部署前检查
echo -e "${YELLOW}Pre-deployment checks${NC}"
# 检查资源配额
kubectl describe quota -n $NAMESPACE || echo "No resource quota found"
# 检查存储类
kubectl get storageclass company-ssd || {
echo -e "${RED}Error: Required storage class 'company-ssd' not found${NC}"
exit 1
}
# 检查镜像仓库连接
echo -e "${YELLOW}Checking image registry connectivity${NC}"
docker pull $REGISTRY/enterprise-app:$VERSION || {
echo -e "${RED}Error: Cannot pull image from registry${NC}"
exit 1
}
# 部署确认
if [ "$ENVIRONMENT" = "production" ]; then
echo -e "${RED}WARNING: You are about to deploy to PRODUCTION!${NC}"
read -p "Are you sure you want to continue? (yes/no): " -r
if [[ ! $REPLY =~ ^[Yy][Ee][Ss]$ ]]; then
echo -e "${YELLOW}Deployment cancelled${NC}"
exit 0
fi
fi
# 执行部署
echo -e "${GREEN}Deploying $CHART_NAME to $ENVIRONMENT${NC}"
helm upgrade --install $CHART_NAME charts/$CHART_NAME \
--namespace $NAMESPACE \
--create-namespace \
-f environments/$ENVIRONMENT/values.yaml \
--set app.image.tag=$VERSION \
--set global.environment=$ENVIRONMENT \
--timeout 10m \
--wait \
--atomic
# 部署后验证
echo -e "${YELLOW}Post-deployment verification${NC}"
# 检查 Pod 状态
echo "Checking pod status..."
kubectl get pods -n $NAMESPACE -l app.kubernetes.io/name=$CHART_NAME
# 检查服务状态
echo "Checking service status..."
kubectl get svc -n $NAMESPACE -l app.kubernetes.io/name=$CHART_NAME
# 健康检查
echo "Performing health check..."
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=$CHART_NAME -n $NAMESPACE --timeout=300s
# 运行测试
if [ -f "tests/$ENVIRONMENT-tests.yaml" ]; then
echo -e "${YELLOW}Running environment-specific tests${NC}"
helm test $CHART_NAME -n $NAMESPACE
fi
# 部署成功通知
echo -e "${GREEN}Deployment completed successfully!${NC}"
echo -e "${GREEN}Chart: $CHART_NAME${NC}"
echo -e "${GREEN}Environment: $ENVIRONMENT${NC}"
echo -e "${GREEN}Namespace: $NAMESPACE${NC}"
echo -e "${GREEN}Version: $VERSION${NC}"
# 获取访问信息
echo -e "${YELLOW}Access Information:${NC}"
kubectl get ingress -n $NAMESPACE -l app.kubernetes.io/name=$CHART_NAME
# 清理临时文件
rm -f /tmp/rendered-$ENVIRONMENT.yaml
echo -e "${GREEN}Deployment script completed${NC}"
10.4 CI/CD 集成
10.4.1 GitLab CI/CD 流水线
# .gitlab-ci.yml
stages:
- validate
- test
- build
- security
- deploy-dev
- deploy-staging
- deploy-production
variables:
CHART_NAME: "enterprise-app"
DOCKER_REGISTRY: "registry.company.com"
HELM_VERSION: "3.12.0"
KUBECTL_VERSION: "1.27.0"
# 模板定义
.helm_template: &helm_template
image: alpine/helm:$HELM_VERSION
before_script:
- apk add --no-cache curl
- curl -LO "https://dl.k8s.io/release/v$KUBECTL_VERSION/bin/linux/amd64/kubectl"
- chmod +x kubectl && mv kubectl /usr/local/bin/
- helm repo add bitnami https://charts.bitnami.com/bitnami
- helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
- helm repo update
# Chart 验证
validate-chart:
<<: *helm_template
stage: validate
script:
- helm lint charts/$CHART_NAME
- helm template $CHART_NAME charts/$CHART_NAME --dry-run
rules:
- changes:
- charts/**/*
- environments/**/*
# 依赖检查
check-dependencies:
<<: *helm_template
stage: validate
script:
- cd charts/$CHART_NAME
- helm dependency update
- helm dependency build
artifacts:
paths:
- charts/$CHART_NAME/charts/
expire_in: 1 hour
rules:
- changes:
- charts/**/Chart.yaml
- charts/**/requirements.yaml
# 安全扫描
security-scan:
image: aquasec/trivy:latest
stage: security
script:
- helm template $CHART_NAME charts/$CHART_NAME > rendered.yaml
- trivy config rendered.yaml
artifacts:
reports:
junit: trivy-report.xml
rules:
- changes:
- charts/**/*
# Chart 测试
test-chart:
<<: *helm_template
stage: test
script:
- helm install $CHART_NAME-test charts/$CHART_NAME
--dry-run --debug
-f environments/dev/values.yaml
- helm template $CHART_NAME charts/$CHART_NAME
-f environments/staging/values.yaml
--validate
- helm template $CHART_NAME charts/$CHART_NAME
-f environments/production/values.yaml
--validate
rules:
- changes:
- charts/**/*
- environments/**/*
# 构建和推送 Chart
build-chart:
<<: *helm_template
stage: build
script:
- helm package charts/$CHART_NAME --version $CI_COMMIT_TAG
- curl --data-binary "@$CHART_NAME-$CI_COMMIT_TAG.tgz"
"$CHART_REPOSITORY_URL/api/charts"
artifacts:
paths:
- "*.tgz"
expire_in: 1 week
only:
- tags
# 开发环境部署
deploy-dev:
<<: *helm_template
stage: deploy-dev
environment:
name: development
url: https://app-dev.company.com
script:
- kubectl config use-context dev-cluster
- helm upgrade --install $CHART_NAME charts/$CHART_NAME
--namespace dev
--create-namespace
-f environments/dev/values.yaml
--set app.image.tag=$CI_COMMIT_SHA
--wait
--timeout 10m
- kubectl rollout status deployment/$CHART_NAME -n dev
rules:
- if: '$CI_COMMIT_BRANCH == "develop"'
changes:
- charts/**/*
- environments/dev/**/*
# 预发布环境部署
deploy-staging:
<<: *helm_template
stage: deploy-staging
environment:
name: staging
url: https://app-staging.company.com
script:
- kubectl config use-context staging-cluster
- helm upgrade --install $CHART_NAME charts/$CHART_NAME
--namespace staging
--create-namespace
-f environments/staging/values.yaml
--set app.image.tag=$CI_COMMIT_TAG
--wait
--timeout 15m
- helm test $CHART_NAME -n staging
rules:
- if: '$CI_COMMIT_TAG =~ /^v[0-9]+\.[0-9]+\.[0-9]+(-rc\.[0-9]+)?$/'
when: manual
# 生产环境部署
deploy-production:
<<: *helm_template
stage: deploy-production
environment:
name: production
url: https://app.company.com
script:
- kubectl config use-context prod-cluster
# 生产环境部署前检查
- helm diff upgrade $CHART_NAME charts/$CHART_NAME
--namespace production
-f environments/production/values.yaml
--set app.image.tag=$CI_COMMIT_TAG
# 执行部署
- helm upgrade --install $CHART_NAME charts/$CHART_NAME
--namespace production
--create-namespace
-f environments/production/values.yaml
--set app.image.tag=$CI_COMMIT_TAG
--wait
--timeout 20m
--atomic
# 部署后验证
- kubectl rollout status deployment/$CHART_NAME -n production
- helm test $CHART_NAME -n production
rules:
- if: '$CI_COMMIT_TAG =~ /^v[0-9]+\.[0-9]+\.[0-9]+$/'
when: manual
allow_failure: false
# 回滚作业
rollback-production:
<<: *helm_template
stage: deploy-production
environment:
name: production
url: https://app.company.com
script:
- kubectl config use-context prod-cluster
- helm rollback $CHART_NAME -n production
- kubectl rollout status deployment/$CHART_NAME -n production
when: manual
only:
- tags
10.4.2 GitHub Actions 工作流
# .github/workflows/helm-deploy.yml
name: Helm Deploy
on:
push:
branches:
- main
- develop
tags:
- 'v*'
pull_request:
branches:
- main
env:
CHART_NAME: enterprise-app
HELM_VERSION: 3.12.0
KUBECTL_VERSION: 1.27.0
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Helm
uses: azure/setup-helm@v3
with:
version: ${{ env.HELM_VERSION }}
- name: Add Helm repositories
run: |
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
- name: Lint Helm Chart
run: helm lint charts/${{ env.CHART_NAME }}
- name: Validate Templates
run: |
helm template ${{ env.CHART_NAME }} charts/${{ env.CHART_NAME }} \
-f environments/dev/values.yaml --validate
helm template ${{ env.CHART_NAME }} charts/${{ env.CHART_NAME }} \
-f environments/staging/values.yaml --validate
helm template ${{ env.CHART_NAME }} charts/${{ env.CHART_NAME }} \
-f environments/production/values.yaml --validate
security-scan:
runs-on: ubuntu-latest
needs: validate
steps:
- uses: actions/checkout@v3
- name: Setup Helm
uses: azure/setup-helm@v3
with:
version: ${{ env.HELM_VERSION }}
- name: Render templates
run: |
helm template ${{ env.CHART_NAME }} charts/${{ env.CHART_NAME }} \
-f environments/production/values.yaml > rendered.yaml
- name: Run Trivy security scan
uses: aquasecurity/trivy-action@master
with:
scan-type: 'config'
scan-ref: 'rendered.yaml'
format: 'sarif'
output: 'trivy-results.sarif'
- name: Upload Trivy scan results
uses: github/codeql-action/upload-sarif@v2
if: always()
with:
sarif_file: 'trivy-results.sarif'
deploy-dev:
runs-on: ubuntu-latest
needs: [validate, security-scan]
if: github.ref == 'refs/heads/develop'
environment: development
steps:
- uses: actions/checkout@v3
- name: Setup Helm
uses: azure/setup-helm@v3
with:
version: ${{ env.HELM_VERSION }}
- name: Setup kubectl
uses: azure/setup-kubectl@v3
with:
version: ${{ env.KUBECTL_VERSION }}
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-west-2
- name: Update kubeconfig
run: aws eks update-kubeconfig --name dev-cluster
- name: Deploy to Development
run: |
helm upgrade --install ${{ env.CHART_NAME }} charts/${{ env.CHART_NAME }} \
--namespace dev \
--create-namespace \
-f environments/dev/values.yaml \
--set app.image.tag=${{ github.sha }} \
--wait \
--timeout 10m
- name: Verify deployment
run: |
kubectl rollout status deployment/${{ env.CHART_NAME }} -n dev
kubectl get pods -n dev -l app.kubernetes.io/name=${{ env.CHART_NAME }}
deploy-staging:
runs-on: ubuntu-latest
needs: [validate, security-scan]
if: startsWith(github.ref, 'refs/tags/v')
environment: staging
steps:
- uses: actions/checkout@v3
- name: Setup Helm
uses: azure/setup-helm@v3
with:
version: ${{ env.HELM_VERSION }}
- name: Setup kubectl
uses: azure/setup-kubectl@v3
with:
version: ${{ env.KUBECTL_VERSION }}
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-west-2
- name: Update kubeconfig
run: aws eks update-kubeconfig --name staging-cluster
- name: Deploy to Staging
run: |
helm upgrade --install ${{ env.CHART_NAME }} charts/${{ env.CHART_NAME }} \
--namespace staging \
--create-namespace \
-f environments/staging/values.yaml \
--set app.image.tag=${{ github.ref_name }} \
--wait \
--timeout 15m
- name: Run tests
run: helm test ${{ env.CHART_NAME }} -n staging
deploy-production:
runs-on: ubuntu-latest
needs: [validate, security-scan]
if: startsWith(github.ref, 'refs/tags/v') && !contains(github.ref, 'rc')
environment: production
steps:
- uses: actions/checkout@v3
- name: Setup Helm
uses: azure/setup-helm@v3
with:
version: ${{ env.HELM_VERSION }}
- name: Setup kubectl
uses: azure/setup-kubectl@v3
with:
version: ${{ env.KUBECTL_VERSION }}
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-west-2
- name: Update kubeconfig
run: aws eks update-kubeconfig --name prod-cluster
- name: Deploy to Production
run: |
helm upgrade --install ${{ env.CHART_NAME }} charts/${{ env.CHART_NAME }} \
--namespace production \
--create-namespace \
-f environments/production/values.yaml \
--set app.image.tag=${{ github.ref_name }} \
--wait \
--timeout 20m \
--atomic
- name: Verify deployment
run: |
kubectl rollout status deployment/${{ env.CHART_NAME }} -n production
helm test ${{ env.CHART_NAME }} -n production
- name: Notify deployment
uses: 8398a7/action-slack@v3
with:
status: ${{ job.status }}
channel: '#deployments'
text: 'Production deployment completed: ${{ env.CHART_NAME }} ${{ github.ref_name }}'
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
10.5 大规模部署管理
10.5.1 集群管理策略
# cluster-management/cluster-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-management-config
namespace: kube-system
data:
cluster-policy.yaml: |
# 集群资源策略
resourceQuotas:
default:
requests.cpu: "100m"
requests.memory: "128Mi"
limits.cpu: "500m"
limits.memory: "512Mi"
persistentvolumeclaims: "10"
services: "20"
secrets: "50"
configmaps: "50"
production:
requests.cpu: "500m"
requests.memory: "512Mi"
limits.cpu: "2000m"
limits.memory: "4Gi"
persistentvolumeclaims: "50"
services: "100"
secrets: "200"
configmaps: "200"
networkPolicies:
defaultDeny: true
allowedNamespaces:
- kube-system
- monitoring
- logging
- ingress-nginx
podSecurityStandards:
enforce: "restricted"
audit: "restricted"
warn: "restricted"
nodeSelectors:
production:
node-type: "production"
instance-type: "c5.xlarge"
staging:
node-type: "staging"
instance-type: "c5.large"
development:
node-type: "development"
instance-type: "t3.medium"
---
# 资源配额模板
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: "{{ .Values.namespace }}"
spec:
hard:
requests.cpu: "{{ .Values.quota.requests.cpu }}"
requests.memory: "{{ .Values.quota.requests.memory }}"
limits.cpu: "{{ .Values.quota.limits.cpu }}"
limits.memory: "{{ .Values.quota.limits.memory }}"
persistentvolumeclaims: "{{ .Values.quota.pvc }}"
services: "{{ .Values.quota.services }}"
secrets: "{{ .Values.quota.secrets }}"
configmaps: "{{ .Values.quota.configmaps }}"
---
# 限制范围模板
apiVersion: v1
kind: LimitRange
metadata:
name: resource-limits
namespace: "{{ .Values.namespace }}"
spec:
limits:
- default:
cpu: "{{ .Values.limits.default.cpu }}"
memory: "{{ .Values.limits.default.memory }}"
defaultRequest:
cpu: "{{ .Values.limits.defaultRequest.cpu }}"
memory: "{{ .Values.limits.defaultRequest.memory }}"
max:
cpu: "{{ .Values.limits.max.cpu }}"
memory: "{{ .Values.limits.max.memory }}"
min:
cpu: "{{ .Values.limits.min.cpu }}"
memory: "{{ .Values.limits.min.memory }}"
type: Container
- max:
storage: "{{ .Values.limits.storage.max }}"
min:
storage: "{{ .Values.limits.storage.min }}"
type: PersistentVolumeClaim
10.5.2 批量部署脚本
#!/bin/bash
# scripts/batch-deploy.sh
set -euo pipefail
# 配置文件
CONFIG_FILE=${1:-"deployments.yaml"}
MAX_PARALLEL=${2:-5}
TIMEOUT=${3:-600}
# 颜色输出
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'
# 日志函数
log() {
echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[ERROR]${NC} $1" >&2
}
success() {
echo -e "${GREEN}[SUCCESS]${NC} $1"
}
warn() {
echo -e "${YELLOW}[WARNING]${NC} $1"
}
# 检查依赖
check_dependencies() {
log "Checking dependencies..."
command -v helm >/dev/null 2>&1 || {
error "Helm is required but not installed"
exit 1
}
command -v kubectl >/dev/null 2>&1 || {
error "kubectl is required but not installed"
exit 1
}
command -v yq >/dev/null 2>&1 || {
error "yq is required but not installed"
exit 1
}
if [ ! -f "$CONFIG_FILE" ]; then
error "Configuration file not found: $CONFIG_FILE"
exit 1
fi
success "All dependencies are available"
}
# 解析配置文件
parse_config() {
log "Parsing configuration file: $CONFIG_FILE"
# 验证配置文件格式
yq eval '.deployments | length' "$CONFIG_FILE" >/dev/null || {
error "Invalid configuration file format"
exit 1
}
DEPLOYMENT_COUNT=$(yq eval '.deployments | length' "$CONFIG_FILE")
log "Found $DEPLOYMENT_COUNT deployments to process"
}
# 部署单个应用
deploy_app() {
local index=$1
local app_name=$(yq eval ".deployments[$index].name" "$CONFIG_FILE")
local chart_path=$(yq eval ".deployments[$index].chart" "$CONFIG_FILE")
local namespace=$(yq eval ".deployments[$index].namespace" "$CONFIG_FILE")
local values_file=$(yq eval ".deployments[$index].values" "$CONFIG_FILE")
local enabled=$(yq eval ".deployments[$index].enabled // true" "$CONFIG_FILE")
if [ "$enabled" != "true" ]; then
warn "Skipping disabled deployment: $app_name"
return 0
fi
log "Deploying $app_name to namespace $namespace"
# 创建命名空间
kubectl create namespace "$namespace" --dry-run=client -o yaml | kubectl apply -f -
# 执行部署
local start_time=$(date +%s)
if helm upgrade --install "$app_name" "$chart_path" \
--namespace "$namespace" \
-f "$values_file" \
--timeout "${TIMEOUT}s" \
--wait \
--atomic; then
local end_time=$(date +%s)
local duration=$((end_time - start_time))
success "Deployed $app_name successfully in ${duration}s"
# 验证部署
kubectl rollout status deployment/"$app_name" -n "$namespace" --timeout="${TIMEOUT}s"
return 0
else
error "Failed to deploy $app_name"
return 1
fi
}
# 并行部署
parallel_deploy() {
log "Starting parallel deployment with max $MAX_PARALLEL concurrent jobs"
local pids=()
local results=()
local deployed=0
local failed=0
for ((i=0; i<DEPLOYMENT_COUNT; i++)); do
# 等待空闲槽位
while [ ${#pids[@]} -ge $MAX_PARALLEL ]; do
for j in "${!pids[@]}"; do
if ! kill -0 "${pids[j]}" 2>/dev/null; then
wait "${pids[j]}"
local exit_code=$?
if [ $exit_code -eq 0 ]; then
((deployed++))
else
((failed++))
fi
unset pids[j]
fi
done
# 重新索引数组
pids=("${pids[@]}")
sleep 1
done
# 启动新的部署任务
deploy_app $i &
pids+=("$!")
done
# 等待所有任务完成
for pid in "${pids[@]}"; do
wait "$pid"
local exit_code=$?
if [ $exit_code -eq 0 ]; then
((deployed++))
else
((failed++))
fi
done
log "Deployment summary: $deployed successful, $failed failed"
if [ $failed -gt 0 ]; then
error "Some deployments failed"
return 1
else
success "All deployments completed successfully"
return 0
fi
}
# 生成部署报告
generate_report() {
log "Generating deployment report"
local report_file="deployment-report-$(date +%Y%m%d-%H%M%S).json"
cat > "$report_file" << EOF
{
"timestamp": "$(date -Iseconds)",
"config_file": "$CONFIG_FILE",
"total_deployments": $DEPLOYMENT_COUNT,
"max_parallel": $MAX_PARALLEL,
"timeout": $TIMEOUT,
"deployments": [
EOF
for ((i=0; i<DEPLOYMENT_COUNT; i++)); do
local app_name=$(yq eval ".deployments[$i].name" "$CONFIG_FILE")
local namespace=$(yq eval ".deployments[$i].namespace" "$CONFIG_FILE")
local enabled=$(yq eval ".deployments[$i].enabled // true" "$CONFIG_FILE")
cat >> "$report_file" << EOF
{
"name": "$app_name",
"namespace": "$namespace",
"enabled": $enabled,
"status": "$(kubectl get deployment "$app_name" -n "$namespace" -o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null || echo 'Unknown')"
}$([ $i -lt $((DEPLOYMENT_COUNT-1)) ] && echo ",")
EOF
done
cat >> "$report_file" << EOF
]
}
EOF
success "Report generated: $report_file"
}
# 主函数
main() {
log "Starting batch deployment process"
check_dependencies
parse_config
if parallel_deploy; then
generate_report
success "Batch deployment completed successfully"
exit 0
else
generate_report
error "Batch deployment failed"
exit 1
fi
}
# 信号处理
trap 'error "Deployment interrupted"; exit 130' INT TERM
# 执行主函数
main "$@"
10.5.3 部署配置文件
# deployments.yaml
deployments:
# 基础设施组件
- name: prometheus
chart: prometheus-community/prometheus
namespace: monitoring
values: environments/production/monitoring/prometheus-values.yaml
enabled: true
priority: 1
dependencies: []
- name: grafana
chart: grafana/grafana
namespace: monitoring
values: environments/production/monitoring/grafana-values.yaml
enabled: true
priority: 1
dependencies: [prometheus]
- name: elasticsearch
chart: elastic/elasticsearch
namespace: logging
values: environments/production/logging/elasticsearch-values.yaml
enabled: true
priority: 1
dependencies: []
- name: kibana
chart: elastic/kibana
namespace: logging
values: environments/production/logging/kibana-values.yaml
enabled: true
priority: 2
dependencies: [elasticsearch]
# 数据库组件
- name: postgresql
chart: bitnami/postgresql
namespace: database
values: environments/production/database/postgresql-values.yaml
enabled: true
priority: 1
dependencies: []
- name: redis
chart: bitnami/redis
namespace: cache
values: environments/production/cache/redis-values.yaml
enabled: true
priority: 1
dependencies: []
# 消息队列
- name: kafka
chart: bitnami/kafka
namespace: messaging
values: environments/production/messaging/kafka-values.yaml
enabled: true
priority: 2
dependencies: []
# 业务应用
- name: user-service
chart: charts/user-service
namespace: applications
values: environments/production/applications/user-service-values.yaml
enabled: true
priority: 3
dependencies: [postgresql, redis]
- name: order-service
chart: charts/order-service
namespace: applications
values: environments/production/applications/order-service-values.yaml
enabled: true
priority: 3
dependencies: [postgresql, redis, kafka]
- name: payment-service
chart: charts/payment-service
namespace: applications
values: environments/production/applications/payment-service-values.yaml
enabled: true
priority: 3
dependencies: [postgresql, redis, kafka]
- name: notification-service
chart: charts/notification-service
namespace: applications
values: environments/production/applications/notification-service-values.yaml
enabled: true
priority: 4
dependencies: [kafka, redis]
# API 网关
- name: api-gateway
chart: charts/api-gateway
namespace: gateway
values: environments/production/gateway/api-gateway-values.yaml
enabled: true
priority: 5
dependencies: [user-service, order-service, payment-service]
# 前端应用
- name: web-frontend
chart: charts/web-frontend
namespace: frontend
values: environments/production/frontend/web-frontend-values.yaml
enabled: true
priority: 6
dependencies: [api-gateway]
# 全局配置
global:
timeout: 600
maxParallel: 5
retryCount: 3
retryDelay: 30
# 健康检查配置
healthCheck:
enabled: true
timeout: 300
interval: 30
# 通知配置
notifications:
slack:
enabled: true
webhook: "https://hooks.slack.com/services/..."
channel: "#deployments"
email:
enabled: false
recipients: ["devops@company.com"]
# 回滚配置
rollback:
enabled: true
onFailure: true
keepHistory: 10
10.6 案例研究:电商平台
10.6.1 架构概述
graph TB
subgraph "Frontend"
A[Web App]
B[Mobile App]
C[Admin Panel]
end
subgraph "API Gateway"
D[Kong/Istio]
end
subgraph "Microservices"
E[User Service]
F[Product Service]
G[Order Service]
H[Payment Service]
I[Inventory Service]
J[Notification Service]
end
subgraph "Data Layer"
K[PostgreSQL]
L[Redis]
M[Elasticsearch]
N[MongoDB]
end
subgraph "Message Queue"
O[Kafka]
P[RabbitMQ]
end
subgraph "Infrastructure"
Q[Prometheus]
R[Grafana]
S[ELK Stack]
T[Jaeger]
end
A --> D
B --> D
C --> D
D --> E
D --> F
D --> G
D --> H
D --> I
D --> J
E --> K
E --> L
F --> N
F --> M
G --> K
G --> O
H --> K
H --> P
I --> K
I --> L
J --> P
E -.-> Q
F -.-> Q
G -.-> Q
H -.-> Q
I -.-> Q
J -.-> Q
10.6.2 电商平台 Helm Chart
# charts/ecommerce-platform/Chart.yaml
apiVersion: v2
name: ecommerce-platform
description: Complete e-commerce platform
type: application
version: 2.0.0
appVersion: "2.0.0"
dependencies:
# 基础设施
- name: postgresql
version: "12.x.x"
repository: "https://charts.bitnami.com/bitnami"
condition: postgresql.enabled
- name: redis
version: "17.x.x"
repository: "https://charts.bitnami.com/bitnami"
condition: redis.enabled
- name: mongodb
version: "13.x.x"
repository: "https://charts.bitnami.com/bitnami"
condition: mongodb.enabled
- name: elasticsearch
version: "8.x.x"
repository: "https://helm.elastic.co"
condition: elasticsearch.enabled
- name: kafka
version: "22.x.x"
repository: "https://charts.bitnami.com/bitnami"
condition: kafka.enabled
- name: rabbitmq
version: "11.x.x"
repository: "https://charts.bitnami.com/bitnami"
condition: rabbitmq.enabled
# 监控
- name: prometheus
version: "23.x.x"
repository: "https://prometheus-community.github.io/helm-charts"
condition: monitoring.prometheus.enabled
- name: grafana
version: "6.x.x"
repository: "https://grafana.github.io/helm-charts"
condition: monitoring.grafana.enabled
# 服务网格
- name: istio-base
version: "1.18.x"
repository: "https://istio-release.storage.googleapis.com/charts"
condition: serviceMesh.istio.enabled
- name: istiod
version: "1.18.x"
repository: "https://istio-release.storage.googleapis.com/charts"
condition: serviceMesh.istio.enabled
keywords:
- ecommerce
- microservices
- kubernetes
- helm
- platform
maintainers:
- name: E-commerce Team
email: ecommerce@company.com
10.6.3 电商平台配置
# charts/ecommerce-platform/values.yaml
global:
imageRegistry: "registry.company.com"
environment: "production"
domain: "ecommerce.company.com"
# 安全配置
security:
tls:
enabled: true
issuer: "letsencrypt-prod"
oauth:
enabled: true
provider: "auth0"
# 监控配置
monitoring:
enabled: true
namespace: "monitoring"
# 日志配置
logging:
enabled: true
level: "info"
# 微服务配置
microservices:
userService:
enabled: true
replicaCount: 3
image:
repository: "user-service"
tag: "v2.0.0"
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
database:
type: "postgresql"
name: "users"
cache:
enabled: true
type: "redis"
productService:
enabled: true
replicaCount: 5
image:
repository: "product-service"
tag: "v2.0.0"
resources:
requests:
memory: "1Gi"
cpu: "1000m"
limits:
memory: "2Gi"
cpu: "2000m"
database:
type: "mongodb"
name: "products"
search:
enabled: true
type: "elasticsearch"
orderService:
enabled: true
replicaCount: 4
image:
repository: "order-service"
tag: "v2.0.0"
resources:
requests:
memory: "768Mi"
cpu: "750m"
limits:
memory: "1.5Gi"
cpu: "1500m"
database:
type: "postgresql"
name: "orders"
messaging:
enabled: true
type: "kafka"
paymentService:
enabled: true
replicaCount: 3
image:
repository: "payment-service"
tag: "v2.0.0"
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
database:
type: "postgresql"
name: "payments"
security:
pci:
enabled: true
level: "level1"
inventoryService:
enabled: true
replicaCount: 2
image:
repository: "inventory-service"
tag: "v2.0.0"
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
database:
type: "postgresql"
name: "inventory"
cache:
enabled: true
type: "redis"
notificationService:
enabled: true
replicaCount: 2
image:
repository: "notification-service"
tag: "v2.0.0"
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
messaging:
enabled: true
type: "rabbitmq"
providers:
email:
enabled: true
service: "sendgrid"
sms:
enabled: true
service: "twilio"
push:
enabled: true
service: "firebase"
# API 网关配置
apiGateway:
enabled: true
type: "istio" # kong, nginx, istio
replicaCount: 3
istio:
gateway:
enabled: true
hosts:
- "api.ecommerce.company.com"
- "admin.ecommerce.company.com"
virtualService:
enabled: true
routes:
- match:
- uri:
prefix: "/api/users"
route:
- destination:
host: "user-service"
port:
number: 80
- match:
- uri:
prefix: "/api/products"
route:
- destination:
host: "product-service"
port:
number: 80
- match:
- uri:
prefix: "/api/orders"
route:
- destination:
host: "order-service"
port:
number: 80
- match:
- uri:
prefix: "/api/payments"
route:
- destination:
host: "payment-service"
port:
number: 80
rateLimiting:
enabled: true
requests: 1000
window: "1m"
authentication:
enabled: true
type: "jwt"
issuer: "https://auth.company.com"
# 前端应用配置
frontend:
webApp:
enabled: true
replicaCount: 3
image:
repository: "web-frontend"
tag: "v2.0.0"
ingress:
enabled: true
hosts:
- "www.ecommerce.company.com"
- "ecommerce.company.com"
mobileApi:
enabled: true
replicaCount: 2
image:
repository: "mobile-api"
tag: "v2.0.0"
ingress:
enabled: true
hosts:
- "mobile-api.ecommerce.company.com"
adminPanel:
enabled: true
replicaCount: 1
image:
repository: "admin-panel"
tag: "v2.0.0"
ingress:
enabled: true
hosts:
- "admin.ecommerce.company.com"
security:
whitelist:
enabled: true
ips:
- "10.0.0.0/8"
- "192.168.0.0/16"
# 数据库配置
postgresql:
enabled: true
auth:
postgresPassword: "secure-password"
database: "ecommerce"
primary:
persistence:
enabled: true
size: "500Gi"
storageClass: "fast-ssd"
readReplicas:
replicaCount: 2
persistence:
enabled: true
size: "500Gi"
storageClass: "fast-ssd"
metrics:
enabled: true
serviceMonitor:
enabled: true
redis:
enabled: true
auth:
enabled: true
password: "redis-password"
master:
persistence:
enabled: true
size: "100Gi"
storageClass: "fast-ssd"
replica:
replicaCount: 2
persistence:
enabled: true
size: "100Gi"
storageClass: "fast-ssd"
metrics:
enabled: true
serviceMonitor:
enabled: true
mongodb:
enabled: true
auth:
enabled: true
rootPassword: "mongo-password"
database: "products"
persistence:
enabled: true
size: "1Ti"
storageClass: "fast-ssd"
replicaSet:
enabled: true
replicas:
secondary: 2
arbiter: 1
metrics:
enabled: true
serviceMonitor:
enabled: true
elasticsearch:
enabled: true
clusterName: "ecommerce-search"
nodeGroup: "master"
masterService: "ecommerce-search-master"
roles:
master: "true"
ingest: "true"
data: "true"
replicas: 3
minimumMasterNodes: 2
persistence:
enabled: true
size: "500Gi"
storageClass: "fast-ssd"
resources:
requests:
cpu: "1000m"
memory: "2Gi"
limits:
cpu: "2000m"
memory: "4Gi"
# 消息队列配置
kafka:
enabled: true
replicaCount: 3
auth:
clientProtocol: "sasl"
interBrokerProtocol: "sasl"
persistence:
enabled: true
size: "200Gi"
storageClass: "fast-ssd"
zookeeper:
enabled: true
replicaCount: 3
persistence:
enabled: true
size: "20Gi"
storageClass: "fast-ssd"
metrics:
kafka:
enabled: true
jmx:
enabled: true
rabbitmq:
enabled: true
auth:
username: "admin"
password: "rabbitmq-password"
persistence:
enabled: true
size: "50Gi"
storageClass: "fast-ssd"
clustering:
enabled: true
replicaCount: 3
metrics:
enabled: true
serviceMonitor:
enabled: true
# 监控配置
monitoring:
prometheus:
enabled: true
server:
persistentVolume:
enabled: true
size: "100Gi"
storageClass: "fast-ssd"
retention: "30d"
alertmanager:
enabled: true
persistentVolume:
enabled: true
size: "10Gi"
storageClass: "fast-ssd"
grafana:
enabled: true
persistence:
enabled: true
size: "10Gi"
storageClass: "fast-ssd"
dashboards:
default:
ecommerce:
gnetId: 12345
revision: 1
datasource: Prometheus
sidecar:
dashboards:
enabled: true
searchNamespace: ALL
datasources:
enabled: true
searchNamespace: ALL
# 服务网格配置
serviceMesh:
istio:
enabled: true
injection:
enabled: true
namespaces:
- "ecommerce"
- "api-gateway"
security:
mtls:
enabled: true
mode: "STRICT"
observability:
tracing:
enabled: true
jaeger:
enabled: true
metrics:
enabled: true
prometheus:
enabled: true
# 备份配置
backup:
enabled: true
schedule: "0 2 * * *"
retention: "30d"
databases:
postgresql:
enabled: true
method: "pg_dump"
mongodb:
enabled: true
method: "mongodump"
storage:
type: "s3"
bucket: "ecommerce-backups"
region: "us-west-2"
encryption: true
# 灾备配置
disasterRecovery:
enabled: true
replication:
enabled: true
regions:
primary: "us-west-2"
secondary: "us-east-1"
rto: "4h" # Recovery Time Objective
rpo: "1h" # Recovery Point Objective
10.6.4 电商平台部署脚本
#!/bin/bash
# scripts/deploy-ecommerce.sh
set -euo pipefail
# 配置参数
ENVIRONMENT=${1:-production}
VERSION=${2:-latest}
NAMESPACE=${3:-ecommerce}
DRY_RUN=${4:-false}
# 颜色输出
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'
log() {
echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}
error() {
echo -e "${RED}[ERROR]${NC} $1" >&2
}
success() {
echo -e "${GREEN}[SUCCESS]${NC} $1"
}
warn() {
echo -e "${YELLOW}[WARNING]${NC} $1"
}
# 部署前检查
pre_deployment_checks() {
log "Running pre-deployment checks..."
# 检查 Helm
if ! command -v helm &> /dev/null; then
error "Helm is not installed"
exit 1
fi
# 检查 kubectl
if ! command -v kubectl &> /dev/null; then
error "kubectl is not installed"
exit 1
fi
# 检查集群连接
if ! kubectl cluster-info &> /dev/null; then
error "Cannot connect to Kubernetes cluster"
exit 1
fi
# 检查命名空间
if ! kubectl get namespace "$NAMESPACE" &> /dev/null; then
log "Creating namespace: $NAMESPACE"
kubectl create namespace "$NAMESPACE"
fi
# 检查存储类
if ! kubectl get storageclass fast-ssd &> /dev/null; then
error "Required storage class 'fast-ssd' not found"
exit 1
fi
success "Pre-deployment checks passed"
}
# 部署基础设施
deploy_infrastructure() {
log "Deploying infrastructure components..."
# 部署 PostgreSQL
log "Deploying PostgreSQL..."
helm upgrade --install postgresql bitnami/postgresql \
--namespace "$NAMESPACE" \
--set auth.postgresPassword="$(kubectl get secret --namespace "$NAMESPACE" postgresql -o jsonpath="{.data.postgres-password}" 2>/dev/null | base64 --decode || echo 'secure-password')" \
--set primary.persistence.size="500Gi" \
--set primary.persistence.storageClass="fast-ssd" \
--wait --timeout 10m
# 部署 Redis
log "Deploying Redis..."
helm upgrade --install redis bitnami/redis \
--namespace "$NAMESPACE" \
--set auth.password="$(kubectl get secret --namespace "$NAMESPACE" redis -o jsonpath="{.data.redis-password}" 2>/dev/null | base64 --decode || echo 'redis-password')" \
--set master.persistence.size="100Gi" \
--set master.persistence.storageClass="fast-ssd" \
--wait --timeout 10m
# 部署 MongoDB
log "Deploying MongoDB..."
helm upgrade --install mongodb bitnami/mongodb \
--namespace "$NAMESPACE" \
--set auth.rootPassword="$(kubectl get secret --namespace "$NAMESPACE" mongodb -o jsonpath="{.data.mongodb-root-password}" 2>/dev/null | base64 --decode || echo 'mongo-password')" \
--set persistence.size="1Ti" \
--set persistence.storageClass="fast-ssd" \
--wait --timeout 15m
# 部署 Elasticsearch
log "Deploying Elasticsearch..."
helm upgrade --install elasticsearch elastic/elasticsearch \
--namespace "$NAMESPACE" \
--set persistence.enabled=true \
--set persistence.size="500Gi" \
--set persistence.storageClass="fast-ssd" \
--set replicas=3 \
--wait --timeout 15m
# 部署 Kafka
log "Deploying Kafka..."
helm upgrade --install kafka bitnami/kafka \
--namespace "$NAMESPACE" \
--set persistence.size="200Gi" \
--set persistence.storageClass="fast-ssd" \
--set zookeeper.persistence.size="20Gi" \
--set zookeeper.persistence.storageClass="fast-ssd" \
--wait --timeout 15m
success "Infrastructure deployment completed"
}
# 部署微服务
deploy_microservices() {
log "Deploying microservices..."
local services=("user-service" "product-service" "order-service" "payment-service" "inventory-service" "notification-service")
for service in "${services[@]}"; do
log "Deploying $service..."
helm upgrade --install "$service" "charts/$service" \
--namespace "$NAMESPACE" \
-f "environments/$ENVIRONMENT/$service-values.yaml" \
--set image.tag="$VERSION" \
--wait --timeout 10m
# 验证部署
kubectl rollout status deployment/"$service" -n "$NAMESPACE" --timeout=300s
done
success "Microservices deployment completed"
}
# 部署 API 网关
deploy_api_gateway() {
log "Deploying API Gateway..."
# 部署 Istio Gateway
helm upgrade --install api-gateway charts/api-gateway \
--namespace "$NAMESPACE" \
-f "environments/$ENVIRONMENT/api-gateway-values.yaml" \
--set image.tag="$VERSION" \
--wait --timeout 10m
success "API Gateway deployment completed"
}
# 部署前端应用
deploy_frontend() {
log "Deploying frontend applications..."
local frontends=("web-frontend" "mobile-api" "admin-panel")
for frontend in "${frontends[@]}"; do
log "Deploying $frontend..."
helm upgrade --install "$frontend" "charts/$frontend" \
--namespace "$NAMESPACE" \
-f "environments/$ENVIRONMENT/$frontend-values.yaml" \
--set image.tag="$VERSION" \
--wait --timeout 10m
done
success "Frontend deployment completed"
}
# 部署监控
deploy_monitoring() {
log "Deploying monitoring stack..."
# 部署 Prometheus
helm upgrade --install prometheus prometheus-community/prometheus \
--namespace monitoring \
--create-namespace \
-f "environments/$ENVIRONMENT/monitoring/prometheus-values.yaml" \
--wait --timeout 10m
# 部署 Grafana
helm upgrade --install grafana grafana/grafana \
--namespace monitoring \
-f "environments/$ENVIRONMENT/monitoring/grafana-values.yaml" \
--wait --timeout 10m
success "Monitoring deployment completed"
}
# 运行测试
run_tests() {
log "Running deployment tests..."
# 健康检查
local services=("user-service" "product-service" "order-service" "payment-service" "inventory-service" "notification-service")
for service in "${services[@]}"; do
log "Testing $service health..."
if kubectl get pods -n "$NAMESPACE" -l app="$service" | grep -q Running; then
success "$service is running"
else
error "$service is not running properly"
kubectl describe pods -n "$NAMESPACE" -l app="$service"
return 1
fi
done
# API 测试
log "Testing API endpoints..."
local api_gateway_ip=$(kubectl get svc api-gateway -n "$NAMESPACE" -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
if [ -n "$api_gateway_ip" ]; then
if curl -f "http://$api_gateway_ip/health" &> /dev/null; then
success "API Gateway health check passed"
else
warn "API Gateway health check failed"
fi
else
warn "API Gateway IP not available yet"
fi
success "Tests completed"
}
# 生成部署报告
generate_report() {
log "Generating deployment report..."
local report_file="ecommerce-deployment-report-$(date +%Y%m%d-%H%M%S).json"
cat > "$report_file" << EOF
{
"deployment": {
"timestamp": "$(date -Iseconds)",
"environment": "$ENVIRONMENT",
"version": "$VERSION",
"namespace": "$NAMESPACE"
},
"services": [
EOF
local services=("user-service" "product-service" "order-service" "payment-service" "inventory-service" "notification-service" "api-gateway" "web-frontend" "mobile-api" "admin-panel")
for i in "${!services[@]}"; do
local service="${services[$i]}"
local status=$(kubectl get deployment "$service" -n "$NAMESPACE" -o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null || echo 'Unknown')
local replicas=$(kubectl get deployment "$service" -n "$NAMESPACE" -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo '0')
cat >> "$report_file" << EOF
{
"name": "$service",
"status": "$status",
"readyReplicas": $replicas
}$([ $i -lt $((${#services[@]}-1)) ] && echo ",")
EOF
done
cat >> "$report_file" << EOF
],
"infrastructure": {
"postgresql": "$(kubectl get statefulset postgresql -n "$NAMESPACE" -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo '0') ready",
"redis": "$(kubectl get statefulset redis-master -n "$NAMESPACE" -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo '0') ready",
"mongodb": "$(kubectl get statefulset mongodb -n "$NAMESPACE" -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo '0') ready",
"elasticsearch": "$(kubectl get statefulset elasticsearch-master -n "$NAMESPACE" -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo '0') ready",
"kafka": "$(kubectl get statefulset kafka -n "$NAMESPACE" -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo '0') ready"
}
}
EOF
success "Report generated: $report_file"
}
# 主函数
main() {
log "Starting e-commerce platform deployment"
log "Environment: $ENVIRONMENT"
log "Version: $VERSION"
log "Namespace: $NAMESPACE"
log "Dry Run: $DRY_RUN"
if [ "$DRY_RUN" = "true" ]; then
log "Running in dry-run mode"
return 0
fi
pre_deployment_checks
deploy_infrastructure
deploy_microservices
deploy_api_gateway
deploy_frontend
deploy_monitoring
run_tests
generate_report
success "E-commerce platform deployment completed successfully!"
log "Access URLs:"
log " Web App: https://www.ecommerce.company.com"
log " Admin Panel: https://admin.ecommerce.company.com"
log " API Gateway: https://api.ecommerce.company.com"
log " Grafana: https://grafana.monitoring.company.com"
}
# 信号处理
trap 'error "Deployment interrupted"; exit 130' INT TERM
# 执行主函数
main "$@"
10.7 运维自动化
10.7.1 监控和告警配置
# monitoring/prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ecommerce-alerts
namespace: monitoring
labels:
app: prometheus
spec:
groups:
- name: ecommerce.rules
rules:
# 应用级别告警
- alert: ServiceDown
expr: up{job=~".*-service"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
description: "Service {{ $labels.job }} has been down for more than 1 minute."
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}."
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.job }}"
description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}."
# 基础设施告警
- alert: DatabaseConnectionHigh
expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High database connections"
description: "Database connection usage is {{ $value | humanizePercentage }}."
- alert: RedisMemoryHigh
expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "Redis memory usage high"
description: "Redis memory usage is {{ $value | humanizePercentage }}."
- alert: KafkaConsumerLag
expr: kafka_consumer_lag_sum > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "Kafka consumer lag high"
description: "Kafka consumer lag is {{ $value }} messages."
# 资源告警
- alert: PodCPUHigh
expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Pod CPU usage high"
description: "Pod {{ $labels.pod }} CPU usage is {{ $value | humanizePercentage }}."
- alert: PodMemoryHigh
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "Pod memory usage high"
description: "Pod {{ $labels.pod }} memory usage is {{ $value | humanizePercentage }}."
- alert: PodRestartHigh
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 0m
labels:
severity: warning
annotations:
summary: "Pod restarting frequently"
description: "Pod {{ $labels.pod }} has restarted {{ $value }} times in the last hour."
---
# monitoring/alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-config
namespace: monitoring
type: Opaque
stringData:
alertmanager.yml: |
global:
smtp_smarthost: 'smtp.company.com:587'
smtp_from: 'alerts@company.com'
slack_api_url: 'https://hooks.slack.com/services/...'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'critical-alerts'
slack_configs:
- channel: '#critical-alerts'
title: 'CRITICAL: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
color: 'danger'
email_configs:
- to: 'oncall@company.com'
subject: 'CRITICAL Alert: {{ .GroupLabels.alertname }}'
body: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'warning-alerts'
slack_configs:
- channel: '#warnings'
title: 'Warning: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
color: 'warning'
10.7.2 自动化运维脚本
#!/bin/bash
# scripts/ops-automation.sh
set -euo pipefail
# 配置
NAMESPACE="ecommerce"
MONITORING_NAMESPACE="monitoring"
LOG_FILE="/var/log/ops-automation.log"
SLACK_WEBHOOK="https://hooks.slack.com/services/..."
# 日志函数
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
# Slack 通知
notify_slack() {
local message="$1"
local color="${2:-good}"
curl -X POST -H 'Content-type: application/json' \
--data "{
\"attachments\": [{
\"color\": \"$color\",
\"text\": \"$message\"
}]
}" \
"$SLACK_WEBHOOK" || true
}
# 健康检查
health_check() {
log "Running health checks..."
local failed_services=()
local services=("user-service" "product-service" "order-service" "payment-service" "inventory-service" "notification-service")
for service in "${services[@]}"; do
if ! kubectl get pods -n "$NAMESPACE" -l app="$service" | grep -q Running; then
failed_services+=("$service")
fi
done
if [ ${#failed_services[@]} -gt 0 ]; then
local message="Health check failed for services: ${failed_services[*]}"
log "$message"
notify_slack "$message" "danger"
return 1
else
log "All services are healthy"
return 0
fi
}
# 资源清理
cleanup_resources() {
log "Cleaning up resources..."
# 清理已完成的 Jobs
kubectl delete jobs --field-selector status.successful=1 -n "$NAMESPACE" --ignore-not-found=true
# 清理失败的 Pods
kubectl delete pods --field-selector status.phase=Failed -n "$NAMESPACE" --ignore-not-found=true
# 清理 Evicted Pods
kubectl get pods -n "$NAMESPACE" | grep Evicted | awk '{print $1}' | xargs -r kubectl delete pod -n "$NAMESPACE"
log "Resource cleanup completed"
}
# 备份数据库
backup_databases() {
log "Starting database backup..."
local backup_date=$(date +%Y%m%d-%H%M%S)
# PostgreSQL 备份
kubectl exec -n "$NAMESPACE" postgresql-0 -- pg_dumpall -U postgres > "/tmp/postgresql-backup-$backup_date.sql"
# MongoDB 备份
kubectl exec -n "$NAMESPACE" mongodb-0 -- mongodump --out "/tmp/mongodb-backup-$backup_date"
# 上传到 S3
aws s3 cp "/tmp/postgresql-backup-$backup_date.sql" "s3://ecommerce-backups/postgresql/"
aws s3 cp "/tmp/mongodb-backup-$backup_date" "s3://ecommerce-backups/mongodb/" --recursive
# 清理本地文件
rm -rf "/tmp/postgresql-backup-$backup_date.sql" "/tmp/mongodb-backup-$backup_date"
log "Database backup completed"
}
# 性能优化
performance_optimization() {
log "Running performance optimization..."
# 检查资源使用情况
local high_cpu_pods=$(kubectl top pods -n "$NAMESPACE" --no-headers | awk '$2 > 80 {print $1}')
local high_memory_pods=$(kubectl top pods -n "$NAMESPACE" --no-headers | awk '$3 > 80 {print $1}')
if [ -n "$high_cpu_pods" ]; then
log "High CPU usage detected in pods: $high_cpu_pods"
notify_slack "High CPU usage detected in pods: $high_cpu_pods" "warning"
fi
if [ -n "$high_memory_pods" ]; then
log "High memory usage detected in pods: $high_memory_pods"
notify_slack "High memory usage detected in pods: $high_memory_pods" "warning"
fi
# 自动扩缩容检查
local hpa_status=$(kubectl get hpa -n "$NAMESPACE" --no-headers | awk '$4 > $5*0.8 {print $1}')
if [ -n "$hpa_status" ]; then
log "HPA scaling triggered for: $hpa_status"
notify_slack "Auto-scaling triggered for: $hpa_status" "good"
fi
log "Performance optimization completed"
}
# 安全扫描
security_scan() {
log "Running security scan..."
# 检查过期的 TLS 证书
local expiring_certs=$(kubectl get certificates -n "$NAMESPACE" -o json | jq -r '.items[] | select(.status.notAfter | fromdateiso8601 < (now + 604800)) | .metadata.name')
if [ -n "$expiring_certs" ]; then
log "TLS certificates expiring soon: $expiring_certs"
notify_slack "TLS certificates expiring soon: $expiring_certs" "warning"
fi
# 检查安全策略违规
local policy_violations=$(kubectl get pods -n "$NAMESPACE" -o json | jq -r '.items[] | select(.spec.securityContext.runAsRoot == true) | .metadata.name')
if [ -n "$policy_violations" ]; then
log "Security policy violations detected: $policy_violations"
notify_slack "Security policy violations detected: $policy_violations" "danger"
fi
log "Security scan completed"
}
# 日志轮转
log_rotation() {
log "Performing log rotation..."
# 压缩旧日志
find /var/log -name "*.log" -mtime +7 -exec gzip {} \;
# 删除超过 30 天的压缩日志
find /var/log -name "*.log.gz" -mtime +30 -delete
log "Log rotation completed"
}
# 主函数
main() {
local operation="${1:-all}"
case $operation in
health)
health_check
;;
cleanup)
cleanup_resources
;;
backup)
backup_databases
;;
performance)
performance_optimization
;;
security)
security_scan
;;
logs)
log_rotation
;;
all)
health_check
cleanup_resources
performance_optimization
security_scan
log_rotation
;;
*)
echo "Usage: $0 {health|cleanup|backup|performance|security|logs|all}"
exit 1
;;
esac
}
# 执行主函数
main "$@"
10.8 故障排除
10.8.1 常见问题诊断
#!/bin/bash
# scripts/troubleshoot.sh
set -euo pipefail
NAMESPACE="ecommerce"
# 诊断 Pod 问题
diagnose_pods() {
echo "=== Pod 诊断 ==="
# 检查失败的 Pods
echo "Failed Pods:"
kubectl get pods -n "$NAMESPACE" --field-selector=status.phase=Failed
# 检查重启次数高的 Pods
echo "\nPods with high restart count:"
kubectl get pods -n "$NAMESPACE" --sort-by='.status.containerStatuses[0].restartCount' | tail -10
# 检查资源使用情况
echo "\nTop resource consuming pods:"
kubectl top pods -n "$NAMESPACE" --sort-by=cpu
kubectl top pods -n "$NAMESPACE" --sort-by=memory
}
# 诊断服务问题
diagnose_services() {
echo "\n=== 服务诊断 ==="
# 检查服务端点
echo "Service endpoints:"
kubectl get endpoints -n "$NAMESPACE"
# 检查服务连接
echo "\nService connectivity test:"
local services=("user-service" "product-service" "order-service")
for service in "${services[@]}"; do
echo "Testing $service..."
kubectl run test-pod --image=curlimages/curl --rm -it --restart=Never -- \
curl -m 5 "http://$service.$NAMESPACE.svc.cluster.local/health" || echo "$service unreachable"
done
}
# 诊断网络问题
diagnose_network() {
echo "\n=== 网络诊断 ==="
# 检查网络策略
echo "Network policies:"
kubectl get networkpolicies -n "$NAMESPACE"
# 检查 DNS 解析
echo "\nDNS resolution test:"
kubectl run dns-test --image=busybox --rm -it --restart=Never -- \
nslookup kubernetes.default.svc.cluster.local
# 检查 Ingress
echo "\nIngress status:"
kubectl get ingress -n "$NAMESPACE"
kubectl describe ingress -n "$NAMESPACE"
}
# 诊断存储问题
diagnose_storage() {
echo "\n=== 存储诊断 ==="
# 检查 PVC 状态
echo "PVC status:"
kubectl get pvc -n "$NAMESPACE"
# 检查存储类
echo "\nStorage classes:"
kubectl get storageclass
# 检查卷使用情况
echo "\nVolume usage:"
kubectl exec -n "$NAMESPACE" postgresql-0 -- df -h /bitnami/postgresql || echo "PostgreSQL volume check failed"
kubectl exec -n "$NAMESPACE" redis-master-0 -- df -h /data || echo "Redis volume check failed"
}
# 诊断数据库问题
diagnose_databases() {
echo "\n=== 数据库诊断 ==="
# PostgreSQL 诊断
echo "PostgreSQL status:"
kubectl exec -n "$NAMESPACE" postgresql-0 -- psql -U postgres -c "SELECT version();" || echo "PostgreSQL connection failed"
kubectl exec -n "$NAMESPACE" postgresql-0 -- psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;" || echo "PostgreSQL activity check failed"
# Redis 诊断
echo "\nRedis status:"
kubectl exec -n "$NAMESPACE" redis-master-0 -- redis-cli ping || echo "Redis connection failed"
kubectl exec -n "$NAMESPACE" redis-master-0 -- redis-cli info memory || echo "Redis memory check failed"
# MongoDB 诊断
echo "\nMongoDB status:"
kubectl exec -n "$NAMESPACE" mongodb-0 -- mongo --eval "db.adminCommand('ismaster')" || echo "MongoDB connection failed"
}
# 生成诊断报告
generate_report() {
local report_file="troubleshoot-report-$(date +%Y%m%d-%H%M%S).txt"
{
echo "=== 故障排除报告 ==="
echo "生成时间: $(date)"
echo "命名空间: $NAMESPACE"
echo ""
diagnose_pods
diagnose_services
diagnose_network
diagnose_storage
diagnose_databases
echo "\n=== 系统事件 ==="
kubectl get events -n "$NAMESPACE" --sort-by='.lastTimestamp' | tail -20
echo "\n=== 集群信息 ==="
kubectl cluster-info
kubectl get nodes
kubectl top nodes
} > "$report_file"
echo "\n诊断报告已生成: $report_file"
}
# 主函数
main() {
local component="${1:-all}"
case $component in
pods)
diagnose_pods
;;
services)
diagnose_services
;;
network)
diagnose_network
;;
storage)
diagnose_storage
;;
databases)
diagnose_databases
;;
report)
generate_report
;;
all)
diagnose_pods
diagnose_services
diagnose_network
diagnose_storage
diagnose_databases
generate_report
;;
*)
echo "Usage: $0 {pods|services|network|storage|databases|report|all}"
exit 1
;;
esac
}
main "$@"
10.9 实践练习
练习 1:企业级 Chart 设计
目标:设计一个完整的企业级微服务应用 Chart
要求: 1. 包含至少 5 个微服务 2. 支持多环境配置 3. 包含完整的监控和日志配置 4. 实现自动扩缩容 5. 包含安全配置和网络策略
步骤:
# 1. 创建 Chart 结构
helm create enterprise-microservices
# 2. 设计依赖关系
# 编辑 Chart.yaml,添加数据库、缓存、消息队列等依赖
# 3. 配置多环境 Values
# 创建 environments/ 目录,包含 dev、staging、production 配置
# 4. 实现模板
# 为每个微服务创建 Deployment、Service、Ingress 模板
# 5. 添加监控配置
# 创建 ServiceMonitor、PrometheusRule 等监控资源
# 6. 测试部署
helm install enterprise-app ./enterprise-microservices -f environments/dev/values.yaml
练习 2:CI/CD 流水线集成
目标:创建完整的 Helm CI/CD 流水线
要求: 1. 自动化 Chart 验证和测试 2. 多环境自动部署 3. 回滚机制 4. 通知和报告
练习 3:大规模部署管理
目标:实现批量应用部署和管理
要求: 1. 支持并行部署 2. 依赖关系管理 3. 部署状态监控 4. 失败处理和重试
10.10 本章小结
本章通过真实的企业级案例,深入探讨了 Helm 在大型组织中的实际应用。我们学习了:
核心内容回顾
企业级架构设计
- Chart 组织结构最佳实践
- 企业级模板设计原则
- 复杂依赖关系管理
多环境管理策略
- 环境配置分离
- 配置继承和覆盖
- 环境特定的部署策略
CI/CD 集成
- GitLab CI/CD 流水线设计
- GitHub Actions 工作流
- 自动化测试和部署
大规模部署管理
- 集群资源管理
- 批量部署脚本
- 并行部署策略
运维自动化
- 监控和告警配置
- 自动化运维脚本
- 性能优化和安全扫描
故障排除
- 常见问题诊断方法
- 自动化故障检测
- 诊断报告生成
最佳实践总结
设计原则
- 模块化和可重用性
- 配置外部化
- 安全优先
- 可观测性
运维实践
- 基础设施即代码
- 持续集成和部署
- 监控驱动的运维
- 自动化优先
团队协作
- 明确的责任分工
- 标准化的流程
- 文档和知识共享
- 持续改进
企业级应用要点
- 可扩展性:设计支持大规模部署的架构
- 可靠性:实现高可用和容错机制
- 安全性:集成全面的安全控制
- 可维护性:建立完善的运维体系
- 合规性:满足企业治理要求
通过本章的学习,你已经掌握了在企业环境中成功实施 Helm 的关键技能和最佳实践。这些知识将帮助你在实际工作中设计和管理复杂的 Kubernetes 应用部署。
恭喜你完成了 Helm 教程的学习!
从基础概念到企业级应用,你已经全面掌握了 Helm 的各个方面。现在你可以: - 创建和管理复杂的 Helm Charts - 实施企业级的部署策略 - 集成 CI/CD 流水线 - 进行大规模的应用管理 - 实现运维自动化
继续实践和探索,将这些知识应用到你的实际项目中,成为 Kubernetes 和 Helm 的专家!