10.1 章节概述

本章将通过真实的企业级案例,展示 Helm 在大型组织中的实际应用。我们将学习如何在复杂的企业环境中设计、部署和管理 Helm Charts,包括多环境管理、CI/CD 集成、大规模部署、运维自动化等关键实践。

学习目标

  • 掌握企业级 Helm 架构设计原则
  • 学习多环境管理和配置策略
  • 了解 CI/CD 流水线中的 Helm 集成
  • 掌握大规模部署和运维自动化
  • 学习企业级监控和治理实践
  • 通过真实案例理解最佳实践

本章架构

graph TB
    A[企业级 Helm 实践] --> B[架构设计]
    A --> C[多环境管理]
    A --> D[CI/CD 集成]
    A --> E[大规模部署]
    A --> F[运维自动化]
    A --> G[案例研究]
    
    B --> B1[Chart 组织结构]
    B --> B2[依赖管理策略]
    B --> B3[版本控制]
    
    C --> C1[环境隔离]
    C --> C2[配置管理]
    C --> C3[部署策略]
    
    D --> D1[GitOps 流程]
    D --> D2[自动化测试]
    D --> D3[发布管理]
    
    E --> E1[集群管理]
    E --> E2[资源调度]
    E --> E3[性能优化]
    
    F --> F1[监控告警]
    F --> F2[日志管理]
    F --> F3[故障恢复]
    
    G --> G1[电商平台]
    G --> G2[金融系统]
    G --> G3[物联网平台]

10.2 企业级架构设计

10.2.1 Chart 组织结构

enterprise-charts/
├── platform/                    # 平台基础组件
│   ├── monitoring/
│   │   ├── prometheus/
│   │   ├── grafana/
│   │   └── alertmanager/
│   ├── logging/
│   │   ├── elasticsearch/
│   │   ├── logstash/
│   │   └── kibana/
│   ├── security/
│   │   ├── vault/
│   │   ├── cert-manager/
│   │   └── oauth2-proxy/
│   └── networking/
│       ├── ingress-nginx/
│       ├── istio/
│       └── calico/
├── applications/                 # 业务应用
│   ├── user-service/
│   ├── order-service/
│   ├── payment-service/
│   └── notification-service/
├── shared/                      # 共享组件
│   ├── database/
│   │   ├── postgresql/
│   │   ├── redis/
│   │   └── mongodb/
│   ├── messaging/
│   │   ├── kafka/
│   │   ├── rabbitmq/
│   │   └── nats/
│   └── storage/
│       ├── minio/
│       └── ceph/
├── environments/                # 环境配置
│   ├── dev/
│   ├── staging/
│   ├── production/
│   └── dr/                     # 灾备环境
└── umbrella/                   # 伞状 Chart
    ├── platform-stack/
    ├── application-stack/
    └── full-stack/

10.2.2 企业级 Chart 模板

# charts/enterprise-app/Chart.yaml
apiVersion: v2
name: enterprise-app
description: Enterprise-grade application template
type: application
version: 1.0.0
appVersion: "1.0.0"

# 企业级依赖管理
dependencies:
  # 监控组件
  - name: prometheus
    version: "15.x.x"
    repository: "https://prometheus-community.github.io/helm-charts"
    condition: monitoring.prometheus.enabled
  
  # 日志组件
  - name: elasticsearch
    version: "7.x.x"
    repository: "https://helm.elastic.co"
    condition: logging.elasticsearch.enabled
  
  # 数据库
  - name: postgresql
    version: "11.x.x"
    repository: "https://charts.bitnami.com/bitnami"
    condition: database.postgresql.enabled
  
  # 缓存
  - name: redis
    version: "16.x.x"
    repository: "https://charts.bitnami.com/bitnami"
    condition: cache.redis.enabled
  
  # 消息队列
  - name: kafka
    version: "18.x.x"
    repository: "https://charts.bitnami.com/bitnami"
    condition: messaging.kafka.enabled

# 企业级注解
annotations:
  category: "Enterprise Application"
  licenses: "Apache-2.0"
  images: |
    - name: app
      image: docker.io/company/enterprise-app:1.0.0
    - name: sidecar
      image: docker.io/company/sidecar:1.0.0
  
# 关键词
keywords:
  - enterprise
  - microservices
  - cloud-native
  - kubernetes
  - helm

# 维护者信息
maintainers:
  - name: Platform Team
    email: platform@company.com
    url: https://platform.company.com
  - name: DevOps Team
    email: devops@company.com
    url: https://devops.company.com

# 主页和源码
home: https://company.com/enterprise-app
sources:
  - https://github.com/company/enterprise-app
  - https://github.com/company/enterprise-charts

# API 版本兼容性
kubeVersion: ">=1.20.0-0"

10.2.3 企业级 Values 结构

# charts/enterprise-app/values.yaml
# 全局配置
global:
  # 镜像仓库配置
  imageRegistry: "registry.company.com"
  imagePullSecrets:
    - name: "company-registry-secret"
  
  # 存储类配置
  storageClass: "company-ssd"
  
  # 网络配置
  networkPolicy:
    enabled: true
    type: "calico"
  
  # 安全配置
  security:
    podSecurityPolicy: "restricted"
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 1000
  
  # 监控配置
  monitoring:
    enabled: true
    namespace: "monitoring"
    serviceMonitor:
      enabled: true
      interval: "30s"
  
  # 日志配置
  logging:
    enabled: true
    level: "info"
    format: "json"
    destination: "elasticsearch"

# 应用配置
app:
  name: "enterprise-app"
  version: "1.0.0"
  
  # 镜像配置
  image:
    repository: "enterprise-app"
    tag: "1.0.0"
    pullPolicy: "IfNotPresent"
  
  # 副本配置
  replicaCount: 3
  
  # 资源配置
  resources:
    requests:
      memory: "512Mi"
      cpu: "500m"
    limits:
      memory: "1Gi"
      cpu: "1000m"
  
  # 健康检查
  healthCheck:
    enabled: true
    livenessProbe:
      httpGet:
        path: "/health/live"
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: "/health/ready"
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 3
  
  # 环境变量
  env:
    - name: "ENVIRONMENT"
      value: "production"
    - name: "LOG_LEVEL"
      value: "info"
    - name: "DATABASE_URL"
      valueFrom:
        secretKeyRef:
          name: "database-credentials"
          key: "url"

# 服务配置
service:
  type: "ClusterIP"
  port: 80
  targetPort: 8080
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"

# Ingress 配置
ingress:
  enabled: true
  className: "nginx"
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    cert-manager.io/cluster-issuer: "company-ca-issuer"
  hosts:
    - host: "app.company.com"
      paths:
        - path: "/"
          pathType: "Prefix"
  tls:
    - secretName: "app-tls-secret"
      hosts:
        - "app.company.com"

# 自动扩缩容
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: "Percent"
          value: 10
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: "Percent"
          value: 50
          periodSeconds: 60

# 持久化存储
persistence:
  enabled: true
  storageClass: "company-ssd"
  accessMode: "ReadWriteOnce"
  size: "10Gi"
  annotations:
    volume.beta.kubernetes.io/storage-provisioner: "ebs.csi.aws.com"

# 数据库配置
database:
  postgresql:
    enabled: true
    auth:
      existingSecret: "postgresql-credentials"
    primary:
      persistence:
        enabled: true
        size: "100Gi"
        storageClass: "company-ssd"
    metrics:
      enabled: true
      serviceMonitor:
        enabled: true

# 缓存配置
cache:
  redis:
    enabled: true
    auth:
      enabled: true
      existingSecret: "redis-credentials"
    master:
      persistence:
        enabled: true
        size: "20Gi"
        storageClass: "company-ssd"
    metrics:
      enabled: true
      serviceMonitor:
        enabled: true

# 消息队列配置
messaging:
  kafka:
    enabled: true
    auth:
      clientProtocol: "sasl"
      existingSecret: "kafka-credentials"
    persistence:
      enabled: true
      size: "100Gi"
      storageClass: "company-ssd"
    metrics:
      kafka:
        enabled: true
      jmx:
        enabled: true

# 监控配置
monitoring:
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true
      interval: "30s"
      scrapeTimeout: "10s"
      labels:
        app: "enterprise-app"
        team: "platform"
  
  grafana:
    enabled: true
    dashboards:
      enabled: true
      configMapName: "enterprise-app-dashboards"

# 日志配置
logging:
  elasticsearch:
    enabled: true
    index: "enterprise-app"
    template: "enterprise-app-template"
  
  fluentd:
    enabled: true
    configMap: "enterprise-app-fluentd-config"

# 安全配置
security:
  networkPolicy:
    enabled: true
    ingress:
      - from:
          - namespaceSelector:
              matchLabels:
                name: "ingress-nginx"
        ports:
          - protocol: "TCP"
            port: 8080
      - from:
          - namespaceSelector:
              matchLabels:
                name: "monitoring"
        ports:
          - protocol: "TCP"
            port: 8080
    egress:
      - to:
          - namespaceSelector:
              matchLabels:
                name: "database"
        ports:
          - protocol: "TCP"
            port: 5432
      - to:
          - namespaceSelector:
              matchLabels:
                name: "cache"
        ports:
          - protocol: "TCP"
            port: 6379
  
  podSecurityPolicy:
    enabled: true
    name: "enterprise-app-psp"
  
  serviceAccount:
    create: true
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::123456789012:role/enterprise-app-role"

# 备份配置
backup:
  enabled: true
  schedule: "0 2 * * *"
  retention: "30d"
  storage:
    type: "s3"
    bucket: "company-backups"
    prefix: "enterprise-app"

10.3 多环境管理策略

10.3.1 环境配置分离

# environments/dev/values.yaml
# 开发环境配置
global:
  imageRegistry: "dev-registry.company.com"
  environment: "development"

app:
  replicaCount: 1
  image:
    tag: "dev-latest"
  resources:
    requests:
      memory: "256Mi"
      cpu: "250m"
    limits:
      memory: "512Mi"
      cpu: "500m"

autoscaling:
  enabled: false

database:
  postgresql:
    primary:
      persistence:
        size: "10Gi"

cache:
  redis:
    master:
      persistence:
        size: "5Gi"

monitoring:
  prometheus:
    enabled: false

logging:
  elasticsearch:
    enabled: false

---
# environments/staging/values.yaml
# 预发布环境配置
global:
  imageRegistry: "staging-registry.company.com"
  environment: "staging"

app:
  replicaCount: 2
  image:
    tag: "staging-v1.0.0"
  resources:
    requests:
      memory: "512Mi"
      cpu: "500m"
    limits:
      memory: "1Gi"
      cpu: "1000m"

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 5

database:
  postgresql:
    primary:
      persistence:
        size: "50Gi"

cache:
  redis:
    master:
      persistence:
        size: "10Gi"

monitoring:
  prometheus:
    enabled: true

logging:
  elasticsearch:
    enabled: true

---
# environments/production/values.yaml
# 生产环境配置
global:
  imageRegistry: "prod-registry.company.com"
  environment: "production"

app:
  replicaCount: 5
  image:
    tag: "v1.0.0"
  resources:
    requests:
      memory: "1Gi"
      cpu: "1000m"
    limits:
      memory: "2Gi"
      cpu: "2000m"

autoscaling:
  enabled: true
  minReplicas: 5
  maxReplicas: 50

database:
  postgresql:
    primary:
      persistence:
        size: "500Gi"
    readReplicas:
      replicaCount: 2
      persistence:
        size: "500Gi"

cache:
  redis:
    master:
      persistence:
        size: "100Gi"
    replica:
      replicaCount: 2
      persistence:
        size: "100Gi"

monitoring:
  prometheus:
    enabled: true
    retention: "30d"

logging:
  elasticsearch:
    enabled: true
    retention: "90d"

backup:
  enabled: true
  schedule: "0 2 * * *"
  retention: "90d"

10.3.2 环境部署脚本

#!/bin/bash
# scripts/deploy.sh

set -euo pipefail

# 配置参数
ENVIRONMENT=${1:-dev}
CHART_NAME=${2:-enterprise-app}
NAMESPACE=${3:-default}
VERSION=${4:-latest}

# 颜色输出
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

echo -e "${GREEN}Starting deployment to ${ENVIRONMENT} environment${NC}"

# 验证环境
case $ENVIRONMENT in
  dev|staging|production)
    echo -e "${GREEN}Environment: $ENVIRONMENT${NC}"
    ;;
  *)
    echo -e "${RED}Error: Invalid environment. Use dev, staging, or production${NC}"
    exit 1
    ;;
esac

# 设置 Kubernetes 上下文
case $ENVIRONMENT in
  dev)
    KUBE_CONTEXT="dev-cluster"
    REGISTRY="dev-registry.company.com"
    ;;
  staging)
    KUBE_CONTEXT="staging-cluster"
    REGISTRY="staging-registry.company.com"
    ;;
  production)
    KUBE_CONTEXT="prod-cluster"
    REGISTRY="prod-registry.company.com"
    ;;
esac

echo -e "${YELLOW}Switching to Kubernetes context: $KUBE_CONTEXT${NC}"
kubectl config use-context $KUBE_CONTEXT

# 创建命名空间(如果不存在)
echo -e "${YELLOW}Ensuring namespace exists: $NAMESPACE${NC}"
kubectl create namespace $NAMESPACE --dry-run=client -o yaml | kubectl apply -f -

# 添加标签
kubectl label namespace $NAMESPACE environment=$ENVIRONMENT --overwrite
kubectl label namespace $NAMESPACE managed-by=helm --overwrite

# 更新 Helm 仓库
echo -e "${YELLOW}Updating Helm repositories${NC}"
helm repo update

# 验证 Chart
echo -e "${YELLOW}Validating Helm chart${NC}"
helm lint charts/$CHART_NAME

# 模板渲染测试
echo -e "${YELLOW}Testing template rendering${NC}"
helm template $CHART_NAME charts/$CHART_NAME \
  -f environments/$ENVIRONMENT/values.yaml \
  --namespace $NAMESPACE \
  --dry-run > /tmp/rendered-$ENVIRONMENT.yaml

# 安全检查
echo -e "${YELLOW}Running security checks${NC}"
kubesec scan /tmp/rendered-$ENVIRONMENT.yaml

# 部署前检查
echo -e "${YELLOW}Pre-deployment checks${NC}"

# 检查资源配额
kubectl describe quota -n $NAMESPACE || echo "No resource quota found"

# 检查存储类
kubectl get storageclass company-ssd || {
  echo -e "${RED}Error: Required storage class 'company-ssd' not found${NC}"
  exit 1
}

# 检查镜像仓库连接
echo -e "${YELLOW}Checking image registry connectivity${NC}"
docker pull $REGISTRY/enterprise-app:$VERSION || {
  echo -e "${RED}Error: Cannot pull image from registry${NC}"
  exit 1
}

# 部署确认
if [ "$ENVIRONMENT" = "production" ]; then
  echo -e "${RED}WARNING: You are about to deploy to PRODUCTION!${NC}"
  read -p "Are you sure you want to continue? (yes/no): " -r
  if [[ ! $REPLY =~ ^[Yy][Ee][Ss]$ ]]; then
    echo -e "${YELLOW}Deployment cancelled${NC}"
    exit 0
  fi
fi

# 执行部署
echo -e "${GREEN}Deploying $CHART_NAME to $ENVIRONMENT${NC}"

helm upgrade --install $CHART_NAME charts/$CHART_NAME \
  --namespace $NAMESPACE \
  --create-namespace \
  -f environments/$ENVIRONMENT/values.yaml \
  --set app.image.tag=$VERSION \
  --set global.environment=$ENVIRONMENT \
  --timeout 10m \
  --wait \
  --atomic

# 部署后验证
echo -e "${YELLOW}Post-deployment verification${NC}"

# 检查 Pod 状态
echo "Checking pod status..."
kubectl get pods -n $NAMESPACE -l app.kubernetes.io/name=$CHART_NAME

# 检查服务状态
echo "Checking service status..."
kubectl get svc -n $NAMESPACE -l app.kubernetes.io/name=$CHART_NAME

# 健康检查
echo "Performing health check..."
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=$CHART_NAME -n $NAMESPACE --timeout=300s

# 运行测试
if [ -f "tests/$ENVIRONMENT-tests.yaml" ]; then
  echo -e "${YELLOW}Running environment-specific tests${NC}"
  helm test $CHART_NAME -n $NAMESPACE
fi

# 部署成功通知
echo -e "${GREEN}Deployment completed successfully!${NC}"
echo -e "${GREEN}Chart: $CHART_NAME${NC}"
echo -e "${GREEN}Environment: $ENVIRONMENT${NC}"
echo -e "${GREEN}Namespace: $NAMESPACE${NC}"
echo -e "${GREEN}Version: $VERSION${NC}"

# 获取访问信息
echo -e "${YELLOW}Access Information:${NC}"
kubectl get ingress -n $NAMESPACE -l app.kubernetes.io/name=$CHART_NAME

# 清理临时文件
rm -f /tmp/rendered-$ENVIRONMENT.yaml

echo -e "${GREEN}Deployment script completed${NC}"

10.4 CI/CD 集成

10.4.1 GitLab CI/CD 流水线

# .gitlab-ci.yml
stages:
  - validate
  - test
  - build
  - security
  - deploy-dev
  - deploy-staging
  - deploy-production

variables:
  CHART_NAME: "enterprise-app"
  DOCKER_REGISTRY: "registry.company.com"
  HELM_VERSION: "3.12.0"
  KUBECTL_VERSION: "1.27.0"

# 模板定义
.helm_template: &helm_template
  image: alpine/helm:$HELM_VERSION
  before_script:
    - apk add --no-cache curl
    - curl -LO "https://dl.k8s.io/release/v$KUBECTL_VERSION/bin/linux/amd64/kubectl"
    - chmod +x kubectl && mv kubectl /usr/local/bin/
    - helm repo add bitnami https://charts.bitnami.com/bitnami
    - helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    - helm repo update

# Chart 验证
validate-chart:
  <<: *helm_template
  stage: validate
  script:
    - helm lint charts/$CHART_NAME
    - helm template $CHART_NAME charts/$CHART_NAME --dry-run
  rules:
    - changes:
        - charts/**/*
        - environments/**/*

# 依赖检查
check-dependencies:
  <<: *helm_template
  stage: validate
  script:
    - cd charts/$CHART_NAME
    - helm dependency update
    - helm dependency build
  artifacts:
    paths:
      - charts/$CHART_NAME/charts/
    expire_in: 1 hour
  rules:
    - changes:
        - charts/**/Chart.yaml
        - charts/**/requirements.yaml

# 安全扫描
security-scan:
  image: aquasec/trivy:latest
  stage: security
  script:
    - helm template $CHART_NAME charts/$CHART_NAME > rendered.yaml
    - trivy config rendered.yaml
  artifacts:
    reports:
      junit: trivy-report.xml
  rules:
    - changes:
        - charts/**/*

# Chart 测试
test-chart:
  <<: *helm_template
  stage: test
  script:
    - helm install $CHART_NAME-test charts/$CHART_NAME 
        --dry-run --debug 
        -f environments/dev/values.yaml
    - helm template $CHART_NAME charts/$CHART_NAME 
        -f environments/staging/values.yaml 
        --validate
    - helm template $CHART_NAME charts/$CHART_NAME 
        -f environments/production/values.yaml 
        --validate
  rules:
    - changes:
        - charts/**/*
        - environments/**/*

# 构建和推送 Chart
build-chart:
  <<: *helm_template
  stage: build
  script:
    - helm package charts/$CHART_NAME --version $CI_COMMIT_TAG
    - curl --data-binary "@$CHART_NAME-$CI_COMMIT_TAG.tgz" 
        "$CHART_REPOSITORY_URL/api/charts"
  artifacts:
    paths:
      - "*.tgz"
    expire_in: 1 week
  only:
    - tags

# 开发环境部署
deploy-dev:
  <<: *helm_template
  stage: deploy-dev
  environment:
    name: development
    url: https://app-dev.company.com
  script:
    - kubectl config use-context dev-cluster
    - helm upgrade --install $CHART_NAME charts/$CHART_NAME
        --namespace dev
        --create-namespace
        -f environments/dev/values.yaml
        --set app.image.tag=$CI_COMMIT_SHA
        --wait
        --timeout 10m
    - kubectl rollout status deployment/$CHART_NAME -n dev
  rules:
    - if: '$CI_COMMIT_BRANCH == "develop"'
      changes:
        - charts/**/*
        - environments/dev/**/*

# 预发布环境部署
deploy-staging:
  <<: *helm_template
  stage: deploy-staging
  environment:
    name: staging
    url: https://app-staging.company.com
  script:
    - kubectl config use-context staging-cluster
    - helm upgrade --install $CHART_NAME charts/$CHART_NAME
        --namespace staging
        --create-namespace
        -f environments/staging/values.yaml
        --set app.image.tag=$CI_COMMIT_TAG
        --wait
        --timeout 15m
    - helm test $CHART_NAME -n staging
  rules:
    - if: '$CI_COMMIT_TAG =~ /^v[0-9]+\.[0-9]+\.[0-9]+(-rc\.[0-9]+)?$/'
  when: manual

# 生产环境部署
deploy-production:
  <<: *helm_template
  stage: deploy-production
  environment:
    name: production
    url: https://app.company.com
  script:
    - kubectl config use-context prod-cluster
    # 生产环境部署前检查
    - helm diff upgrade $CHART_NAME charts/$CHART_NAME
        --namespace production
        -f environments/production/values.yaml
        --set app.image.tag=$CI_COMMIT_TAG
    # 执行部署
    - helm upgrade --install $CHART_NAME charts/$CHART_NAME
        --namespace production
        --create-namespace
        -f environments/production/values.yaml
        --set app.image.tag=$CI_COMMIT_TAG
        --wait
        --timeout 20m
        --atomic
    # 部署后验证
    - kubectl rollout status deployment/$CHART_NAME -n production
    - helm test $CHART_NAME -n production
  rules:
    - if: '$CI_COMMIT_TAG =~ /^v[0-9]+\.[0-9]+\.[0-9]+$/'
  when: manual
  allow_failure: false

# 回滚作业
rollback-production:
  <<: *helm_template
  stage: deploy-production
  environment:
    name: production
    url: https://app.company.com
  script:
    - kubectl config use-context prod-cluster
    - helm rollback $CHART_NAME -n production
    - kubectl rollout status deployment/$CHART_NAME -n production
  when: manual
  only:
    - tags

10.4.2 GitHub Actions 工作流

# .github/workflows/helm-deploy.yml
name: Helm Deploy

on:
  push:
    branches:
      - main
      - develop
    tags:
      - 'v*'
  pull_request:
    branches:
      - main

env:
  CHART_NAME: enterprise-app
  HELM_VERSION: 3.12.0
  KUBECTL_VERSION: 1.27.0

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Setup Helm
      uses: azure/setup-helm@v3
      with:
        version: ${{ env.HELM_VERSION }}
    
    - name: Add Helm repositories
      run: |
        helm repo add bitnami https://charts.bitnami.com/bitnami
        helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
        helm repo update
    
    - name: Lint Helm Chart
      run: helm lint charts/${{ env.CHART_NAME }}
    
    - name: Validate Templates
      run: |
        helm template ${{ env.CHART_NAME }} charts/${{ env.CHART_NAME }} \
          -f environments/dev/values.yaml --validate
        helm template ${{ env.CHART_NAME }} charts/${{ env.CHART_NAME }} \
          -f environments/staging/values.yaml --validate
        helm template ${{ env.CHART_NAME }} charts/${{ env.CHART_NAME }} \
          -f environments/production/values.yaml --validate

  security-scan:
    runs-on: ubuntu-latest
    needs: validate
    steps:
    - uses: actions/checkout@v3
    
    - name: Setup Helm
      uses: azure/setup-helm@v3
      with:
        version: ${{ env.HELM_VERSION }}
    
    - name: Render templates
      run: |
        helm template ${{ env.CHART_NAME }} charts/${{ env.CHART_NAME }} \
          -f environments/production/values.yaml > rendered.yaml
    
    - name: Run Trivy security scan
      uses: aquasecurity/trivy-action@master
      with:
        scan-type: 'config'
        scan-ref: 'rendered.yaml'
        format: 'sarif'
        output: 'trivy-results.sarif'
    
    - name: Upload Trivy scan results
      uses: github/codeql-action/upload-sarif@v2
      if: always()
      with:
        sarif_file: 'trivy-results.sarif'

  deploy-dev:
    runs-on: ubuntu-latest
    needs: [validate, security-scan]
    if: github.ref == 'refs/heads/develop'
    environment: development
    steps:
    - uses: actions/checkout@v3
    
    - name: Setup Helm
      uses: azure/setup-helm@v3
      with:
        version: ${{ env.HELM_VERSION }}
    
    - name: Setup kubectl
      uses: azure/setup-kubectl@v3
      with:
        version: ${{ env.KUBECTL_VERSION }}
    
    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v2
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: us-west-2
    
    - name: Update kubeconfig
      run: aws eks update-kubeconfig --name dev-cluster
    
    - name: Deploy to Development
      run: |
        helm upgrade --install ${{ env.CHART_NAME }} charts/${{ env.CHART_NAME }} \
          --namespace dev \
          --create-namespace \
          -f environments/dev/values.yaml \
          --set app.image.tag=${{ github.sha }} \
          --wait \
          --timeout 10m
    
    - name: Verify deployment
      run: |
        kubectl rollout status deployment/${{ env.CHART_NAME }} -n dev
        kubectl get pods -n dev -l app.kubernetes.io/name=${{ env.CHART_NAME }}

  deploy-staging:
    runs-on: ubuntu-latest
    needs: [validate, security-scan]
    if: startsWith(github.ref, 'refs/tags/v')
    environment: staging
    steps:
    - uses: actions/checkout@v3
    
    - name: Setup Helm
      uses: azure/setup-helm@v3
      with:
        version: ${{ env.HELM_VERSION }}
    
    - name: Setup kubectl
      uses: azure/setup-kubectl@v3
      with:
        version: ${{ env.KUBECTL_VERSION }}
    
    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v2
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: us-west-2
    
    - name: Update kubeconfig
      run: aws eks update-kubeconfig --name staging-cluster
    
    - name: Deploy to Staging
      run: |
        helm upgrade --install ${{ env.CHART_NAME }} charts/${{ env.CHART_NAME }} \
          --namespace staging \
          --create-namespace \
          -f environments/staging/values.yaml \
          --set app.image.tag=${{ github.ref_name }} \
          --wait \
          --timeout 15m
    
    - name: Run tests
      run: helm test ${{ env.CHART_NAME }} -n staging

  deploy-production:
    runs-on: ubuntu-latest
    needs: [validate, security-scan]
    if: startsWith(github.ref, 'refs/tags/v') && !contains(github.ref, 'rc')
    environment: production
    steps:
    - uses: actions/checkout@v3
    
    - name: Setup Helm
      uses: azure/setup-helm@v3
      with:
        version: ${{ env.HELM_VERSION }}
    
    - name: Setup kubectl
      uses: azure/setup-kubectl@v3
      with:
        version: ${{ env.KUBECTL_VERSION }}
    
    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v2
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: us-west-2
    
    - name: Update kubeconfig
      run: aws eks update-kubeconfig --name prod-cluster
    
    - name: Deploy to Production
      run: |
        helm upgrade --install ${{ env.CHART_NAME }} charts/${{ env.CHART_NAME }} \
          --namespace production \
          --create-namespace \
          -f environments/production/values.yaml \
          --set app.image.tag=${{ github.ref_name }} \
          --wait \
          --timeout 20m \
          --atomic
    
    - name: Verify deployment
      run: |
        kubectl rollout status deployment/${{ env.CHART_NAME }} -n production
        helm test ${{ env.CHART_NAME }} -n production
    
    - name: Notify deployment
      uses: 8398a7/action-slack@v3
      with:
        status: ${{ job.status }}
        channel: '#deployments'
        text: 'Production deployment completed: ${{ env.CHART_NAME }} ${{ github.ref_name }}'
      env:
        SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

10.5 大规模部署管理

10.5.1 集群管理策略

# cluster-management/cluster-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-management-config
  namespace: kube-system
data:
  cluster-policy.yaml: |
    # 集群资源策略
    resourceQuotas:
      default:
        requests.cpu: "100m"
        requests.memory: "128Mi"
        limits.cpu: "500m"
        limits.memory: "512Mi"
        persistentvolumeclaims: "10"
        services: "20"
        secrets: "50"
        configmaps: "50"
      
      production:
        requests.cpu: "500m"
        requests.memory: "512Mi"
        limits.cpu: "2000m"
        limits.memory: "4Gi"
        persistentvolumeclaims: "50"
        services: "100"
        secrets: "200"
        configmaps: "200"
    
    networkPolicies:
      defaultDeny: true
      allowedNamespaces:
        - kube-system
        - monitoring
        - logging
        - ingress-nginx
    
    podSecurityStandards:
      enforce: "restricted"
      audit: "restricted"
      warn: "restricted"
    
    nodeSelectors:
      production:
        node-type: "production"
        instance-type: "c5.xlarge"
      staging:
        node-type: "staging"
        instance-type: "c5.large"
      development:
        node-type: "development"
        instance-type: "t3.medium"

---
# 资源配额模板
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: "{{ .Values.namespace }}"
spec:
  hard:
    requests.cpu: "{{ .Values.quota.requests.cpu }}"
    requests.memory: "{{ .Values.quota.requests.memory }}"
    limits.cpu: "{{ .Values.quota.limits.cpu }}"
    limits.memory: "{{ .Values.quota.limits.memory }}"
    persistentvolumeclaims: "{{ .Values.quota.pvc }}"
    services: "{{ .Values.quota.services }}"
    secrets: "{{ .Values.quota.secrets }}"
    configmaps: "{{ .Values.quota.configmaps }}"

---
# 限制范围模板
apiVersion: v1
kind: LimitRange
metadata:
  name: resource-limits
  namespace: "{{ .Values.namespace }}"
spec:
  limits:
  - default:
      cpu: "{{ .Values.limits.default.cpu }}"
      memory: "{{ .Values.limits.default.memory }}"
    defaultRequest:
      cpu: "{{ .Values.limits.defaultRequest.cpu }}"
      memory: "{{ .Values.limits.defaultRequest.memory }}"
    max:
      cpu: "{{ .Values.limits.max.cpu }}"
      memory: "{{ .Values.limits.max.memory }}"
    min:
      cpu: "{{ .Values.limits.min.cpu }}"
      memory: "{{ .Values.limits.min.memory }}"
    type: Container
  - max:
      storage: "{{ .Values.limits.storage.max }}"
    min:
      storage: "{{ .Values.limits.storage.min }}"
    type: PersistentVolumeClaim

10.5.2 批量部署脚本

#!/bin/bash
# scripts/batch-deploy.sh

set -euo pipefail

# 配置文件
CONFIG_FILE=${1:-"deployments.yaml"}
MAX_PARALLEL=${2:-5}
TIMEOUT=${3:-600}

# 颜色输出
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'

# 日志函数
log() {
    echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}

error() {
    echo -e "${RED}[ERROR]${NC} $1" >&2
}

success() {
    echo -e "${GREEN}[SUCCESS]${NC} $1"
}

warn() {
    echo -e "${YELLOW}[WARNING]${NC} $1"
}

# 检查依赖
check_dependencies() {
    log "Checking dependencies..."
    
    command -v helm >/dev/null 2>&1 || {
        error "Helm is required but not installed"
        exit 1
    }
    
    command -v kubectl >/dev/null 2>&1 || {
        error "kubectl is required but not installed"
        exit 1
    }
    
    command -v yq >/dev/null 2>&1 || {
        error "yq is required but not installed"
        exit 1
    }
    
    if [ ! -f "$CONFIG_FILE" ]; then
        error "Configuration file not found: $CONFIG_FILE"
        exit 1
    fi
    
    success "All dependencies are available"
}

# 解析配置文件
parse_config() {
    log "Parsing configuration file: $CONFIG_FILE"
    
    # 验证配置文件格式
    yq eval '.deployments | length' "$CONFIG_FILE" >/dev/null || {
        error "Invalid configuration file format"
        exit 1
    }
    
    DEPLOYMENT_COUNT=$(yq eval '.deployments | length' "$CONFIG_FILE")
    log "Found $DEPLOYMENT_COUNT deployments to process"
}

# 部署单个应用
deploy_app() {
    local index=$1
    local app_name=$(yq eval ".deployments[$index].name" "$CONFIG_FILE")
    local chart_path=$(yq eval ".deployments[$index].chart" "$CONFIG_FILE")
    local namespace=$(yq eval ".deployments[$index].namespace" "$CONFIG_FILE")
    local values_file=$(yq eval ".deployments[$index].values" "$CONFIG_FILE")
    local enabled=$(yq eval ".deployments[$index].enabled // true" "$CONFIG_FILE")
    
    if [ "$enabled" != "true" ]; then
        warn "Skipping disabled deployment: $app_name"
        return 0
    fi
    
    log "Deploying $app_name to namespace $namespace"
    
    # 创建命名空间
    kubectl create namespace "$namespace" --dry-run=client -o yaml | kubectl apply -f -
    
    # 执行部署
    local start_time=$(date +%s)
    
    if helm upgrade --install "$app_name" "$chart_path" \
        --namespace "$namespace" \
        -f "$values_file" \
        --timeout "${TIMEOUT}s" \
        --wait \
        --atomic; then
        
        local end_time=$(date +%s)
        local duration=$((end_time - start_time))
        success "Deployed $app_name successfully in ${duration}s"
        
        # 验证部署
        kubectl rollout status deployment/"$app_name" -n "$namespace" --timeout="${TIMEOUT}s"
        
        return 0
    else
        error "Failed to deploy $app_name"
        return 1
    fi
}

# 并行部署
parallel_deploy() {
    log "Starting parallel deployment with max $MAX_PARALLEL concurrent jobs"
    
    local pids=()
    local results=()
    local deployed=0
    local failed=0
    
    for ((i=0; i<DEPLOYMENT_COUNT; i++)); do
        # 等待空闲槽位
        while [ ${#pids[@]} -ge $MAX_PARALLEL ]; do
            for j in "${!pids[@]}"; do
                if ! kill -0 "${pids[j]}" 2>/dev/null; then
                    wait "${pids[j]}"
                    local exit_code=$?
                    
                    if [ $exit_code -eq 0 ]; then
                        ((deployed++))
                    else
                        ((failed++))
                    fi
                    
                    unset pids[j]
                fi
            done
            
            # 重新索引数组
            pids=("${pids[@]}")
            sleep 1
        done
        
        # 启动新的部署任务
        deploy_app $i &
        pids+=("$!")
    done
    
    # 等待所有任务完成
    for pid in "${pids[@]}"; do
        wait "$pid"
        local exit_code=$?
        
        if [ $exit_code -eq 0 ]; then
            ((deployed++))
        else
            ((failed++))
        fi
    done
    
    log "Deployment summary: $deployed successful, $failed failed"
    
    if [ $failed -gt 0 ]; then
        error "Some deployments failed"
        return 1
    else
        success "All deployments completed successfully"
        return 0
    fi
}

# 生成部署报告
generate_report() {
    log "Generating deployment report"
    
    local report_file="deployment-report-$(date +%Y%m%d-%H%M%S).json"
    
    cat > "$report_file" << EOF
{
  "timestamp": "$(date -Iseconds)",
  "config_file": "$CONFIG_FILE",
  "total_deployments": $DEPLOYMENT_COUNT,
  "max_parallel": $MAX_PARALLEL,
  "timeout": $TIMEOUT,
  "deployments": [
EOF
    
    for ((i=0; i<DEPLOYMENT_COUNT; i++)); do
        local app_name=$(yq eval ".deployments[$i].name" "$CONFIG_FILE")
        local namespace=$(yq eval ".deployments[$i].namespace" "$CONFIG_FILE")
        local enabled=$(yq eval ".deployments[$i].enabled // true" "$CONFIG_FILE")
        
        cat >> "$report_file" << EOF
    {
      "name": "$app_name",
      "namespace": "$namespace",
      "enabled": $enabled,
      "status": "$(kubectl get deployment "$app_name" -n "$namespace" -o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null || echo 'Unknown')"
    }$([ $i -lt $((DEPLOYMENT_COUNT-1)) ] && echo ",")
EOF
    done
    
    cat >> "$report_file" << EOF
  ]
}
EOF
    
    success "Report generated: $report_file"
}

# 主函数
main() {
    log "Starting batch deployment process"
    
    check_dependencies
    parse_config
    
    if parallel_deploy; then
        generate_report
        success "Batch deployment completed successfully"
        exit 0
    else
        generate_report
        error "Batch deployment failed"
        exit 1
    fi
}

# 信号处理
trap 'error "Deployment interrupted"; exit 130' INT TERM

# 执行主函数
main "$@"

10.5.3 部署配置文件

# deployments.yaml
deployments:
  # 基础设施组件
  - name: prometheus
    chart: prometheus-community/prometheus
    namespace: monitoring
    values: environments/production/monitoring/prometheus-values.yaml
    enabled: true
    priority: 1
    dependencies: []
  
  - name: grafana
    chart: grafana/grafana
    namespace: monitoring
    values: environments/production/monitoring/grafana-values.yaml
    enabled: true
    priority: 1
    dependencies: [prometheus]
  
  - name: elasticsearch
    chart: elastic/elasticsearch
    namespace: logging
    values: environments/production/logging/elasticsearch-values.yaml
    enabled: true
    priority: 1
    dependencies: []
  
  - name: kibana
    chart: elastic/kibana
    namespace: logging
    values: environments/production/logging/kibana-values.yaml
    enabled: true
    priority: 2
    dependencies: [elasticsearch]
  
  # 数据库组件
  - name: postgresql
    chart: bitnami/postgresql
    namespace: database
    values: environments/production/database/postgresql-values.yaml
    enabled: true
    priority: 1
    dependencies: []
  
  - name: redis
    chart: bitnami/redis
    namespace: cache
    values: environments/production/cache/redis-values.yaml
    enabled: true
    priority: 1
    dependencies: []
  
  # 消息队列
  - name: kafka
    chart: bitnami/kafka
    namespace: messaging
    values: environments/production/messaging/kafka-values.yaml
    enabled: true
    priority: 2
    dependencies: []
  
  # 业务应用
  - name: user-service
    chart: charts/user-service
    namespace: applications
    values: environments/production/applications/user-service-values.yaml
    enabled: true
    priority: 3
    dependencies: [postgresql, redis]
  
  - name: order-service
    chart: charts/order-service
    namespace: applications
    values: environments/production/applications/order-service-values.yaml
    enabled: true
    priority: 3
    dependencies: [postgresql, redis, kafka]
  
  - name: payment-service
    chart: charts/payment-service
    namespace: applications
    values: environments/production/applications/payment-service-values.yaml
    enabled: true
    priority: 3
    dependencies: [postgresql, redis, kafka]
  
  - name: notification-service
    chart: charts/notification-service
    namespace: applications
    values: environments/production/applications/notification-service-values.yaml
    enabled: true
    priority: 4
    dependencies: [kafka, redis]
  
  # API 网关
  - name: api-gateway
    chart: charts/api-gateway
    namespace: gateway
    values: environments/production/gateway/api-gateway-values.yaml
    enabled: true
    priority: 5
    dependencies: [user-service, order-service, payment-service]
  
  # 前端应用
  - name: web-frontend
    chart: charts/web-frontend
    namespace: frontend
    values: environments/production/frontend/web-frontend-values.yaml
    enabled: true
    priority: 6
    dependencies: [api-gateway]

# 全局配置
global:
  timeout: 600
  maxParallel: 5
  retryCount: 3
  retryDelay: 30
  
  # 健康检查配置
  healthCheck:
    enabled: true
    timeout: 300
    interval: 30
  
  # 通知配置
  notifications:
    slack:
      enabled: true
      webhook: "https://hooks.slack.com/services/..."
      channel: "#deployments"
    email:
      enabled: false
      recipients: ["devops@company.com"]
  
  # 回滚配置
  rollback:
    enabled: true
    onFailure: true
    keepHistory: 10

10.6 案例研究:电商平台

10.6.1 架构概述

graph TB
    subgraph "Frontend"
        A[Web App]
        B[Mobile App]
        C[Admin Panel]
    end
    
    subgraph "API Gateway"
        D[Kong/Istio]
    end
    
    subgraph "Microservices"
        E[User Service]
        F[Product Service]
        G[Order Service]
        H[Payment Service]
        I[Inventory Service]
        J[Notification Service]
    end
    
    subgraph "Data Layer"
        K[PostgreSQL]
        L[Redis]
        M[Elasticsearch]
        N[MongoDB]
    end
    
    subgraph "Message Queue"
        O[Kafka]
        P[RabbitMQ]
    end
    
    subgraph "Infrastructure"
        Q[Prometheus]
        R[Grafana]
        S[ELK Stack]
        T[Jaeger]
    end
    
    A --> D
    B --> D
    C --> D
    D --> E
    D --> F
    D --> G
    D --> H
    D --> I
    D --> J
    
    E --> K
    E --> L
    F --> N
    F --> M
    G --> K
    G --> O
    H --> K
    H --> P
    I --> K
    I --> L
    J --> P
    
    E -.-> Q
    F -.-> Q
    G -.-> Q
    H -.-> Q
    I -.-> Q
    J -.-> Q

10.6.2 电商平台 Helm Chart

# charts/ecommerce-platform/Chart.yaml
apiVersion: v2
name: ecommerce-platform
description: Complete e-commerce platform
type: application
version: 2.0.0
appVersion: "2.0.0"

dependencies:
  # 基础设施
  - name: postgresql
    version: "12.x.x"
    repository: "https://charts.bitnami.com/bitnami"
    condition: postgresql.enabled
  
  - name: redis
    version: "17.x.x"
    repository: "https://charts.bitnami.com/bitnami"
    condition: redis.enabled
  
  - name: mongodb
    version: "13.x.x"
    repository: "https://charts.bitnami.com/bitnami"
    condition: mongodb.enabled
  
  - name: elasticsearch
    version: "8.x.x"
    repository: "https://helm.elastic.co"
    condition: elasticsearch.enabled
  
  - name: kafka
    version: "22.x.x"
    repository: "https://charts.bitnami.com/bitnami"
    condition: kafka.enabled
  
  - name: rabbitmq
    version: "11.x.x"
    repository: "https://charts.bitnami.com/bitnami"
    condition: rabbitmq.enabled
  
  # 监控
  - name: prometheus
    version: "23.x.x"
    repository: "https://prometheus-community.github.io/helm-charts"
    condition: monitoring.prometheus.enabled
  
  - name: grafana
    version: "6.x.x"
    repository: "https://grafana.github.io/helm-charts"
    condition: monitoring.grafana.enabled
  
  # 服务网格
  - name: istio-base
    version: "1.18.x"
    repository: "https://istio-release.storage.googleapis.com/charts"
    condition: serviceMesh.istio.enabled
  
  - name: istiod
    version: "1.18.x"
    repository: "https://istio-release.storage.googleapis.com/charts"
    condition: serviceMesh.istio.enabled

keywords:
  - ecommerce
  - microservices
  - kubernetes
  - helm
  - platform

maintainers:
  - name: E-commerce Team
    email: ecommerce@company.com

10.6.3 电商平台配置

# charts/ecommerce-platform/values.yaml
global:
  imageRegistry: "registry.company.com"
  environment: "production"
  domain: "ecommerce.company.com"
  
  # 安全配置
  security:
    tls:
      enabled: true
      issuer: "letsencrypt-prod"
    oauth:
      enabled: true
      provider: "auth0"
  
  # 监控配置
  monitoring:
    enabled: true
    namespace: "monitoring"
  
  # 日志配置
  logging:
    enabled: true
    level: "info"

# 微服务配置
microservices:
  userService:
    enabled: true
    replicaCount: 3
    image:
      repository: "user-service"
      tag: "v2.0.0"
    resources:
      requests:
        memory: "512Mi"
        cpu: "500m"
      limits:
        memory: "1Gi"
        cpu: "1000m"
    database:
      type: "postgresql"
      name: "users"
    cache:
      enabled: true
      type: "redis"
  
  productService:
    enabled: true
    replicaCount: 5
    image:
      repository: "product-service"
      tag: "v2.0.0"
    resources:
      requests:
        memory: "1Gi"
        cpu: "1000m"
      limits:
        memory: "2Gi"
        cpu: "2000m"
    database:
      type: "mongodb"
      name: "products"
    search:
      enabled: true
      type: "elasticsearch"
  
  orderService:
    enabled: true
    replicaCount: 4
    image:
      repository: "order-service"
      tag: "v2.0.0"
    resources:
      requests:
        memory: "768Mi"
        cpu: "750m"
      limits:
        memory: "1.5Gi"
        cpu: "1500m"
    database:
      type: "postgresql"
      name: "orders"
    messaging:
      enabled: true
      type: "kafka"
  
  paymentService:
    enabled: true
    replicaCount: 3
    image:
      repository: "payment-service"
      tag: "v2.0.0"
    resources:
      requests:
        memory: "512Mi"
        cpu: "500m"
      limits:
        memory: "1Gi"
        cpu: "1000m"
    database:
      type: "postgresql"
      name: "payments"
    security:
      pci:
        enabled: true
        level: "level1"
  
  inventoryService:
    enabled: true
    replicaCount: 2
    image:
      repository: "inventory-service"
      tag: "v2.0.0"
    resources:
      requests:
        memory: "256Mi"
        cpu: "250m"
      limits:
        memory: "512Mi"
        cpu: "500m"
    database:
      type: "postgresql"
      name: "inventory"
    cache:
      enabled: true
      type: "redis"
  
  notificationService:
    enabled: true
    replicaCount: 2
    image:
      repository: "notification-service"
      tag: "v2.0.0"
    resources:
      requests:
        memory: "256Mi"
        cpu: "250m"
      limits:
        memory: "512Mi"
        cpu: "500m"
    messaging:
      enabled: true
      type: "rabbitmq"
    providers:
      email:
        enabled: true
        service: "sendgrid"
      sms:
        enabled: true
        service: "twilio"
      push:
        enabled: true
        service: "firebase"

# API 网关配置
apiGateway:
  enabled: true
  type: "istio"  # kong, nginx, istio
  replicaCount: 3
  
  istio:
    gateway:
      enabled: true
      hosts:
        - "api.ecommerce.company.com"
        - "admin.ecommerce.company.com"
    virtualService:
      enabled: true
      routes:
        - match:
            - uri:
                prefix: "/api/users"
          route:
            - destination:
                host: "user-service"
                port:
                  number: 80
        - match:
            - uri:
                prefix: "/api/products"
          route:
            - destination:
                host: "product-service"
                port:
                  number: 80
        - match:
            - uri:
                prefix: "/api/orders"
          route:
            - destination:
                host: "order-service"
                port:
                  number: 80
        - match:
            - uri:
                prefix: "/api/payments"
          route:
            - destination:
                host: "payment-service"
                port:
                  number: 80
  
  rateLimiting:
    enabled: true
    requests: 1000
    window: "1m"
  
  authentication:
    enabled: true
    type: "jwt"
    issuer: "https://auth.company.com"

# 前端应用配置
frontend:
  webApp:
    enabled: true
    replicaCount: 3
    image:
      repository: "web-frontend"
      tag: "v2.0.0"
    ingress:
      enabled: true
      hosts:
        - "www.ecommerce.company.com"
        - "ecommerce.company.com"
  
  mobileApi:
    enabled: true
    replicaCount: 2
    image:
      repository: "mobile-api"
      tag: "v2.0.0"
    ingress:
      enabled: true
      hosts:
        - "mobile-api.ecommerce.company.com"
  
  adminPanel:
    enabled: true
    replicaCount: 1
    image:
      repository: "admin-panel"
      tag: "v2.0.0"
    ingress:
      enabled: true
      hosts:
        - "admin.ecommerce.company.com"
    security:
      whitelist:
        enabled: true
        ips:
          - "10.0.0.0/8"
          - "192.168.0.0/16"

# 数据库配置
postgresql:
  enabled: true
  auth:
    postgresPassword: "secure-password"
    database: "ecommerce"
  primary:
    persistence:
      enabled: true
      size: "500Gi"
      storageClass: "fast-ssd"
  readReplicas:
    replicaCount: 2
    persistence:
      enabled: true
      size: "500Gi"
      storageClass: "fast-ssd"
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true

redis:
  enabled: true
  auth:
    enabled: true
    password: "redis-password"
  master:
    persistence:
      enabled: true
      size: "100Gi"
      storageClass: "fast-ssd"
  replica:
    replicaCount: 2
    persistence:
      enabled: true
      size: "100Gi"
      storageClass: "fast-ssd"
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true

mongodb:
  enabled: true
  auth:
    enabled: true
    rootPassword: "mongo-password"
    database: "products"
  persistence:
    enabled: true
    size: "1Ti"
    storageClass: "fast-ssd"
  replicaSet:
    enabled: true
    replicas:
      secondary: 2
      arbiter: 1
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true

elasticsearch:
  enabled: true
  clusterName: "ecommerce-search"
  nodeGroup: "master"
  masterService: "ecommerce-search-master"
  roles:
    master: "true"
    ingest: "true"
    data: "true"
  replicas: 3
  minimumMasterNodes: 2
  persistence:
    enabled: true
    size: "500Gi"
    storageClass: "fast-ssd"
  resources:
    requests:
      cpu: "1000m"
      memory: "2Gi"
    limits:
      cpu: "2000m"
      memory: "4Gi"

# 消息队列配置
kafka:
  enabled: true
  replicaCount: 3
  auth:
    clientProtocol: "sasl"
    interBrokerProtocol: "sasl"
  persistence:
    enabled: true
    size: "200Gi"
    storageClass: "fast-ssd"
  zookeeper:
    enabled: true
    replicaCount: 3
    persistence:
      enabled: true
      size: "20Gi"
      storageClass: "fast-ssd"
  metrics:
    kafka:
      enabled: true
    jmx:
      enabled: true

rabbitmq:
  enabled: true
  auth:
    username: "admin"
    password: "rabbitmq-password"
  persistence:
    enabled: true
    size: "50Gi"
    storageClass: "fast-ssd"
  clustering:
    enabled: true
    replicaCount: 3
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true

# 监控配置
monitoring:
  prometheus:
    enabled: true
    server:
      persistentVolume:
        enabled: true
        size: "100Gi"
        storageClass: "fast-ssd"
      retention: "30d"
    alertmanager:
      enabled: true
      persistentVolume:
        enabled: true
        size: "10Gi"
        storageClass: "fast-ssd"
  
  grafana:
    enabled: true
    persistence:
      enabled: true
      size: "10Gi"
      storageClass: "fast-ssd"
    dashboards:
      default:
        ecommerce:
          gnetId: 12345
          revision: 1
          datasource: Prometheus
    
    sidecar:
      dashboards:
        enabled: true
        searchNamespace: ALL
      datasources:
        enabled: true
        searchNamespace: ALL

# 服务网格配置
serviceMesh:
  istio:
    enabled: true
    injection:
      enabled: true
      namespaces:
        - "ecommerce"
        - "api-gateway"
    
    security:
      mtls:
        enabled: true
        mode: "STRICT"
    
    observability:
      tracing:
        enabled: true
        jaeger:
          enabled: true
      metrics:
        enabled: true
        prometheus:
          enabled: true

# 备份配置
backup:
  enabled: true
  schedule: "0 2 * * *"
  retention: "30d"
  
  databases:
    postgresql:
      enabled: true
      method: "pg_dump"
    mongodb:
      enabled: true
      method: "mongodump"
  
  storage:
    type: "s3"
    bucket: "ecommerce-backups"
    region: "us-west-2"
    encryption: true

# 灾备配置
disasterRecovery:
  enabled: true
  
  replication:
    enabled: true
    regions:
      primary: "us-west-2"
      secondary: "us-east-1"
  
  rto: "4h"  # Recovery Time Objective
  rpo: "1h"  # Recovery Point Objective

10.6.4 电商平台部署脚本

#!/bin/bash
# scripts/deploy-ecommerce.sh

set -euo pipefail

# 配置参数
ENVIRONMENT=${1:-production}
VERSION=${2:-latest}
NAMESPACE=${3:-ecommerce}
DRY_RUN=${4:-false}

# 颜色输出
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'

log() {
    echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1"
}

error() {
    echo -e "${RED}[ERROR]${NC} $1" >&2
}

success() {
    echo -e "${GREEN}[SUCCESS]${NC} $1"
}

warn() {
    echo -e "${YELLOW}[WARNING]${NC} $1"
}

# 部署前检查
pre_deployment_checks() {
    log "Running pre-deployment checks..."
    
    # 检查 Helm
    if ! command -v helm &> /dev/null; then
        error "Helm is not installed"
        exit 1
    fi
    
    # 检查 kubectl
    if ! command -v kubectl &> /dev/null; then
        error "kubectl is not installed"
        exit 1
    fi
    
    # 检查集群连接
    if ! kubectl cluster-info &> /dev/null; then
        error "Cannot connect to Kubernetes cluster"
        exit 1
    fi
    
    # 检查命名空间
    if ! kubectl get namespace "$NAMESPACE" &> /dev/null; then
        log "Creating namespace: $NAMESPACE"
        kubectl create namespace "$NAMESPACE"
    fi
    
    # 检查存储类
    if ! kubectl get storageclass fast-ssd &> /dev/null; then
        error "Required storage class 'fast-ssd' not found"
        exit 1
    fi
    
    success "Pre-deployment checks passed"
}

# 部署基础设施
deploy_infrastructure() {
    log "Deploying infrastructure components..."
    
    # 部署 PostgreSQL
    log "Deploying PostgreSQL..."
    helm upgrade --install postgresql bitnami/postgresql \
        --namespace "$NAMESPACE" \
        --set auth.postgresPassword="$(kubectl get secret --namespace "$NAMESPACE" postgresql -o jsonpath="{.data.postgres-password}" 2>/dev/null | base64 --decode || echo 'secure-password')" \
        --set primary.persistence.size="500Gi" \
        --set primary.persistence.storageClass="fast-ssd" \
        --wait --timeout 10m
    
    # 部署 Redis
    log "Deploying Redis..."
    helm upgrade --install redis bitnami/redis \
        --namespace "$NAMESPACE" \
        --set auth.password="$(kubectl get secret --namespace "$NAMESPACE" redis -o jsonpath="{.data.redis-password}" 2>/dev/null | base64 --decode || echo 'redis-password')" \
        --set master.persistence.size="100Gi" \
        --set master.persistence.storageClass="fast-ssd" \
        --wait --timeout 10m
    
    # 部署 MongoDB
    log "Deploying MongoDB..."
    helm upgrade --install mongodb bitnami/mongodb \
        --namespace "$NAMESPACE" \
        --set auth.rootPassword="$(kubectl get secret --namespace "$NAMESPACE" mongodb -o jsonpath="{.data.mongodb-root-password}" 2>/dev/null | base64 --decode || echo 'mongo-password')" \
        --set persistence.size="1Ti" \
        --set persistence.storageClass="fast-ssd" \
        --wait --timeout 15m
    
    # 部署 Elasticsearch
    log "Deploying Elasticsearch..."
    helm upgrade --install elasticsearch elastic/elasticsearch \
        --namespace "$NAMESPACE" \
        --set persistence.enabled=true \
        --set persistence.size="500Gi" \
        --set persistence.storageClass="fast-ssd" \
        --set replicas=3 \
        --wait --timeout 15m
    
    # 部署 Kafka
    log "Deploying Kafka..."
    helm upgrade --install kafka bitnami/kafka \
        --namespace "$NAMESPACE" \
        --set persistence.size="200Gi" \
        --set persistence.storageClass="fast-ssd" \
        --set zookeeper.persistence.size="20Gi" \
        --set zookeeper.persistence.storageClass="fast-ssd" \
        --wait --timeout 15m
    
    success "Infrastructure deployment completed"
}

# 部署微服务
deploy_microservices() {
    log "Deploying microservices..."
    
    local services=("user-service" "product-service" "order-service" "payment-service" "inventory-service" "notification-service")
    
    for service in "${services[@]}"; do
        log "Deploying $service..."
        
        helm upgrade --install "$service" "charts/$service" \
            --namespace "$NAMESPACE" \
            -f "environments/$ENVIRONMENT/$service-values.yaml" \
            --set image.tag="$VERSION" \
            --wait --timeout 10m
        
        # 验证部署
        kubectl rollout status deployment/"$service" -n "$NAMESPACE" --timeout=300s
    done
    
    success "Microservices deployment completed"
}

# 部署 API 网关
deploy_api_gateway() {
    log "Deploying API Gateway..."
    
    # 部署 Istio Gateway
    helm upgrade --install api-gateway charts/api-gateway \
        --namespace "$NAMESPACE" \
        -f "environments/$ENVIRONMENT/api-gateway-values.yaml" \
        --set image.tag="$VERSION" \
        --wait --timeout 10m
    
    success "API Gateway deployment completed"
}

# 部署前端应用
deploy_frontend() {
    log "Deploying frontend applications..."
    
    local frontends=("web-frontend" "mobile-api" "admin-panel")
    
    for frontend in "${frontends[@]}"; do
        log "Deploying $frontend..."
        
        helm upgrade --install "$frontend" "charts/$frontend" \
            --namespace "$NAMESPACE" \
            -f "environments/$ENVIRONMENT/$frontend-values.yaml" \
            --set image.tag="$VERSION" \
            --wait --timeout 10m
    done
    
    success "Frontend deployment completed"
}

# 部署监控
deploy_monitoring() {
    log "Deploying monitoring stack..."
    
    # 部署 Prometheus
    helm upgrade --install prometheus prometheus-community/prometheus \
        --namespace monitoring \
        --create-namespace \
        -f "environments/$ENVIRONMENT/monitoring/prometheus-values.yaml" \
        --wait --timeout 10m
    
    # 部署 Grafana
    helm upgrade --install grafana grafana/grafana \
        --namespace monitoring \
        -f "environments/$ENVIRONMENT/monitoring/grafana-values.yaml" \
        --wait --timeout 10m
    
    success "Monitoring deployment completed"
}

# 运行测试
run_tests() {
    log "Running deployment tests..."
    
    # 健康检查
    local services=("user-service" "product-service" "order-service" "payment-service" "inventory-service" "notification-service")
    
    for service in "${services[@]}"; do
        log "Testing $service health..."
        
        if kubectl get pods -n "$NAMESPACE" -l app="$service" | grep -q Running; then
            success "$service is running"
        else
            error "$service is not running properly"
            kubectl describe pods -n "$NAMESPACE" -l app="$service"
            return 1
        fi
    done
    
    # API 测试
    log "Testing API endpoints..."
    
    local api_gateway_ip=$(kubectl get svc api-gateway -n "$NAMESPACE" -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    
    if [ -n "$api_gateway_ip" ]; then
        if curl -f "http://$api_gateway_ip/health" &> /dev/null; then
            success "API Gateway health check passed"
        else
            warn "API Gateway health check failed"
        fi
    else
        warn "API Gateway IP not available yet"
    fi
    
    success "Tests completed"
}

# 生成部署报告
generate_report() {
    log "Generating deployment report..."
    
    local report_file="ecommerce-deployment-report-$(date +%Y%m%d-%H%M%S).json"
    
    cat > "$report_file" << EOF
{
  "deployment": {
    "timestamp": "$(date -Iseconds)",
    "environment": "$ENVIRONMENT",
    "version": "$VERSION",
    "namespace": "$NAMESPACE"
  },
  "services": [
EOF
    
    local services=("user-service" "product-service" "order-service" "payment-service" "inventory-service" "notification-service" "api-gateway" "web-frontend" "mobile-api" "admin-panel")
    
    for i in "${!services[@]}"; do
        local service="${services[$i]}"
        local status=$(kubectl get deployment "$service" -n "$NAMESPACE" -o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null || echo 'Unknown')
        local replicas=$(kubectl get deployment "$service" -n "$NAMESPACE" -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo '0')
        
        cat >> "$report_file" << EOF
    {
      "name": "$service",
      "status": "$status",
      "readyReplicas": $replicas
    }$([ $i -lt $((${#services[@]}-1)) ] && echo ",")
EOF
    done
    
    cat >> "$report_file" << EOF
  ],
  "infrastructure": {
    "postgresql": "$(kubectl get statefulset postgresql -n "$NAMESPACE" -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo '0') ready",
    "redis": "$(kubectl get statefulset redis-master -n "$NAMESPACE" -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo '0') ready",
    "mongodb": "$(kubectl get statefulset mongodb -n "$NAMESPACE" -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo '0') ready",
    "elasticsearch": "$(kubectl get statefulset elasticsearch-master -n "$NAMESPACE" -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo '0') ready",
    "kafka": "$(kubectl get statefulset kafka -n "$NAMESPACE" -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo '0') ready"
  }
}
EOF
    
    success "Report generated: $report_file"
}

# 主函数
main() {
    log "Starting e-commerce platform deployment"
    log "Environment: $ENVIRONMENT"
    log "Version: $VERSION"
    log "Namespace: $NAMESPACE"
    log "Dry Run: $DRY_RUN"
    
    if [ "$DRY_RUN" = "true" ]; then
        log "Running in dry-run mode"
        return 0
    fi
    
    pre_deployment_checks
    deploy_infrastructure
    deploy_microservices
    deploy_api_gateway
    deploy_frontend
    deploy_monitoring
    run_tests
    generate_report
    
    success "E-commerce platform deployment completed successfully!"
    
    log "Access URLs:"
    log "  Web App: https://www.ecommerce.company.com"
    log "  Admin Panel: https://admin.ecommerce.company.com"
    log "  API Gateway: https://api.ecommerce.company.com"
    log "  Grafana: https://grafana.monitoring.company.com"
}

# 信号处理
trap 'error "Deployment interrupted"; exit 130' INT TERM

# 执行主函数
main "$@"

10.7 运维自动化

10.7.1 监控和告警配置

# monitoring/prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ecommerce-alerts
  namespace: monitoring
  labels:
    app: prometheus
spec:
  groups:
  - name: ecommerce.rules
    rules:
    # 应用级别告警
    - alert: ServiceDown
      expr: up{job=~".*-service"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Service {{ $labels.job }} is down"
        description: "Service {{ $labels.job }} has been down for more than 1 minute."
    
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High error rate on {{ $labels.job }}"
        description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}."
    
    - alert: HighLatency
      expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High latency on {{ $labels.job }}"
        description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}."
    
    # 基础设施告警
    - alert: DatabaseConnectionHigh
      expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High database connections"
        description: "Database connection usage is {{ $value | humanizePercentage }}."
    
    - alert: RedisMemoryHigh
      expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Redis memory usage high"
        description: "Redis memory usage is {{ $value | humanizePercentage }}."
    
    - alert: KafkaConsumerLag
      expr: kafka_consumer_lag_sum > 1000
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Kafka consumer lag high"
        description: "Kafka consumer lag is {{ $value }} messages."
    
    # 资源告警
    - alert: PodCPUHigh
      expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Pod CPU usage high"
        description: "Pod {{ $labels.pod }} CPU usage is {{ $value | humanizePercentage }}."
    
    - alert: PodMemoryHigh
      expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Pod memory usage high"
        description: "Pod {{ $labels.pod }} memory usage is {{ $value | humanizePercentage }}."
    
    - alert: PodRestartHigh
      expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: "Pod restarting frequently"
        description: "Pod {{ $labels.pod }} has restarted {{ $value }} times in the last hour."

---
# monitoring/alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-config
  namespace: monitoring
type: Opaque
stringData:
  alertmanager.yml: |
    global:
      smtp_smarthost: 'smtp.company.com:587'
      smtp_from: 'alerts@company.com'
      slack_api_url: 'https://hooks.slack.com/services/...'
    
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h
      receiver: 'default'
      routes:
      - match:
          severity: critical
        receiver: 'critical-alerts'
      - match:
          severity: warning
        receiver: 'warning-alerts'
    
    receivers:
    - name: 'default'
      slack_configs:
      - channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    
    - name: 'critical-alerts'
      slack_configs:
      - channel: '#critical-alerts'
        title: 'CRITICAL: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        color: 'danger'
      email_configs:
      - to: 'oncall@company.com'
        subject: 'CRITICAL Alert: {{ .GroupLabels.alertname }}'
        body: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    
    - name: 'warning-alerts'
      slack_configs:
      - channel: '#warnings'
        title: 'Warning: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        color: 'warning'

10.7.2 自动化运维脚本

#!/bin/bash
# scripts/ops-automation.sh

set -euo pipefail

# 配置
NAMESPACE="ecommerce"
MONITORING_NAMESPACE="monitoring"
LOG_FILE="/var/log/ops-automation.log"
SLACK_WEBHOOK="https://hooks.slack.com/services/..."

# 日志函数
log() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

# Slack 通知
notify_slack() {
    local message="$1"
    local color="${2:-good}"
    
    curl -X POST -H 'Content-type: application/json' \
        --data "{
            \"attachments\": [{
                \"color\": \"$color\",
                \"text\": \"$message\"
            }]
        }" \
        "$SLACK_WEBHOOK" || true
}

# 健康检查
health_check() {
    log "Running health checks..."
    
    local failed_services=()
    local services=("user-service" "product-service" "order-service" "payment-service" "inventory-service" "notification-service")
    
    for service in "${services[@]}"; do
        if ! kubectl get pods -n "$NAMESPACE" -l app="$service" | grep -q Running; then
            failed_services+=("$service")
        fi
    done
    
    if [ ${#failed_services[@]} -gt 0 ]; then
        local message="Health check failed for services: ${failed_services[*]}"
        log "$message"
        notify_slack "$message" "danger"
        return 1
    else
        log "All services are healthy"
        return 0
    fi
}

# 资源清理
cleanup_resources() {
    log "Cleaning up resources..."
    
    # 清理已完成的 Jobs
    kubectl delete jobs --field-selector status.successful=1 -n "$NAMESPACE" --ignore-not-found=true
    
    # 清理失败的 Pods
    kubectl delete pods --field-selector status.phase=Failed -n "$NAMESPACE" --ignore-not-found=true
    
    # 清理 Evicted Pods
    kubectl get pods -n "$NAMESPACE" | grep Evicted | awk '{print $1}' | xargs -r kubectl delete pod -n "$NAMESPACE"
    
    log "Resource cleanup completed"
}

# 备份数据库
backup_databases() {
    log "Starting database backup..."
    
    local backup_date=$(date +%Y%m%d-%H%M%S)
    
    # PostgreSQL 备份
    kubectl exec -n "$NAMESPACE" postgresql-0 -- pg_dumpall -U postgres > "/tmp/postgresql-backup-$backup_date.sql"
    
    # MongoDB 备份
    kubectl exec -n "$NAMESPACE" mongodb-0 -- mongodump --out "/tmp/mongodb-backup-$backup_date"
    
    # 上传到 S3
    aws s3 cp "/tmp/postgresql-backup-$backup_date.sql" "s3://ecommerce-backups/postgresql/"
    aws s3 cp "/tmp/mongodb-backup-$backup_date" "s3://ecommerce-backups/mongodb/" --recursive
    
    # 清理本地文件
    rm -rf "/tmp/postgresql-backup-$backup_date.sql" "/tmp/mongodb-backup-$backup_date"
    
    log "Database backup completed"
}

# 性能优化
performance_optimization() {
    log "Running performance optimization..."
    
    # 检查资源使用情况
    local high_cpu_pods=$(kubectl top pods -n "$NAMESPACE" --no-headers | awk '$2 > 80 {print $1}')
    local high_memory_pods=$(kubectl top pods -n "$NAMESPACE" --no-headers | awk '$3 > 80 {print $1}')
    
    if [ -n "$high_cpu_pods" ]; then
        log "High CPU usage detected in pods: $high_cpu_pods"
        notify_slack "High CPU usage detected in pods: $high_cpu_pods" "warning"
    fi
    
    if [ -n "$high_memory_pods" ]; then
        log "High memory usage detected in pods: $high_memory_pods"
        notify_slack "High memory usage detected in pods: $high_memory_pods" "warning"
    fi
    
    # 自动扩缩容检查
    local hpa_status=$(kubectl get hpa -n "$NAMESPACE" --no-headers | awk '$4 > $5*0.8 {print $1}')
    
    if [ -n "$hpa_status" ]; then
        log "HPA scaling triggered for: $hpa_status"
        notify_slack "Auto-scaling triggered for: $hpa_status" "good"
    fi
    
    log "Performance optimization completed"
}

# 安全扫描
security_scan() {
    log "Running security scan..."
    
    # 检查过期的 TLS 证书
    local expiring_certs=$(kubectl get certificates -n "$NAMESPACE" -o json | jq -r '.items[] | select(.status.notAfter | fromdateiso8601 < (now + 604800)) | .metadata.name')
    
    if [ -n "$expiring_certs" ]; then
        log "TLS certificates expiring soon: $expiring_certs"
        notify_slack "TLS certificates expiring soon: $expiring_certs" "warning"
    fi
    
    # 检查安全策略违规
    local policy_violations=$(kubectl get pods -n "$NAMESPACE" -o json | jq -r '.items[] | select(.spec.securityContext.runAsRoot == true) | .metadata.name')
    
    if [ -n "$policy_violations" ]; then
        log "Security policy violations detected: $policy_violations"
        notify_slack "Security policy violations detected: $policy_violations" "danger"
    fi
    
    log "Security scan completed"
}

# 日志轮转
log_rotation() {
    log "Performing log rotation..."
    
    # 压缩旧日志
    find /var/log -name "*.log" -mtime +7 -exec gzip {} \;
    
    # 删除超过 30 天的压缩日志
    find /var/log -name "*.log.gz" -mtime +30 -delete
    
    log "Log rotation completed"
}

# 主函数
main() {
    local operation="${1:-all}"
    
    case $operation in
        health)
            health_check
            ;;
        cleanup)
            cleanup_resources
            ;;
        backup)
            backup_databases
            ;;
        performance)
            performance_optimization
            ;;
        security)
            security_scan
            ;;
        logs)
            log_rotation
            ;;
        all)
            health_check
            cleanup_resources
            performance_optimization
            security_scan
            log_rotation
            ;;
        *)
            echo "Usage: $0 {health|cleanup|backup|performance|security|logs|all}"
            exit 1
            ;;
    esac
}

# 执行主函数
main "$@"

10.8 故障排除

10.8.1 常见问题诊断

#!/bin/bash
# scripts/troubleshoot.sh

set -euo pipefail

NAMESPACE="ecommerce"

# 诊断 Pod 问题
diagnose_pods() {
    echo "=== Pod 诊断 ==="
    
    # 检查失败的 Pods
    echo "Failed Pods:"
    kubectl get pods -n "$NAMESPACE" --field-selector=status.phase=Failed
    
    # 检查重启次数高的 Pods
    echo "\nPods with high restart count:"
    kubectl get pods -n "$NAMESPACE" --sort-by='.status.containerStatuses[0].restartCount' | tail -10
    
    # 检查资源使用情况
    echo "\nTop resource consuming pods:"
    kubectl top pods -n "$NAMESPACE" --sort-by=cpu
    kubectl top pods -n "$NAMESPACE" --sort-by=memory
}

# 诊断服务问题
diagnose_services() {
    echo "\n=== 服务诊断 ==="
    
    # 检查服务端点
    echo "Service endpoints:"
    kubectl get endpoints -n "$NAMESPACE"
    
    # 检查服务连接
    echo "\nService connectivity test:"
    local services=("user-service" "product-service" "order-service")
    
    for service in "${services[@]}"; do
        echo "Testing $service..."
        kubectl run test-pod --image=curlimages/curl --rm -it --restart=Never -- \
            curl -m 5 "http://$service.$NAMESPACE.svc.cluster.local/health" || echo "$service unreachable"
    done
}

# 诊断网络问题
diagnose_network() {
    echo "\n=== 网络诊断 ==="
    
    # 检查网络策略
    echo "Network policies:"
    kubectl get networkpolicies -n "$NAMESPACE"
    
    # 检查 DNS 解析
    echo "\nDNS resolution test:"
    kubectl run dns-test --image=busybox --rm -it --restart=Never -- \
        nslookup kubernetes.default.svc.cluster.local
    
    # 检查 Ingress
    echo "\nIngress status:"
    kubectl get ingress -n "$NAMESPACE"
    kubectl describe ingress -n "$NAMESPACE"
}

# 诊断存储问题
diagnose_storage() {
    echo "\n=== 存储诊断 ==="
    
    # 检查 PVC 状态
    echo "PVC status:"
    kubectl get pvc -n "$NAMESPACE"
    
    # 检查存储类
    echo "\nStorage classes:"
    kubectl get storageclass
    
    # 检查卷使用情况
    echo "\nVolume usage:"
    kubectl exec -n "$NAMESPACE" postgresql-0 -- df -h /bitnami/postgresql || echo "PostgreSQL volume check failed"
    kubectl exec -n "$NAMESPACE" redis-master-0 -- df -h /data || echo "Redis volume check failed"
}

# 诊断数据库问题
diagnose_databases() {
    echo "\n=== 数据库诊断 ==="
    
    # PostgreSQL 诊断
    echo "PostgreSQL status:"
    kubectl exec -n "$NAMESPACE" postgresql-0 -- psql -U postgres -c "SELECT version();" || echo "PostgreSQL connection failed"
    kubectl exec -n "$NAMESPACE" postgresql-0 -- psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;" || echo "PostgreSQL activity check failed"
    
    # Redis 诊断
    echo "\nRedis status:"
    kubectl exec -n "$NAMESPACE" redis-master-0 -- redis-cli ping || echo "Redis connection failed"
    kubectl exec -n "$NAMESPACE" redis-master-0 -- redis-cli info memory || echo "Redis memory check failed"
    
    # MongoDB 诊断
    echo "\nMongoDB status:"
    kubectl exec -n "$NAMESPACE" mongodb-0 -- mongo --eval "db.adminCommand('ismaster')" || echo "MongoDB connection failed"
}

# 生成诊断报告
generate_report() {
    local report_file="troubleshoot-report-$(date +%Y%m%d-%H%M%S).txt"
    
    {
        echo "=== 故障排除报告 ==="
        echo "生成时间: $(date)"
        echo "命名空间: $NAMESPACE"
        echo ""
        
        diagnose_pods
        diagnose_services
        diagnose_network
        diagnose_storage
        diagnose_databases
        
        echo "\n=== 系统事件 ==="
        kubectl get events -n "$NAMESPACE" --sort-by='.lastTimestamp' | tail -20
        
        echo "\n=== 集群信息 ==="
        kubectl cluster-info
        kubectl get nodes
        kubectl top nodes
        
    } > "$report_file"
    
    echo "\n诊断报告已生成: $report_file"
}

# 主函数
main() {
    local component="${1:-all}"
    
    case $component in
        pods)
            diagnose_pods
            ;;
        services)
            diagnose_services
            ;;
        network)
            diagnose_network
            ;;
        storage)
            diagnose_storage
            ;;
        databases)
            diagnose_databases
            ;;
        report)
            generate_report
            ;;
        all)
            diagnose_pods
            diagnose_services
            diagnose_network
            diagnose_storage
            diagnose_databases
            generate_report
            ;;
        *)
            echo "Usage: $0 {pods|services|network|storage|databases|report|all}"
            exit 1
            ;;
    esac
}

main "$@"

10.9 实践练习

练习 1:企业级 Chart 设计

目标:设计一个完整的企业级微服务应用 Chart

要求: 1. 包含至少 5 个微服务 2. 支持多环境配置 3. 包含完整的监控和日志配置 4. 实现自动扩缩容 5. 包含安全配置和网络策略

步骤

# 1. 创建 Chart 结构
helm create enterprise-microservices

# 2. 设计依赖关系
# 编辑 Chart.yaml,添加数据库、缓存、消息队列等依赖

# 3. 配置多环境 Values
# 创建 environments/ 目录,包含 dev、staging、production 配置

# 4. 实现模板
# 为每个微服务创建 Deployment、Service、Ingress 模板

# 5. 添加监控配置
# 创建 ServiceMonitor、PrometheusRule 等监控资源

# 6. 测试部署
helm install enterprise-app ./enterprise-microservices -f environments/dev/values.yaml

练习 2:CI/CD 流水线集成

目标:创建完整的 Helm CI/CD 流水线

要求: 1. 自动化 Chart 验证和测试 2. 多环境自动部署 3. 回滚机制 4. 通知和报告

练习 3:大规模部署管理

目标:实现批量应用部署和管理

要求: 1. 支持并行部署 2. 依赖关系管理 3. 部署状态监控 4. 失败处理和重试

10.10 本章小结

本章通过真实的企业级案例,深入探讨了 Helm 在大型组织中的实际应用。我们学习了:

核心内容回顾

  1. 企业级架构设计

    • Chart 组织结构最佳实践
    • 企业级模板设计原则
    • 复杂依赖关系管理
  2. 多环境管理策略

    • 环境配置分离
    • 配置继承和覆盖
    • 环境特定的部署策略
  3. CI/CD 集成

    • GitLab CI/CD 流水线设计
    • GitHub Actions 工作流
    • 自动化测试和部署
  4. 大规模部署管理

    • 集群资源管理
    • 批量部署脚本
    • 并行部署策略
  5. 运维自动化

    • 监控和告警配置
    • 自动化运维脚本
    • 性能优化和安全扫描
  6. 故障排除

    • 常见问题诊断方法
    • 自动化故障检测
    • 诊断报告生成

最佳实践总结

  1. 设计原则

    • 模块化和可重用性
    • 配置外部化
    • 安全优先
    • 可观测性
  2. 运维实践

    • 基础设施即代码
    • 持续集成和部署
    • 监控驱动的运维
    • 自动化优先
  3. 团队协作

    • 明确的责任分工
    • 标准化的流程
    • 文档和知识共享
    • 持续改进

企业级应用要点

  1. 可扩展性:设计支持大规模部署的架构
  2. 可靠性:实现高可用和容错机制
  3. 安全性:集成全面的安全控制
  4. 可维护性:建立完善的运维体系
  5. 合规性:满足企业治理要求

通过本章的学习,你已经掌握了在企业环境中成功实施 Helm 的关键技能和最佳实践。这些知识将帮助你在实际工作中设计和管理复杂的 Kubernetes 应用部署。


恭喜你完成了 Helm 教程的学习!

从基础概念到企业级应用,你已经全面掌握了 Helm 的各个方面。现在你可以: - 创建和管理复杂的 Helm Charts - 实施企业级的部署策略 - 集成 CI/CD 流水线 - 进行大规模的应用管理 - 实现运维自动化

继续实践和探索,将这些知识应用到你的实际项目中,成为 Kubernetes 和 Helm 的专家!