部署与运维 - 在线学习与练习平台

概述

本章将全面介绍 gRPC 应用的部署策略、运维实践和生产环境最佳实践，包括容器化部署、Kubernetes集成、CI/CD流水线、监控告警、故障恢复等。我们将学习如何构建可靠、可扩展的生产级 gRPC 服务。

学习目标

掌握 gRPC 应用的容器化部署
学习 Kubernetes 环境下的服务部署
了解 CI/CD 流水线的构建
掌握生产环境的运维实践
学习故障恢复和灾备策略

容器化部署

from enum import Enum
from dataclasses import dataclass
from typing import List, Dict, Any, Optional, Union
from abc import ABC, abstractmethod
import json
import yaml
from datetime import datetime, timedelta

class DeploymentStrategy(Enum):
    """部署策略枚举"""
    ROLLING_UPDATE = "rolling_update"
    BLUE_GREEN = "blue_green"
    CANARY = "canary"
    RECREATE = "recreate"

class EnvironmentType(Enum):
    """环境类型枚举"""
    DEVELOPMENT = "development"
    TESTING = "testing"
    STAGING = "staging"
    PRODUCTION = "production"

class ServiceType(Enum):
    """服务类型枚举"""
    CLUSTER_IP = "ClusterIP"
    NODE_PORT = "NodePort"
    LOAD_BALANCER = "LoadBalancer"
    EXTERNAL_NAME = "ExternalName"

class ScalingPolicy(Enum):
    """扩缩容策略枚举"""
    CPU_UTILIZATION = "cpu_utilization"
    MEMORY_UTILIZATION = "memory_utilization"
    REQUEST_RATE = "request_rate"
    CUSTOM_METRIC = "custom_metric"

@dataclass
class ContainerConfig:
    """容器配置"""
    image: str
    tag: str
    ports: List[int]
    env_vars: Dict[str, str]
    resources: Dict[str, str]
    health_check: Dict[str, Any]
    volumes: List[Dict[str, str]] = None
    
@dataclass
class DeploymentConfig:
    """部署配置"""
    name: str
    namespace: str
    replicas: int
    strategy: DeploymentStrategy
    container: ContainerConfig
    service_config: Dict[str, Any]
    ingress_config: Dict[str, Any] = None
    
@dataclass
class MonitoringConfig:
    """监控配置"""
    metrics_enabled: bool
    logging_enabled: bool
    tracing_enabled: bool
    alerts: List[Dict[str, Any]]
    dashboards: List[str]
    
class DeploymentManager:
    """部署管理器"""
    
    def __init__(self):
        self.deployments = {}
        self.environments = {}
        
    def create_dockerfile(self) -> str:
        """创建Dockerfile"""
        return """
# Dockerfile for gRPC Service
# 多阶段构建，优化镜像大小
FROM golang:1.21-alpine AS builder

# 设置工作目录
WORKDIR /app

# 安装必要的包
RUN apk add --no-cache git ca-certificates tzdata

# 复制go mod文件
COPY go.mod go.sum ./

# 下载依赖
RUN go mod download

# 复制源代码
COPY . .

# 构建应用
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o main ./cmd/server

# 运行阶段
FROM alpine:latest

# 安装ca证书和时区数据
RUN apk --no-cache add ca-certificates tzdata

# 创建非root用户
RUN addgroup -g 1001 appgroup && \
    adduser -D -s /bin/sh -u 1001 -G appgroup appuser

# 设置工作目录
WORKDIR /root/

# 从构建阶段复制二进制文件
COPY --from=builder /app/main .

# 复制配置文件
COPY --from=builder /app/configs ./configs

# 复制证书文件（如果有）
COPY --from=builder /app/certs ./certs

# 设置文件权限
RUN chown -R appuser:appgroup /root

# 切换到非root用户
USER appuser

# 暴露端口
EXPOSE 8080 8081 9090

# 健康检查
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD ["./main", "--health-check"]

# 启动应用
CMD ["./main"]
"""
    
    def create_docker_compose(self) -> str:
        """创建Docker Compose配置"""
        return """
# docker-compose.yml
version: '3.8'

services:
  grpc-server:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "8080:8080"  # gRPC端口
      - "8081:8081"  # HTTP网关端口
      - "9090:9090"  # 指标端口
    environment:
      - ENV=production
      - LOG_LEVEL=info
      - DB_HOST=postgres
      - DB_PORT=5432
      - DB_NAME=grpc_db
      - DB_USER=grpc_user
      - DB_PASSWORD=grpc_pass
      - REDIS_HOST=redis
      - REDIS_PORT=6379
      - JAEGER_ENDPOINT=http://jaeger:14268/api/traces
    volumes:
      - ./configs:/app/configs:ro
      - ./logs:/app/logs
    depends_on:
      - postgres
      - redis
      - jaeger
    networks:
      - grpc-network
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "./main", "--health-check"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  postgres:
    image: postgres:15-alpine
    environment:
      - POSTGRES_DB=grpc_db
      - POSTGRES_USER=grpc_user
      - POSTGRES_PASSWORD=grpc_pass
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./scripts/init.sql:/docker-entrypoint-initdb.d/init.sql:ro
    ports:
      - "5432:5432"
    networks:
      - grpc-network
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    command: redis-server --appendonly yes
    volumes:
      - redis_data:/data
    ports:
      - "6379:6379"
    networks:
      - grpc-network
    restart: unless-stopped

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # Jaeger UI
      - "14268:14268"  # HTTP collector
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    networks:
      - grpc-network
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9091:9090"
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'
    networks:
      - grpc-network
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana
      - ./monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
      - ./monitoring/grafana/datasources:/etc/grafana/provisioning/datasources:ro
    networks:
      - grpc-network
    restart: unless-stopped

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - ./nginx/ssl:/etc/nginx/ssl:ro
    depends_on:
      - grpc-server
    networks:
      - grpc-network
    restart: unless-stopped

volumes:
  postgres_data:
  redis_data:
  prometheus_data:
  grafana_data:

networks:
  grpc-network:
    driver: bridge
"""
    
    def create_kubernetes_manifests(self) -> Dict[str, str]:
        """创建Kubernetes清单文件"""
        manifests = {}
        
        # Namespace
        manifests['namespace.yaml'] = """
apiVersion: v1
kind: Namespace
metadata:
  name: grpc-system
  labels:
    name: grpc-system
    environment: production
"""
        
        # ConfigMap
        manifests['configmap.yaml'] = """
apiVersion: v1
kind: ConfigMap
metadata:
  name: grpc-config
  namespace: grpc-system
data:
  app.yaml: |
    server:
      port: 8080
      gateway_port: 8081
      metrics_port: 9090
      read_timeout: 30s
      write_timeout: 30s
      idle_timeout: 120s
    
    database:
      host: postgres-service
      port: 5432
      name: grpc_db
      max_connections: 100
      max_idle_connections: 10
      connection_max_lifetime: 3600s
    
    redis:
      host: redis-service
      port: 6379
      db: 0
      pool_size: 100
    
    logging:
      level: info
      format: json
      output: stdout
    
    tracing:
      enabled: true
      jaeger_endpoint: http://jaeger-service:14268/api/traces
      sample_rate: 0.1
    
    metrics:
      enabled: true
      path: /metrics
      namespace: grpc_service
"""
        
        # Secret
        manifests['secret.yaml'] = """
apiVersion: v1
kind: Secret
metadata:
  name: grpc-secret
  namespace: grpc-system
type: Opaque
data:
  db-password: Z3JwY19wYXNz  # base64 encoded
  jwt-secret: bXlfc2VjcmV0X2tleQ==  # base64 encoded
  tls-cert: LS0tLS1CRUdJTi0uLi4=  # base64 encoded TLS cert
  tls-key: LS0tLS1CRUdJTi0uLi4=   # base64 encoded TLS key
"""
        
        # Deployment
        manifests['deployment.yaml'] = """
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grpc-server
  namespace: grpc-system
  labels:
    app: grpc-server
    version: v1.0.0
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
  selector:
    matchLabels:
      app: grpc-server
  template:
    metadata:
      labels:
        app: grpc-server
        version: v1.0.0
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"
    spec:
      serviceAccountName: grpc-service-account
      securityContext:
        runAsNonRoot: true
        runAsUser: 1001
        fsGroup: 1001
      containers:
      - name: grpc-server
        image: grpc-server:v1.0.0
        imagePullPolicy: IfNotPresent
        ports:
        - name: grpc
          containerPort: 8080
          protocol: TCP
        - name: http
          containerPort: 8081
          protocol: TCP
        - name: metrics
          containerPort: 9090
          protocol: TCP
        env:
        - name: ENV
          value: "production"
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: grpc-secret
              key: db-password
        - name: JWT_SECRET
          valueFrom:
            secretKeyRef:
              name: grpc-secret
              key: jwt-secret
        volumeMounts:
        - name: config
          mountPath: /app/configs
          readOnly: true
        - name: tls-certs
          mountPath: /app/certs
          readOnly: true
        - name: logs
          mountPath: /app/logs
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          exec:
            command:
            - /app/main
            - --health-check
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          exec:
            command:
            - /app/main
            - --ready-check
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        startupProbe:
          exec:
            command:
            - /app/main
            - --startup-check
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 30
      volumes:
      - name: config
        configMap:
          name: grpc-config
      - name: tls-certs
        secret:
          secretName: grpc-secret
          items:
          - key: tls-cert
            path: tls.crt
          - key: tls-key
            path: tls.key
      - name: logs
        emptyDir: {}
      nodeSelector:
        kubernetes.io/os: linux
      tolerations:
      - key: "node.kubernetes.io/not-ready"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 300
      - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 300
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - grpc-server
              topologyKey: kubernetes.io/hostname
"""
        
        # Service
        manifests['service.yaml'] = """
apiVersion: v1
kind: Service
metadata:
  name: grpc-service
  namespace: grpc-system
  labels:
    app: grpc-server
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
  type: LoadBalancer
  ports:
  - name: grpc
    port: 8080
    targetPort: 8080
    protocol: TCP
  - name: http
    port: 8081
    targetPort: 8081
    protocol: TCP
  selector:
    app: grpc-server
---
apiVersion: v1
kind: Service
metadata:
  name: grpc-metrics
  namespace: grpc-system
  labels:
    app: grpc-server
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
spec:
  type: ClusterIP
  ports:
  - name: metrics
    port: 9090
    targetPort: 9090
    protocol: TCP
  selector:
    app: grpc-server
"""
        
        # HPA
        manifests['hpa.yaml'] = """
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: grpc-hpa
  namespace: grpc-system
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: grpc-server
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: grpc_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      - type: Pods
        value: 2
        periodSeconds: 60
      selectPolicy: Max
"""
        
        # Ingress
        manifests['ingress.yaml'] = """
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grpc-ingress
  namespace: grpc-system
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/backend-protocol: GRPC
    nginx.ingress.kubernetes.io/grpc-backend: "true"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
  - hosts:
    - grpc.example.com
    secretName: grpc-tls
  rules:
  - host: grpc.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: grpc-service
            port:
              number: 8080
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grpc-http-ingress
  namespace: grpc-system
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
  - hosts:
    - api.example.com
    secretName: grpc-http-tls
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: grpc-service
            port:
              number: 8081
"""
        
        # ServiceAccount
        manifests['rbac.yaml'] = """
apiVersion: v1
kind: ServiceAccount
metadata:
  name: grpc-service-account
  namespace: grpc-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: grpc-cluster-role
rules:
- apiGroups: [""]
  resources: ["pods", "services", "endpoints"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: grpc-cluster-role-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: grpc-cluster-role
subjects:
- kind: ServiceAccount
  name: grpc-service-account
  namespace: grpc-system
"""
        
        return manifests
    
    def create_helm_chart(self) -> Dict[str, str]:
        """创建Helm Chart"""
        chart_files = {}
        
        # Chart.yaml
        chart_files['Chart.yaml'] = """
apiVersion: v2
name: grpc-service
description: A Helm chart for gRPC service
type: application
version: 0.1.0
appVersion: "1.0.0"
keywords:
  - grpc
  - microservice
  - api
home: https://github.com/example/grpc-service
sources:
  - https://github.com/example/grpc-service
maintainers:
  - name: DevOps Team
    email: devops@example.com
"""
        
        # values.yaml
        chart_files['values.yaml'] = """
# Default values for grpc-service
replicaCount: 3

image:
  repository: grpc-service
  pullPolicy: IfNotPresent
  tag: "v1.0.0"

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""

serviceAccount:
  create: true
  annotations: {}
  name: ""

podAnnotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "9090"
  prometheus.io/path: "/metrics"

podSecurityContext:
  runAsNonRoot: true
  runAsUser: 1001
  fsGroup: 1001

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
    - ALL
  readOnlyRootFilesystem: true

service:
  type: LoadBalancer
  grpcPort: 8080
  httpPort: 8081
  metricsPort: 9090

ingress:
  enabled: true
  className: "nginx"
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: GRPC
    nginx.ingress.kubernetes.io/grpc-backend: "true"
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: grpc.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: grpc-tls
      hosts:
        - grpc.example.com

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 250m
    memory: 256Mi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

nodeSelector: {}

tolerations: []

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/name
            operator: In
            values:
            - grpc-service
        topologyKey: kubernetes.io/hostname

config:
  server:
    port: 8080
    gateway_port: 8081
    metrics_port: 9090
  database:
    host: postgres-service
    port: 5432
    name: grpc_db
  redis:
    host: redis-service
    port: 6379
  logging:
    level: info
    format: json
  tracing:
    enabled: true
    sample_rate: 0.1

secrets:
  dbPassword: "grpc_pass"
  jwtSecret: "my_secret_key"

monitoring:
  enabled: true
  serviceMonitor:
    enabled: true
    interval: 30s
    path: /metrics
"""
        
        # templates/deployment.yaml
        chart_files['templates/deployment.yaml'] = """
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "grpc-service.fullname" . }}
  labels:
    {{- include "grpc-service.labels" . | nindent 4 }}
spec:
  {{- if not .Values.autoscaling.enabled }}
  replicas: {{ .Values.replicaCount }}
  {{- end }}
  selector:
    matchLabels:
      {{- include "grpc-service.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      {{- with .Values.podAnnotations }}
      annotations:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      labels:
        {{- include "grpc-service.selectorLabels" . | nindent 8 }}
    spec:
      {{- with .Values.imagePullSecrets }}
      imagePullSecrets:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      serviceAccountName: {{ include "grpc-service.serviceAccountName" . }}
      securityContext:
        {{- toYaml .Values.podSecurityContext | nindent 8 }}
      containers:
        - name: {{ .Chart.Name }}
          securityContext:
            {{- toYaml .Values.securityContext | nindent 12 }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - name: grpc
              containerPort: {{ .Values.config.server.port }}
              protocol: TCP
            - name: http
              containerPort: {{ .Values.config.server.gateway_port }}
              protocol: TCP
            - name: metrics
              containerPort: {{ .Values.config.server.metrics_port }}
              protocol: TCP
          env:
            - name: ENV
              value: "production"
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          volumeMounts:
            - name: config
              mountPath: /app/configs
              readOnly: true
          livenessProbe:
            exec:
              command:
              - /app/main
              - --health-check
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            exec:
              command:
              - /app/main
              - --ready-check
            initialDelaySeconds: 5
            periodSeconds: 5
          resources:
            {{- toYaml .Values.resources | nindent 12 }}
      volumes:
        - name: config
          configMap:
            name: {{ include "grpc-service.fullname" . }}-config
      {{- with .Values.nodeSelector }}
      nodeSelector:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.affinity }}
      affinity:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.tolerations }}
      tolerations:
        {{- toYaml . | nindent 8 }}
      {{- end }}
"""
        
        return chart_files
    
    def create_cicd_pipeline(self) -> Dict[str, str]:
        """创建CI/CD流水线"""
        pipelines = {}
        
        # GitHub Actions
        pipelines['github-actions.yml'] = """
name: CI/CD Pipeline

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]
  release:
    types: [ published ]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    
    - name: Set up Go
      uses: actions/setup-go@v4
      with:
        go-version: '1.21'
    
    - name: Cache Go modules
      uses: actions/cache@v3
      with:
        path: ~/go/pkg/mod
        key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }}
        restore-keys: |
          ${{ runner.os }}-go-
    
    - name: Download dependencies
      run: go mod download
    
    - name: Run tests
      run: |
        go test -v -race -coverprofile=coverage.out ./...
        go tool cover -html=coverage.out -o coverage.html
    
    - name: Upload coverage reports
      uses: codecov/codecov-action@v3
      with:
        file: ./coverage.out
    
    - name: Run linting
      uses: golangci/golangci-lint-action@v3
      with:
        version: latest
    
    - name: Run security scan
      uses: securecodewarrior/github-action-add-sarif@v1
      with:
        sarif-file: 'gosec-report.sarif'

  build:
    needs: test
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    steps:
    - uses: actions/checkout@v4
    
    - name: Set up Docker Buildx
      uses: docker/setup-buildx-action@v3
    
    - name: Log in to Container Registry
      uses: docker/login-action@v3
      with:
        registry: ${{ env.REGISTRY }}
        username: ${{ github.actor }}
        password: ${{ secrets.GITHUB_TOKEN }}
    
    - name: Extract metadata
      id: meta
      uses: docker/metadata-action@v5
      with:
        images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
        tags: |
          type=ref,event=branch
          type=ref,event=pr
          type=semver,pattern={{version}}
          type=semver,pattern={{major}}.{{minor}}
          type=sha,prefix={{branch}}-
    
    - name: Build and push Docker image
      uses: docker/build-push-action@v5
      with:
        context: .
        platforms: linux/amd64,linux/arm64
        push: true
        tags: ${{ steps.meta.outputs.tags }}
        labels: ${{ steps.meta.outputs.labels }}
        cache-from: type=gha
        cache-to: type=gha,mode=max

  deploy-staging:
    needs: build
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/develop'
    environment: staging
    steps:
    - uses: actions/checkout@v4
    
    - name: Configure kubectl
      uses: azure/k8s-set-context@v3
      with:
        method: kubeconfig
        kubeconfig: ${{ secrets.KUBE_CONFIG_STAGING }}
    
    - name: Deploy to staging
      run: |
        helm upgrade --install grpc-service ./helm/grpc-service \
          --namespace grpc-staging \
          --create-namespace \
          --set image.tag=${{ github.sha }} \
          --set environment=staging \
          --values ./helm/values-staging.yaml
    
    - name: Run integration tests
      run: |
        kubectl wait --for=condition=ready pod -l app=grpc-service -n grpc-staging --timeout=300s
        go test -v ./tests/integration/...

  deploy-production:
    needs: build
    runs-on: ubuntu-latest
    if: github.event_name == 'release'
    environment: production
    steps:
    - uses: actions/checkout@v4
    
    - name: Configure kubectl
      uses: azure/k8s-set-context@v3
      with:
        method: kubeconfig
        kubeconfig: ${{ secrets.KUBE_CONFIG_PROD }}
    
    - name: Deploy to production
      run: |
        helm upgrade --install grpc-service ./helm/grpc-service \
          --namespace grpc-production \
          --create-namespace \
          --set image.tag=${{ github.event.release.tag_name }} \
          --set environment=production \
          --values ./helm/values-production.yaml
    
    - name: Verify deployment
      run: |
        kubectl wait --for=condition=ready pod -l app=grpc-service -n grpc-production --timeout=300s
        kubectl get pods -n grpc-production
        kubectl get services -n grpc-production

  security-scan:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    
    - name: Run Trivy vulnerability scanner
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
        format: 'sarif'
        output: 'trivy-results.sarif'
    
    - name: Upload Trivy scan results
      uses: github/codeql-action/upload-sarif@v2
      with:
        sarif_file: 'trivy-results.sarif'
"""
        
        # GitLab CI
        pipelines['gitlab-ci.yml'] = """
stages:
  - test
  - build
  - deploy-staging
  - deploy-production

variables:
  DOCKER_DRIVER: overlay2
  DOCKER_TLS_CERTDIR: "/certs"
  REGISTRY: $CI_REGISTRY
  IMAGE_NAME: $CI_PROJECT_PATH
  KUBECONFIG: /tmp/kubeconfig

before_script:
  - docker info

test:
  stage: test
  image: golang:1.21
  services:
    - postgres:15
    - redis:7
  variables:
    POSTGRES_DB: test_db
    POSTGRES_USER: test_user
    POSTGRES_PASSWORD: test_pass
    DATABASE_URL: postgres://test_user:test_pass@postgres:5432/test_db
    REDIS_URL: redis://redis:6379
  before_script:
    - go mod download
  script:
    - go test -v -race -coverprofile=coverage.out ./...
    - go tool cover -func=coverage.out
  coverage: '/total:.*?(\d+\.\d+)%/'
  artifacts:
    reports:
      coverage_report:
        coverage_format: cobertura
        path: coverage.xml

lint:
  stage: test
  image: golangci/golangci-lint:latest
  script:
    - golangci-lint run -v

security-scan:
  stage: test
  image: securecodewarrior/docker-action
  script:
    - gosec -fmt sarif -out gosec-report.sarif ./...
  artifacts:
    reports:
      sast: gosec-report.sarif

build:
  stage: build
  image: docker:latest
  services:
    - docker:dind
  before_script:
    - echo $CI_REGISTRY_PASSWORD | docker login -u $CI_REGISTRY_USER --password-stdin $CI_REGISTRY
  script:
    - docker build -t $REGISTRY/$IMAGE_NAME:$CI_COMMIT_SHA .
    - docker push $REGISTRY/$IMAGE_NAME:$CI_COMMIT_SHA
    - |
      if [ "$CI_COMMIT_REF_NAME" = "main" ]; then
        docker tag $REGISTRY/$IMAGE_NAME:$CI_COMMIT_SHA $REGISTRY/$IMAGE_NAME:latest
        docker push $REGISTRY/$IMAGE_NAME:latest
      fi
  only:
    - main
    - develop
    - tags

deploy-staging:
  stage: deploy-staging
  image: bitnami/kubectl:latest
  before_script:
    - echo $KUBE_CONFIG_STAGING | base64 -d > $KUBECONFIG
    - kubectl config view
  script:
    - |
      helm upgrade --install grpc-service ./helm/grpc-service \
        --namespace grpc-staging \
        --create-namespace \
        --set image.tag=$CI_COMMIT_SHA \
        --set environment=staging \
        --values ./helm/values-staging.yaml
    - kubectl wait --for=condition=ready pod -l app=grpc-service -n grpc-staging --timeout=300s
  environment:
    name: staging
    url: https://grpc-staging.example.com
  only:
    - develop

deploy-production:
  stage: deploy-production
  image: bitnami/kubectl:latest
  before_script:
    - echo $KUBE_CONFIG_PROD | base64 -d > $KUBECONFIG
    - kubectl config view
  script:
    - |
      helm upgrade --install grpc-service ./helm/grpc-service \
        --namespace grpc-production \
        --create-namespace \
        --set image.tag=$CI_COMMIT_TAG \
        --set environment=production \
        --values ./helm/values-production.yaml
    - kubectl wait --for=condition=ready pod -l app=grpc-service -n grpc-production --timeout=300s
  environment:
    name: production
    url: https://grpc.example.com
  when: manual
  only:
    - tags
"""
        
        return pipelines

# 创建部署管理器实例
deployment_mgr = DeploymentManager()

# 生成容器化配置
dockerfile = deployment_mgr.create_dockerfile()
print("=== 容器化部署 ===")
print("✓ 多阶段构建Dockerfile")
print("✓ Docker Compose配置")
print("✓ 安全最佳实践")
print("✓ 健康检查配置")
print("✓ 资源限制设置")

# 生成Kubernetes配置
k8s_manifests = deployment_mgr.create_kubernetes_manifests()
print("\n=== Kubernetes部署 ===")
print("✓ 完整的K8s清单文件")
print("✓ 自动扩缩容配置")
print("✓ 服务发现与负载均衡")
print("✓ 安全策略配置")
print("✓ 监控集成")

# 生成Helm Chart
helm_chart = deployment_mgr.create_helm_chart()
print("\n=== Helm Chart ===")
print("✓ 参数化配置")
print("✓ 多环境支持")
print("✓ 版本管理")
print("✓ 依赖管理")
print("✓ 升级策略")

# 生成CI/CD流水线
cicd_pipelines = deployment_mgr.create_cicd_pipeline()
print("\n=== CI/CD流水线 ===")
print("✓ 自动化测试")
print("✓ 安全扫描")
print("✓ 镜像构建")
print("✓ 多环境部署")
print("✓ 回滚策略")

运维监控

1. 监控配置

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "grpc_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  - job_name: 'grpc-service'
    static_configs:
      - targets: ['grpc-service:9090']
    metrics_path: /metrics
    scrape_interval: 30s
    
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

2. 告警规则

# grpc_rules.yml
groups:
  - name: grpc.rules
    rules:
      - alert: GRPCHighErrorRate
        expr: rate(grpc_errors_total[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High gRPC error rate"
          
      - alert: GRPCHighLatency
        expr: histogram_quantile(0.95, rate(grpc_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High gRPC latency"

3. 日志聚合

# fluentd配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
data:
  fluent.conf: |
    <source>
      @type tail
      path /var/log/containers/*grpc*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      format json
    </source>
    
    <filter kubernetes.**>
      @type kubernetes_metadata
    </filter>
    
    <match kubernetes.**>
      @type elasticsearch
      host elasticsearch
      port 9200
      index_name grpc-logs
    </match>

故障恢复

1. 备份策略

#!/bin/bash
# backup.sh - 数据备份脚本

BACKUP_DIR="/backups/$(date +%Y%m%d)"
DB_HOST="postgres-service"
DB_NAME="grpc_db"

# 创建备份目录
mkdir -p $BACKUP_DIR

# 数据库备份
pg_dump -h $DB_HOST -U grpc_user $DB_NAME | gzip > $BACKUP_DIR/database.sql.gz

# 配置备份
kubectl get configmap grpc-config -o yaml > $BACKUP_DIR/configmap.yaml
kubectl get secret grpc-secret -o yaml > $BACKUP_DIR/secret.yaml

# 上传到云存储
aws s3 sync $BACKUP_DIR s3://grpc-backups/$(date +%Y%m%d)/

echo "Backup completed: $BACKUP_DIR"

2. 灾备恢复

#!/bin/bash
# restore.sh - 灾备恢复脚本

BACKUP_DATE=$1
BACKUP_DIR="/backups/$BACKUP_DATE"

if [ -z "$BACKUP_DATE" ]; then
    echo "Usage: $0 <backup_date>"
    exit 1
fi

# 从云存储下载备份
aws s3 sync s3://grpc-backups/$BACKUP_DATE/ $BACKUP_DIR/

# 恢复数据库
gunzip -c $BACKUP_DIR/database.sql.gz | psql -h postgres-service -U grpc_user grpc_db

# 恢复配置
kubectl apply -f $BACKUP_DIR/configmap.yaml
kubectl apply -f $BACKUP_DIR/secret.yaml

# 重启服务
kubectl rollout restart deployment/grpc-server -n grpc-system

echo "Restore completed from backup: $BACKUP_DATE"

3. 健康检查

// health.go - 健康检查实现
package main

import (
    "context"
    "database/sql"
    "fmt"
    "net/http"
    "time"
    
    "google.golang.org/grpc/health/grpc_health_v1"
)

type HealthChecker struct {
    db    *sql.DB
    redis *redis.Client
}

func (h *HealthChecker) Check(ctx context.Context, req *grpc_health_v1.HealthCheckRequest) (*grpc_health_v1.HealthCheckResponse, error) {
    // 检查数据库连接
    if err := h.checkDatabase(ctx); err != nil {
        return &grpc_health_v1.HealthCheckResponse{
            Status: grpc_health_v1.HealthCheckResponse_NOT_SERVING,
        }, nil
    }
    
    // 检查Redis连接
    if err := h.checkRedis(ctx); err != nil {
        return &grpc_health_v1.HealthCheckResponse{
            Status: grpc_health_v1.HealthCheckResponse_NOT_SERVING,
        }, nil
    }
    
    return &grpc_health_v1.HealthCheckResponse{
        Status: grpc_health_v1.HealthCheckResponse_SERVING,
    }, nil
}

func (h *HealthChecker) checkDatabase(ctx context.Context) error {
    ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
    defer cancel()
    
    return h.db.PingContext(ctx)
}

func (h *HealthChecker) checkRedis(ctx context.Context) error {
    ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
    defer cancel()
    
    return h.redis.Ping(ctx).Err()
}

性能调优

1. 资源配置

# 生产环境资源配置
resources:
  requests:
    memory: "512Mi"
    cpu: "500m"
  limits:
    memory: "1Gi"
    cpu: "1000m"

# JVM调优（如果使用Java）
env:
  - name: JAVA_OPTS
    value: "-Xms512m -Xmx1g -XX:+UseG1GC -XX:MaxGCPauseMillis=200"

2. 网络优化

# 网络策略
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: grpc-network-policy
spec:
  podSelector:
    matchLabels:
      app: grpc-server
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: grpc-client
    ports:
    - protocol: TCP
      port: 8080

3. 缓存策略

// cache.go - 缓存配置
type CacheConfig struct {
    TTL           time.Duration
    MaxSize       int
    EvictionPolicy string
}

func NewCacheConfig() *CacheConfig {
    return &CacheConfig{
        TTL:           30 * time.Minute,
        MaxSize:       10000,
        EvictionPolicy: "LRU",
    }
}

总结

本章全面介绍了 gRPC 应用的部署与运维实践，主要内容包括：

核心要点

容器化部署
- 多阶段构建优化
- 安全最佳实践
- 健康检查配置
- 资源限制管理
Kubernetes集成
- 完整的清单文件
- 自动扩缩容配置
- 服务发现机制
- 安全策略实施
CI/CD流水线
- 自动化测试集成
- 安全扫描流程
- 多环境部署策略
- 回滚机制设计
运维监控
- 指标收集配置
- 告警规则设置
- 日志聚合管理
- 性能分析工具

最佳实践

部署策略
- 蓝绿部署减少风险
- 金丝雀发布验证
- 滚动更新保证可用性
- 快速回滚机制
安全实践
- 最小权限原则
- 镜像安全扫描
- 网络策略隔离
- 密钥管理规范
监控运维
- 全方位监控覆盖
- 主动告警机制
- 自动化运维流程
- 故障快速定位
性能优化
- 资源合理配置
- 网络优化调整
- 缓存策略应用
- 负载均衡配置

下一步学习

深入学习云原生技术
掌握服务网格应用
了解GitOps实践
实践SRE方法论
学习混沌工程

通过本章学习，你已经掌握了 gRPC 应用在生产环境中的完整部署和运维技能，能够构建可靠、可扩展、易维护的微服务系统。这标志着 gRPC 教程的完成，从基础概念到生产实践，你已经具备了全面的 gRPC 开发和运维能力。

📂 分类导航

▶ 学与练
- ▶ 软件技术基础
  - ▶ 操作系统技术
    - Linux实战
    - ▶ Linux技巧
      - debug-remote-api.md
  - ▶ 容器化与编排
    - Docker实战
    - ▶ Docker高级
- ▶ 前端开发技术
  - ▶ 框架与库
    - js
    - vue
  - ▶ 前端生态
    - bootstrap
    - vue-ssr
- ▶ 后端开发技术
  - ▶ 编程语言
    - ▶ Java
    - ▶ Go
      - go-server.md
      - mini.md
    - Rust
    - Python
    - csharp
  - ▶ 中间件
    - redis
    - ▶ minio
      - minio.md
    - elasticsearch
    - kafka
    - elk
    - caddy
  - ▶ 数据库
    - MySQL
    - SQLServer
    - ▶ Dameng
      - sql.md
    - clickhouse
- ▶ 数据开发与运维
  - ▶ 数据开发
    - hadoop
  - ▶ 运维开发
    - ▶ CI/CD
      - jenkins
    - ▶ 自动化
      - allinssl.md
    - ▶ 日志处理
      - elk
    - ▶ 监控
- 软件速学教程
▶ 软件园
- AI智能体与应用
- 开发工具与环境
- AI 开发和编排
- 业务与生产力应用
- 数据和中间件
▶ 工具箱
- 内容管理
- 编码解码
- ▶ 系统监控
  - miaotixing.md
- ▶ 日常工具
- 工具命令
- 使用教程

📚 部署与运维

概述