概述

本章将详细介绍TiDB的各种安装部署方式,包括单机部署、集群部署、容器化部署和云上部署。通过本章学习,您将掌握在不同环境下部署TiDB集群的方法和最佳实践。

学习目标

通过本章学习,您将了解: - TiDB的部署架构和组件要求 - 不同部署方式的适用场景 - TiUP工具的使用方法 - 容器化部署的实践 - 生产环境部署的最佳实践 - 部署后的验证和监控配置

部署前准备

1. 系统要求

from enum import Enum
from dataclasses import dataclass
from typing import Dict, List, Any, Optional, Tuple
import json
import subprocess
import os
from datetime import datetime

class OSType(Enum):
    """操作系统类型"""
    CENTOS = "centos"
    UBUNTU = "ubuntu"
    REDHAT = "redhat"
    DEBIAN = "debian"
    MACOS = "macos"
    WINDOWS = "windows"

class DeploymentType(Enum):
    """部署类型"""
    STANDALONE = "standalone"      # 单机部署
    CLUSTER = "cluster"            # 集群部署
    DOCKER = "docker"              # Docker部署
    KUBERNETES = "kubernetes"      # K8s部署
    CLOUD = "cloud"                # 云部署

class ComponentType(Enum):
    """组件类型"""
    TIDB = "tidb"
    TIKV = "tikv"
    PD = "pd"
    TIFLASH = "tiflash"
    PUMP = "pump"
    DRAINER = "drainer"
    GRAFANA = "grafana"
    PROMETHEUS = "prometheus"
    ALERTMANAGER = "alertmanager"

class DeploymentStatus(Enum):
    """部署状态"""
    PLANNING = "planning"
    PREPARING = "preparing"
    DEPLOYING = "deploying"
    RUNNING = "running"
    FAILED = "failed"
    STOPPED = "stopped"

@dataclass
class SystemRequirement:
    """系统要求"""
    component: ComponentType
    min_cpu_cores: int
    recommended_cpu_cores: int
    min_memory_gb: int
    recommended_memory_gb: int
    min_storage_gb: int
    recommended_storage_gb: int
    network_bandwidth_mbps: int
    os_requirements: List[str]
    additional_requirements: List[str]

@dataclass
class DeploymentNode:
    """部署节点"""
    node_id: str
    hostname: str
    ip_address: str
    os_type: OSType
    cpu_cores: int
    memory_gb: int
    storage_gb: int
    components: List[ComponentType]
    ssh_port: int
    ssh_user: str
    labels: Dict[str, str]
    status: str

@dataclass
class DeploymentPlan:
    """部署计划"""
    plan_id: str
    deployment_type: DeploymentType
    cluster_name: str
    nodes: List[DeploymentNode]
    global_config: Dict[str, Any]
    component_configs: Dict[ComponentType, Dict[str, Any]]
    monitoring_config: Dict[str, Any]
    security_config: Dict[str, Any]
    estimated_duration_minutes: int
    prerequisites: List[str]
    deployment_steps: List[str]

@dataclass
class DeploymentResult:
    """部署结果"""
    plan_id: str
    status: DeploymentStatus
    start_time: datetime
    end_time: Optional[datetime]
    deployed_components: List[ComponentType]
    failed_components: List[ComponentType]
    cluster_endpoints: Dict[str, str]
    monitoring_urls: Dict[str, str]
    logs: List[str]
    next_steps: List[str]

class TiDBDeploymentManager:
    """TiDB部署管理器"""
    
    def __init__(self):
        self.system_requirements = self._initialize_system_requirements()
        self.deployment_templates = self._initialize_deployment_templates()
        self.deployment_history = []
    
    def _initialize_system_requirements(self) -> Dict[ComponentType, SystemRequirement]:
        """初始化系统要求"""
        requirements = {}
        
        # TiDB Server要求
        requirements[ComponentType.TIDB] = SystemRequirement(
            component=ComponentType.TIDB,
            min_cpu_cores=4,
            recommended_cpu_cores=8,
            min_memory_gb=8,
            recommended_memory_gb=16,
            min_storage_gb=50,
            recommended_storage_gb=100,
            network_bandwidth_mbps=1000,
            os_requirements=[
                "CentOS 7.3+",
                "Ubuntu 16.04+",
                "RHEL 7.3+"
            ],
            additional_requirements=[
                "时间同步服务(NTP)",
                "防火墙配置",
                "文件描述符限制调整",
                "透明大页禁用"
            ]
        )
        
        # TiKV要求
        requirements[ComponentType.TIKV] = SystemRequirement(
            component=ComponentType.TIKV,
            min_cpu_cores=8,
            recommended_cpu_cores=16,
            min_memory_gb=16,
            recommended_memory_gb=32,
            min_storage_gb=500,
            recommended_storage_gb=1000,
            network_bandwidth_mbps=1000,
            os_requirements=[
                "CentOS 7.3+",
                "Ubuntu 16.04+",
                "RHEL 7.3+"
            ],
            additional_requirements=[
                "SSD存储推荐",
                "独立数据盘",
                "I/O调度器优化",
                "内核参数调优"
            ]
        )
        
        # PD要求
        requirements[ComponentType.PD] = SystemRequirement(
            component=ComponentType.PD,
            min_cpu_cores=2,
            recommended_cpu_cores=4,
            min_memory_gb=4,
            recommended_memory_gb=8,
            min_storage_gb=20,
            recommended_storage_gb=50,
            network_bandwidth_mbps=1000,
            os_requirements=[
                "CentOS 7.3+",
                "Ubuntu 16.04+",
                "RHEL 7.3+"
            ],
            additional_requirements=[
                "低延迟存储",
                "稳定网络连接",
                "时间同步精度",
                "独立部署推荐"
            ]
        )
        
        # TiFlash要求
        requirements[ComponentType.TIFLASH] = SystemRequirement(
            component=ComponentType.TIFLASH,
            min_cpu_cores=16,
            recommended_cpu_cores=32,
            min_memory_gb=32,
            recommended_memory_gb=64,
            min_storage_gb=1000,
            recommended_storage_gb=2000,
            network_bandwidth_mbps=1000,
            os_requirements=[
                "CentOS 7.3+",
                "Ubuntu 16.04+",
                "RHEL 7.3+"
            ],
            additional_requirements=[
                "大内存配置",
                "高速存储",
                "CPU向量化支持",
                "独立存储路径"
            ]
        )
        
        return requirements
    
    def _initialize_deployment_templates(self) -> Dict[DeploymentType, Dict[str, Any]]:
        """初始化部署模板"""
        templates = {}
        
        # 单机部署模板
        templates[DeploymentType.STANDALONE] = {
            "description": "单机部署,适用于开发测试环境",
            "min_nodes": 1,
            "recommended_nodes": 1,
            "components_per_node": [ComponentType.TIDB, ComponentType.TIKV, ComponentType.PD],
            "resource_requirements": {
                "cpu_cores": 8,
                "memory_gb": 16,
                "storage_gb": 200
            },
            "use_cases": ["开发环境", "功能测试", "概念验证", "学习实验"]
        }
        
        # 集群部署模板
        templates[DeploymentType.CLUSTER] = {
            "description": "生产级集群部署",
            "min_nodes": 6,
            "recommended_nodes": 9,
            "components_distribution": {
                ComponentType.TIDB: 2,
                ComponentType.TIKV: 3,
                ComponentType.PD: 3,
                ComponentType.TIFLASH: 1
            },
            "resource_requirements": {
                "total_cpu_cores": 64,
                "total_memory_gb": 128,
                "total_storage_gb": 3000
            },
            "use_cases": ["生产环境", "高可用部署", "大规模应用", "关键业务"]
        }
        
        # Docker部署模板
        templates[DeploymentType.DOCKER] = {
            "description": "Docker容器化部署",
            "min_nodes": 1,
            "recommended_nodes": 3,
            "container_requirements": {
                "docker_version": "19.03+",
                "docker_compose_version": "1.25+",
                "network_mode": "bridge",
                "volume_management": "bind_mount"
            },
            "use_cases": ["快速部署", "开发环境", "CI/CD", "容器化迁移"]
        }
        
        # Kubernetes部署模板
        templates[DeploymentType.KUBERNETES] = {
            "description": "Kubernetes云原生部署",
            "min_nodes": 3,
            "recommended_nodes": 6,
            "k8s_requirements": {
                "kubernetes_version": "1.16+",
                "storage_class": "SSD",
                "network_plugin": "CNI",
                "operator_version": "1.3+"
            },
            "use_cases": ["云原生环境", "自动化运维", "弹性扩展", "多租户"]
        }
        
        return templates
    
    def check_system_requirements(self, nodes: List[DeploymentNode]) -> Dict[str, Any]:
        """检查系统要求"""
        check_result = {
            "overall_status": "passed",
            "node_checks": [],
            "warnings": [],
            "errors": [],
            "recommendations": []
        }
        
        for node in nodes:
            node_check = {
                "node_id": node.node_id,
                "hostname": node.hostname,
                "component_checks": [],
                "status": "passed"
            }
            
            for component in node.components:
                if component in self.system_requirements:
                    req = self.system_requirements[component]
                    component_check = {
                        "component": component.value,
                        "cpu_check": "passed" if node.cpu_cores >= req.min_cpu_cores else "failed",
                        "memory_check": "passed" if node.memory_gb >= req.min_memory_gb else "failed",
                        "storage_check": "passed" if node.storage_gb >= req.min_storage_gb else "failed",
                        "recommendations": []
                    }
                    
                    # 生成建议
                    if node.cpu_cores < req.recommended_cpu_cores:
                        component_check["recommendations"].append(
                            f"建议CPU核数至少{req.recommended_cpu_cores}核(当前{node.cpu_cores}核)"
                        )
                    
                    if node.memory_gb < req.recommended_memory_gb:
                        component_check["recommendations"].append(
                            f"建议内存至少{req.recommended_memory_gb}GB(当前{node.memory_gb}GB)"
                        )
                    
                    if node.storage_gb < req.recommended_storage_gb:
                        component_check["recommendations"].append(
                            f"建议存储至少{req.recommended_storage_gb}GB(当前{node.storage_gb}GB)"
                        )
                    
                    # 检查是否有失败项
                    if any(status == "failed" for status in 
                          [component_check["cpu_check"], component_check["memory_check"], 
                           component_check["storage_check"]]):
                        component_check["status"] = "failed"
                        node_check["status"] = "failed"
                        check_result["overall_status"] = "failed"
                    
                    node_check["component_checks"].append(component_check)
            
            check_result["node_checks"].append(node_check)
        
        # 生成全局建议
        if check_result["overall_status"] == "failed":
            check_result["errors"].append("部分节点不满足最低系统要求,请升级硬件配置")
        
        check_result["recommendations"].extend([
            "建议使用SSD存储以获得更好的性能",
            "确保所有节点时间同步",
            "配置防火墙规则开放必要端口",
            "调整操作系统内核参数",
            "禁用透明大页功能"
        ])
        
        return check_result
    
    def generate_deployment_plan(self, deployment_type: DeploymentType, 
                               cluster_name: str, nodes: List[DeploymentNode]) -> DeploymentPlan:
        """生成部署计划"""
        template = self.deployment_templates.get(deployment_type)
        if not template:
            raise ValueError(f"不支持的部署类型: {deployment_type}")
        
        # 生成全局配置
        global_config = {
            "cluster_name": cluster_name,
            "tidb_version": "v6.5.0",
            "timezone": "Asia/Shanghai",
            "enable_binlog": False,
            "enable_audit_log": True,
            "max_connections": 4000,
            "log_level": "info"
        }
        
        # 生成组件配置
        component_configs = {
            ComponentType.TIDB: {
                "port": 4000,
                "status_port": 10080,
                "max_connections": 1000,
                "token_limit": 1000,
                "mem_quota_query": "32GB",
                "log_file": "/tidb-data/tidb.log",
                "slow_threshold": 300
            },
            ComponentType.TIKV: {
                "port": 20160,
                "status_port": 20180,
                "data_dir": "/tidb-data/tikv",
                "log_file": "/tidb-data/tikv.log",
                "block_cache_size": "8GB",
                "write_buffer_size": "128MB",
                "max_open_files": 40960
            },
            ComponentType.PD: {
                "client_port": 2379,
                "peer_port": 2380,
                "data_dir": "/tidb-data/pd",
                "log_file": "/tidb-data/pd.log",
                "schedule_leader_limit": 4,
                "schedule_region_limit": 2048
            },
            ComponentType.TIFLASH: {
                "tcp_port": 9000,
                "http_port": 8123,
                "flash_service_port": 3930,
                "flash_proxy_port": 20170,
                "data_dir": "/tidb-data/tiflash",
                "log_file": "/tidb-data/tiflash.log"
            }
        }
        
        # 生成监控配置
        monitoring_config = {
            "enable_monitoring": True,
            "prometheus_port": 9090,
            "grafana_port": 3000,
            "alertmanager_port": 9093,
            "retention_period": "30d",
            "scrape_interval": "15s"
        }
        
        # 生成安全配置
        security_config = {
            "enable_tls": False,
            "ca_cert_path": "",
            "server_cert_path": "",
            "server_key_path": "",
            "enable_authentication": True,
            "password_policy": "strong"
        }
        
        # 估算部署时间
        base_time = 30  # 基础时间30分钟
        time_per_node = 10  # 每个节点10分钟
        estimated_duration = base_time + (len(nodes) * time_per_node)
        
        # 生成前置条件
        prerequisites = [
            "确保所有节点满足系统要求",
            "配置SSH免密登录",
            "同步所有节点时间",
            "配置防火墙规则",
            "调整系统参数",
            "准备数据目录",
            "下载TiUP工具"
        ]
        
        # 生成部署步骤
        deployment_steps = [
            "1. 环境检查和准备",
            "2. 下载和安装TiUP",
            "3. 生成集群配置文件",
            "4. 检查集群配置",
            "5. 部署集群组件",
            "6. 启动集群服务",
            "7. 初始化集群",
            "8. 配置监控系统",
            "9. 验证部署结果",
            "10. 配置备份策略"
        ]
        
        plan = DeploymentPlan(
            plan_id=f"deploy_{cluster_name}_{int(datetime.now().timestamp())}",
            deployment_type=deployment_type,
            cluster_name=cluster_name,
            nodes=nodes,
            global_config=global_config,
            component_configs=component_configs,
            monitoring_config=monitoring_config,
            security_config=security_config,
            estimated_duration_minutes=estimated_duration,
            prerequisites=prerequisites,
            deployment_steps=deployment_steps
        )
        
        return plan
    
    def generate_tiup_config(self, plan: DeploymentPlan) -> str:
        """生成TiUP配置文件"""
        config = {
            "global": {
                "user": "tidb",
                "ssh_port": 22,
                "deploy_dir": "/tidb-deploy",
                "data_dir": "/tidb-data",
                "arch": "amd64"
            },
            "pd_servers": [],
            "tidb_servers": [],
            "tikv_servers": [],
            "tiflash_servers": [],
            "monitoring_servers": [],
            "grafana_servers": [],
            "alertmanager_servers": []
        }
        
        # 更新全局配置
        config["global"].update({
            "cluster_name": plan.cluster_name,
            "tidb_version": plan.global_config.get("tidb_version", "v6.5.0")
        })
        
        # 配置各组件服务器
        for node in plan.nodes:
            node_config = {
                "host": node.ip_address,
                "ssh_port": node.ssh_port,
                "port": None,
                "status_port": None,
                "deploy_dir": f"/tidb-deploy/{{}}",
                "data_dir": f"/tidb-data/{{}}",
                "log_dir": f"/tidb-data/{{}}/log",
                "numa_node": "0,1",
                "config": {}
            }
            
            for component in node.components:
                component_node_config = node_config.copy()
                
                if component == ComponentType.TIDB:
                    tidb_config = plan.component_configs[ComponentType.TIDB]
                    component_node_config.update({
                        "port": tidb_config["port"],
                        "status_port": tidb_config["status_port"],
                        "deploy_dir": component_node_config["deploy_dir"].format("tidb"),
                        "data_dir": component_node_config["data_dir"].format("tidb"),
                        "log_dir": component_node_config["log_dir"].format("tidb"),
                        "config": {
                            "log.level": plan.global_config.get("log_level", "info"),
                            "performance.max-procs": node.cpu_cores,
                            "performance.max-memory": f"{int(node.memory_gb * 0.8)}GB"
                        }
                    })
                    config["tidb_servers"].append(component_node_config)
                
                elif component == ComponentType.TIKV:
                    tikv_config = plan.component_configs[ComponentType.TIKV]
                    component_node_config.update({
                        "port": tikv_config["port"],
                        "status_port": tikv_config["status_port"],
                        "deploy_dir": component_node_config["deploy_dir"].format("tikv"),
                        "data_dir": component_node_config["data_dir"].format("tikv"),
                        "log_dir": component_node_config["log_dir"].format("tikv"),
                        "config": {
                            "storage.block-cache.capacity": tikv_config["block_cache_size"],
                            "rocksdb.max-open-files": tikv_config["max_open_files"],
                            "rocksdb.max-background-jobs": min(node.cpu_cores, 8)
                        }
                    })
                    config["tikv_servers"].append(component_node_config)
                
                elif component == ComponentType.PD:
                    pd_config = plan.component_configs[ComponentType.PD]
                    component_node_config.update({
                        "client_port": pd_config["client_port"],
                        "peer_port": pd_config["peer_port"],
                        "deploy_dir": component_node_config["deploy_dir"].format("pd"),
                        "data_dir": component_node_config["data_dir"].format("pd"),
                        "log_dir": component_node_config["log_dir"].format("pd"),
                        "config": {
                            "schedule.leader-schedule-limit": pd_config["schedule_leader_limit"],
                            "schedule.region-schedule-limit": pd_config["schedule_region_limit"]
                        }
                    })
                    # 移除不需要的字段
                    component_node_config.pop("port", None)
                    component_node_config.pop("status_port", None)
                    config["pd_servers"].append(component_node_config)
                
                elif component == ComponentType.TIFLASH:
                    tiflash_config = plan.component_configs[ComponentType.TIFLASH]
                    component_node_config.update({
                        "tcp_port": tiflash_config["tcp_port"],
                        "http_port": tiflash_config["http_port"],
                        "flash_service_port": tiflash_config["flash_service_port"],
                        "flash_proxy_port": tiflash_config["flash_proxy_port"],
                        "deploy_dir": component_node_config["deploy_dir"].format("tiflash"),
                        "data_dir": component_node_config["data_dir"].format("tiflash"),
                        "log_dir": component_node_config["log_dir"].format("tiflash"),
                        "config": {
                            "profiles.default.max_memory_usage": f"{int(node.memory_gb * 0.8)}GB",
                            "profiles.default.max_threads": node.cpu_cores
                        }
                    })
                    # 移除不需要的字段
                    component_node_config.pop("port", None)
                    component_node_config.pop("status_port", None)
                    config["tiflash_servers"].append(component_node_config)
        
        # 添加监控组件(选择第一个节点部署)
        if plan.nodes and plan.monitoring_config.get("enable_monitoring", True):
            monitor_node = plan.nodes[0]
            
            # Prometheus配置
            config["monitoring_servers"].append({
                "host": monitor_node.ip_address,
                "ssh_port": monitor_node.ssh_port,
                "port": plan.monitoring_config["prometheus_port"],
                "deploy_dir": "/tidb-deploy/prometheus",
                "data_dir": "/tidb-data/prometheus",
                "log_dir": "/tidb-data/prometheus/log"
            })
            
            # Grafana配置
            config["grafana_servers"].append({
                "host": monitor_node.ip_address,
                "ssh_port": monitor_node.ssh_port,
                "port": plan.monitoring_config["grafana_port"],
                "deploy_dir": "/tidb-deploy/grafana",
                "config": {
                    "log.level": "info"
                }
            })
            
            # AlertManager配置
            config["alertmanager_servers"].append({
                "host": monitor_node.ip_address,
                "ssh_port": monitor_node.ssh_port,
                "web_port": plan.monitoring_config["alertmanager_port"],
                "cluster_port": 9094,
                "deploy_dir": "/tidb-deploy/alertmanager",
                "data_dir": "/tidb-data/alertmanager",
                "log_dir": "/tidb-data/alertmanager/log"
            })
        
        # 转换为YAML格式字符串
        import yaml
        return yaml.dump(config, default_flow_style=False, allow_unicode=True)
    
    def generate_docker_compose(self, plan: DeploymentPlan) -> str:
        """生成Docker Compose配置"""
        services = {}
        networks = {
            "tidb-network": {
                "driver": "bridge"
            }
        }
        volumes = {}
        
        # 为每个组件生成服务配置
        for i, node in enumerate(plan.nodes):
            for component in node.components:
                service_name = f"{component.value}-{i+1}"
                
                if component == ComponentType.PD:
                    services[service_name] = {
                        "image": f"pingcap/pd:{plan.global_config.get('tidb_version', 'v6.5.0')}",
                        "container_name": service_name,
                        "ports": ["2379:2379", "2380:2380"],
                        "volumes": [f"{service_name}-data:/data"],
                        "networks": ["tidb-network"],
                        "environment": {
                            "TZ": plan.global_config.get("timezone", "Asia/Shanghai")
                        },
                        "command": [
                            "--name=pd1",
                            "--data-dir=/data",
                            "--client-urls=http://0.0.0.0:2379",
                            "--peer-urls=http://0.0.0.0:2380",
                            "--advertise-client-urls=http://pd-1:2379",
                            "--advertise-peer-urls=http://pd-1:2380",
                            "--initial-cluster=pd1=http://pd-1:2380"
                        ],
                        "restart": "unless-stopped"
                    }
                    volumes[f"{service_name}-data"] = None
                
                elif component == ComponentType.TIKV:
                    services[service_name] = {
                        "image": f"pingcap/tikv:{plan.global_config.get('tidb_version', 'v6.5.0')}",
                        "container_name": service_name,
                        "ports": ["20160:20160"],
                        "volumes": [f"{service_name}-data:/data"],
                        "networks": ["tidb-network"],
                        "environment": {
                            "TZ": plan.global_config.get("timezone", "Asia/Shanghai")
                        },
                        "command": [
                            "--pd-endpoints=pd-1:2379",
                            "--addr=0.0.0.0:20160",
                            "--data-dir=/data"
                        ],
                        "depends_on": ["pd-1"],
                        "restart": "unless-stopped"
                    }
                    volumes[f"{service_name}-data"] = None
                
                elif component == ComponentType.TIDB:
                    services[service_name] = {
                        "image": f"pingcap/tidb:{plan.global_config.get('tidb_version', 'v6.5.0')}",
                        "container_name": service_name,
                        "ports": ["4000:4000", "10080:10080"],
                        "networks": ["tidb-network"],
                        "environment": {
                            "TZ": plan.global_config.get("timezone", "Asia/Shanghai")
                        },
                        "command": [
                            "--store=tikv",
                            "--path=pd-1:2379"
                        ],
                        "depends_on": ["pd-1", "tikv-1"],
                        "restart": "unless-stopped"
                    }
        
        # 添加监控服务
        if plan.monitoring_config.get("enable_monitoring", True):
            # Prometheus
            services["prometheus"] = {
                "image": "prom/prometheus:latest",
                "container_name": "prometheus",
                "ports": [f"{plan.monitoring_config['prometheus_port']}:9090"],
                "volumes": [
                    "prometheus-data:/prometheus",
                    "./prometheus.yml:/etc/prometheus/prometheus.yml"
                ],
                "networks": ["tidb-network"],
                "restart": "unless-stopped"
            }
            volumes["prometheus-data"] = None
            
            # Grafana
            services["grafana"] = {
                "image": "grafana/grafana:latest",
                "container_name": "grafana",
                "ports": [f"{plan.monitoring_config['grafana_port']}:3000"],
                "volumes": ["grafana-data:/var/lib/grafana"],
                "networks": ["tidb-network"],
                "environment": {
                    "GF_SECURITY_ADMIN_PASSWORD": "admin"
                },
                "restart": "unless-stopped"
            }
            volumes["grafana-data"] = None
        
        compose_config = {
            "version": "3.8",
            "services": services,
            "networks": networks,
            "volumes": volumes
        }
        
        import yaml
        return yaml.dump(compose_config, default_flow_style=False, allow_unicode=True)
    
    def simulate_deployment(self, plan: DeploymentPlan) -> DeploymentResult:
        """模拟部署过程"""
        start_time = datetime.now()
        
        # 模拟部署步骤
        deployed_components = []
        failed_components = []
        logs = []
        
        logs.append(f"开始部署集群: {plan.cluster_name}")
        logs.append(f"部署类型: {plan.deployment_type.value}")
        logs.append(f"节点数量: {len(plan.nodes)}")
        
        # 模拟各组件部署
        all_components = set()
        for node in plan.nodes:
            all_components.update(node.components)
        
        for component in all_components:
            try:
                logs.append(f"正在部署 {component.value} 组件...")
                # 模拟部署时间
                import time
                time.sleep(0.1)  # 模拟部署延迟
                
                # 90%的成功率
                import random
                if random.random() > 0.1:
                    deployed_components.append(component)
                    logs.append(f"{component.value} 组件部署成功")
                else:
                    failed_components.append(component)
                    logs.append(f"{component.value} 组件部署失败")
            except Exception as e:
                failed_components.append(component)
                logs.append(f"{component.value} 组件部署异常: {str(e)}")
        
        # 确定部署状态
        if not failed_components:
            status = DeploymentStatus.RUNNING
            logs.append("集群部署成功,所有组件正常运行")
        elif len(failed_components) < len(all_components) / 2:
            status = DeploymentStatus.RUNNING
            logs.append("集群部署基本成功,部分组件需要修复")
        else:
            status = DeploymentStatus.FAILED
            logs.append("集群部署失败,大部分组件无法启动")
        
        # 生成集群端点
        cluster_endpoints = {}
        monitoring_urls = {}
        
        if ComponentType.TIDB in deployed_components:
            tidb_node = next((node for node in plan.nodes if ComponentType.TIDB in node.components), None)
            if tidb_node:
                tidb_port = plan.component_configs[ComponentType.TIDB]["port"]
                cluster_endpoints["tidb"] = f"{tidb_node.ip_address}:{tidb_port}"
        
        if ComponentType.PD in deployed_components:
            pd_node = next((node for node in plan.nodes if ComponentType.PD in node.components), None)
            if pd_node:
                pd_port = plan.component_configs[ComponentType.PD]["client_port"]
                cluster_endpoints["pd"] = f"{pd_node.ip_address}:{pd_port}"
        
        if plan.monitoring_config.get("enable_monitoring", True):
            monitor_node = plan.nodes[0] if plan.nodes else None
            if monitor_node:
                monitoring_urls["prometheus"] = f"http://{monitor_node.ip_address}:{plan.monitoring_config['prometheus_port']}"
                monitoring_urls["grafana"] = f"http://{monitor_node.ip_address}:{plan.monitoring_config['grafana_port']}"
        
        # 生成后续步骤
        next_steps = []
        if status == DeploymentStatus.RUNNING:
            next_steps.extend([
                "配置数据库用户和权限",
                "导入初始数据",
                "配置备份策略",
                "设置监控告警",
                "进行性能测试"
            ])
        else:
            next_steps.extend([
                "检查失败组件的日志",
                "修复网络和配置问题",
                "重新部署失败的组件",
                "验证集群状态"
            ])
        
        if failed_components:
            next_steps.append(f"修复失败的组件: {', '.join([c.value for c in failed_components])}")
        
        result = DeploymentResult(
            plan_id=plan.plan_id,
            status=status,
            start_time=start_time,
            end_time=datetime.now(),
            deployed_components=deployed_components,
            failed_components=failed_components,
            cluster_endpoints=cluster_endpoints,
            monitoring_urls=monitoring_urls,
            logs=logs,
            next_steps=next_steps
        )
        
        self.deployment_history.append(result)
        return result
    
    def get_deployment_template_info(self, deployment_type: DeploymentType) -> Dict[str, Any]:
        """获取部署模板信息"""
        template = self.deployment_templates.get(deployment_type)
        if not template:
            return {"error": f"不支持的部署类型: {deployment_type}"}
        
        return {
            "deployment_type": deployment_type.value,
            "description": template["description"],
            "resource_requirements": template.get("resource_requirements", {}),
            "use_cases": template.get("use_cases", []),
            "min_nodes": template.get("min_nodes", 1),
            "recommended_nodes": template.get("recommended_nodes", 3),
            "special_requirements": template.get("k8s_requirements", template.get("container_requirements", {}))
        }

# TiDB部署演示
print("\n\n=== TiDB安装与部署 ===")

deployment_manager = TiDBDeploymentManager()

print("\n1. 系统要求检查:")
# 模拟节点配置
test_nodes = [
    DeploymentNode(
        node_id="node-1", hostname="tidb-node-1", ip_address="192.168.1.10",
        os_type=OSType.CENTOS, cpu_cores=8, memory_gb=16, storage_gb=500,
        components=[ComponentType.TIDB, ComponentType.PD], ssh_port=22, ssh_user="tidb",
        labels={"zone": "zone-a", "role": "compute"}, status="ready"
    ),
    DeploymentNode(
        node_id="node-2", hostname="tidb-node-2", ip_address="192.168.1.11",
        os_type=OSType.CENTOS, cpu_cores=16, memory_gb=32, storage_gb=1000,
        components=[ComponentType.TIKV], ssh_port=22, ssh_user="tidb",
        labels={"zone": "zone-b", "role": "storage"}, status="ready"
    ),
    DeploymentNode(
        node_id="node-3", hostname="tidb-node-3", ip_address="192.168.1.12",
        os_type=OSType.CENTOS, cpu_cores=16, memory_gb=32, storage_gb=1000,
        components=[ComponentType.TIKV], ssh_port=22, ssh_user="tidb",
        labels={"zone": "zone-c", "role": "storage"}, status="ready"
    )
]

check_result = deployment_manager.check_system_requirements(test_nodes)
print(f"   整体状态: {check_result['overall_status']}")
print(f"   检查节点数: {len(check_result['node_checks'])}")
print(f"   建议数量: {len(check_result['recommendations'])}")

for node_check in check_result['node_checks'][:2]:
    print(f"\n   节点 {node_check['hostname']}:")
    print(f"     状态: {node_check['status']}")
    print(f"     组件检查: {len(node_check['component_checks'])}项")
    for comp_check in node_check['component_checks']:
        print(f"       {comp_check['component']}: CPU({comp_check['cpu_check']}) "
              f"内存({comp_check['memory_check']}) 存储({comp_check['storage_check']})")

print("\n2. 部署模板信息:")
for deploy_type in [DeploymentType.STANDALONE, DeploymentType.CLUSTER, DeploymentType.DOCKER]:
    template_info = deployment_manager.get_deployment_template_info(deploy_type)
    print(f"\n   {deploy_type.value.title()}部署:")
    print(f"     描述: {template_info['description']}")
    print(f"     最少节点: {template_info['min_nodes']}")
    print(f"     推荐节点: {template_info['recommended_nodes']}")
    print(f"     适用场景: {', '.join(template_info['use_cases'][:3])}")

print("\n3. 生成部署计划:")
deployment_plan = deployment_manager.generate_deployment_plan(
    DeploymentType.CLUSTER, "test-cluster", test_nodes
)
print(f"   计划ID: {deployment_plan.plan_id}")
print(f"   集群名称: {deployment_plan.cluster_name}")
print(f"   部署类型: {deployment_plan.deployment_type.value}")
print(f"   节点数量: {len(deployment_plan.nodes)}")
print(f"   预计耗时: {deployment_plan.estimated_duration_minutes}分钟")
print(f"   前置条件: {len(deployment_plan.prerequisites)}项")
print(f"   部署步骤: {len(deployment_plan.deployment_steps)}步")

print("\n   全局配置:")
for key, value in list(deployment_plan.global_config.items())[:5]:
    print(f"     {key}: {value}")

print("\n   组件配置:")
for component, config in list(deployment_plan.component_configs.items())[:2]:
    print(f"     {component.value}: {len(config)}个配置项")
    for key, value in list(config.items())[:3]:
        print(f"       {key}: {value}")

print("\n4. TiUP配置文件生成:")
tiup_config = deployment_manager.generate_tiup_config(deployment_plan)
config_lines = tiup_config.split('\n')
print(f"   配置文件行数: {len(config_lines)}")
print(f"   主要配置段: global, pd_servers, tidb_servers, tikv_servers")
print("\n   配置文件预览:")
for line in config_lines[:10]:
    if line.strip():
        print(f"     {line}")

print("\n5. Docker Compose配置生成:")
docker_compose = deployment_manager.generate_docker_compose(deployment_plan)
compose_lines = docker_compose.split('\n')
print(f"   配置文件行数: {len(compose_lines)}")
print("\n   配置文件预览:")
for line in compose_lines[:15]:
    if line.strip():
        print(f"     {line}")

print("\n6. 模拟部署过程:")
deployment_result = deployment_manager.simulate_deployment(deployment_plan)
print(f"   部署状态: {deployment_result.status.value}")
print(f"   开始时间: {deployment_result.start_time.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"   结束时间: {deployment_result.end_time.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"   成功组件: {len(deployment_result.deployed_components)}个")
print(f"   失败组件: {len(deployment_result.failed_components)}个")

print("\n   集群端点:")
for service, endpoint in deployment_result.cluster_endpoints.items():
    print(f"     {service}: {endpoint}")

print("\n   监控地址:")
for service, url in deployment_result.monitoring_urls.items():
    print(f"     {service}: {url}")

print("\n   部署日志:")
for log in deployment_result.logs[:8]:
    print(f"     {log}")

print("\n   后续步骤:")
for step in deployment_result.next_steps[:5]:
    print(f"     {step}")

TiUP工具使用

1. TiUP安装

TiUP是TiDB官方提供的集群部署和管理工具,支持一键部署、升级、扩容等操作。

安装TiUP:

# 在线安装
curl --proto '=https' --tlsv1.2 -sSf https://tiup-mirrors.pingcap.com/install.sh | sh

# 重新加载环境变量
source ~/.bashrc

# 验证安装
tiup --version

# 安装cluster组件
tiup install cluster

TiUP常用命令:

# 查看可用版本
tiup list tidb

# 查看集群列表
tiup cluster list

# 检查集群配置
tiup cluster check topology.yaml

# 部署集群
tiup cluster deploy test-cluster v6.5.0 topology.yaml

# 启动集群
tiup cluster start test-cluster

# 查看集群状态
tiup cluster display test-cluster

# 停止集群
tiup cluster stop test-cluster

# 销毁集群
tiup cluster destroy test-cluster

2. 集群配置文件

基础集群配置示例:

global:
  user: "tidb"
  ssh_port: 22
  deploy_dir: "/tidb-deploy"
  data_dir: "/tidb-data"
  arch: "amd64"

pd_servers:
  - host: 10.0.1.1
    client_port: 2379
    peer_port: 2380
  - host: 10.0.1.2
    client_port: 2379
    peer_port: 2380
  - host: 10.0.1.3
    client_port: 2379
    peer_port: 2380

tidb_servers:
  - host: 10.0.1.4
    port: 4000
    status_port: 10080
  - host: 10.0.1.5
    port: 4000
    status_port: 10080

tikv_servers:
  - host: 10.0.1.6
    port: 20160
    status_port: 20180
  - host: 10.0.1.7
    port: 20160
    status_port: 20180
  - host: 10.0.1.8
    port: 20160
    status_port: 20180

monitoring_servers:
  - host: 10.0.1.9
    port: 9090

grafana_servers:
  - host: 10.0.1.9
    port: 3000

3. 部署流程

完整部署流程:

  1. 环境准备 “`bash

    配置SSH免密登录

    ssh-keygen -t rsa ssh-copy-id tidb@10.0.1.1

# 同步时间 sudo ntpdate -s time.nist.gov

# 调整系统参数 echo ‘fs.file-max = 1000000’ >> /etc/sysctl.conf sysctl -p


2. **检查环境**
   ```bash
   tiup cluster check topology.yaml --user tidb
   tiup cluster check topology.yaml --apply --user tidb
  1. 部署集群

    tiup cluster deploy test-cluster v6.5.0 topology.yaml --user tidb
    
    1. 启动集群 bash tiup cluster start test-cluster
  2. 验证部署

    tiup cluster display test-cluster
    mysql -h 10.0.1.4 -P 4000 -u root
    

    容器化部署

    1. Docker部署

    单机Docker部署:

    # 拉取镜像
    docker pull pingcap/tidb:v6.5.0
    docker pull pingcap/tikv:v6.5.0
    docker pull pingcap/pd:v6.5.0
    # 创建网络
    docker network create tidb-network
    # 启动PD
    docker run -d --name pd1 \
    --network tidb-network \
    -p 2379:2379 -p 2380:2380 \
    -v pd1-data:/data \
    pingcap/pd:v6.5.0 \
    --name=pd1 \
    --data-dir=/data \
    --client-urls=http://0.0.0.0:2379 \
    --peer-urls=http://0.0.0.0:2380 \
    --advertise-client-urls=http://pd1:2379 \
    --advertise-peer-urls=http://pd1:2380 \
    --initial-cluster=pd1=http://pd1:2380
    # 启动TiKV
    docker run -d --name tikv1 \
    --network tidb-network \
    -p 20160:20160 \
    -v tikv1-data:/data \
    pingcap/tikv:v6.5.0 \
    --pd-endpoints=pd1:2379 \
    --addr=0.0.0.0:20160 \
    --data-dir=/data
    # 启动TiDB
    docker run -d --name tidb1 \
    --network tidb-network \
    -p 4000:4000 -p 10080:10080 \
    pingcap/tidb:v6.5.0 \
    --store=tikv \
    --path=pd1:2379
    

Docker Compose部署:

version: '3.8'

services:
  pd:
    image: pingcap/pd:v6.5.0
    container_name: pd
    ports:
      - "2379:2379"
      - "2380:2380"
    volumes:
      - pd-data:/data
    command:
      - --name=pd
      - --data-dir=/data
      - --client-urls=http://0.0.0.0:2379
      - --peer-urls=http://0.0.0.0:2380
      - --advertise-client-urls=http://pd:2379
      - --advertise-peer-urls=http://pd:2380
      - --initial-cluster=pd=http://pd:2380
    restart: unless-stopped

  tikv:
    image: pingcap/tikv:v6.5.0
    container_name: tikv
    ports:
      - "20160:20160"
    volumes:
      - tikv-data:/data
    command:
      - --pd-endpoints=pd:2379
      - --addr=0.0.0.0:20160
      - --data-dir=/data
    depends_on:
      - pd
    restart: unless-stopped

  tidb:
    image: pingcap/tidb:v6.5.0
    container_name: tidb
    ports:
      - "4000:4000"
      - "10080:10080"
    command:
      - --store=tikv
      - --path=pd:2379
    depends_on:
      - pd
      - tikv
    restart: unless-stopped

volumes:
  pd-data:
  tikv-data:

2. Kubernetes部署

使用TiDB Operator:

  1. 安装TiDB Operator “`bash

    添加Helm仓库

    helm repo add pingcap https://charts.pingcap.org/ helm repo update

# 安装CRD kubectl apply -f https://raw.githubusercontent.com/pingcap/tidb-operator/v1.4.0/manifests/crd.yaml

# 安装Operator helm install tidb-operator pingcap/tidb-operator –namespace tidb-admin –create-namespace


2. **部署TiDB集群**
   ```yaml
   apiVersion: pingcap.com/v1alpha1
   kind: TidbCluster
   metadata:
     name: basic
     namespace: tidb-cluster
   spec:
     version: v6.5.0
     timezone: Asia/Shanghai
     
     pd:
       baseImage: pingcap/pd
       replicas: 3
       requests:
         storage: "10Gi"
       config: |
         [log]
         level = "info"
     
     tikv:
       baseImage: pingcap/tikv
       replicas: 3
       requests:
         storage: "100Gi"
       config: |
         [storage]
         reserve-space = "2GB"
     
     tidb:
       baseImage: pingcap/tidb
       replicas: 2
       service:
         type: ClusterIP
       config: |
         [log]
         level = "info"

生产环境部署最佳实践

1. 硬件配置建议

TiDB Server: - CPU: 16核以上 - 内存: 32GB以上 - 存储: SSD 200GB以上 - 网络: 万兆网卡

TiKV Server: - CPU: 16核以上 - 内存: 64GB以上 - 存储: NVMe SSD 1TB以上 - 网络: 万兆网卡

PD Server: - CPU: 8核以上 - 内存: 16GB以上 - 存储: SSD 200GB以上 - 网络: 万兆网卡

2. 网络配置

端口规划: - TiDB: 4000 (MySQL协议), 10080 (HTTP状态) - TiKV: 20160 (客户端), 20180 (状态) - PD: 2379 (客户端), 2380 (节点间通信) - 监控: 9090 (Prometheus), 3000 (Grafana)

防火墙配置:

# CentOS/RHEL
firewall-cmd --permanent --add-port=4000/tcp
firewall-cmd --permanent --add-port=10080/tcp
firewall-cmd --permanent --add-port=20160/tcp
firewall-cmd --permanent --add-port=20180/tcp
firewall-cmd --permanent --add-port=2379/tcp
firewall-cmd --permanent --add-port=2380/tcp
firewall-cmd --reload

# Ubuntu/Debian
ufw allow 4000/tcp
ufw allow 10080/tcp
ufw allow 20160/tcp
ufw allow 20180/tcp
ufw allow 2379/tcp
ufw allow 2380/tcp

3. 系统优化

内核参数优化:

# /etc/sysctl.conf
fs.file-max = 1000000
fs.nr_open = 1000000
net.core.somaxconn = 32768
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_syncookies = 0
vm.swappiness = 1
vm.overcommit_memory = 1

# 应用配置
sysctl -p

用户限制配置:

# /etc/security/limits.conf
tidb soft nofile 1000000
tidb hard nofile 1000000
tidb soft stack 32768
tidb hard stack 32768

禁用透明大页:

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# 永久禁用
echo 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' >> /etc/rc.local
echo 'echo never > /sys/kernel/mm/transparent_hugepage/defrag' >> /etc/rc.local

4. 存储配置

磁盘分区建议: - 系统盘: 100GB以上 - 数据盘: 根据业务需求,建议1TB以上 - 日志盘: 200GB以上 - 备份盘: 数据盘的2倍以上

文件系统优化:

# 格式化数据盘
mkfs.ext4 -F /dev/sdb

# 挂载配置
echo '/dev/sdb /tidb-data ext4 defaults,noatime,nodelalloc 0 2' >> /etc/fstab
mount -a

# 设置权限
chown -R tidb:tidb /tidb-data

部署后验证

1. 集群状态检查

# 检查集群状态
tiup cluster display test-cluster

# 检查组件状态
tiup cluster exec test-cluster --command "systemctl status tidb-4000.service"
tiup cluster exec test-cluster --command "systemctl status tikv-20160.service"
tiup cluster exec test-cluster --command "systemctl status pd-2379.service"

2. 数据库连接测试

# 使用MySQL客户端连接
mysql -h 192.168.1.10 -P 4000 -u root

# 基础SQL测试
SHOW DATABASES;
CREATE DATABASE test_db;
USE test_db;
CREATE TABLE test_table (id INT PRIMARY KEY, name VARCHAR(50));
INSERT INTO test_table VALUES (1, 'test');
SELECT * FROM test_table;

3. 性能基准测试

# 安装sysbench
yum install -y sysbench

# 准备测试数据
sysbench oltp_read_write \
  --mysql-host=192.168.1.10 \
  --mysql-port=4000 \
  --mysql-user=root \
  --mysql-db=test \
  --tables=10 \
  --table-size=100000 \
  prepare

# 运行性能测试
sysbench oltp_read_write \
  --mysql-host=192.168.1.10 \
  --mysql-port=4000 \
  --mysql-user=root \
  --mysql-db=test \
  --tables=10 \
  --table-size=100000 \
  --threads=16 \
  --time=300 \
  run

监控配置

1. Prometheus配置

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "tidb.rules.yml"

scrape_configs:
  - job_name: 'tidb'
    static_configs:
      - targets: ['192.168.1.10:10080', '192.168.1.11:10080']
  
  - job_name: 'tikv'
    static_configs:
      - targets: ['192.168.1.12:20180', '192.168.1.13:20180']
  
  - job_name: 'pd'
    static_configs:
      - targets: ['192.168.1.14:2379', '192.168.1.15:2379']

2. Grafana仪表板

导入TiDB官方仪表板: 1. 访问Grafana Web界面 (http://监控节点IP:3000) 2. 使用admin/admin登录 3. 导入TiDB官方仪表板模板 4. 配置Prometheus数据源

关键监控指标: - QPS (每秒查询数) - 响应时间 - 连接数 - CPU和内存使用率 - 磁盘I/O - 网络流量 - 集群健康状态

常见问题排查

1. 部署失败

问题:SSH连接失败

# 检查SSH配置
ssh -v tidb@192.168.1.10

# 检查防火墙
systemctl status firewalld
firewall-cmd --list-all

问题:端口冲突

# 检查端口占用
netstat -tlnp | grep :4000
lsof -i :4000

# 修改配置文件中的端口

问题:磁盘空间不足

# 检查磁盘空间
df -h

# 清理日志文件
find /tidb-data -name "*.log" -mtime +7 -delete

2. 性能问题

问题:查询慢

-- 查看慢查询
SELECT * FROM INFORMATION_SCHEMA.SLOW_QUERY 
WHERE TIME > '2023-01-01 00:00:00' 
ORDER BY Query_time DESC LIMIT 10;

-- 分析执行计划
EXPLAIN ANALYZE SELECT * FROM table_name WHERE condition;

问题:连接数过多

-- 查看当前连接
SHOW PROCESSLIST;

-- 调整最大连接数
SET GLOBAL max_connections = 2000;

3. 集群问题

问题:节点下线

# 检查节点状态
tiup cluster display test-cluster

# 重启节点
tiup cluster restart test-cluster -N 192.168.1.10:4000

# 替换故障节点
tiup cluster scale-out test-cluster scale-out.yaml

总结

本章详细介绍了TiDB的安装部署方法,包括:

关键要点

  1. 部署方式选择

    • 单机部署适用于开发测试
    • 集群部署适用于生产环境
    • 容器化部署适用于云原生环境
  2. 工具使用

    • TiUP是官方推荐的部署工具
    • 支持一键部署、升级、扩容
    • 提供完整的集群生命周期管理
  3. 环境要求

    • 硬件配置要满足最低要求
    • 系统参数需要优化调整
    • 网络和存储配置要合理

最佳实践

  1. 部署规划

    • 合理规划集群拓扑
    • 考虑高可用和容灾
    • 预留扩展空间
  2. 环境准备

    • 充分的环境检查
    • 系统参数优化
    • 安全配置加固
  3. 监控运维

    • 完善的监控体系
    • 及时的告警机制
    • 定期的健康检查
  4. 故障处理

    • 建立故障处理流程
    • 准备应急预案
    • 定期演练恢复过程

下一步学习

  1. 基础操作: 学习TiDB的SQL语法和基础操作
  2. 性能优化: 掌握查询优化和性能调优技巧
  3. 运维管理: 了解集群运维和故障处理方法
  4. 高级特性: 探索TiDB的高级功能和特性

通过本章的学习,您已经掌握了TiDB的部署方法。在后续章节中,我们将深入学习TiDB的使用和优化技巧。