2.1 环境准备

2.1.1 系统要求

硬件要求: - CPU:8核心以上 - 内存:16GB以上(推荐32GB) - 存储:100GB以上可用空间 - 网络:千兆以太网

软件要求: - 操作系统:CentOS 7+、Ubuntu 18.04+、RHEL 7+ - Java:JDK 8(推荐Oracle JDK或OpenJDK) - Python:2.7或3.6+

依赖组件版本兼容性:

Kylin版本 Hadoop Hive HBase Spark
4.0.x 2.8+ 1.2+ 2.0+ 2.4+
3.1.x 2.7+ 1.2+ 1.1+ 2.3+
3.0.x 2.7+ 1.2+ 1.1+ 2.3+
2.6.x 2.7+ 1.2+ 1.1+ 2.1+

2.1.2 环境检查脚本

#!/bin/bash
# kylin_env_check.sh - Kylin环境检查脚本

echo "=== Apache Kylin 环境检查 ==="

# 检查操作系统
echo "1. 操作系统信息:"
cat /etc/os-release | grep -E "NAME|VERSION"
echo

# 检查Java版本
echo "2. Java版本检查:"
if command -v java &> /dev/null; then
    java -version
    echo "JAVA_HOME: $JAVA_HOME"
else
    echo "错误:未找到Java,请安装JDK 8"
fi
echo

# 检查内存
echo "3. 系统内存:"
free -h
echo

# 检查磁盘空间
echo "4. 磁盘空间:"
df -h
echo

# 检查网络
echo "5. 网络连接:"
ping -c 3 google.com > /dev/null 2>&1
if [ $? -eq 0 ]; then
    echo "网络连接正常"
else
    echo "网络连接异常"
fi
echo

# 检查Hadoop组件
echo "6. Hadoop生态组件检查:"
for cmd in hadoop hive hbase; do
    if command -v $cmd &> /dev/null; then
        echo "✓ $cmd 已安装"
        $cmd version 2>/dev/null | head -1
    else
        echo "✗ $cmd 未安装"
    fi
done

echo "=== 环境检查完成 ==="

2.2 依赖组件安装

2.2.1 Java环境配置

安装OpenJDK 8:

# CentOS/RHEL
sudo yum install -y java-1.8.0-openjdk java-1.8.0-openjdk-devel

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y openjdk-8-jdk

# 配置JAVA_HOME
echo 'export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk' >> ~/.bashrc
echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

# 验证安装
java -version
javac -version

2.2.2 Hadoop集群部署

下载和安装Hadoop:

# 下载Hadoop 2.10.1
wget https://archive.apache.org/dist/hadoop/common/hadoop-2.10.1/hadoop-2.10.1.tar.gz
tar -xzf hadoop-2.10.1.tar.gz
sudo mv hadoop-2.10.1 /opt/hadoop
sudo chown -R $USER:$USER /opt/hadoop

# 配置环境变量
echo 'export HADOOP_HOME=/opt/hadoop' >> ~/.bashrc
echo 'export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop' >> ~/.bashrc
echo 'export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH' >> ~/.bashrc
source ~/.bashrc

Hadoop配置文件:

core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/hadoop/tmp</value>
    </property>
</configuration>

hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/opt/hadoop/data/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/opt/hadoop/data/datanode</value>
    </property>
</configuration>

yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>8192</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>8192</value>
    </property>
</configuration>

启动Hadoop:

# 格式化HDFS
hdfs namenode -format

# 启动HDFS
start-dfs.sh

# 启动YARN
start-yarn.sh

# 验证服务
jps
hadoop fs -ls /

2.2.3 Hive安装配置

下载和安装Hive:

# 下载Hive 2.3.9
wget https://archive.apache.org/dist/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz
tar -xzf apache-hive-2.3.9-bin.tar.gz
sudo mv apache-hive-2.3.9-bin /opt/hive
sudo chown -R $USER:$USER /opt/hive

# 配置环境变量
echo 'export HIVE_HOME=/opt/hive' >> ~/.bashrc
echo 'export PATH=$HIVE_HOME/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

Hive配置:

hive-site.xml

<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://localhost:3306/hive_metastore?createDatabaseIfNotExist=true&amp;useSSL=false</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hive</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>hive123</value>
    </property>
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
    </property>
</configuration>

初始化Hive:

# 下载MySQL JDBC驱动
wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.33.tar.gz
tar -xzf mysql-connector-java-8.0.33.tar.gz
cp mysql-connector-java-8.0.33/mysql-connector-java-8.0.33.jar $HIVE_HOME/lib/

# 初始化schema
schematool -dbType mysql -initSchema

# 启动Hive Metastore
nohup hive --service metastore &

# 测试Hive
hive -e "SHOW DATABASES;"

2.2.4 HBase安装配置

下载和安装HBase:

# 下载HBase 2.4.17
wget https://archive.apache.org/dist/hbase/2.4.17/hbase-2.4.17-bin.tar.gz
tar -xzf hbase-2.4.17-bin.tar.gz
sudo mv hbase-2.4.17 /opt/hbase
sudo chown -R $USER:$USER /opt/hbase

# 配置环境变量
echo 'export HBASE_HOME=/opt/hbase' >> ~/.bashrc
echo 'export PATH=$HBASE_HOME/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

HBase配置:

hbase-site.xml

<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://localhost:9000/hbase</value>
    </property>
    <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
    </property>
    <property>
        <name>hbase.zookeeper.quorum</name>
        <value>localhost</value>
    </property>
    <property>
        <name>hbase.zookeeper.property.dataDir</name>
        <value>/opt/hbase/zookeeper</value>
    </property>
</configuration>

启动HBase:

# 启动HBase
start-hbase.sh

# 验证HBase
hbase shell
# 在HBase shell中执行:
# list
# exit

2.3 Kylin安装

2.3.1 下载Kylin

# 下载Apache Kylin 4.0.3
wget https://archive.apache.org/dist/kylin/apache-kylin-4.0.3/apache-kylin-4.0.3-bin.tar.gz
tar -xzf apache-kylin-4.0.3-bin.tar.gz
sudo mv apache-kylin-4.0.3-bin /opt/kylin
sudo chown -R $USER:$USER /opt/kylin

# 配置环境变量
echo 'export KYLIN_HOME=/opt/kylin' >> ~/.bashrc
echo 'export PATH=$KYLIN_HOME/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

2.3.2 Kylin配置

主配置文件 kylin.properties

# Kylin服务器配置
kylin.server.mode=all
kylin.server.cluster-servers=localhost:7070

# 元数据存储
kylin.metadata.url=kylin_metadata@hbase

# 作业引擎配置
kylin.engine.default=2
kylin.engine.spark-conf.spark.master=local[*]
kylin.engine.spark-conf.spark.submit.deployMode=client
kylin.engine.spark-conf.spark.driver.memory=4G
kylin.engine.spark-conf.spark.executor.memory=4G
kylin.engine.spark-conf.spark.executor.cores=4
kylin.engine.spark-conf.spark.executor.instances=1

# 存储引擎配置
kylin.storage.default=2
kylin.storage.hbase.cluster-fs=hdfs://localhost:9000

# Web服务配置
kylin.web.timezone=GMT+8
kylin.web.cross-domain-enabled=true

# 安全配置
kylin.security.profile=testing
kylin.security.ldap.connection-server=
kylin.security.ldap.connection-username=
kylin.security.ldap.connection-password=

# 查询配置
kylin.query.cache-enabled=true
kylin.query.large-query-threshold=1000000
kylin.query.timeout-seconds=300

# 构建配置
kylin.job.scheduler.default=0
kylin.job.max-concurrent-jobs=10
kylin.job.retry=3

环境配置文件 setenv.sh

#!/bin/bash

# Java配置
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk

# Hadoop配置
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

# Hive配置
export HIVE_HOME=/opt/hive
export HIVE_CONF=$HIVE_HOME/conf

# HBase配置
export HBASE_HOME=/opt/hbase
export HBASE_CONF_DIR=$HBASE_HOME/conf

# Spark配置(如果使用独立Spark)
# export SPARK_HOME=/opt/spark
# export SPARK_CONF_DIR=$SPARK_HOME/conf

# Kylin配置
export KYLIN_JVM_SETTINGS="-Xms1G -Xmx8G -XX:+UseG1GC"
export KYLIN_DEBUG_SETTINGS="-Dkylin.hdfs.working.dir=/kylin"

# 添加依赖JAR
export KYLIN_EXTRA_START_OPTS="
-Dhdp.version=2.6.5.0-292
-Djava.security.krb5.realm=
-Djava.security.krb5.kdc=
-Djava.security.auth.login.config=
"

2.3.3 初始化Kylin

检查环境:

# 检查Kylin环境
kylin.sh org.apache.kylin.common.util.CheckEnv

# 如果检查通过,会显示:
# Checking environment...
# KYLIN_HOME is set to /opt/kylin
# java version "1.8.0_XXX"
# HADOOP_HOME is set to /opt/hadoop
# HIVE_HOME is set to /opt/hive
# HBASE_HOME is set to /opt/hbase
# Environment check passed.

创建HDFS工作目录:

# 创建Kylin工作目录
hadoop fs -mkdir -p /kylin
hadoop fs -chmod 777 /kylin

# 验证目录创建
hadoop fs -ls /

初始化元数据:

# 初始化Kylin元数据表
kylin.sh org.apache.kylin.storage.hbase.util.DeployCoprocessorCLI default

# 检查HBase中的Kylin表
echo "list" | hbase shell

2.4 启动和验证

2.4.1 启动Kylin服务

启动脚本:

#!/bin/bash
# start_kylin.sh - Kylin启动脚本

echo "启动Apache Kylin服务..."

# 检查依赖服务
echo "检查依赖服务状态..."
if ! jps | grep -q NameNode; then
    echo "错误:HDFS NameNode未运行"
    exit 1
fi

if ! jps | grep -q HMaster; then
    echo "错误:HBase Master未运行"
    exit 1
fi

# 启动Kylin
echo "启动Kylin服务..."
kylin.sh start

# 等待服务启动
echo "等待服务启动..."
sleep 30

# 检查服务状态
if jps | grep -q KylinLauncher; then
    echo "✓ Kylin服务启动成功"
    echo "Web UI: http://localhost:7070/kylin"
    echo "默认用户名/密码: ADMIN/KYLIN"
else
    echo "✗ Kylin服务启动失败"
    echo "请检查日志: $KYLIN_HOME/logs/kylin.log"
fi

执行启动:

# 启动Kylin
kylin.sh start

# 检查进程
jps | grep Kylin

# 检查日志
tail -f $KYLIN_HOME/logs/kylin.log

2.4.2 Web UI访问

访问地址: - URL: http://localhost:7070/kylin - 默认用户名: ADMIN - 默认密码: KYLIN

首次登录检查项: 1. 系统信息页面显示正常 2. 数据源连接状态正常 3. 作业引擎状态正常 4. 存储引擎状态正常

2.4.3 命令行验证

基本命令测试:

# 检查Kylin版本
kylin.sh version

# 检查系统状态
kylin.sh org.apache.kylin.tool.DiagnosisInfoCLI

# 列出项目
curl -X GET \
  -H "Authorization: Basic QURNSU46S1lMSU4=" \
  -H "Content-Type: application/json" \
  "http://localhost:7070/kylin/api/projects"

健康检查脚本:

#!/bin/bash
# kylin_health_check.sh - Kylin健康检查

echo "=== Kylin健康检查 ==="

# 检查进程
echo "1. 进程检查:"
if jps | grep -q KylinLauncher; then
    echo "✓ Kylin进程运行正常"
else
    echo "✗ Kylin进程未运行"
    exit 1
fi

# 检查端口
echo "2. 端口检查:"
if netstat -tlnp | grep -q :7070; then
    echo "✓ Kylin端口7070监听正常"
else
    echo "✗ Kylin端口7070未监听"
fi

# 检查Web服务
echo "3. Web服务检查:"
response=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:7070/kylin)
if [ "$response" = "200" ]; then
    echo "✓ Web服务响应正常"
else
    echo "✗ Web服务响应异常: $response"
fi

# 检查API服务
echo "4. API服务检查:"
api_response=$(curl -s -X GET \
  -H "Authorization: Basic QURNSU46S1lMSU4=" \
  -H "Content-Type: application/json" \
  "http://localhost:7070/kylin/api/user/authentication" \
  -w "%{http_code}")

if echo "$api_response" | grep -q "200"; then
    echo "✓ API服务正常"
else
    echo "✗ API服务异常"
fi

echo "=== 健康检查完成 ==="

2.5 集群部署

2.5.1 集群架构设计

推荐集群配置:

┌─────────────────────────────────────────────────────────────┐
│                    Kylin集群架构                            │
├─────────────────────────────────────────────────────────────┤
│  Load Balancer (Nginx/HAProxy)                            │
├─────────────────────────────────────────────────────────────┤
│  Kylin Node 1    │  Kylin Node 2    │  Kylin Node 3     │
│  (Query + Job)   │  (Query + Job)   │  (Query Only)     │
├─────────────────────────────────────────────────────────────┤
│              Hadoop Cluster (HDFS + YARN)                 │
├─────────────────────────────────────────────────────────────┤
│                   HBase Cluster                           │
├─────────────────────────────────────────────────────────────┤
│                   Hive Metastore                          │
└─────────────────────────────────────────────────────────────┘

2.5.2 节点配置

节点1配置(kylin-node1):

# kylin.properties for node1
kylin.server.mode=all
kylin.server.cluster-servers=kylin-node1:7070,kylin-node2:7070,kylin-node3:7070
kylin.server.cluster-name=kylin-cluster

# 作业调度节点
kylin.job.scheduler.default=0
kylin.job.max-concurrent-jobs=5

# 查询节点
kylin.query.server.enabled=true

节点2配置(kylin-node2):

# kylin.properties for node2
kylin.server.mode=all
kylin.server.cluster-servers=kylin-node1:7070,kylin-node2:7070,kylin-node3:7070
kylin.server.cluster-name=kylin-cluster

# 作业调度节点
kylin.job.scheduler.default=0
kylin.job.max-concurrent-jobs=5

# 查询节点
kylin.query.server.enabled=true

节点3配置(kylin-node3):

# kylin.properties for node3
kylin.server.mode=query
kylin.server.cluster-servers=kylin-node1:7070,kylin-node2:7070,kylin-node3:7070
kylin.server.cluster-name=kylin-cluster

# 仅查询节点
kylin.job.scheduler.default=-1
kylin.query.server.enabled=true

2.5.3 负载均衡配置

Nginx配置:

# /etc/nginx/conf.d/kylin.conf
upstream kylin_cluster {
    server kylin-node1:7070 weight=3;
    server kylin-node2:7070 weight=3;
    server kylin-node3:7070 weight=4;
    
    # 健康检查
    check interval=3000 rise=2 fall=5 timeout=1000;
}

server {
    listen 80;
    server_name kylin.example.com;
    
    location / {
        proxy_pass http://kylin_cluster;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # 超时设置
        proxy_connect_timeout 30s;
        proxy_send_timeout 300s;
        proxy_read_timeout 300s;
    }
    
    # 健康检查端点
    location /health {
        access_log off;
        return 200 "healthy\n";
        add_header Content-Type text/plain;
    }
}

2.6 性能调优

2.6.1 JVM调优

Kylin JVM参数:

# setenv.sh中的JVM配置
export KYLIN_JVM_SETTINGS="
-Xms8G
-Xmx16G
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:G1HeapRegionSize=16m
-XX:+G1UseAdaptiveIHOP
-XX:+UnlockExperimentalVMOptions
-XX:+UseCGroupMemoryLimitForHeap
-XX:+PrintGC
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationStoppedTime
-Xloggc:$KYLIN_HOME/logs/kylin-gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=64M
"

2.6.2 系统调优

Linux系统参数:

# /etc/sysctl.conf
# 网络优化
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
net.core.netdev_max_backlog = 5000

# 文件描述符
fs.file-max = 2097152

# 虚拟内存
vm.swappiness = 1
vm.dirty_ratio = 15
vm.dirty_background_ratio = 5

# 应用配置
sysctl -p

用户限制配置:

# /etc/security/limits.conf
kylin soft nofile 65536
kylin hard nofile 65536
kylin soft nproc 32768
kylin hard nproc 32768

2.7 故障排除

2.7.1 常见问题

问题1:Kylin启动失败

# 检查日志
tail -f $KYLIN_HOME/logs/kylin.log

# 常见原因:
# 1. Java版本不兼容
# 2. Hadoop/HBase服务未启动
# 3. 端口被占用
# 4. 权限问题

# 解决方法:
# 检查Java版本
java -version

# 检查服务状态
jps

# 检查端口
netstat -tlnp | grep 7070

# 检查权限
ls -la $KYLIN_HOME

问题2:Web UI无法访问

# 检查防火墙
sudo firewall-cmd --list-ports
sudo firewall-cmd --add-port=7070/tcp --permanent
sudo firewall-cmd --reload

# 检查网络连接
telnet localhost 7070

# 检查Kylin配置
grep -i "server.port" $KYLIN_HOME/conf/kylin.properties

问题3:构建任务失败

# 检查YARN资源
yarn application -list
yarn logs -applicationId application_xxx

# 检查HDFS空间
hadoop fs -df -h

# 检查HBase状态
echo "status" | hbase shell

2.7.2 日志分析

日志文件位置:

# Kylin主日志
$KYLIN_HOME/logs/kylin.log

# 安全日志
$KYLIN_HOME/logs/security.log

# 查询日志
$KYLIN_HOME/logs/kylin-query.log

# GC日志
$KYLIN_HOME/logs/kylin-gc.log

日志分析脚本:

#!/bin/bash
# analyze_kylin_logs.sh

LOG_DIR="$KYLIN_HOME/logs"
DATE=$(date +%Y-%m-%d)

echo "=== Kylin日志分析 ($DATE) ==="

# 错误统计
echo "1. 错误统计:"
grep -i "error" $LOG_DIR/kylin.log | wc -l
echo "   ERROR数量: $(grep -i 'error' $LOG_DIR/kylin.log | wc -l)"
echo "   WARN数量: $(grep -i 'warn' $LOG_DIR/kylin.log | wc -l)"

# 最近错误
echo "2. 最近错误:"
grep -i "error" $LOG_DIR/kylin.log | tail -5

# 查询统计
echo "3. 查询统计:"
if [ -f "$LOG_DIR/kylin-query.log" ]; then
    echo "   今日查询数: $(grep "$DATE" $LOG_DIR/kylin-query.log | wc -l)"
    echo "   平均响应时间: $(grep "$DATE" $LOG_DIR/kylin-query.log | grep -o 'duration:[0-9]*' | cut -d: -f2 | awk '{sum+=$1; count++} END {if(count>0) print sum/count "ms"}')"
fi

# GC统计
echo "4. GC统计:"
if [ -f "$LOG_DIR/kylin-gc.log" ]; then
    echo "   GC次数: $(grep "GC" $LOG_DIR/kylin-gc.log | wc -l)"
    echo "   平均GC时间: $(grep "GC" $LOG_DIR/kylin-gc.log | grep -o '[0-9]*\.[0-9]*secs' | cut -d's' -f1 | awk '{sum+=$1; count++} END {if(count>0) print sum/count "s"}')"
fi

echo "=== 分析完成 ==="

2.8 本章小结

本章详细介绍了Apache Kylin的安装部署过程:

核心内容: 1. 环境准备:系统要求、依赖检查、环境配置 2. 组件安装:Hadoop、Hive、HBase的安装配置 3. Kylin部署:单机和集群部署方案 4. 性能调优:JVM调优、系统参数优化 5. 故障排除:常见问题解决和日志分析

关键要点: 1. 确保所有依赖组件版本兼容 2. 正确配置环境变量和路径 3. 合理分配系统资源 4. 建立完善的监控和日志机制

下一章预告: 下一章将介绍Kylin的基本概念和术语,包括数据模型、维度、度量等核心概念的详细解释。

2.9 练习与思考

实践练习

  1. 在虚拟机中搭建完整的Kylin环境
  2. 编写自动化部署脚本
  3. 配置Kylin集群并测试负载均衡

思考题

  1. 如何根据业务需求选择合适的硬件配置?
  2. 在生产环境中如何保证Kylin的高可用性?
  3. 如何监控和优化Kylin的性能?