12.1 监控概述
Jenkins监控的重要性
监控目标:
Jenkins监控的核心目标:
1. 系统健康监控
- 服务可用性监控
- 性能指标监控
- 资源使用监控
- 错误率监控
2. 构建质量监控
- 构建成功率
- 构建时间趋势
- 测试覆盖率
- 代码质量指标
3. 用户体验监控
- 响应时间监控
- 队列等待时间
- 用户操作监控
- 界面性能监控
4. 安全监控
- 登录失败监控
- 权限变更监控
- 异常操作监控
- 安全漏洞监控
监控架构:
监控系统架构:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Data Source │───▶│ Collector │───▶│ Storage │
│ (Jenkins) │ │ (Prometheus) │ │ (InfluxDB) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Alerting │◀───│ Visualization │◀───│ Processing │
│ (AlertManager)│ │ (Grafana) │ │ (Prometheus) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
数据流:
1. Jenkins暴露指标
2. Prometheus收集指标
3. 数据存储和处理
4. Grafana可视化展示
5. AlertManager发送告警
监控指标体系
系统级指标:
基础设施指标:
1. 硬件资源
- CPU使用率
- 内存使用率
- 磁盘使用率
- 网络I/O
- 磁盘I/O
2. 系统性能
- 负载平均值
- 进程数量
- 文件描述符使用
- 网络连接数
3. JVM指标
- 堆内存使用
- 非堆内存使用
- GC频率和时间
- 线程数量
- 类加载数量
应用级指标:
Jenkins应用指标:
1. 构建指标
- 构建总数
- 构建成功率
- 构建失败率
- 平均构建时间
- 构建队列长度
2. 节点指标
- 在线节点数
- 执行器总数
- 忙碌执行器数
- 节点响应时间
3. 用户指标
- 活跃用户数
- 登录次数
- 操作频率
- 会话时长
4. 插件指标
- 插件数量
- 插件更新
- 插件错误
- 插件性能
业务级指标:
业务相关指标:
1. 交付指标
- 部署频率
- 变更前置时间
- 变更失败率
- 恢复时间
2. 质量指标
- 测试通过率
- 代码覆盖率
- 缺陷密度
- 技术债务
3. 效率指标
- 开发速度
- 反馈时间
- 自动化率
- 重复工作率
12.2 Prometheus监控集成
Prometheus插件配置
插件安装和配置:
安装步骤:
1. 安装 "Prometheus metrics" 插件
2. 重启Jenkins
3. 访问 http://jenkins-server:8080/prometheus
4. 验证指标输出
配置选项:
- 管理Jenkins -> 系统配置 -> Prometheus
- 启用指标收集
- 配置指标路径:/prometheus
- 设置收集间隔:30秒
- 启用额外指标:JVM、系统、构建
Prometheus配置:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "jenkins_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'jenkins'
static_configs:
- targets: ['jenkins:8080']
metrics_path: '/prometheus'
scrape_interval: 30s
scrape_timeout: 10s
- job_name: 'jenkins-nodes'
static_configs:
- targets:
- 'jenkins-agent-1:8080'
- 'jenkins-agent-2:8080'
metrics_path: '/prometheus'
scrape_interval: 30s
- job_name: 'node-exporter'
static_configs:
- targets:
- 'jenkins:9100'
- 'jenkins-agent-1:9100'
- 'jenkins-agent-2:9100'
Jenkins指标规则:
# jenkins_rules.yml
groups:
- name: jenkins.rules
rules:
# 构建成功率
- record: jenkins:build_success_rate
expr: |
(
sum(rate(jenkins_builds_success_build_count[5m])) /
sum(rate(jenkins_builds_build_count[5m]))
) * 100
# 平均构建时间
- record: jenkins:build_duration_avg
expr: |
sum(rate(jenkins_builds_duration_milliseconds_summary_sum[5m])) /
sum(rate(jenkins_builds_duration_milliseconds_summary_count[5m]))
# 队列等待时间
- record: jenkins:queue_waiting_time
expr: |
sum(jenkins_queue_size_value) *
avg(jenkins_builds_duration_milliseconds_summary{quantile="0.5"})
# 节点可用性
- record: jenkins:node_availability
expr: |
(
sum(jenkins_node_online_value) /
sum(jenkins_node_count_value)
) * 100
# 执行器使用率
- record: jenkins:executor_utilization
expr: |
(
sum(jenkins_executor_in_use_value) /
sum(jenkins_executor_count_value)
) * 100
- name: jenkins.alerts
rules:
# 构建失败率过高
- alert: HighBuildFailureRate
expr: jenkins:build_success_rate < 80
for: 5m
labels:
severity: warning
annotations:
summary: "Jenkins构建失败率过高"
description: "构建成功率为 {{ $value }}%,低于80%阈值"
# 构建时间过长
- alert: LongBuildDuration
expr: jenkins:build_duration_avg > 1800000 # 30分钟
for: 10m
labels:
severity: warning
annotations:
summary: "Jenkins构建时间过长"
description: "平均构建时间为 {{ $value | humanizeDuration }},超过30分钟"
# 队列积压严重
- alert: HighQueueBacklog
expr: jenkins_queue_size_value > 20
for: 5m
labels:
severity: critical
annotations:
summary: "Jenkins构建队列积压严重"
description: "当前队列中有 {{ $value }} 个任务等待执行"
# 节点离线
- alert: NodeOffline
expr: jenkins_node_online_value == 0
for: 2m
labels:
severity: warning
annotations:
summary: "Jenkins节点离线"
description: "节点 {{ $labels.node_name }} 已离线"
# 磁盘空间不足
- alert: LowDiskSpace
expr: |
(
node_filesystem_avail_bytes{mountpoint="/"} /
node_filesystem_size_bytes{mountpoint="/"}
) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "磁盘空间不足"
description: "{{ $labels.instance }} 磁盘使用率超过90%"
# 内存使用率过高
- alert: HighMemoryUsage
expr: |
(
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用率过高"
description: "{{ $labels.instance }} 内存使用率为 {{ $value }}%"
自定义指标收集
Pipeline中的指标收集:
// 在共享库中实现指标收集
def collectBuildMetrics(Map config = [:]) {
def startTime = System.currentTimeMillis()
try {
// 执行构建步骤
config.buildSteps()
// 记录成功指标
def duration = System.currentTimeMillis() - startTime
recordMetric('build_success', 1, [
project: env.JOB_NAME,
branch: env.BRANCH_NAME ?: 'unknown'
])
recordMetric('build_duration', duration, [
project: env.JOB_NAME,
branch: env.BRANCH_NAME ?: 'unknown'
])
} catch (Exception e) {
// 记录失败指标
def duration = System.currentTimeMillis() - startTime
recordMetric('build_failure', 1, [
project: env.JOB_NAME,
branch: env.BRANCH_NAME ?: 'unknown',
error_type: e.class.simpleName
])
recordMetric('build_duration', duration, [
project: env.JOB_NAME,
branch: env.BRANCH_NAME ?: 'unknown',
status: 'failed'
])
throw e
}
}
// 记录自定义指标
def recordMetric(String name, Number value, Map labels = [:]) {
def labelsStr = labels.collect { k, v -> "${k}=\"${v}\"" }.join(',')
def metric = "${name}{${labelsStr}} ${value} ${System.currentTimeMillis()}"
// 写入指标文件
writeFile file: 'metrics.txt', text: metric + '\n', append: true
// 发送到Prometheus Pushgateway
if (env.PROMETHEUS_PUSHGATEWAY_URL) {
sh """
echo '${metric}' | curl -X POST \
--data-binary @- \
${env.PROMETHEUS_PUSHGATEWAY_URL}/metrics/job/jenkins/instance/${env.NODE_NAME}
"""
}
}
// 收集测试指标
def collectTestMetrics() {
// 解析测试结果
def testResults = readFile('target/surefire-reports/TEST-*.xml')
def testsuite = new XmlSlurper().parseText(testResults)
def totalTests = testsuite.@tests.toInteger()
def failures = testsuite.@failures.toInteger()
def errors = testsuite.@errors.toInteger()
def skipped = testsuite.@skipped.toInteger()
def passed = totalTests - failures - errors - skipped
// 记录测试指标
recordMetric('test_total', totalTests, [project: env.JOB_NAME])
recordMetric('test_passed', passed, [project: env.JOB_NAME])
recordMetric('test_failed', failures + errors, [project: env.JOB_NAME])
recordMetric('test_skipped', skipped, [project: env.JOB_NAME])
// 计算测试通过率
def passRate = totalTests > 0 ? (passed / totalTests) * 100 : 0
recordMetric('test_pass_rate', passRate, [project: env.JOB_NAME])
}
// 收集代码质量指标
def collectQualityMetrics() {
// 读取SonarQube结果
if (fileExists('target/sonar/report-task.txt')) {
def reportTask = readFile('target/sonar/report-task.txt')
def serverUrl = reportTask.find(/serverUrl=(.+)/) { match, url -> url }
def taskId = reportTask.find(/ceTaskId=(.+)/) { match, id -> id }
// 等待分析完成
sleep(time: 30, unit: 'SECONDS')
// 获取质量门状态
def qualityGateStatus = sh(
script: """
curl -s "${serverUrl}/api/qualitygates/project_status?analysisId=${taskId}" \
| jq -r '.projectStatus.status'
""",
returnStdout: true
).trim()
recordMetric('quality_gate_status', qualityGateStatus == 'OK' ? 1 : 0, [
project: env.JOB_NAME
])
// 获取具体指标
def metrics = sh(
script: """
curl -s "${serverUrl}/api/measures/component?component=${env.JOB_NAME}&metricKeys=coverage,bugs,vulnerabilities,code_smells" \
| jq -r '.component.measures[] | "\(.metric)=\(.value)"'
""",
returnStdout: true
).trim().split('\n')
metrics.each { metric ->
def (name, value) = metric.split('=')
recordMetric("sonar_${name}", value.toDouble(), [project: env.JOB_NAME])
}
}
}
// 使用示例
pipeline {
agent any
stages {
stage('Build with Metrics') {
steps {
script {
collectBuildMetrics {
sh 'mvn clean compile'
}
}
}
}
stage('Test with Metrics') {
steps {
script {
collectBuildMetrics {
sh 'mvn test'
}
collectTestMetrics()
}
}
}
stage('Quality Analysis') {
steps {
script {
sh 'mvn sonar:sonar'
collectQualityMetrics()
}
}
}
}
post {
always {
// 归档指标文件
archiveArtifacts artifacts: 'metrics.txt', allowEmptyArchive: true
}
}
}
12.3 Grafana仪表板
仪表板设计
Jenkins概览仪表板:
{
"dashboard": {
"id": null,
"title": "Jenkins Overview",
"tags": ["jenkins", "ci-cd"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Build Success Rate",
"type": "stat",
"targets": [
{
"expr": "jenkins:build_success_rate",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 80},
{"color": "green", "value": 95}
]
}
}
},
"gridPos": {"h": 8, "w": 6, "x": 0, "y": 0}
},
{
"id": 2,
"title": "Average Build Duration",
"type": "stat",
"targets": [
{
"expr": "jenkins:build_duration_avg / 1000",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 300},
{"color": "red", "value": 600}
]
}
}
},
"gridPos": {"h": 8, "w": 6, "x": 6, "y": 0}
},
{
"id": 3,
"title": "Queue Size",
"type": "stat",
"targets": [
{
"expr": "jenkins_queue_size_value",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 10},
{"color": "red", "value": 20}
]
}
}
},
"gridPos": {"h": 8, "w": 6, "x": 12, "y": 0}
},
{
"id": 4,
"title": "Online Nodes",
"type": "stat",
"targets": [
{
"expr": "sum(jenkins_node_online_value)",
"refId": "A"
}
],
"gridPos": {"h": 8, "w": 6, "x": 18, "y": 0}
},
{
"id": 5,
"title": "Build Rate",
"type": "graph",
"targets": [
{
"expr": "rate(jenkins_builds_build_count[5m]) * 60",
"legendFormat": "Builds per minute",
"refId": "A"
}
],
"yAxes": [
{
"label": "Builds/min",
"min": 0
}
],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
},
{
"id": 6,
"title": "Executor Utilization",
"type": "graph",
"targets": [
{
"expr": "jenkins:executor_utilization",
"legendFormat": "Utilization %",
"refId": "A"
}
],
"yAxes": [
{
"label": "Percent",
"min": 0,
"max": 100
}
],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}
}
],
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "30s"
}
}
构建详情仪表板:
{
"dashboard": {
"title": "Jenkins Build Details",
"panels": [
{
"id": 1,
"title": "Build Duration by Project",
"type": "graph",
"targets": [
{
"expr": "avg by (job) (jenkins_builds_duration_milliseconds_summary{quantile=\"0.5\"}) / 1000",
"legendFormat": "{{ job }} (median)",
"refId": "A"
},
{
"expr": "avg by (job) (jenkins_builds_duration_milliseconds_summary{quantile=\"0.95\"}) / 1000",
"legendFormat": "{{ job }} (95th percentile)",
"refId": "B"
}
],
"yAxes": [
{
"label": "Duration (seconds)",
"min": 0
}
],
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 0}
},
{
"id": 2,
"title": "Build Status Distribution",
"type": "piechart",
"targets": [
{
"expr": "sum by (result) (increase(jenkins_builds_build_count[1h]))",
"legendFormat": "{{ result }}",
"refId": "A"
}
],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
},
{
"id": 3,
"title": "Failed Builds by Project",
"type": "table",
"targets": [
{
"expr": "topk(10, sum by (job) (increase(jenkins_builds_failed_build_count[24h])))",
"format": "table",
"refId": "A"
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {"Time": true},
"renameByName": {
"job": "Project",
"Value": "Failed Builds (24h)"
}
}
}
],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}
}
]
}
}
节点监控仪表板:
{
"dashboard": {
"title": "Jenkins Nodes Monitoring",
"panels": [
{
"id": 1,
"title": "Node Status",
"type": "table",
"targets": [
{
"expr": "jenkins_node_online_value",
"format": "table",
"instant": true,
"refId": "A"
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {"Time": true, "__name__": true},
"renameByName": {
"node_name": "Node",
"Value": "Online"
}
}
},
{
"id": "fieldLookup",
"options": {
"lookupField": "Online",
"mappings": {
"0": "Offline",
"1": "Online"
}
}
}
],
"gridPos": {"h": 8, "w": 8, "x": 0, "y": 0}
},
{
"id": 2,
"title": "CPU Usage by Node",
"type": "graph",
"targets": [
{
"expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{ instance }}",
"refId": "A"
}
],
"yAxes": [
{
"label": "CPU Usage %",
"min": 0,
"max": 100
}
],
"gridPos": {"h": 8, "w": 8, "x": 8, "y": 0}
},
{
"id": 3,
"title": "Memory Usage by Node",
"type": "graph",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"legendFormat": "{{ instance }}",
"refId": "A"
}
],
"yAxes": [
{
"label": "Memory Usage %",
"min": 0,
"max": 100
}
],
"gridPos": {"h": 8, "w": 8, "x": 16, "y": 0}
},
{
"id": 4,
"title": "Executor Usage",
"type": "graph",
"targets": [
{
"expr": "jenkins_executor_in_use_value",
"legendFormat": "{{ node_name }} (in use)",
"refId": "A"
},
{
"expr": "jenkins_executor_count_value",
"legendFormat": "{{ node_name }} (total)",
"refId": "B"
}
],
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
}
]
}
}
告警配置
AlertManager配置:
# alertmanager.yml
global:
smtp_smarthost: 'smtp.company.com:587'
smtp_from: 'jenkins-alerts@company.com'
smtp_auth_username: 'jenkins-alerts@company.com'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'default'
email_configs:
- to: 'devops@company.com'
subject: 'Jenkins Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
{{ end }}
- name: 'critical-alerts'
email_configs:
- to: 'devops@company.com,management@company.com'
subject: '🚨 CRITICAL Jenkins Alert: {{ .GroupLabels.alertname }}'
body: |
CRITICAL ALERT!
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Severity: {{ .Labels.severity }}
Time: {{ .StartsAt }}
{{ end }}
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts'
title: '🚨 Critical Jenkins Alert'
text: |
{{ range .Alerts }}
*{{ .Annotations.summary }}*
{{ .Annotations.description }}
{{ end }}
- name: 'warning-alerts'
email_configs:
- to: 'devops@company.com'
subject: '⚠️ Jenkins Warning: {{ .GroupLabels.alertname }}'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#jenkins'
title: '⚠️ Jenkins Warning'
text: |
{{ range .Alerts }}
*{{ .Annotations.summary }}*
{{ .Annotations.description }}
{{ end }}
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
12.4 日志管理
日志收集配置
Jenkins日志配置:
<!-- $JENKINS_HOME/log.properties -->
# 根日志级别
.level = INFO
# 控制台处理器
handlers = java.util.logging.ConsoleHandler, java.util.logging.FileHandler
# 控制台日志格式
java.util.logging.ConsoleHandler.level = INFO
java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter
# 文件日志配置
java.util.logging.FileHandler.level = ALL
java.util.logging.FileHandler.pattern = /var/log/jenkins/jenkins.%g.log
java.util.logging.FileHandler.limit = 50000000
java.util.logging.FileHandler.count = 10
java.util.logging.FileHandler.formatter = java.util.logging.SimpleFormatter
# 特定组件日志级别
jenkins.level = INFO
hudson.level = INFO
org.springframework.level = WARNING
org.apache.level = WARNING
# 安全相关日志
hudson.security.level = INFO
jenkins.security.level = INFO
# 构建相关日志
hudson.model.Run.level = INFO
hudson.model.AbstractBuild.level = INFO
# 插件日志
hudson.PluginManager.level = INFO
jenkins.InitReactorRunner.level = INFO
# SCM日志
hudson.scm.level = INFO
hudson.plugins.git.level = INFO
# 网络日志
hudson.remoting.level = WARNING
org.eclipse.jetty.level = WARNING
Logback配置(推荐):
<!-- $JENKINS_HOME/logback.xml -->
<configuration>
<!-- 控制台输出 -->
<appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
<!-- 文件输出 -->
<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>/var/log/jenkins/jenkins.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<fileNamePattern>/var/log/jenkins/jenkins.%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
<maxFileSize>100MB</maxFileSize>
<maxHistory>30</maxHistory>
<totalSizeCap>10GB</totalSizeCap>
</rollingPolicy>
<encoder>
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
<!-- JSON格式输出(用于日志收集) -->
<appender name="JSON_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>/var/log/jenkins/jenkins.json</file>
<rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<fileNamePattern>/var/log/jenkins/jenkins.%d{yyyy-MM-dd}.%i.json.gz</fileNamePattern>
<maxFileSize>100MB</maxFileSize>
<maxHistory>30</maxHistory>
<totalSizeCap>10GB</totalSizeCap>
</rollingPolicy>
<encoder class="net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder">
<providers>
<timestamp/>
<logLevel/>
<loggerName/>
<message/>
<mdc/>
<arguments/>
<stackTrace/>
</providers>
</encoder>
</appender>
<!-- 安全日志 -->
<appender name="SECURITY_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>/var/log/jenkins/security.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<fileNamePattern>/var/log/jenkins/security.%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
<maxFileSize>50MB</maxFileSize>
<maxHistory>90</maxHistory>
</rollingPolicy>
<encoder>
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
<!-- 构建日志 -->
<appender name="BUILD_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>/var/log/jenkins/builds.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<fileNamePattern>/var/log/jenkins/builds.%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
<maxFileSize>200MB</maxFileSize>
<maxHistory>30</maxHistory>
</rollingPolicy>
<encoder>
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
<!-- 特定Logger配置 -->
<logger name="hudson.security" level="INFO" additivity="false">
<appender-ref ref="SECURITY_FILE"/>
<appender-ref ref="JSON_FILE"/>
</logger>
<logger name="jenkins.security" level="INFO" additivity="false">
<appender-ref ref="SECURITY_FILE"/>
<appender-ref ref="JSON_FILE"/>
</logger>
<logger name="hudson.model.Run" level="INFO" additivity="false">
<appender-ref ref="BUILD_FILE"/>
<appender-ref ref="JSON_FILE"/>
</logger>
<logger name="hudson.model.AbstractBuild" level="INFO" additivity="false">
<appender-ref ref="BUILD_FILE"/>
<appender-ref ref="JSON_FILE"/>
</logger>
<!-- 根Logger -->
<root level="INFO">
<appender-ref ref="CONSOLE"/>
<appender-ref ref="FILE"/>
<appender-ref ref="JSON_FILE"/>
</root>
</configuration>
ELK Stack集成
Filebeat配置:
# filebeat.yml
filebeat.inputs:
# Jenkins主日志
- type: log
enabled: true
paths:
- /var/log/jenkins/jenkins.json
json.keys_under_root: true
json.add_error_key: true
fields:
service: jenkins
environment: production
log_type: application
fields_under_root: true
# Jenkins安全日志
- type: log
enabled: true
paths:
- /var/log/jenkins/security.log
multiline.pattern: '^\d{4}-\d{2}-\d{2}'
multiline.negate: true
multiline.match: after
fields:
service: jenkins
environment: production
log_type: security
fields_under_root: true
# Jenkins构建日志
- type: log
enabled: true
paths:
- /var/log/jenkins/builds.log
multiline.pattern: '^\d{4}-\d{2}-\d{2}'
multiline.negate: true
multiline.match: after
fields:
service: jenkins
environment: production
log_type: build
fields_under_root: true
# 系统日志
- type: log
enabled: true
paths:
- /var/log/syslog
- /var/log/auth.log
fields:
service: system
environment: production
log_type: system
fields_under_root: true
processors:
# 添加主机信息
- add_host_metadata:
when.not.contains.tags: forwarded
# 添加Docker信息(如果运行在容器中)
- add_docker_metadata: ~
# 删除敏感信息
- drop_fields:
fields: ["agent", "ecs", "host.architecture"]
ignore_missing: true
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "jenkins-logs-%{+yyyy.MM.dd}"
template.name: "jenkins"
template.pattern: "jenkins-*"
template.settings:
index.number_of_shards: 1
index.number_of_replicas: 1
index.refresh_interval: 30s
logging.level: info
logging.to_files: true
logging.files:
path: /var/log/filebeat
name: filebeat
keepfiles: 7
permissions: 0644
Logstash配置:
# logstash.conf
input {
beats {
port => 5044
}
}
filter {
# 处理Jenkins日志
if [service] == "jenkins" {
# 解析时间戳
date {
match => [ "timestamp", "yyyy-MM-dd HH:mm:ss.SSS" ]
}
# 提取日志级别
if [level] {
mutate {
uppercase => [ "level" ]
}
}
# 处理安全日志
if [log_type] == "security" {
grok {
match => {
"message" => "%{TIMESTAMP_ISO8601:timestamp} \[%{DATA:thread}\] %{LOGLEVEL:level} %{DATA:logger} - %{GREEDYDATA:log_message}"
}
}
# 检测安全事件
if [log_message] =~ /(?i)(login|authentication|authorization|failed|denied|unauthorized)/ {
mutate {
add_tag => [ "security_event" ]
}
}
# 检测登录失败
if [log_message] =~ /(?i)(login.*failed|authentication.*failed|bad credentials)/ {
mutate {
add_tag => [ "login_failure" ]
}
}
}
# 处理构建日志
if [log_type] == "build" {
grok {
match => {
"message" => "%{TIMESTAMP_ISO8601:timestamp} \[%{DATA:thread}\] %{LOGLEVEL:level} %{DATA:logger} - %{GREEDYDATA:log_message}"
}
}
# 提取构建信息
if [log_message] =~ /Started by/ {
mutate {
add_tag => [ "build_started" ]
}
}
if [log_message] =~ /(Finished:|Build step)/ {
mutate {
add_tag => [ "build_finished" ]
}
}
if [log_message] =~ /(FAILURE|ERROR|Exception)/ {
mutate {
add_tag => [ "build_error" ]
}
}
}
}
# 处理系统日志
if [service] == "system" {
grok {
match => {
"message" => "%{SYSLOGTIMESTAMP:timestamp} %{IPORHOST:host} %{DATA:program}(?:\[%{POSINT:pid}\])?: %{GREEDYDATA:log_message}"
}
}
date {
match => [ "timestamp", "MMM d HH:mm:ss", "MMM dd HH:mm:ss" ]
}
}
# 添加地理位置信息(如果有IP地址)
if [client_ip] {
geoip {
source => "client_ip"
target => "geoip"
}
}
# 清理字段
mutate {
remove_field => [ "@version", "beat", "input", "prospector" ]
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "%{service}-logs-%{+YYYY.MM.dd}"
}
# 输出到控制台(调试用)
stdout {
codec => rubydebug
}
}
Elasticsearch索引模板:
{
"index_patterns": ["jenkins-logs-*"],
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1,
"index.refresh_interval": "30s",
"index.max_result_window": 50000
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"timestamp": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSS"
},
"level": {
"type": "keyword"
},
"logger": {
"type": "keyword"
},
"message": {
"type": "text",
"analyzer": "standard"
},
"log_message": {
"type": "text",
"analyzer": "standard"
},
"service": {
"type": "keyword"
},
"environment": {
"type": "keyword"
},
"log_type": {
"type": "keyword"
},
"host": {
"properties": {
"name": {
"type": "keyword"
},
"ip": {
"type": "ip"
}
}
},
"tags": {
"type": "keyword"
},
"thread": {
"type": "keyword"
},
"geoip": {
"properties": {
"location": {
"type": "geo_point"
},
"country_name": {
"type": "keyword"
},
"city_name": {
"type": "keyword"
}
}
}
}
}
}
日志分析和告警
Kibana仪表板配置:
{
"version": "7.10.0",
"objects": [
{
"id": "jenkins-logs-overview",
"type": "dashboard",
"attributes": {
"title": "Jenkins Logs Overview",
"panelsJSON": "[\n {\n \"id\": \"log-level-distribution\",\n \"type\": \"visualization\",\n \"gridData\": {\n \"x\": 0,\n \"y\": 0,\n \"w\": 24,\n \"h\": 15\n }\n },\n {\n \"id\": \"security-events-timeline\",\n \"type\": \"visualization\",\n \"gridData\": {\n \"x\": 24,\n \"y\": 0,\n \"w\": 24,\n \"h\": 15\n }\n },\n {\n \"id\": \"build-errors-table\",\n \"type\": \"visualization\",\n \"gridData\": {\n \"x\": 0,\n \"y\": 15,\n \"w\": 48,\n \"h\": 20\n }\n }\n]"
}
},
{
"id": "log-level-distribution",
"type": "visualization",
"attributes": {
"title": "Log Level Distribution",
"visState": {
"type": "pie",
"params": {
"addTooltip": true,
"addLegend": true,
"legendPosition": "right"
},
"aggs": [
{
"id": "1",
"type": "count",
"schema": "metric",
"params": {}
},
{
"id": "2",
"type": "terms",
"schema": "segment",
"params": {
"field": "level",
"size": 10,
"order": "desc",
"orderBy": "1"
}
}
]
}
}
}
]
}
ElastAlert告警规则:
# jenkins_security_alerts.yml
name: Jenkins Security Events
type: frequency
index: jenkins-logs-*
num_events: 5
timeframe:
minutes: 5
filter:
- terms:
tags: ["security_event"]
alert:
- "email"
- "slack"
email:
- "security@company.com"
slack:
slack_webhook_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
slack_channel_override: "#security-alerts"
slack_username_override: "ElastAlert"
alert_text: |
Jenkins安全事件告警
在过去5分钟内检测到 {0} 个安全相关事件
事件详情:
{1}
alert_text_args:
- num_matches
- log_message
---
# jenkins_login_failures.yml
name: Jenkins Login Failures
type: frequency
index: jenkins-logs-*
num_events: 3
timeframe:
minutes: 5
filter:
- terms:
tags: ["login_failure"]
alert:
- "email"
- "slack"
email:
- "devops@company.com"
slack:
slack_webhook_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
slack_channel_override: "#jenkins-alerts"
alert_text: |
Jenkins登录失败告警
在过去5分钟内检测到 {0} 次登录失败
可能的暴力破解攻击!
---
# jenkins_build_errors.yml
name: Jenkins Build Errors
type: frequency
index: jenkins-logs-*
num_events: 10
timeframe:
minutes: 15
filter:
- terms:
tags: ["build_error"]
- terms:
level: ["ERROR"]
alert:
- "email"
email:
- "devops@company.com"
alert_text: |
Jenkins构建错误告警
在过去15分钟内检测到 {0} 个构建错误
请检查构建配置和环境
本章小结
本章详细介绍了Jenkins的监控与日志管理:
- 监控概述:了解监控的重要性和指标体系
- Prometheus集成:学习指标收集和告警配置
- Grafana仪表板:掌握可视化监控的实现
- 日志管理:学习日志收集、分析和告警
有效的监控和日志管理是确保Jenkins系统稳定运行的关键,能够帮助团队及时发现和解决问题。
下一章预告
下一章我们将学习Jenkins的性能优化,包括系统调优、构建优化和资源管理。
练习与思考
理论练习
监控策略设计:
- 设计适合团队的监控指标体系
- 规划告警策略和通知机制
- 考虑监控数据的存储和保留策略
日志管理规划:
- 设计日志收集和分析架构
- 规划日志存储和轮转策略
- 考虑日志安全和合规要求
实践练习
监控系统搭建:
- 部署Prometheus和Grafana
- 配置Jenkins指标收集
- 创建监控仪表板
日志系统实现:
- 搭建ELK Stack
- 配置日志收集和解析
- 实现日志告警
思考题
监控优化:
- 如何平衡监控的全面性和性能影响?
- 如何设计有效的告警策略避免告警疲劳?
- 如何利用监控数据进行容量规划?
日志分析:
- 如何从日志中提取有价值的业务洞察?
- 如何处理大量日志数据的存储和查询性能?
- 如何确保日志数据的安全和隐私?