12.1 监控概述

Jenkins监控的重要性

监控目标:

Jenkins监控的核心目标:

1. 系统健康监控
   - 服务可用性监控
   - 性能指标监控
   - 资源使用监控
   - 错误率监控

2. 构建质量监控
   - 构建成功率
   - 构建时间趋势
   - 测试覆盖率
   - 代码质量指标

3. 用户体验监控
   - 响应时间监控
   - 队列等待时间
   - 用户操作监控
   - 界面性能监控

4. 安全监控
   - 登录失败监控
   - 权限变更监控
   - 异常操作监控
   - 安全漏洞监控

监控架构:

监控系统架构:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Data Source   │───▶│   Collector     │───▶│   Storage       │
│   (Jenkins)     │    │   (Prometheus)  │    │   (InfluxDB)    │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                │
                                ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Alerting      │◀───│   Visualization │◀───│   Processing    │
│   (AlertManager)│    │   (Grafana)     │    │   (Prometheus)  │
└─────────────────┘    └─────────────────┘    └─────────────────┘

数据流:
1. Jenkins暴露指标
2. Prometheus收集指标
3. 数据存储和处理
4. Grafana可视化展示
5. AlertManager发送告警

监控指标体系

系统级指标:

基础设施指标:

1. 硬件资源
   - CPU使用率
   - 内存使用率
   - 磁盘使用率
   - 网络I/O
   - 磁盘I/O

2. 系统性能
   - 负载平均值
   - 进程数量
   - 文件描述符使用
   - 网络连接数

3. JVM指标
   - 堆内存使用
   - 非堆内存使用
   - GC频率和时间
   - 线程数量
   - 类加载数量

应用级指标:

Jenkins应用指标:

1. 构建指标
   - 构建总数
   - 构建成功率
   - 构建失败率
   - 平均构建时间
   - 构建队列长度

2. 节点指标
   - 在线节点数
   - 执行器总数
   - 忙碌执行器数
   - 节点响应时间

3. 用户指标
   - 活跃用户数
   - 登录次数
   - 操作频率
   - 会话时长

4. 插件指标
   - 插件数量
   - 插件更新
   - 插件错误
   - 插件性能

业务级指标:

业务相关指标:

1. 交付指标
   - 部署频率
   - 变更前置时间
   - 变更失败率
   - 恢复时间

2. 质量指标
   - 测试通过率
   - 代码覆盖率
   - 缺陷密度
   - 技术债务

3. 效率指标
   - 开发速度
   - 反馈时间
   - 自动化率
   - 重复工作率

12.2 Prometheus监控集成

Prometheus插件配置

插件安装和配置:

安装步骤:
1. 安装 "Prometheus metrics" 插件
2. 重启Jenkins
3. 访问 http://jenkins-server:8080/prometheus
4. 验证指标输出

配置选项:
- 管理Jenkins -> 系统配置 -> Prometheus
- 启用指标收集
- 配置指标路径:/prometheus
- 设置收集间隔:30秒
- 启用额外指标:JVM、系统、构建

Prometheus配置:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "jenkins_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  - job_name: 'jenkins'
    static_configs:
      - targets: ['jenkins:8080']
    metrics_path: '/prometheus'
    scrape_interval: 30s
    scrape_timeout: 10s
    
  - job_name: 'jenkins-nodes'
    static_configs:
      - targets: 
        - 'jenkins-agent-1:8080'
        - 'jenkins-agent-2:8080'
    metrics_path: '/prometheus'
    scrape_interval: 30s
    
  - job_name: 'node-exporter'
    static_configs:
      - targets:
        - 'jenkins:9100'
        - 'jenkins-agent-1:9100'
        - 'jenkins-agent-2:9100'

Jenkins指标规则:

# jenkins_rules.yml
groups:
  - name: jenkins.rules
    rules:
      # 构建成功率
      - record: jenkins:build_success_rate
        expr: |
          (
            sum(rate(jenkins_builds_success_build_count[5m])) /
            sum(rate(jenkins_builds_build_count[5m]))
          ) * 100
      
      # 平均构建时间
      - record: jenkins:build_duration_avg
        expr: |
          sum(rate(jenkins_builds_duration_milliseconds_summary_sum[5m])) /
          sum(rate(jenkins_builds_duration_milliseconds_summary_count[5m]))
      
      # 队列等待时间
      - record: jenkins:queue_waiting_time
        expr: |
          sum(jenkins_queue_size_value) * 
          avg(jenkins_builds_duration_milliseconds_summary{quantile="0.5"})
      
      # 节点可用性
      - record: jenkins:node_availability
        expr: |
          (
            sum(jenkins_node_online_value) /
            sum(jenkins_node_count_value)
          ) * 100
      
      # 执行器使用率
      - record: jenkins:executor_utilization
        expr: |
          (
            sum(jenkins_executor_in_use_value) /
            sum(jenkins_executor_count_value)
          ) * 100

  - name: jenkins.alerts
    rules:
      # 构建失败率过高
      - alert: HighBuildFailureRate
        expr: jenkins:build_success_rate < 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Jenkins构建失败率过高"
          description: "构建成功率为 {{ $value }}%,低于80%阈值"
      
      # 构建时间过长
      - alert: LongBuildDuration
        expr: jenkins:build_duration_avg > 1800000  # 30分钟
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Jenkins构建时间过长"
          description: "平均构建时间为 {{ $value | humanizeDuration }},超过30分钟"
      
      # 队列积压严重
      - alert: HighQueueBacklog
        expr: jenkins_queue_size_value > 20
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Jenkins构建队列积压严重"
          description: "当前队列中有 {{ $value }} 个任务等待执行"
      
      # 节点离线
      - alert: NodeOffline
        expr: jenkins_node_online_value == 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Jenkins节点离线"
          description: "节点 {{ $labels.node_name }} 已离线"
      
      # 磁盘空间不足
      - alert: LowDiskSpace
        expr: |
          (
            node_filesystem_avail_bytes{mountpoint="/"} /
            node_filesystem_size_bytes{mountpoint="/"}
          ) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "磁盘空间不足"
          description: "{{ $labels.instance }} 磁盘使用率超过90%"
      
      # 内存使用率过高
      - alert: HighMemoryUsage
        expr: |
          (
            1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
          ) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "内存使用率过高"
          description: "{{ $labels.instance }} 内存使用率为 {{ $value }}%"

自定义指标收集

Pipeline中的指标收集:

// 在共享库中实现指标收集
def collectBuildMetrics(Map config = [:]) {
    def startTime = System.currentTimeMillis()
    
    try {
        // 执行构建步骤
        config.buildSteps()
        
        // 记录成功指标
        def duration = System.currentTimeMillis() - startTime
        recordMetric('build_success', 1, [
            project: env.JOB_NAME,
            branch: env.BRANCH_NAME ?: 'unknown'
        ])
        recordMetric('build_duration', duration, [
            project: env.JOB_NAME,
            branch: env.BRANCH_NAME ?: 'unknown'
        ])
        
    } catch (Exception e) {
        // 记录失败指标
        def duration = System.currentTimeMillis() - startTime
        recordMetric('build_failure', 1, [
            project: env.JOB_NAME,
            branch: env.BRANCH_NAME ?: 'unknown',
            error_type: e.class.simpleName
        ])
        recordMetric('build_duration', duration, [
            project: env.JOB_NAME,
            branch: env.BRANCH_NAME ?: 'unknown',
            status: 'failed'
        ])
        
        throw e
    }
}

// 记录自定义指标
def recordMetric(String name, Number value, Map labels = [:]) {
    def labelsStr = labels.collect { k, v -> "${k}=\"${v}\"" }.join(',')
    def metric = "${name}{${labelsStr}} ${value} ${System.currentTimeMillis()}"
    
    // 写入指标文件
    writeFile file: 'metrics.txt', text: metric + '\n', append: true
    
    // 发送到Prometheus Pushgateway
    if (env.PROMETHEUS_PUSHGATEWAY_URL) {
        sh """
            echo '${metric}' | curl -X POST \
                --data-binary @- \
                ${env.PROMETHEUS_PUSHGATEWAY_URL}/metrics/job/jenkins/instance/${env.NODE_NAME}
        """
    }
}

// 收集测试指标
def collectTestMetrics() {
    // 解析测试结果
    def testResults = readFile('target/surefire-reports/TEST-*.xml')
    def testsuite = new XmlSlurper().parseText(testResults)
    
    def totalTests = testsuite.@tests.toInteger()
    def failures = testsuite.@failures.toInteger()
    def errors = testsuite.@errors.toInteger()
    def skipped = testsuite.@skipped.toInteger()
    def passed = totalTests - failures - errors - skipped
    
    // 记录测试指标
    recordMetric('test_total', totalTests, [project: env.JOB_NAME])
    recordMetric('test_passed', passed, [project: env.JOB_NAME])
    recordMetric('test_failed', failures + errors, [project: env.JOB_NAME])
    recordMetric('test_skipped', skipped, [project: env.JOB_NAME])
    
    // 计算测试通过率
    def passRate = totalTests > 0 ? (passed / totalTests) * 100 : 0
    recordMetric('test_pass_rate', passRate, [project: env.JOB_NAME])
}

// 收集代码质量指标
def collectQualityMetrics() {
    // 读取SonarQube结果
    if (fileExists('target/sonar/report-task.txt')) {
        def reportTask = readFile('target/sonar/report-task.txt')
        def serverUrl = reportTask.find(/serverUrl=(.+)/) { match, url -> url }
        def taskId = reportTask.find(/ceTaskId=(.+)/) { match, id -> id }
        
        // 等待分析完成
        sleep(time: 30, unit: 'SECONDS')
        
        // 获取质量门状态
        def qualityGateStatus = sh(
            script: """
                curl -s "${serverUrl}/api/qualitygates/project_status?analysisId=${taskId}" \
                | jq -r '.projectStatus.status'
            """,
            returnStdout: true
        ).trim()
        
        recordMetric('quality_gate_status', qualityGateStatus == 'OK' ? 1 : 0, [
            project: env.JOB_NAME
        ])
        
        // 获取具体指标
        def metrics = sh(
            script: """
                curl -s "${serverUrl}/api/measures/component?component=${env.JOB_NAME}&metricKeys=coverage,bugs,vulnerabilities,code_smells" \
                | jq -r '.component.measures[] | "\(.metric)=\(.value)"'
            """,
            returnStdout: true
        ).trim().split('\n')
        
        metrics.each { metric ->
            def (name, value) = metric.split('=')
            recordMetric("sonar_${name}", value.toDouble(), [project: env.JOB_NAME])
        }
    }
}

// 使用示例
pipeline {
    agent any
    
    stages {
        stage('Build with Metrics') {
            steps {
                script {
                    collectBuildMetrics {
                        sh 'mvn clean compile'
                    }
                }
            }
        }
        
        stage('Test with Metrics') {
            steps {
                script {
                    collectBuildMetrics {
                        sh 'mvn test'
                    }
                    collectTestMetrics()
                }
            }
        }
        
        stage('Quality Analysis') {
            steps {
                script {
                    sh 'mvn sonar:sonar'
                    collectQualityMetrics()
                }
            }
        }
    }
    
    post {
        always {
            // 归档指标文件
            archiveArtifacts artifacts: 'metrics.txt', allowEmptyArchive: true
        }
    }
}

12.3 Grafana仪表板

仪表板设计

Jenkins概览仪表板:

{
  "dashboard": {
    "id": null,
    "title": "Jenkins Overview",
    "tags": ["jenkins", "ci-cd"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Build Success Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "jenkins:build_success_rate",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                {"color": "red", "value": 0},
                {"color": "yellow", "value": 80},
                {"color": "green", "value": 95}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Average Build Duration",
        "type": "stat",
        "targets": [
          {
            "expr": "jenkins:build_duration_avg / 1000",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s",
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 300},
                {"color": "red", "value": 600}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 6, "y": 0}
      },
      {
        "id": 3,
        "title": "Queue Size",
        "type": "stat",
        "targets": [
          {
            "expr": "jenkins_queue_size_value",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "yellow", "value": 10},
                {"color": "red", "value": 20}
              ]
            }
          }
        },
        "gridPos": {"h": 8, "w": 6, "x": 12, "y": 0}
      },
      {
        "id": 4,
        "title": "Online Nodes",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(jenkins_node_online_value)",
            "refId": "A"
          }
        ],
        "gridPos": {"h": 8, "w": 6, "x": 18, "y": 0}
      },
      {
        "id": 5,
        "title": "Build Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(jenkins_builds_build_count[5m]) * 60",
            "legendFormat": "Builds per minute",
            "refId": "A"
          }
        ],
        "yAxes": [
          {
            "label": "Builds/min",
            "min": 0
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
      },
      {
        "id": 6,
        "title": "Executor Utilization",
        "type": "graph",
        "targets": [
          {
            "expr": "jenkins:executor_utilization",
            "legendFormat": "Utilization %",
            "refId": "A"
          }
        ],
        "yAxes": [
          {
            "label": "Percent",
            "min": 0,
            "max": 100
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s"
  }
}

构建详情仪表板:

{
  "dashboard": {
    "title": "Jenkins Build Details",
    "panels": [
      {
        "id": 1,
        "title": "Build Duration by Project",
        "type": "graph",
        "targets": [
          {
            "expr": "avg by (job) (jenkins_builds_duration_milliseconds_summary{quantile=\"0.5\"}) / 1000",
            "legendFormat": "{{ job }} (median)",
            "refId": "A"
          },
          {
            "expr": "avg by (job) (jenkins_builds_duration_milliseconds_summary{quantile=\"0.95\"}) / 1000",
            "legendFormat": "{{ job }} (95th percentile)",
            "refId": "B"
          }
        ],
        "yAxes": [
          {
            "label": "Duration (seconds)",
            "min": 0
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "Build Status Distribution",
        "type": "piechart",
        "targets": [
          {
            "expr": "sum by (result) (increase(jenkins_builds_build_count[1h]))",
            "legendFormat": "{{ result }}",
            "refId": "A"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
      },
      {
        "id": 3,
        "title": "Failed Builds by Project",
        "type": "table",
        "targets": [
          {
            "expr": "topk(10, sum by (job) (increase(jenkins_builds_failed_build_count[24h])))",
            "format": "table",
            "refId": "A"
          }
        ],
        "transformations": [
          {
            "id": "organize",
            "options": {
              "excludeByName": {"Time": true},
              "renameByName": {
                "job": "Project",
                "Value": "Failed Builds (24h)"
              }
            }
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}
      }
    ]
  }
}

节点监控仪表板:

{
  "dashboard": {
    "title": "Jenkins Nodes Monitoring",
    "panels": [
      {
        "id": 1,
        "title": "Node Status",
        "type": "table",
        "targets": [
          {
            "expr": "jenkins_node_online_value",
            "format": "table",
            "instant": true,
            "refId": "A"
          }
        ],
        "transformations": [
          {
            "id": "organize",
            "options": {
              "excludeByName": {"Time": true, "__name__": true},
              "renameByName": {
                "node_name": "Node",
                "Value": "Online"
              }
            }
          },
          {
            "id": "fieldLookup",
            "options": {
              "lookupField": "Online",
              "mappings": {
                "0": "Offline",
                "1": "Online"
              }
            }
          }
        ],
        "gridPos": {"h": 8, "w": 8, "x": 0, "y": 0}
      },
      {
        "id": 2,
        "title": "CPU Usage by Node",
        "type": "graph",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{ instance }}",
            "refId": "A"
          }
        ],
        "yAxes": [
          {
            "label": "CPU Usage %",
            "min": 0,
            "max": 100
          }
        ],
        "gridPos": {"h": 8, "w": 8, "x": 8, "y": 0}
      },
      {
        "id": 3,
        "title": "Memory Usage by Node",
        "type": "graph",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
            "legendFormat": "{{ instance }}",
            "refId": "A"
          }
        ],
        "yAxes": [
          {
            "label": "Memory Usage %",
            "min": 0,
            "max": 100
          }
        ],
        "gridPos": {"h": 8, "w": 8, "x": 16, "y": 0}
      },
      {
        "id": 4,
        "title": "Executor Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "jenkins_executor_in_use_value",
            "legendFormat": "{{ node_name }} (in use)",
            "refId": "A"
          },
          {
            "expr": "jenkins_executor_count_value",
            "legendFormat": "{{ node_name }} (total)",
            "refId": "B"
          }
        ],
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
      }
    ]
  }
}

告警配置

AlertManager配置:

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'jenkins-alerts@company.com'
  smtp_auth_username: 'jenkins-alerts@company.com'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
    - match:
        severity: warning
      receiver: 'warning-alerts'

receivers:
  - name: 'default'
    email_configs:
      - to: 'devops@company.com'
        subject: 'Jenkins Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
          {{ end }}

  - name: 'critical-alerts'
    email_configs:
      - to: 'devops@company.com,management@company.com'
        subject: '🚨 CRITICAL Jenkins Alert: {{ .GroupLabels.alertname }}'
        body: |
          CRITICAL ALERT!
          
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Severity: {{ .Labels.severity }}
          Time: {{ .StartsAt }}
          {{ end }}
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: '🚨 Critical Jenkins Alert'
        text: |
          {{ range .Alerts }}
          *{{ .Annotations.summary }}*
          {{ .Annotations.description }}
          {{ end }}

  - name: 'warning-alerts'
    email_configs:
      - to: 'devops@company.com'
        subject: '⚠️ Jenkins Warning: {{ .GroupLabels.alertname }}'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#jenkins'
        title: '⚠️ Jenkins Warning'
        text: |
          {{ range .Alerts }}
          *{{ .Annotations.summary }}*
          {{ .Annotations.description }}
          {{ end }}

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

12.4 日志管理

日志收集配置

Jenkins日志配置:

<!-- $JENKINS_HOME/log.properties -->
# 根日志级别
.level = INFO

# 控制台处理器
handlers = java.util.logging.ConsoleHandler, java.util.logging.FileHandler

# 控制台日志格式
java.util.logging.ConsoleHandler.level = INFO
java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter

# 文件日志配置
java.util.logging.FileHandler.level = ALL
java.util.logging.FileHandler.pattern = /var/log/jenkins/jenkins.%g.log
java.util.logging.FileHandler.limit = 50000000
java.util.logging.FileHandler.count = 10
java.util.logging.FileHandler.formatter = java.util.logging.SimpleFormatter

# 特定组件日志级别
jenkins.level = INFO
hudson.level = INFO
org.springframework.level = WARNING
org.apache.level = WARNING

# 安全相关日志
hudson.security.level = INFO
jenkins.security.level = INFO

# 构建相关日志
hudson.model.Run.level = INFO
hudson.model.AbstractBuild.level = INFO

# 插件日志
hudson.PluginManager.level = INFO
jenkins.InitReactorRunner.level = INFO

# SCM日志
hudson.scm.level = INFO
hudson.plugins.git.level = INFO

# 网络日志
hudson.remoting.level = WARNING
org.eclipse.jetty.level = WARNING

Logback配置(推荐):

<!-- $JENKINS_HOME/logback.xml -->
<configuration>
    <!-- 控制台输出 -->
    <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>
    
    <!-- 文件输出 -->
    <appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>/var/log/jenkins/jenkins.log</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <fileNamePattern>/var/log/jenkins/jenkins.%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
            <maxFileSize>100MB</maxFileSize>
            <maxHistory>30</maxHistory>
            <totalSizeCap>10GB</totalSizeCap>
        </rollingPolicy>
        <encoder>
            <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>
    
    <!-- JSON格式输出(用于日志收集) -->
    <appender name="JSON_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>/var/log/jenkins/jenkins.json</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <fileNamePattern>/var/log/jenkins/jenkins.%d{yyyy-MM-dd}.%i.json.gz</fileNamePattern>
            <maxFileSize>100MB</maxFileSize>
            <maxHistory>30</maxHistory>
            <totalSizeCap>10GB</totalSizeCap>
        </rollingPolicy>
        <encoder class="net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder">
            <providers>
                <timestamp/>
                <logLevel/>
                <loggerName/>
                <message/>
                <mdc/>
                <arguments/>
                <stackTrace/>
            </providers>
        </encoder>
    </appender>
    
    <!-- 安全日志 -->
    <appender name="SECURITY_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>/var/log/jenkins/security.log</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <fileNamePattern>/var/log/jenkins/security.%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
            <maxFileSize>50MB</maxFileSize>
            <maxHistory>90</maxHistory>
        </rollingPolicy>
        <encoder>
            <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>
    
    <!-- 构建日志 -->
    <appender name="BUILD_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>/var/log/jenkins/builds.log</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <fileNamePattern>/var/log/jenkins/builds.%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
            <maxFileSize>200MB</maxFileSize>
            <maxHistory>30</maxHistory>
        </rollingPolicy>
        <encoder>
            <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>
    
    <!-- 特定Logger配置 -->
    <logger name="hudson.security" level="INFO" additivity="false">
        <appender-ref ref="SECURITY_FILE"/>
        <appender-ref ref="JSON_FILE"/>
    </logger>
    
    <logger name="jenkins.security" level="INFO" additivity="false">
        <appender-ref ref="SECURITY_FILE"/>
        <appender-ref ref="JSON_FILE"/>
    </logger>
    
    <logger name="hudson.model.Run" level="INFO" additivity="false">
        <appender-ref ref="BUILD_FILE"/>
        <appender-ref ref="JSON_FILE"/>
    </logger>
    
    <logger name="hudson.model.AbstractBuild" level="INFO" additivity="false">
        <appender-ref ref="BUILD_FILE"/>
        <appender-ref ref="JSON_FILE"/>
    </logger>
    
    <!-- 根Logger -->
    <root level="INFO">
        <appender-ref ref="CONSOLE"/>
        <appender-ref ref="FILE"/>
        <appender-ref ref="JSON_FILE"/>
    </root>
</configuration>

ELK Stack集成

Filebeat配置:

# filebeat.yml
filebeat.inputs:
  # Jenkins主日志
  - type: log
    enabled: true
    paths:
      - /var/log/jenkins/jenkins.json
    json.keys_under_root: true
    json.add_error_key: true
    fields:
      service: jenkins
      environment: production
      log_type: application
    fields_under_root: true
    
  # Jenkins安全日志
  - type: log
    enabled: true
    paths:
      - /var/log/jenkins/security.log
    multiline.pattern: '^\d{4}-\d{2}-\d{2}'
    multiline.negate: true
    multiline.match: after
    fields:
      service: jenkins
      environment: production
      log_type: security
    fields_under_root: true
    
  # Jenkins构建日志
  - type: log
    enabled: true
    paths:
      - /var/log/jenkins/builds.log
    multiline.pattern: '^\d{4}-\d{2}-\d{2}'
    multiline.negate: true
    multiline.match: after
    fields:
      service: jenkins
      environment: production
      log_type: build
    fields_under_root: true
    
  # 系统日志
  - type: log
    enabled: true
    paths:
      - /var/log/syslog
      - /var/log/auth.log
    fields:
      service: system
      environment: production
      log_type: system
    fields_under_root: true

processors:
  # 添加主机信息
  - add_host_metadata:
      when.not.contains.tags: forwarded
      
  # 添加Docker信息(如果运行在容器中)
  - add_docker_metadata: ~
  
  # 删除敏感信息
  - drop_fields:
      fields: ["agent", "ecs", "host.architecture"]
      ignore_missing: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "jenkins-logs-%{+yyyy.MM.dd}"
  template.name: "jenkins"
  template.pattern: "jenkins-*"
  template.settings:
    index.number_of_shards: 1
    index.number_of_replicas: 1
    index.refresh_interval: 30s

logging.level: info
logging.to_files: true
logging.files:
  path: /var/log/filebeat
  name: filebeat
  keepfiles: 7
  permissions: 0644

Logstash配置:

# logstash.conf
input {
  beats {
    port => 5044
  }
}

filter {
  # 处理Jenkins日志
  if [service] == "jenkins" {
    # 解析时间戳
    date {
      match => [ "timestamp", "yyyy-MM-dd HH:mm:ss.SSS" ]
    }
    
    # 提取日志级别
    if [level] {
      mutate {
        uppercase => [ "level" ]
      }
    }
    
    # 处理安全日志
    if [log_type] == "security" {
      grok {
        match => { 
          "message" => "%{TIMESTAMP_ISO8601:timestamp} \[%{DATA:thread}\] %{LOGLEVEL:level} %{DATA:logger} - %{GREEDYDATA:log_message}"
        }
      }
      
      # 检测安全事件
      if [log_message] =~ /(?i)(login|authentication|authorization|failed|denied|unauthorized)/ {
        mutate {
          add_tag => [ "security_event" ]
        }
      }
      
      # 检测登录失败
      if [log_message] =~ /(?i)(login.*failed|authentication.*failed|bad credentials)/ {
        mutate {
          add_tag => [ "login_failure" ]
        }
      }
    }
    
    # 处理构建日志
    if [log_type] == "build" {
      grok {
        match => { 
          "message" => "%{TIMESTAMP_ISO8601:timestamp} \[%{DATA:thread}\] %{LOGLEVEL:level} %{DATA:logger} - %{GREEDYDATA:log_message}"
        }
      }
      
      # 提取构建信息
      if [log_message] =~ /Started by/ {
        mutate {
          add_tag => [ "build_started" ]
        }
      }
      
      if [log_message] =~ /(Finished:|Build step)/ {
        mutate {
          add_tag => [ "build_finished" ]
        }
      }
      
      if [log_message] =~ /(FAILURE|ERROR|Exception)/ {
        mutate {
          add_tag => [ "build_error" ]
        }
      }
    }
  }
  
  # 处理系统日志
  if [service] == "system" {
    grok {
      match => { 
        "message" => "%{SYSLOGTIMESTAMP:timestamp} %{IPORHOST:host} %{DATA:program}(?:\[%{POSINT:pid}\])?: %{GREEDYDATA:log_message}"
      }
    }
    
    date {
      match => [ "timestamp", "MMM  d HH:mm:ss", "MMM dd HH:mm:ss" ]
    }
  }
  
  # 添加地理位置信息(如果有IP地址)
  if [client_ip] {
    geoip {
      source => "client_ip"
      target => "geoip"
    }
  }
  
  # 清理字段
  mutate {
    remove_field => [ "@version", "beat", "input", "prospector" ]
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "%{service}-logs-%{+YYYY.MM.dd}"
  }
  
  # 输出到控制台(调试用)
  stdout {
    codec => rubydebug
  }
}

Elasticsearch索引模板:

{
  "index_patterns": ["jenkins-logs-*"],
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    "index.refresh_interval": "30s",
    "index.max_result_window": 50000
  },
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "timestamp": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss.SSS"
      },
      "level": {
        "type": "keyword"
      },
      "logger": {
        "type": "keyword"
      },
      "message": {
        "type": "text",
        "analyzer": "standard"
      },
      "log_message": {
        "type": "text",
        "analyzer": "standard"
      },
      "service": {
        "type": "keyword"
      },
      "environment": {
        "type": "keyword"
      },
      "log_type": {
        "type": "keyword"
      },
      "host": {
        "properties": {
          "name": {
            "type": "keyword"
          },
          "ip": {
            "type": "ip"
          }
        }
      },
      "tags": {
        "type": "keyword"
      },
      "thread": {
        "type": "keyword"
      },
      "geoip": {
        "properties": {
          "location": {
            "type": "geo_point"
          },
          "country_name": {
            "type": "keyword"
          },
          "city_name": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

日志分析和告警

Kibana仪表板配置:

{
  "version": "7.10.0",
  "objects": [
    {
      "id": "jenkins-logs-overview",
      "type": "dashboard",
      "attributes": {
        "title": "Jenkins Logs Overview",
        "panelsJSON": "[\n  {\n    \"id\": \"log-level-distribution\",\n    \"type\": \"visualization\",\n    \"gridData\": {\n      \"x\": 0,\n      \"y\": 0,\n      \"w\": 24,\n      \"h\": 15\n    }\n  },\n  {\n    \"id\": \"security-events-timeline\",\n    \"type\": \"visualization\",\n    \"gridData\": {\n      \"x\": 24,\n      \"y\": 0,\n      \"w\": 24,\n      \"h\": 15\n    }\n  },\n  {\n    \"id\": \"build-errors-table\",\n    \"type\": \"visualization\",\n    \"gridData\": {\n      \"x\": 0,\n      \"y\": 15,\n      \"w\": 48,\n      \"h\": 20\n    }\n  }\n]"
      }
    },
    {
      "id": "log-level-distribution",
      "type": "visualization",
      "attributes": {
        "title": "Log Level Distribution",
        "visState": {
          "type": "pie",
          "params": {
            "addTooltip": true,
            "addLegend": true,
            "legendPosition": "right"
          },
          "aggs": [
            {
              "id": "1",
              "type": "count",
              "schema": "metric",
              "params": {}
            },
            {
              "id": "2",
              "type": "terms",
              "schema": "segment",
              "params": {
                "field": "level",
                "size": 10,
                "order": "desc",
                "orderBy": "1"
              }
            }
          ]
        }
      }
    }
  ]
}

ElastAlert告警规则:

# jenkins_security_alerts.yml
name: Jenkins Security Events
type: frequency
index: jenkins-logs-*
num_events: 5
timeframe:
  minutes: 5

filter:
- terms:
    tags: ["security_event"]

alert:
- "email"
- "slack"

email:
- "security@company.com"

slack:
slack_webhook_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
slack_channel_override: "#security-alerts"
slack_username_override: "ElastAlert"

alert_text: |
  Jenkins安全事件告警
  
  在过去5分钟内检测到 {0} 个安全相关事件
  
  事件详情:
  {1}

alert_text_args:
  - num_matches
  - log_message

---

# jenkins_login_failures.yml
name: Jenkins Login Failures
type: frequency
index: jenkins-logs-*
num_events: 3
timeframe:
  minutes: 5

filter:
- terms:
    tags: ["login_failure"]

alert:
- "email"
- "slack"

email:
- "devops@company.com"

slack:
slack_webhook_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
slack_channel_override: "#jenkins-alerts"

alert_text: |
  Jenkins登录失败告警
  
  在过去5分钟内检测到 {0} 次登录失败
  
  可能的暴力破解攻击!

---

# jenkins_build_errors.yml
name: Jenkins Build Errors
type: frequency
index: jenkins-logs-*
num_events: 10
timeframe:
  minutes: 15

filter:
- terms:
    tags: ["build_error"]
- terms:
    level: ["ERROR"]

alert:
- "email"

email:
- "devops@company.com"

alert_text: |
  Jenkins构建错误告警
  
  在过去15分钟内检测到 {0} 个构建错误
  
  请检查构建配置和环境

本章小结

本章详细介绍了Jenkins的监控与日志管理:

  1. 监控概述:了解监控的重要性和指标体系
  2. Prometheus集成:学习指标收集和告警配置
  3. Grafana仪表板:掌握可视化监控的实现
  4. 日志管理:学习日志收集、分析和告警

有效的监控和日志管理是确保Jenkins系统稳定运行的关键,能够帮助团队及时发现和解决问题。

下一章预告

下一章我们将学习Jenkins的性能优化,包括系统调优、构建优化和资源管理。

练习与思考

理论练习

  1. 监控策略设计

    • 设计适合团队的监控指标体系
    • 规划告警策略和通知机制
    • 考虑监控数据的存储和保留策略
  2. 日志管理规划

    • 设计日志收集和分析架构
    • 规划日志存储和轮转策略
    • 考虑日志安全和合规要求

实践练习

  1. 监控系统搭建

    • 部署Prometheus和Grafana
    • 配置Jenkins指标收集
    • 创建监控仪表板
  2. 日志系统实现

    • 搭建ELK Stack
    • 配置日志收集和解析
    • 实现日志告警

思考题

  1. 监控优化

    • 如何平衡监控的全面性和性能影响?
    • 如何设计有效的告警策略避免告警疲劳?
    • 如何利用监控数据进行容量规划?
  2. 日志分析

    • 如何从日志中提取有价值的业务洞察?
    • 如何处理大量日志数据的存储和查询性能?
    • 如何确保日志数据的安全和隐私?