第六章:告警规则

详细介绍 Prometheus 告警规则的配置、使用 AlertManager 进行告警分组、抑制、静默管理等高级功能

最后更新: 2024-01-01
页面目录

第六章:告警规则

6.1 Alertmanager 概述

Alertmanager 负责接收 Prometheus 发送的告警,进行分组、去重、静默等处理后,发送到告警接收端。

6.1.1 架构

┌─────────────────┐
│   Prometheus    │
│   Alerting Rule │
└────────┬────────┘
         │ HTTP
┌─────────────────────────────────────────────────┐
│              Alertmanager Cluster               │
│  ┌─────────────────────────────────────────┐    │
│  │  ┌───────────┐  ┌───────────┐           │    │
│  │  │  Grouping │  │ Inhibition│           │    │
│  │  │  (分组)    │  │  (抑制)    │           │    │
│  │  └───────────┘  └───────────┘           │    │
│  │  ┌───────────┐  ┌───────────┐           │    │
│  │  │  Silence  │  │  Routing  │           │    │
│  │  │  (静默)   │  │  (路由)   │           │    │
│  │  └───────────┘  └───────────┘           │    │
│  └─────────────────────────────────────────┘    │
└─────────────────┬──────────────────────────────┘
       ┌──────────┼──────────┐
       ▼          ▼          ▼
   ┌────────┐ ┌────────┐ ┌────────┐
   │ Email  │ │ Slack  │ │ Webhook│
   └────────┘ └────────┘ └────────┘

6.2 安装 Alertmanager

6.2.1 二进制安装

# 下载
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar xzf alertmanager-0.27.0.linux-amd64.tar.gz
cd alertmanager-0.27.0.linux-amd64

# 运行
./alertmanager --config.file=alertmanager.yml

6.2.2 Docker 安装

# docker-compose.yml
version: '3.8'

services:
  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - ./data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    restart: unless-stopped

6.3 告警规则配置

6.3.1 基本告警规则

# rules/alerts.yml
groups:
  - name: example
    rules:
      # 基础告警
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
          team: ops
        annotations:
          summary: "实例 {{ $labels.instance }} 不可用"
          description: "{{ $labels.job }} ({{ $labels.instance }}) 已停止运行超过 5 分钟"

6.3.2 告警规则字段说明

字段 必填 说明
alert 告警名称
expr PromQL 表达式
for 持续时间,触发条件持续多久才告警
labels 标签,附加到告警
annotations 注解,告警的描述信息

6.3.3 常用告警规则示例

groups:
  - name: node-alerts
    rules:
      # CPU 告警
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高"
          description: "实例 {{ $labels.instance }} CPU 使用率已超过 80%,当前: {{ $value }}%"

      # 内存告警
      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "内存使用率过高"
          description: "{{ $labels.instance }} 内存使用率 {{ $value | printf \"%.2f\" }}%"

      # 磁盘空间告警
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "磁盘空间不足"
          description: "{{ $labels.instance }} {{ $labels.mountpoint }} 可用空间不足 15%"

      # 磁盘 I/O 告警
      - alert: DiskIOHigh
        expr: rate(node_disk_io_time_seconds_total[5m]) > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "磁盘 I/O 使用率过高"

      # 网络流量异常
      - alert: NetworkTrafficHigh
        expr: rate(node_network_receive_bytes_total[5m]) > 100 * 1024 * 1024
        for: 5m
        labels:
          severity: info
        annotations:
          summary: "网络流量异常高"

6.3.4 服务告警规则

groups:
  - name: service-alerts
    rules:
      # 服务不可用
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "服务 {{ $labels.job }} 不可用"
          runbook_url: "https://wiki.example.com/runbook/{{ $labels.job }}"

      # HTTP 错误率
      - alert: HighHTTPErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) 
            / 
          sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "HTTP 错误率过高"
          description: "{{ $labels.service }} 错误率 {{ $value | printf \"%.2f\" }}%"

      # 延迟过高
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, 
            sum by(le, service) (rate(http_request_duration_seconds_bucket[5m]))
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "服务延迟过高"
          description: "{{ $labels.service }} P99 延迟 {{ $value | printf \"%.3f\" }}s"

      # 连接数过多
      - alert: TooManyConnections
        expr: sum by(service) (rate(http_connections_total[5m])) > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "连接数过多"

6.4 Alertmanager 配置

6.4.1 基本配置

# alertmanager.yml
global:
  # 邮件配置
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'
  
  # 钉钉/微信 webhook
  # 参考钉钉官方文档配置

route:
  # 接收者
  receiver: 'default-receiver'
  
  # 告警分组
  group_by: ['alertname', 'cluster', 'service']
  
  # 分组等待时间
  group_wait: 30s
  
  # 分组发送间隔
  group_interval: 5m
  
  # 重复告警间隔
  repeat_interval: 4h
  
  # 子路由
  routes:
    - match:
        severity: critical
      receiver: 'critical-receiver'
      continue: true
    - match:
        severity: warning
      receiver: 'warning-receiver'
    - match:
        service: api
      receiver: 'api-receiver'

# 接收者配置
receivers:
  - name: 'default-receiver'
    email_configs:
      - to: 'team@example.com'
        headers:
          subject: 'Prometheus Alert: {{ .GroupLabels.alertname }}'

  - name: 'critical-receiver'
    email_configs:
      - to: 'oncall@example.com'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts-critical'
        send_resolved: true

  - name: 'warning-receiver'
    email_configs:
      - to: 'team@example.com'

  - name: 'api-receiver'
    webhook_configs:
      - url: 'http://webhook-service:8080/alerts'
        send_resolved: true

6.4.2 钉钉告警配置

receivers:
  - name: 'dingtalk'
    webhook_configs:
      - url: 'https://oapi.dingtalk.com/robot/send?access_token=xxx'
        http_config:
          tls_config:
            insecure_skip_verify: true
{
  "msgtype": "markdown",
  "markdown": {
    "title": "Prometheus 告警",
    "content": "### {{ .GroupLabels.alertname }}\n\n**实例**: {{ .Labels.instance }}\n\n**描述**: {{ .Annotations.description }}"
  }
}

6.5 高级特性

6.5.1 告警分组

route:
  # 按服务分组
  group_by: ['alertname', 'service']
  
  # 新告警等待 30 秒
  group_wait: 30s
  
  # 已有告警组有新告警,5 分钟后发送
  group_interval: 5m
  
  # 告警恢复后再等 4 小时发送相同告警
  repeat_interval: 4h

6.5.2 告警抑制

# 阻止级联告警
inhibit_rules:
  # 当 InstanceDown 告警触发时,抑制其他所有该实例的告警
  - source_match:
      alertname: 'InstanceDown'
    source_match_re:
      severity: 'critical'
    equal: ['instance', 'cluster']
    
  # 抑制测试环境的告警
  - source_match:
      env: 'test'
    equal: ['alertname']

6.5.3 告警静默

# 通过 API 创建静默
curl -X POST http://localhost:9093/api/v2/silences \
  -H 'Content-Type: application/json' \
  -d '{
    "matchers": [
      {"name": "instance", "value": "server1", "isRegex": false}
    ],
    "startsAt": "2024-01-01T00:00:00Z",
    "endsAt": "2024-01-02T00:00:00Z",
    "createdBy": "admin",
    "comment": "Maintenance window"
  }'

6.5.4 告警路由

Root Route
     ├── severity=critical ───▶ Critical Handler
     │                             │
     │                             └── notify-critical
     ├── severity=warning ───▶ Warning Handler
     │                             │
     │                             ├── service=api ──▶ API Team
     │                             ├── service=web ──▶ Web Team
     │                             └── default ───▶ On-call
     └── default ─────────────▶ Default Handler

6.6 测试告警规则

6.6.1 规则语法检查

# 检查规则文件语法
./promtool check rules rules/*.yml

# 格式化规则文件
./promtool rules format rules/*.yml

6.6.2 测试表达式

# 在 Web UI 中测试
# 访问 http://localhost:9090/graph

# 使用 API 测试
curl 'http://localhost:9090/api/v1/query?query=up == 0'

6.6.3 模拟告警

# 创建测试告警
amtool alert \
  --alertmanager.url=http://localhost:9093 \
  --config.file=alertmanager.yml \
  add alertname=test \
  severity=warning \
  instance=localhost

6.7 告警最佳实践

6.7.1 告警设计原则

原则 说明
可操作 告警必须能触发实际操作
避免噪音 减少重复和无意义的告警
分级处理 区分 critical/warning/info
丰富上下文 提供足够的诊断信息

6.7.2 告警模板

annotations:
  summary: "{{ $labels.alertname }} on {{ $labels.instance }}"
  description: |
    {{ $labels.alertname }} 告警详情:
    
    - 实例: {{ $labels.instance }}
    - 集群: {{ $labels.cluster }}
    - 服务: {{ $labels.service }}
    - 当前值: {{ $value | printf "%.2f" }}
    
    📊 Dashboard: https://grafana.com/d/{{ $labels.dashboard_id }}
    📖 Runbook: https://wiki.example.com/runbook/{{ $labels.alertname }}

6.7.3 常见问题

# 避免告警风暴 - 使用 group_by 和 group_interval
route:
  group_by: ['alertname', 'cluster']
  group_interval: 5m

# 避免重复告警 - 设置合理的 repeat_interval
route:
  repeat_interval: 12h  # 生产环境建议 4-12 小时

# 告警恢复通知
receivers:
  - name: 'default'
    slack_configs:
      - send_resolved: true  # 启用恢复通知

6.8 本章小结

本章介绍了 Prometheus 告警系统的完整配置:

  1. Alertmanager 架构 - 分组、抑制、静默机制
  2. 告警规则配置 - 基础语法和常用规则示例
  3. Alertmanager 配置 - 路由、接收者、通知配置
  4. 高级特性 - 抑制、静默、路由匹配
  5. 测试与调试 - 规则验证和告警测试
  6. 最佳实践 - 告警设计原则和模板

📖 下一步