第八章:告警规则

学习如何配置 Loki 告警规则,实现日志驱动的告警。

最后更新: 2024-01-22
页面目录

Loki 告警规则

Loki 的 Ruler 组件支持基于 LogQL 的告警规则,可以检测日志中的异常情况。

告警概述

┌─────────────────────────────────────────────────────────────────┐
│                      Loki 告警流程                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Loki Ruler                                                      │
│   │                                                               │
│   ├── 评估告警规则                                                 │
│   │    ↓                                                          │
│   │  ┌─────────────────┐                                        │
│   │  │  LogQL Query     │                                        │
│   │  │  rate(errors[5m])│                                        │
│   │  └────────┬────────┘                                        │
│   │           ↓                                                  │
│   │  ┌─────────────────┐                                        │
│   │  │  条件判断        │                                        │
│   │  │  > 10           │                                        │
│   │  └────────┬────────┘                                        │
│   │           ↓ (触发)                                            │
│   │  ┌─────────────────┐                                        │
│   │  │  Alertmanager   │                                        │
│   │  │  发送通知        │                                        │
│   │  └─────────────────┘                                        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

规则文件结构

# /etc/loki/rules/app.yaml
groups:
  - name: application
    interval: 1m
    rules:
      - alert: HighErrorRate
        expr: |
          sum by (job) (
            rate({job="myapp"} |= "error" [5m])
          ) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | printf \"%.2f\" }}%"

      - alert: ServiceDown
        expr: |
          sum(rate({job="myapp"}[5m])) == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "No logs received from {{ $labels.job }}"

完整配置示例

Loki Ruler 配置

# /etc/loki/config.yaml
ruler:
  enable_api: true
  enable_sharding: true
  evaluation_interval: 10s
  poll_interval: 10s

  # 规则存储
  storage:
    type: local
    local:
      directory: /etc/loki/rules

  # 或使用对象存储
  # storage:
  #   type: s3
  #   s3:
  #     bucket_name: loki-rules
  #     endpoint: s3.amazonaws.com

  # Alertmanager 配置
  alertmanager_url: http://alertmanager:9093
  alertmanager_discovery: false
  alertmanager_refresh_interval: 30s

  # 通知配置
  notification_queue:
    capacity: 10000
    timeout: 10s

  # 分片配置
  wal:
    dir: /var/loki/wal
    truncate_frequency: 60s
    min_age: 5m
    max_age: 4h

告警规则详解

基本结构

groups:
  - name: <group_name>
    interval: <duration>
    rules:
      - alert: <alert_name>
        expr: <logql_expr>
        for: <duration>
        labels:
          <label>: <value>
        annotations:
          <annotation>: <value>

规则字段

字段 必填 说明
alert 告警名称
expr LogQL 表达式
for 持续时间
labels 标签
annotations 描述信息

告警条件

for 参数

# 无延迟告警(立即触发)
- alert: ImmediateAlert
  expr: rate({job="myapp"}[5m]) > 100
  for: 0m

# 5分钟后告警
- alert: DelayedAlert
  expr: rate({job="myapp"}[5m]) > 100
  for: 5m

# 10分钟后告警
- alert: LongDelayAlert
  expr: rate({job="myapp"}[5m]) > 100
  for: 10m

标签和注释

- alert: HighErrorRate
  expr: |
    sum by (job, namespace) (
      rate({job="myapp"} |= "error" [5m])
    ) > 0.05
  for: 5m
  labels:
    severity: critical
    team: platform
    category: availability
  annotations:
    summary: "High error rate in {{ $labels.job }}"
    description: |
      Error rate is {{ $value | printf "%.2f" }}%
      Job: {{ $labels.job }}
      Namespace: {{ $labels.namespace }}
      Current value: {{ $value }}

模板变量

可用变量

变量 说明
$labels 告警的标签
$value 当前值
$externalURL Loki 外部 URL

示例

annotations:
  # 访问标签值
  summary: "Error in {{ $labels.service }}"

  # 格式化数值
  description: "Error rate: {{ $value | printf \"%.2f\" }}%"

  # 条件格式化
  summary: |
    {{ if gt $value 1.0 }}
    Critical error rate
    {{ else }}
    Warning error rate
    {{ end }}

实用告警规则

应用告警

# /etc/loki/rules/application.yaml
groups:
  - name: application-alerts
    interval: 1m
    rules:
      # 高错误率
      - alert: ApplicationHighErrorRate
        expr: |
          sum by (service, namespace) (
            rate({namespace="production"} |= "error" [5m])
          ) / sum by (service, namespace) (
            rate({namespace="production"} [5m])
          ) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Application error rate exceeds 5%"
          description: "{{ $labels.service }} in {{ $labels.namespace }} has error rate of {{ $value | printf \"%.2f\" }}%"

      # 服务无响应
      - alert: ServiceNotResponding
        expr: |
          sum by (service, namespace) (
            rate({namespace="production"} [5m])
          ) == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Service not responding"
          description: "{{ $labels.service }} in {{ $labels.namespace }} is not receiving logs"

      # 异常增长
      - alert: ErrorBurst
        expr: |
          sum by (service) (
            increase({namespace="production"} |= "error" [5m])
          ) > 100
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Burst of errors detected"
          description: "{{ $labels.service }} has {{ $value }} errors in the last 5 minutes"

基础设施告警

# /etc/loki/rules/infrastructure.yaml
groups:
  - name: infrastructure-alerts
    interval: 1m
    rules:
      # CPU 使用率高
      - alert: HighCPUUsage
        expr: |
          sum by (instance) (
            rate({job="node"} |= "cpu" | json | usage > 90 [5m])
          ) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 90% for 5 minutes"

      # 磁盘空间不足
      - alert: LowDiskSpace
        expr: |
          {job="node"} |= "disk" | json | free < 10
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Free disk space is below 10%"

      # 内存泄漏
      - alert: MemoryLeak
        expr: |
          (sum by (instance) (
            rate({job="node"} |= "memory" | json | usage [1h])
          ) / sum by (instance) (
            rate({job="node"} |= "memory" | json | total [1h])
          )) > 0.9
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "Potential memory leak on {{ $labels.instance }}"

安全告警

# /etc/loki/rules/security.yaml
groups:
  - name: security-alerts
    interval: 1m
    rules:
      # 失败登录
      - alert: FailedLoginAttempts
        expr: |
          sum by (user, source_ip) (
            rate({job="auth"} |= "login failed" [5m])
          ) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Multiple failed login attempts"
          description: "{{ $value }} failed login attempts from {{ $labels.source_ip }} for user {{ $labels.user }}"

      # SQL 注入
      - alert: SQLInjectionAttempt
        expr: |
          sum by (source_ip) (
            rate({job="api"} |= "sql injection" [5m])
          ) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "SQL injection attempt detected"
          description: "SQL injection attempt from {{ $labels.source_ip }}"

      # 异常访问
      - alert: UnusualAccessPattern
        expr: |
          sum by (source_ip) (
            rate({job="api"} |~ "401|403" [5m])
          ) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Unusual access pattern from {{ $labels.source_ip }}"

Alertmanager 集成

Alertmanager 配置

# /etc/alertmanager/alertmanager.yaml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'loki-alerts@example.com'
  smtp_auth_username: 'alerts'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'email'
  routes:
    - match:
        severity: critical
      receiver: 'critical-notifications'
      continue: true
    - match:
        severity: warning
      receiver: 'warning-notifications'

receivers:
  - name: 'email'
    email_configs:
      - to: 'team@example.com'
        headers:
          subject: '[Loki] {{ .GroupLabels.alertname }}'

  - name: 'critical-notifications'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#critical'
        title: 'Critical Alert'
        text: '{{ .CommonAnnotations.summary }}'

  - name: 'warning-notifications'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#warnings'

规则管理 API

列出规则

# 列出所有规则
curl http://loki:3100/prometheus/api/v1/rules

# 列出特定命名空间的规则
curl http://loki:3100/prometheus/api/v1/rules?type=alert

测试规则

# 使用 LogCLI 测试
logcli query '<rule_expression>'

常见问题

告警未触发

  1. 检查规则语法
  2. 确认数据流是否正常
  3. 检查 for 参数
  4. 验证 Alertmanager 连接

告警重复

  • 调整 group_interval
  • 使用更好的分组标签

下一步

接下来让我们学习 Grafana 集成。

👉 Grafana 集成