第八章:告警规则
学习如何配置 Loki 告警规则,实现日志驱动的告警。
最后更新: 2024-01-22
页面目录
Loki 告警规则
Loki 的 Ruler 组件支持基于 LogQL 的告警规则,可以检测日志中的异常情况。
告警概述
┌─────────────────────────────────────────────────────────────────┐
│ Loki 告警流程 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Loki Ruler │
│ │ │
│ ├── 评估告警规则 │
│ │ ↓ │
│ │ ┌─────────────────┐ │
│ │ │ LogQL Query │ │
│ │ │ rate(errors[5m])│ │
│ │ └────────┬────────┘ │
│ │ ↓ │
│ │ ┌─────────────────┐ │
│ │ │ 条件判断 │ │
│ │ │ > 10 │ │
│ │ └────────┬────────┘ │
│ │ ↓ (触发) │
│ │ ┌─────────────────┐ │
│ │ │ Alertmanager │ │
│ │ │ 发送通知 │ │
│ │ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
规则文件结构
# /etc/loki/rules/app.yaml
groups:
- name: application
interval: 1m
rules:
- alert: HighErrorRate
expr: |
sum by (job) (
rate({job="myapp"} |= "error" [5m])
) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | printf \"%.2f\" }}%"
- alert: ServiceDown
expr: |
sum(rate({job="myapp"}[5m])) == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "No logs received from {{ $labels.job }}"
完整配置示例
Loki Ruler 配置
# /etc/loki/config.yaml
ruler:
enable_api: true
enable_sharding: true
evaluation_interval: 10s
poll_interval: 10s
# 规则存储
storage:
type: local
local:
directory: /etc/loki/rules
# 或使用对象存储
# storage:
# type: s3
# s3:
# bucket_name: loki-rules
# endpoint: s3.amazonaws.com
# Alertmanager 配置
alertmanager_url: http://alertmanager:9093
alertmanager_discovery: false
alertmanager_refresh_interval: 30s
# 通知配置
notification_queue:
capacity: 10000
timeout: 10s
# 分片配置
wal:
dir: /var/loki/wal
truncate_frequency: 60s
min_age: 5m
max_age: 4h
告警规则详解
基本结构
groups:
- name: <group_name>
interval: <duration>
rules:
- alert: <alert_name>
expr: <logql_expr>
for: <duration>
labels:
<label>: <value>
annotations:
<annotation>: <value>
规则字段
| 字段 | 必填 | 说明 |
|---|---|---|
alert |
是 | 告警名称 |
expr |
是 | LogQL 表达式 |
for |
否 | 持续时间 |
labels |
否 | 标签 |
annotations |
否 | 描述信息 |
告警条件
for 参数
# 无延迟告警(立即触发)
- alert: ImmediateAlert
expr: rate({job="myapp"}[5m]) > 100
for: 0m
# 5分钟后告警
- alert: DelayedAlert
expr: rate({job="myapp"}[5m]) > 100
for: 5m
# 10分钟后告警
- alert: LongDelayAlert
expr: rate({job="myapp"}[5m]) > 100
for: 10m
标签和注释
- alert: HighErrorRate
expr: |
sum by (job, namespace) (
rate({job="myapp"} |= "error" [5m])
) > 0.05
for: 5m
labels:
severity: critical
team: platform
category: availability
annotations:
summary: "High error rate in {{ $labels.job }}"
description: |
Error rate is {{ $value | printf "%.2f" }}%
Job: {{ $labels.job }}
Namespace: {{ $labels.namespace }}
Current value: {{ $value }}
模板变量
可用变量
| 变量 | 说明 |
|---|---|
$labels |
告警的标签 |
$value |
当前值 |
$externalURL |
Loki 外部 URL |
示例
annotations:
# 访问标签值
summary: "Error in {{ $labels.service }}"
# 格式化数值
description: "Error rate: {{ $value | printf \"%.2f\" }}%"
# 条件格式化
summary: |
{{ if gt $value 1.0 }}
Critical error rate
{{ else }}
Warning error rate
{{ end }}
实用告警规则
应用告警
# /etc/loki/rules/application.yaml
groups:
- name: application-alerts
interval: 1m
rules:
# 高错误率
- alert: ApplicationHighErrorRate
expr: |
sum by (service, namespace) (
rate({namespace="production"} |= "error" [5m])
) / sum by (service, namespace) (
rate({namespace="production"} [5m])
) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Application error rate exceeds 5%"
description: "{{ $labels.service }} in {{ $labels.namespace }} has error rate of {{ $value | printf \"%.2f\" }}%"
# 服务无响应
- alert: ServiceNotResponding
expr: |
sum by (service, namespace) (
rate({namespace="production"} [5m])
) == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Service not responding"
description: "{{ $labels.service }} in {{ $labels.namespace }} is not receiving logs"
# 异常增长
- alert: ErrorBurst
expr: |
sum by (service) (
increase({namespace="production"} |= "error" [5m])
) > 100
for: 2m
labels:
severity: warning
annotations:
summary: "Burst of errors detected"
description: "{{ $labels.service }} has {{ $value }} errors in the last 5 minutes"
基础设施告警
# /etc/loki/rules/infrastructure.yaml
groups:
- name: infrastructure-alerts
interval: 1m
rules:
# CPU 使用率高
- alert: HighCPUUsage
expr: |
sum by (instance) (
rate({job="node"} |= "cpu" | json | usage > 90 [5m])
) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 90% for 5 minutes"
# 磁盘空间不足
- alert: LowDiskSpace
expr: |
{job="node"} |= "disk" | json | free < 10
for: 10m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Free disk space is below 10%"
# 内存泄漏
- alert: MemoryLeak
expr: |
(sum by (instance) (
rate({job="node"} |= "memory" | json | usage [1h])
) / sum by (instance) (
rate({job="node"} |= "memory" | json | total [1h])
)) > 0.9
for: 30m
labels:
severity: critical
annotations:
summary: "Potential memory leak on {{ $labels.instance }}"
安全告警
# /etc/loki/rules/security.yaml
groups:
- name: security-alerts
interval: 1m
rules:
# 失败登录
- alert: FailedLoginAttempts
expr: |
sum by (user, source_ip) (
rate({job="auth"} |= "login failed" [5m])
) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Multiple failed login attempts"
description: "{{ $value }} failed login attempts from {{ $labels.source_ip }} for user {{ $labels.user }}"
# SQL 注入
- alert: SQLInjectionAttempt
expr: |
sum by (source_ip) (
rate({job="api"} |= "sql injection" [5m])
) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "SQL injection attempt detected"
description: "SQL injection attempt from {{ $labels.source_ip }}"
# 异常访问
- alert: UnusualAccessPattern
expr: |
sum by (source_ip) (
rate({job="api"} |~ "401|403" [5m])
) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Unusual access pattern from {{ $labels.source_ip }}"
Alertmanager 集成
Alertmanager 配置
# /etc/alertmanager/alertmanager.yaml
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'loki-alerts@example.com'
smtp_auth_username: 'alerts'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'email'
routes:
- match:
severity: critical
receiver: 'critical-notifications'
continue: true
- match:
severity: warning
receiver: 'warning-notifications'
receivers:
- name: 'email'
email_configs:
- to: 'team@example.com'
headers:
subject: '[Loki] {{ .GroupLabels.alertname }}'
- name: 'critical-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#critical'
title: 'Critical Alert'
text: '{{ .CommonAnnotations.summary }}'
- name: 'warning-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#warnings'
规则管理 API
列出规则
# 列出所有规则
curl http://loki:3100/prometheus/api/v1/rules
# 列出特定命名空间的规则
curl http://loki:3100/prometheus/api/v1/rules?type=alert
测试规则
# 使用 LogCLI 测试
logcli query '<rule_expression>'
常见问题
告警未触发
- 检查规则语法
- 确认数据流是否正常
- 检查
for参数 - 验证 Alertmanager 连接
告警重复
- 调整
group_interval - 使用更好的分组标签
下一步
接下来让我们学习 Grafana 集成。