第十章:记录规则
详细介绍 Prometheus 记录规则的使用,用于预计算高频查询,提升查询性能
最后更新: 2024-01-01
页面目录
第十章:记录规则
10.1 记录规则概述
记录规则 (Recording Rules) 是预计算的 PromQL 表达式,将复杂查询的结果保存为新的时间序列,用于优化高频查询和 Dashboard 性能。
10.1.1 为什么使用记录规则
| 场景 | 不用规则 | 使用规则 |
|---|---|---|
| Dashboard 加载 | 每次执行完整计算 | 直接读取预计算结果 |
| 响应时间 | 秒级 | 毫秒级 |
| Prometheus 负载 | 高 | 低 |
| 告警评估 | 慢 | 快 |
10.1.2 记录规则 vs 告警规则
| 特性 | 记录规则 | 告警规则 |
|---|---|---|
| 目的 | 预计算指标 | 触发通知 |
| 触发条件 | 无 | 表达式为真 |
| 持续评估 | 是 | 是 (for 持续时间) |
| 发送通知 | 否 | 是 |
10.2 规则配置
10.2.1 基本语法
# rules/recording_rules.yml
groups:
- name: <group_name>
interval: <duration> # 评估间隔,默认继承 global.evaluation_interval
limit: <number> # 每组最大规则数
rules:
- record: <string> # 新的指标名
expr: <string> # PromQL 表达式
labels: # 附加标签
<labelname>: <labelvalue>
annotations: # 注解 (仅告警规则)
<annotationname>: <template>
10.2.2 命名规范
# 命名格式: <role>:<metric_name>:<aggregation_op>:<grouping_labels>
# 示例
job:http_requests:rate5m # 角色:指标:时间窗口:聚合操作
instance:memory_usage:avg # 按实例聚合的内存使用率
service:error_rate:ratio # 服务错误率
:kube_pod_cpu_sums:rate5m # 无角色前缀
10.2.3 完整示例
# rules/recording_rules.yml
groups:
# 基础指标预计算
- name: node_recording_rules
interval: 30s
rules:
# CPU 使用率
- record: instance:node_cpu:rate5m
expr: |
rate(node_cpu_seconds_total[5m])
# CPU 使用百分比
- record: instance:node_cpu_usage:percent
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 内存使用率
- record: instance:node_memory_usage:percent
expr: |
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
# 磁盘使用率
- record: instance:node_disk_usage:percent
expr: |
100 * (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes))
# 网络 I/O
- record: instance:node_network_receive_bytes:rate5m
expr: |
rate(node_network_receive_bytes_total[5m])
- record: instance:node_network_transmit_bytes:rate5m
expr: |
rate(node_network_transmit_bytes_total[5m])
10.3 服务级别指标
10.3.1 HTTP 服务指标
groups:
- name: http_service_rules
interval: 15s
rules:
# QPS
- record: service:http_requests:rate5m
expr: |
sum by (service, job) (
rate(http_requests_total[5m])
)
# 错误率
- record: service:http_errors:rate5m
expr: |
sum by (service, job, status) (
rate(http_requests_total{status=~"5.."}[5m])
)
# 错误率百分比
- record: service:http_error_rate:percent
expr: |
100 * sum by (service, job) (
rate(http_requests_total{status=~"5.."}[5m])
) / sum by (service, job) (
rate(http_requests_total[5m])
)
# P99 延迟
- record: service:http_request_duration_p99:seconds
expr: |
histogram_quantile(
0.99,
sum by (service, job, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# P95 延迟
- record: service:http_request_duration_p95:seconds
expr: |
histogram_quantile(
0.95,
sum by (service, job, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# P50 延迟
- record: service:http_request_duration_p50:seconds
expr: |
histogram_quantile(
0.50,
sum by (service, job, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# 请求大小
- record: service:http_request_size_bytes:avg
expr: |
sum by (service, job) (
rate(http_request_size_bytes_sum[5m])
) / sum by (service, job) (
rate(http_request_size_bytes_count[5m])
)
10.3.2 Kubernetes 指标
groups:
- name: kubernetes_recording_rules
interval: 30s
rules:
# Pod CPU 使用率
- record: namespace:pod_cpu_usage:sum
expr: |
sum by (namespace, pod) (
rate(container_cpu_usage_seconds_total{container!="", container!="POD"}[5m])
)
# Pod 内存使用
- record: namespace:pod_memory_usage:sum
expr: |
sum by (namespace, pod) (
container_memory_working_set_bytes{container!="", container!="POD"}
)
# Deployment 副本数
- record: deployment:replicas:avg
expr: |
avg by (namespace, deployment) (
kube_deployment_spec_replicas
)
# Ready Pods 比例
- record: deployment:ready_replicas:ratio
expr: |
kube_deployment_spec_replicas
/
kube_deployment_spec_replicas
# Node CPU 总量
- record: cluster:node_cpu:sum
expr: |
sum by (cluster) (
rate(node_cpu_seconds_total[5m])
)
# Node 内存总量
- record: cluster:node_memory:bytes
expr: |
sum by (cluster) (
node_memory_MemTotal_bytes
)
10.4 业务指标
10.4.1 订单服务指标
groups:
- name: order_service_rules
interval: 1m
rules:
# 订单创建速率
- record: service:orders_created:rate1m
expr: |
sum by (service) (
rate(orders_created_total[1m])
)
# 订单金额
- record: service:order_amount:sum
expr: |
sum by (service) (
rate(orders_amount_total[5m])
)
# 订单成功率
- record: service:order_success_rate:percent
expr: |
100 * sum by (service) (
rate(orders_completed_total{status="success"}[5m])
) / sum by (service) (
rate(orders_completed_total[5m])
)
# 订单处理时间 P99
- record: service:order_processing_duration_p99:seconds
expr: |
histogram_quantile(
0.99,
sum by (service, le) (
rate(order_processing_duration_seconds_bucket[5m])
)
)
10.4.2 用户活跃指标
groups:
- name: user_activity_rules
interval: 5m
rules:
# DAU (日活跃用户)
- record: app:dau:sum
expr: |
sum by (app_id) (
increase(user_active_daily[1d])
)
# MAU (月活跃用户)
- record: app:mau:sum
expr: |
sum by (app_id) (
increase(user_active_monthly[30d])
)
# 实时在线用户
- record: app:online_users:gauge
expr: |
sum by (app_id) (
user_online_current
)
# 用户留存率
- record: app:retention_rate:d1
expr: |
100 * sum by (app_id) (
increase(user_retention_d1[1d])
) / sum by (app_id) (
increase(user_new[1d])
)
10.5 告警规则示例
10.5.1 使用记录规则优化告警
groups:
- name: recording_rules_for_alerts
interval: 30s
rules:
# 预计算 CPU 告警指标
- record: alert_level:cpu_usage:percent
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 预计算内存告警指标
- record: alert_level:memory_usage:percent
expr: |
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
- name: alert_rules_optimized
interval: 30s
rules:
# 使用预计算指标
- alert: HighCPUUsage
expr: alert_level:cpu_usage:percent > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU 使用率过高"
description: "{{ $labels.instance }} CPU 使用率 {{ $value | printf \"%.2f\" }}%"
- alert: CriticalCPUUsage
expr: alert_level:cpu_usage:percent > 95
for: 2m
labels:
severity: critical
annotations:
summary: "CPU 使用率严重过高"
description: "{{ $labels.instance }} CPU 使用率 {{ $value | printf \"%.2f\" }}%"
- alert: HighMemoryUsage
expr: alert_level:memory_usage:percent > 90
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用率过高"
10.6 Prometheus 配置
10.6.1 规则文件配置
# prometheus.yml
rule_files:
# 本地规则文件
- "rules/*.yml"
- "rules/*.yaml"
# 带递归
- "rules/**/*.yml"
# 可选规则 (不存在不报错)
- "rules/optional/*.yml"
10.6.2 评估间隔配置
global:
# 全局评估间隔
evaluation_interval: 15s
# 可以为特定组覆盖
groups:
- name: slow_queries
interval: 1m # 覆盖全局
rules:
- record: ...
10.7 测试规则
10.7.1 规则语法检查
# 检查规则文件
./promtool check rules rules/*.yml
# 检查告警规则
./promtool check alerts alertmanager.yml rules/*.yml
10.7.2 测试 PromQL 表达式
# 在 Prometheus Web UI 测试
# 或使用 curl
curl -G http://localhost:9090/api/v1/query \
--data-urlencode 'expr=instance:node_cpu:rate5m{instance="localhost:9100"}'
10.7.3 模拟评估
# 使用 promtool 测试规则
./promtool test rules rules/test.yml
# rules/test.yml
rule_files:
- recording_rules.yml
test: |
# 测试 CPU 记录规则
eval_alert instance:node_cpu:rate5m
# 模拟时间序列输入
node_cpu_seconds_total{mode="idle", instance="test"} 0.9 1609459200000
node_cpu_seconds_total{mode="idle", instance="test"} 0.8 1609459260000
# 期望输出
expect_instance:node_cpu:rate5m{instance="test"} 0.1
10.8 最佳实践
10.8.1 命名规范
| 模式 | 示例 | 说明 |
|---|---|---|
role:metric:window |
job:http_requests:rate5m |
带时间窗口 |
level:metric:agg |
instance:cpu_usage:avg |
聚合操作 |
:metric:sum |
:kube_pod_cpu:sum |
无角色前缀 |
service:metric:ratio |
api:error_rate:percent |
比率指标 |
10.8.2 标签管理
rules:
# 保留原始标签
- record: instance:cpu_usage:percent
expr: |
100 - (avg by (instance, job, cluster) (rate(node_cpu_seconds_total[5m])) * 100)
# 不要添加重复标签
# 添加业务标签
- record: service:api:qps
expr: |
sum by (service, team, environment) (
rate(http_requests_total[5m])
)
10.8.3 性能考虑
| 建议 | 说明 |
|---|---|
| 评估间隔 | 至少 2-3 倍于原始数据间隔 |
| 规则数量 | 单组不超过 100 条规则 |
| 复杂表达式 | 拆分为多个简单规则 |
| 标签基数 | 避免高基数标签组合 |
10.9 本章小结
本章介绍了 Prometheus 记录规则:
- 记录规则概述 - 为什么需要和优势
- 规则配置 - 基本语法和命名规范
- 服务级别指标 - HTTP 和 K8s 指标预计算
- 业务指标 - 订单和用户活跃指标
- 告警优化 - 使用记录规则优化告警
- 测试规则 - 语法检查和测试
- 最佳实践 - 命名规范和性能优化
📖 下一步