第十章:记录规则

详细介绍 Prometheus 记录规则的使用,用于预计算高频查询,提升查询性能

最后更新: 2024-01-01
页面目录

第十章:记录规则

10.1 记录规则概述

记录规则 (Recording Rules) 是预计算的 PromQL 表达式,将复杂查询的结果保存为新的时间序列,用于优化高频查询和 Dashboard 性能。

10.1.1 为什么使用记录规则

场景 不用规则 使用规则
Dashboard 加载 每次执行完整计算 直接读取预计算结果
响应时间 秒级 毫秒级
Prometheus 负载
告警评估

10.1.2 记录规则 vs 告警规则

特性 记录规则 告警规则
目的 预计算指标 触发通知
触发条件 表达式为真
持续评估 是 (for 持续时间)
发送通知

10.2 规则配置

10.2.1 基本语法

# rules/recording_rules.yml
groups:
  - name: <group_name>
    interval: <duration>  # 评估间隔,默认继承 global.evaluation_interval
    limit: <number>       # 每组最大规则数
    rules:
      - record: <string>           # 新的指标名
        expr: <string>              # PromQL 表达式
        labels:                     # 附加标签
          <labelname>: <labelvalue>
        annotations:                # 注解 (仅告警规则)
          <annotationname>: <template>

10.2.2 命名规范

# 命名格式: <role>:<metric_name>:<aggregation_op>:<grouping_labels>

# 示例
job:http_requests:rate5m           # 角色:指标:时间窗口:聚合操作
instance:memory_usage:avg          # 按实例聚合的内存使用率
service:error_rate:ratio           # 服务错误率
:kube_pod_cpu_sums:rate5m           # 无角色前缀

10.2.3 完整示例

# rules/recording_rules.yml
groups:
  # 基础指标预计算
  - name: node_recording_rules
    interval: 30s
    rules:
      # CPU 使用率
      - record: instance:node_cpu:rate5m
        expr: |
          rate(node_cpu_seconds_total[5m])

      # CPU 使用百分比
      - record: instance:node_cpu_usage:percent
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      # 内存使用率
      - record: instance:node_memory_usage:percent
        expr: |
          100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

      # 磁盘使用率
      - record: instance:node_disk_usage:percent
        expr: |
          100 * (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes))

      # 网络 I/O
      - record: instance:node_network_receive_bytes:rate5m
        expr: |
          rate(node_network_receive_bytes_total[5m])

      - record: instance:node_network_transmit_bytes:rate5m
        expr: |
          rate(node_network_transmit_bytes_total[5m])

10.3 服务级别指标

10.3.1 HTTP 服务指标

groups:
  - name: http_service_rules
    interval: 15s
    rules:
      # QPS
      - record: service:http_requests:rate5m
        expr: |
          sum by (service, job) (
            rate(http_requests_total[5m])
          )

      # 错误率
      - record: service:http_errors:rate5m
        expr: |
          sum by (service, job, status) (
            rate(http_requests_total{status=~"5.."}[5m])
          )

      # 错误率百分比
      - record: service:http_error_rate:percent
        expr: |
          100 * sum by (service, job) (
            rate(http_requests_total{status=~"5.."}[5m])
          ) / sum by (service, job) (
            rate(http_requests_total[5m])
          )

      # P99 延迟
      - record: service:http_request_duration_p99:seconds
        expr: |
          histogram_quantile(
            0.99,
            sum by (service, job, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

      # P95 延迟
      - record: service:http_request_duration_p95:seconds
        expr: |
          histogram_quantile(
            0.95,
            sum by (service, job, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

      # P50 延迟
      - record: service:http_request_duration_p50:seconds
        expr: |
          histogram_quantile(
            0.50,
            sum by (service, job, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

      # 请求大小
      - record: service:http_request_size_bytes:avg
        expr: |
          sum by (service, job) (
            rate(http_request_size_bytes_sum[5m])
          ) / sum by (service, job) (
            rate(http_request_size_bytes_count[5m])
          )

10.3.2 Kubernetes 指标

groups:
  - name: kubernetes_recording_rules
    interval: 30s
    rules:
      # Pod CPU 使用率
      - record: namespace:pod_cpu_usage:sum
        expr: |
          sum by (namespace, pod) (
            rate(container_cpu_usage_seconds_total{container!="", container!="POD"}[5m])
          )

      # Pod 内存使用
      - record: namespace:pod_memory_usage:sum
        expr: |
          sum by (namespace, pod) (
            container_memory_working_set_bytes{container!="", container!="POD"}
          )

      # Deployment 副本数
      - record: deployment:replicas:avg
        expr: |
          avg by (namespace, deployment) (
            kube_deployment_spec_replicas
          )

      # Ready Pods 比例
      - record: deployment:ready_replicas:ratio
        expr: |
          kube_deployment_spec_replicas
            / 
          kube_deployment_spec_replicas

      # Node CPU 总量
      - record: cluster:node_cpu:sum
        expr: |
          sum by (cluster) (
            rate(node_cpu_seconds_total[5m])
          )

      # Node 内存总量
      - record: cluster:node_memory:bytes
        expr: |
          sum by (cluster) (
            node_memory_MemTotal_bytes
          )

10.4 业务指标

10.4.1 订单服务指标

groups:
  - name: order_service_rules
    interval: 1m
    rules:
      # 订单创建速率
      - record: service:orders_created:rate1m
        expr: |
          sum by (service) (
            rate(orders_created_total[1m])
          )

      # 订单金额
      - record: service:order_amount:sum
        expr: |
          sum by (service) (
            rate(orders_amount_total[5m])
          )

      # 订单成功率
      - record: service:order_success_rate:percent
        expr: |
          100 * sum by (service) (
            rate(orders_completed_total{status="success"}[5m])
          ) / sum by (service) (
            rate(orders_completed_total[5m])
          )

      # 订单处理时间 P99
      - record: service:order_processing_duration_p99:seconds
        expr: |
          histogram_quantile(
            0.99,
            sum by (service, le) (
              rate(order_processing_duration_seconds_bucket[5m])
            )
          )

10.4.2 用户活跃指标

groups:
  - name: user_activity_rules
    interval: 5m
    rules:
      # DAU (日活跃用户)
      - record: app:dau:sum
        expr: |
          sum by (app_id) (
            increase(user_active_daily[1d])
          )

      # MAU (月活跃用户)
      - record: app:mau:sum
        expr: |
          sum by (app_id) (
            increase(user_active_monthly[30d])
          )

      # 实时在线用户
      - record: app:online_users:gauge
        expr: |
          sum by (app_id) (
            user_online_current
          )

      # 用户留存率
      - record: app:retention_rate:d1
        expr: |
          100 * sum by (app_id) (
            increase(user_retention_d1[1d])
          ) / sum by (app_id) (
            increase(user_new[1d])
          )

10.5 告警规则示例

10.5.1 使用记录规则优化告警

groups:
  - name: recording_rules_for_alerts
    interval: 30s
    rules:
      # 预计算 CPU 告警指标
      - record: alert_level:cpu_usage:percent
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      # 预计算内存告警指标
      - record: alert_level:memory_usage:percent
        expr: |
          100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

  - name: alert_rules_optimized
    interval: 30s
    rules:
      # 使用预计算指标
      - alert: HighCPUUsage
        expr: alert_level:cpu_usage:percent > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高"
          description: "{{ $labels.instance }} CPU 使用率 {{ $value | printf \"%.2f\" }}%"

      - alert: CriticalCPUUsage
        expr: alert_level:cpu_usage:percent > 95
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "CPU 使用率严重过高"
          description: "{{ $labels.instance }} CPU 使用率 {{ $value | printf \"%.2f\" }}%"

      - alert: HighMemoryUsage
        expr: alert_level:memory_usage:percent > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "内存使用率过高"

10.6 Prometheus 配置

10.6.1 规则文件配置

# prometheus.yml
rule_files:
  # 本地规则文件
  - "rules/*.yml"
  - "rules/*.yaml"
  
  # 带递归
  - "rules/**/*.yml"
  
  # 可选规则 (不存在不报错)
  - "rules/optional/*.yml"

10.6.2 评估间隔配置

global:
  # 全局评估间隔
  evaluation_interval: 15s

# 可以为特定组覆盖
groups:
  - name: slow_queries
    interval: 1m  # 覆盖全局
    rules:
      - record: ...

10.7 测试规则

10.7.1 规则语法检查

# 检查规则文件
./promtool check rules rules/*.yml

# 检查告警规则
./promtool check alerts alertmanager.yml rules/*.yml

10.7.2 测试 PromQL 表达式

# 在 Prometheus Web UI 测试

# 或使用 curl
curl -G http://localhost:9090/api/v1/query \
  --data-urlencode 'expr=instance:node_cpu:rate5m{instance="localhost:9100"}'

10.7.3 模拟评估

# 使用 promtool 测试规则
./promtool test rules rules/test.yml
# rules/test.yml
rule_files:
  - recording_rules.yml

test: |
  # 测试 CPU 记录规则
  eval_alert instance:node_cpu:rate5m
    # 模拟时间序列输入
    node_cpu_seconds_total{mode="idle", instance="test"} 0.9 1609459200000
    node_cpu_seconds_total{mode="idle", instance="test"} 0.8 1609459260000
  # 期望输出
  expect_instance:node_cpu:rate5m{instance="test"} 0.1

10.8 最佳实践

10.8.1 命名规范

模式 示例 说明
role:metric:window job:http_requests:rate5m 带时间窗口
level:metric:agg instance:cpu_usage:avg 聚合操作
:metric:sum :kube_pod_cpu:sum 无角色前缀
service:metric:ratio api:error_rate:percent 比率指标

10.8.2 标签管理

rules:
  # 保留原始标签
  - record: instance:cpu_usage:percent
    expr: |
      100 - (avg by (instance, job, cluster) (rate(node_cpu_seconds_total[5m])) * 100)
    # 不要添加重复标签

  # 添加业务标签
  - record: service:api:qps
    expr: |
      sum by (service, team, environment) (
        rate(http_requests_total[5m])
      )

10.8.3 性能考虑

建议 说明
评估间隔 至少 2-3 倍于原始数据间隔
规则数量 单组不超过 100 条规则
复杂表达式 拆分为多个简单规则
标签基数 避免高基数标签组合

10.9 本章小结

本章介绍了 Prometheus 记录规则:

  1. 记录规则概述 - 为什么需要和优势
  2. 规则配置 - 基本语法和命名规范
  3. 服务级别指标 - HTTP 和 K8s 指标预计算
  4. 业务指标 - 订单和用户活跃指标
  5. 告警优化 - 使用记录规则优化告警
  6. 测试规则 - 语法检查和测试
  7. 最佳实践 - 命名规范和性能优化

📖 下一步