第十四章：故障排查

14.1 常用诊断工具

14.1.1 Prometheus 内置端点

端点	说明
`/healthy`	健康检查
`/ready`	就绪检查
`/metrics`	Prometheus 自身指标
`/graph`	Web UI 查询界面
`/status`	配置和运行时信息
`/tsdb-status`	TSDB 状态

# 健康检查
curl http://localhost:9090/-/healthy

# 就绪检查
curl http://localhost:9090/-/ready

# TSDB 状态
curl http://localhost:9090/api/v1/status/tsdb

14.1.2 promtool 命令

# 检查配置语法
./promtool check config prometheus.yml

# 检查规则语法
./promtool check rules rules/*.yml

# 格式化配置
./promtool config format prometheus.yml

# 导出 TSDB 数据
./promtool tsdb dump /data/prometheus

# 分析 TSDB
./promtool tsdb analyze /data/prometheus

# 测试规则
./promtool test rules rules/test.yml

14.2 抓取问题

14.2.1 Target 不可用

# 查看所有 Target 状态
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, endpoint: .scrapeUrl, health: .health, lastError: .lastError}'

# 常见错误及解决方案

# 1. Connection Refused
# 原因: 端口未监听
# 解决: 检查 Exporter 是否运行
netstat -tlnp | grep 9100
ss -tlnp | grep 9100

# 2. Timeout
# 原因: 抓取超时
# 解决: 增加 scrape_timeout
scrape_configs:
  - job_name: 'slow-target'
    scrape_timeout: 30s

# 3. Certificate Error
# 原因: TLS 证书问题
# 解决: 检查证书有效期和配置
openssl s_client -connect target:9100 -showcerts

# 4. DNS Resolution Failed
# 原因: DNS 解析失败
# 解决: 检查 DNS 配置
dig target.example.com
nslookup target.example.com

14.2.2 Target 健康但无数据

# 检查是否有数据
count({job="problem-job"})

# 检查指标名称
count by (__name__) ({job="problem-job"})

# 检查标签
{job="problem-job"}

# 解决方案

# 1. Relabel 过滤了所有数据
# 检查 relabel_configs
relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: 'true'

# 2. 指标格式错误
# 检查 Exporter 输出
curl http://localhost:9100/metrics | head -100

# 3. 指标过期
# 检查 lastScrape
{__name__=~".+"} | sort by (__name__)

14.2.3 标签问题

# 检查高基数标签
{__name__=~".+"} | count by (__name__) | sort desc | topk(10)

# 检查空标签
{job="problem-job", handler=""}

14.3 查询问题

14.3.1 查询超时

# 问题: 查询执行时间过长
# 解决方案: 优化查询或增加超时

# 增加查询超时
web:
  query_timeout: 2m

# 优化查询
# ❌ 低效查询
sum by (__name__) ({__name__=~".+"})

# ✅ 高效查询
sum by (job) (rate(http_requests_total[5m]))

14.3.2 查询结果为空

# 1. 检查数据范围
curl 'http://localhost:9090/api/v1/query?query=up&time=1704067200'

# 2. 检查时间窗口
curl 'http://localhost:9090/api/v1/query_range?query=up&start=1703980800&end=1704067200&step=3600'

# 3. 检查标签匹配
curl 'http://localhost:9090/api/v1/label/__name__/values'
curl 'http://localhost:9090/api/v1/series?match[]={job="problem-job"}'

14.3.3 内存不足

# 检查 Prometheus 内存使用
process_resident_memory_bytes{job="prometheus"}

# 检查查询内存限制
prometheus_tsdb_head_samples{type="head"}

# 解决方案
# 1. 减少标签基数
# 2. 增加 Prometheus 内存限制
./prometheus \
  --storage.tsdb.max-bytes=50GB \
  --query.max-memory-mb=4096

# 3. 启用查询内存限制
web:
  query_max_memory_mb: 2048

14.4 存储问题

14.4.1 磁盘空间不足

# 检查磁盘使用
df -h /data/prometheus

# 检查 TSDB 大小
du -sh /data/prometheus/data

# 查看块大小分布
ls -la /data/prometheus/data/ | head -20

# 清理旧块 (谨慎操作)
./promtool tsdb create-blocks \
  --output.dir=/tmp/cleaned \
  --max-block-duration=2h \
  /data/prometheus

# 配置数据保留
storage:
  retention:
    time: 30d
    size: 100GB

14.4.2 WAL 问题

# 检查 WAL 大小
du -sh /data/prometheus/wal

# 检查 WAL 段
ls /data/prometheus/wal/*.wal | wc -l

# 清理 WAL (Prometheus 会自动清理)
# 确保 Prometheus 正常运行

14.4.3 数据丢失

# 1. 检查 checkpoint
ls -la /data/prometheus/wal/checkpoint.*

# 2. 恢复 WAL
./promtool tsdb dump /data/prometheus

# 3. 检查块完整性
./promtool tsdb analyze /data/prometheus

14.5 告警问题

14.5.1 告警未触发

# 1. 检查告警规则
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.type=="alerting")'

# 2. 检查告警状态
curl http://localhost:9090/api/v1/alerts

# 3. 测试告警条件
# 在 Graph 界面执行告警表达式

# 常见问题

# 1. for 时间不够
# 告警需要持续 5 分钟才触发
- alert: HighCPU
  expr: cpu_usage > 80
  for: 5m  # 确保足够长

# 2. 评估间隔问题
# 全局 evaluation_interval: 15s
# for: 5m 需要至少 20 次评估

# 3. Alertmanager 未配置
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

14.5.2 告警重复发送

# 1. 配置 group_by
route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

# 2. 配置告警抑制
inhibit_rules:
  - source_match:
      alertname: 'InstanceDown'
    equal: ['instance']

14.5.3 Alertmanager 问题

# 1. 检查 Alertmanager 状态
curl http://localhost:9093/api/v1/status

# 2. 查看告警
curl http://localhost:9093/api/v1/alerts

# 3. 测试发送
amtool alert add \
  --alertmanager.url=http://localhost:9093 \
  alertname=test \
  severity=critical

14.6 性能问题

14.6.1 高内存使用

# 诊断 Prometheus 自身指标
# 内存
process_resident_memory_bytes{job="prometheus"}

# 样本数
prometheus_tsdb_head_series

# 块数
prometheus_tsdb_blocks_loaded

# 查询队列
prometheus_engine_queries_concurrent_max

# 解决方案

# 1. 减少标签基数
# 2. 清理历史数据
./promtool tsdb create-blocks \
  --output.delete \
  --max-block-duration=2h \
  /data/prometheus

# 3. 限制查询并发
web:
  max_concurrent: 20
  query_timeout: 2m

# 4. 启用压缩
storage:
  tsdb:
    wal_compression: true

14.6.2 高 CPU 使用

# 检查 CPU 使用
rate(process_cpu_seconds_total{job="prometheus"}[5m])

# 检查评估时间
prometheus_rule_evaluation_duration_seconds

# 检查查询时间
prometheus_query_evaluation_duration_seconds

# 解决方案

# 1. 减少规则数量
# 2. 减少 scrape targets
# 3. 增加评估间隔
groups:
  - name: slow_rules
    interval: 5m  # 5 分钟评估一次
    rules:
      - ...

# 4. 使用记录规则预计算

14.6.3 慢查询

# 诊断慢查询
# 历史查询
prometheus_tsdb_compactions_duration_seconds

# 查询延迟
prometheus_engine_query_duration_seconds{slice="inner_eval"}

# 最慢查询
topk(10, 
  prometheus_engine_query_duration_seconds{slice="query_exec"}
)

14.7 网络问题

14.7.1 服务发现失败

# Kubernetes SD 问题
# 1. 检查 RBAC 权限
kubectl auth can-i get pods --as=system:serviceaccount:monitoring:prometheus

# 2. 检查服务账号
kubectl get sa prometheus -n monitoring -o yaml

# 3. 检查 endpoint
curl -k https://kubernetes.default.svc:443/api/v1/pods

# DNS SD 问题
# 检查 DNS 配置
scrape_configs:
  - job_name: 'dns-sd'
    dns_sd_configs:
      - names:
          - 'targets.example.com'
        refresh_interval: 30s
        type: A
        port: 9100

14.7.2 防火墙问题

# Prometheus → Exporter
nc -zv target:9100

# Prometheus → Alertmanager
nc -zv alertmanager:9093

# 检查防火墙规则
iptables -L -n | grep 9090

14.8 日志分析

14.8.1 日志级别

# 启动时设置日志级别
./prometheus \
  --log.level=debug \
  --log.format=json

# 运行时修改 (需要启用 web.enable-lifecycle)
curl -X POST http://localhost:9090/-/reload

14.8.2 常见日志信息

# 抓取成功
level=info component=scrape_manager target="..." msg="scrape target scraped"

# 抓取失败
level=error component=scrape_manager target="..." msg="scrape failed" err="connection refused"

# 规则评估
level=info component=rule_manager name="..." duration=...

# 压缩
level=info component=compactor msg="compaction complete"

14.9 常见问题速查

问题	原因	解决方案
Target 显示 DOWN	Exporter 未运行/网络不通	检查 Exporter 状态
无数据查询结果	抓取间隔问题/时间范围	检查时间窗口
查询超时	指标过多/查询复杂	优化查询/增加超时
磁盘空间满	数据保留过长	清理历史数据
告警不触发	for 时间不够/配置错误	检查告警配置
Alertmanager 无告警	Alertmanager 未配置	配置 alertmanagers
内存持续增长	标签基数过高	优化标签设计
启动缓慢	WAL 恢复	正常现象，耐心等待

14.10 本章小结

本章介绍了 Prometheus 故障排查方法：

诊断工具 - 内置端点和 promtool
抓取问题 - Target 不可用、无数据、标签问题
查询问题 - 超时、结果为空、内存不足
存储问题 - 磁盘空间、WAL、数据丢失
告警问题 - 未触发、重复发送、Alertmanager 问题
性能问题 - 内存、CPU、慢查询
网络问题 - 服务发现、防火墙
日志分析 - 日志级别和常见信息

📖 Prometheus 权威教程完成！

恭喜您完成了 Prometheus 全部 14 章的学习。从基础概念到高级应用，从安装部署到故障排查，本教程涵盖了 Prometheus 监控系统的完整知识体系。建议在实际项目中应用所学知识，不断积累经验。