第八章:监控与告警体系
最后更新: 2024-01-01
作者: DevOps Team
页面目录
第八章:监控与告警体系
可观测性是现代运维的核心能力。本章将介绍如何构建完善的监控、告警和可观测性体系,实现对系统运行状态的全面感知和快速响应。
8.1 可观测性三大支柱
8.1.1 可观测性模型
┌─────────────────────────────────────────────────────────────────┐
│ 可观测性 (Observability) │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ 指标 │ │ 日志 │ │ 链路追踪 │ │
│ │ (Metrics) │ │ (Logs) │ │ (Traces) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ └──────────────────┼──────────────────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Dashboard │ │
│ │ Alerting │ │
│ │ Analysis │ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
8.1.2 监控层次
| 层次 | 指标类型 | 示例 |
|---|---|---|
| 基础设施 | 主机指标 | CPU、内存、磁盘、网络 |
| 容器 | 容器指标 | Pod 数量、资源使用 |
| 应用 | 业务指标 | 请求量、响应时间、错误率 |
| 用户体验 | 前端指标 | 页面加载、用户操作 |
8.2 Prometheus 监控
8.2.1 Prometheus 配置
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
env: 'us-east-1'
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "/etc/prometheus/rules/*.yml"
scrape_configs:
# Prometheus 自身
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Kubernetes API Server
- job_name: 'kubernetes-apiserver'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
regex: default;kubernetes
action: keep
# Node Exporter
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:9100
# Pod 指标
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
8.2.2 告警规则
# rules/app-alerts.yml
groups:
- name: application
rules:
# 服务可用性告警
- alert: ServiceDown
expr: up{job="myapp"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "服务 {{ $labels.instance }} 不可用"
description: "服务已停止超过 1 分钟"
# 高错误率告警
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "错误率超过 5%"
description: "当前错误率: {{ $value | humanizePercentage }}"
# 响应时间告警
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "P99 响应时间过高"
description: "P99 延迟超过 2 秒,当前: {{ $value }}s"
# 内存使用告警
- alert: HighMemoryUsage
expr: |
container_memory_usage_bytes{job="kubernetes-pods"}
/
container_spec_memory_limit_bytes{job="kubernetes-pods"} > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用率超过 90%"
description: "容器 {{ $labels.container }} 内存使用率 {{ $value | humanizePercentage }}"
# CPU 使用告警
- alert: HighCPUUsage
expr: |
rate(container_cpu_usage_seconds_total{job="kubernetes-pods"}[5m])
/
container_spec_cpu_quota{job="kubernetes-pods"}
* 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU 使用率超过 80%"
# 流量异常告警
- alert: TrafficAnomaly
expr: |
abs(
rate(http_requests_total[5m])
-
rate(http_requests_total[5m] offset 1h)
)
/
rate(http_requests_total[5m] offset 1h) > 0.5
for: 10m
labels:
severity: info
annotations:
summary: "流量异常"
description: "与 1 小时前相比,流量变化超过 50%"
8.3 Grafana 可视化
8.3.1 Dashboard 配置
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": null,
"links": [],
"liveNow": false,
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 70},
{"color": "red", "value": 85}
]
},
"unit": "percent"
}
},
"gridPos": {"h": 8, "w": 6, "x": 0, "y": 0},
"id": 1,
"options": {
"orientation": "auto",
"reduceOptions": {
"calcs": ["lastNotNull"],
"fields": "",
"values": false
},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"pluginVersion": "10.0.0",
"targets": [
{
"expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "CPU"
}
],
"title": "CPU 使用率",
"type": "gauge"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {"mode": "palette-classic"},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {"legend": false, "tooltip": false, "viz": false},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {"type": "linear"},
"showPoints": "never",
"spanNulls": false,
"stacking": {"group": "A", "mode": "none"},
"thresholdsStyle": {"mode": "off"}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [{"color": "green", "value": null}]
},
"unit": "reqps"
}
},
"gridPos": {"h": 8, "w": 12, "x": 6, "y": 0},
"id": 2,
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (status)",
"legendFormat": "{{status}}"
}
],
"title": "请求速率",
"type": "timeseries"
}
],
"refresh": "30s",
"schemaVersion": 38,
"style": "dark",
"tags": ["application"],
"templating": {
"list": [
{
"current": {"selected": false, "text": "production", "value": "production"},
"datasource": {"type": "prometheus", "uid": "prometheus"},
"definition": "label_values(up, env)",
"hide": 0,
"includeAll": false,
"label": "环境",
"multi": false,
"name": "env",
"options": [],
"query": {"query": "label_values(up, env)", "refId": "StandardVariableQuery"},
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 1,
"type": "query"
}
]
},
"time": {"from": "now-6h", "to": "now"},
"timepicker": {},
"timezone": "",
"title": "应用监控 Dashboard",
"uid": "app-monitor",
"version": 1,
"weekStart": ""
}
8.4 Alertmanager 配置
# alertmanager.yml
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts'
smtp_auth_password: '${SMTP_PASSWORD}'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
# 关键告警立即通知
- match:
severity: critical
receiver: 'critical-notifications'
group_wait: 10s
repeat_interval: 1h
# 安全相关告警
- match:
severity: security
receiver: 'security-team'
# 按服务分组
- match:
service: api
receiver: 'api-team'
routes:
- match:
severity: warning
receiver: 'api-team-warning'
- match:
severity: critical
receiver: 'api-team-critical'
receivers:
- name: 'default'
email_configs:
- to: 'devops@example.com'
headers:
subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
- name: 'critical-notifications'
slack_configs:
- api_url: '${SLACK_WEBHOOK_URL}'
channel: '#critical-alerts'
send_resolved: true
title: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'
text: |
{{ range .Alerts }}
**{{ .Status }}** {{ .Labels.alertname }}
{{ .Annotations.description }}
{{ if .Annotations.runbook_url }}
Runbook: {{ .Annotations.runbook_url }}
{{ end }}
{{ end }}
pagerduty_configs:
- service_key: '${PAGERDUTY_KEY}'
severity: critical
webhook_configs:
- url: '${WEBHOOK_URL}'
send_resolved: true
inhibit_rules:
# 当整个服务不可用时,抑制其他相关告警
- source_match:
severity: critical
source_labels: [alertname]
equal: [service]
target_match_re:
severity: warning
target_labels:
service: alertname
8.5 本章小结
| 组件 | 功能 | 关键配置 |
|---|---|---|
| Prometheus | 指标采集和存储 | scrape_interval, recording rules |
| Grafana | 可视化和分析 | Dashboard, Panel |
| Alertmanager | 告警路由和通知 | route, receiver |
📌 下一章预告
下一章我们将学习 日志管理与分析,包括:
- ELK Stack 架构
- 日志采集和聚合
- 日志查询和分析
💡 提示:监控应该遵循"黄金信号"原则:延迟、流量、错误和饱和度。