第十二章:监控告警
学习 Elasticsearch 监控配置,包括集群监控、索引监控、Kibana 仪表盘和告警规则设置。
最后更新: 2024-01-15
页面目录
第十二章:监控告警
12.1 监控概述
12.1.1 监控指标
Elasticsearch 监控指标
├── 集群级别
│ ├── 节点数量
│ ├── 分片分布
│ ├── 集群健康状态
│ └── 集群吞吐量
├── 节点级别
│ ├── CPU 使用率
│ ├── 内存使用
│ ├── 磁盘使用
│ └── JVM 堆内存
├── 索引级别
│ ├── 文档数量
│ ├── 索引大小
│ ├── 分片数量
│ └── 刷新时间
└── 搜索/索引性能
├── 查询延迟
├── 索引延迟
├── 搜索吞吐量
└── 索引吞吐量
12.1.2 监控工具
| 工具 | 说明 |
|---|---|
| Kibana Monitoring | 官方可视化监控 |
| Metricbeat | 系统指标收集 |
| Curator | 索引生命周期管理 |
| Watcher | 告警功能 |
| Prometheus | 指标导出 |
12.2 Kibana Monitoring
12.2.1 启用 Monitoring
# elasticsearch.yml
xpack.monitoring.enabled: true
xpack.monitoring.collection.enabled: true
xpack.monitoring.collection.interval: 10s
# Kibana 配置
# kibana.yml
monitoring.ui.enabled: true
12.2.2 Monitoring 界面
┌─────────────────────────────────────────────────────────┐
│ Kibana Stack Monitoring │
├─────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Cluster │ │ Nodes │ │ Indices │ │
│ │ Overview │ │ Listing │ │ Listing │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
├─────────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────────┐│
│ │ Time Series Charts ││
│ │ - CPU Usage ││
│ │ - Memory Usage ││
│ │ - Search/Index Rate ││
│ └─────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────┘
12.3 Metricbeat
12.3.1 安装 Metricbeat
# 下载安装
curl -L -O https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.12.0-linux-x86_64.tar.gz
tar xzvf metricbeat-8.12.0-linux-x86_64.tar.gz
cd metricbeat-8.12.0
# 启用 Elasticsearch 模块
./metricbeat modules enable elasticsearch
# 配置
vim metricbeat.yml
12.3.2 Metricbeat 配置
# metricbeat.yml
metricbeat.modules:
- module: elasticsearch
metricsets:
- node
- node_stats
- cluster_stats
- index
- index_stats
- shard
period: 10s
hosts: ["http://localhost:9200"]
username: "elastic"
password: "password"
output.elasticsearch:
hosts: ["localhost:9200"]
index: "metricbeat-%{+yyyy.MM.dd}"
setup.kibana:
host: "http://localhost:5601"
12.3.3 启动 Metricbeat
# 测试配置
./metricbeat test config
# 启动
./metricbeat -e
# 加载索引模板
./metricbeat setup --index-management
12.4 集群健康检查
12.4.1 健康状态 API
# 集群健康
GET /_cluster/health
# 等待健康
GET /_cluster/health?wait_for_status=green&timeout=30s
# 索引健康
GET /_cluster/health?level=indices
# 分片健康
GET /_cluster/health?level=shards
12.4.2 健康状态脚本
#!/bin/bash
# check_cluster_health.sh
CLUSTER_URL="http://localhost:9200"
STATUS=$(curl -s "${CLUSTER_URL}/_cluster/health" | jq -r '.status')
if [ "$STATUS" = "red" ]; then
echo "CRITICAL: Cluster health is RED"
exit 2
elif [ "$STATUS" = "yellow" ]; then
echo "WARNING: Cluster health is YELLOW"
exit 1
else
echo "OK: Cluster health is GREEN"
exit 0
fi
12.5 节点监控
12.5.1 节点统计
# 节点信息
GET /_cat/nodes?v
# 节点统计
GET /_nodes/stats
# JVM 统计
GET /_nodes/stats/jvm
# 线程池统计
GET /_nodes/stats/thread_pool
12.5.2 关键指标
| 指标 | 警告阈值 | 严重阈值 |
|---|---|---|
| CPU 使用率 | >70% | >90% |
| JVM 堆使用 | >70% | >85% |
| 磁盘使用 | >75% | >90% |
| 分片数量 | >1000/节点 | >2000/节点 |
| 搜索延迟 P99 | >500ms | >1000ms |
12.6 索引监控
12.6.1 索引统计
# 所有索引
GET /_cat/indices?v
# 索引统计
GET /_stats
# 特定索引
GET /my-index/_stats
# 索引恢复
GET /_cat/recovery?v
12.6.2 分片监控
# 分片列表
GET /_cat/shards?v
# 未分配分片
GET /_cat/shards?h=index,shard,prirep,state,unassigned.reason&v
# 分片统计
GET /_cat/shards?stats=segments
12.7 Watcher 告警
12.7.1 创建 Watch
# CPU 使用率告警
PUT /_watcher/watch/high_cpu_usage
{
"trigger": {
"schedule": {
"interval": "5m"
}
},
"input": {
"http": {
"request": {
"url": "http://localhost:9200/_nodes/stats",
"auth": {
"basic": {
"username": "elastic",
"password": "password"
}
}
}
}
},
"condition": {
"script": {
"source": "return ctx.payload._nodes.nodes.values().any { it.value.cpu.usage && it.value.cpu.usage > 0.9 }"
}
},
"actions": {
"log_alert": {
"logging": {
"level": "warn",
"text": "High CPU usage detected on nodes"
}
},
"webhook_alert": {
"webhook": {
"method": "POST",
"url": "http://your-webhook-endpoint/alert",
"body": {
"message": "High CPU usage detected",
"timestamp": "{{ctx.execution_time}}"
}
}
}
}
}
12.7.2 分片告警
PUT /_watcher/watch/unassigned_shards
{
"trigger": {
"schedule": {
"interval": "1m"
}
},
"input": {
"http": {
"request": {
"url": "http://localhost:9200/_cluster/health"
}
}
},
"condition": {
"compare": {
"ctx.payload.number_of_unassigned_shards": {
"gt": 0
}
}
},
"actions": {
"slack_alert": {
"webhook": {
"url": "https://hooks.slack.com/services/xxx",
"body": {
"text": "Unassigned shards detected: {{ctx.payload.number_of_unassigned_shards}}"
}
}
}
}
}
12.8 Kibana 仪表盘
12.8.1 内置仪表盘
Metricbeat Elasticsearch Overview
├── [Cluster] Overview
├── [Nodes] Overview
├── [Indices] Overview
└── [Transforms] Overview
12.8.2 自定义仪表盘
- 创建可视化
- 添加到仪表盘
- 设置刷新间隔
- 配置时间范围
12.8.3 导入导出
# 导出
POST /_api/saved_objects/_export
{
"objects": [
{ "type": "dashboard", "id": "dashboard-id" }
]
}
# 导入
POST /_api/saved_objects/_import
12.9 告警规则
12.9.1 常用告警规则
| 告警名称 | 条件 | 严重程度 |
|---|---|---|
| 集群 Red | status=red | Critical |
| 节点离线 | node_count 减少 | Critical |
| 磁盘空间不足 | disk >85% | Warning |
| JVM 堆高 | heap >85% | Warning |
| 搜索延迟高 | P99 >1s | Warning |
| 无副本 | replicas=0 | Warning |
12.9.2 告警配置
{
"trigger": {
"schedule": { "interval": "1m" }
},
"input": {
"search": {
"request": {
"indices": ["metricbeat-*"],
"body": {
"size": 0,
"query": {
"range": {
"system.cpu.user.pct": { "gte": 0.9 }
}
}
}
}
}
},
"condition": {
"compare": {
"ctx.payload.hits.total.value": { "gt": 0 }
}
},
"actions": {
"notify": {
"email": {
"profile": "standard",
"to": ["admin@example.com"],
"subject": "High CPU Alert",
"body": "CPU usage exceeded 90%"
}
}
}
}
12.10 运维脚本
12.10.1 集群健康检查脚本
#!/bin/bash
# cluster_monitor.sh
ES_HOST="localhost:9200"
SLACK_WEBHOOK="https://hooks.slack.com/services/xxx"
check_cluster() {
RESPONSE=$(curl -s -u elastic:password "${ES_HOST}/_cluster/health")
STATUS=$(echo $RESPONSE | jq -r '.status')
NODE_COUNT=$(echo $RESPONSE | jq -r '.number_of_nodes')
UNASSIGNED=$(echo $RESPONSE | jq -r '.number_of_unassigned_shards')
if [ "$STATUS" = "red" ]; then
send_alert "Critical" "Cluster status: RED"
elif [ "$STATUS" = "yellow" ]; then
send_alert "Warning" "Cluster status: YELLOW"
fi
if [ "$UNASSIGNED" -gt 0 ]; then
send_alert "Warning" "Unassigned shards: $UNASSIGNED"
fi
}
check_nodes() {
RESPONSE=$(curl -s -u elastic:password "${ES_HOST}/_nodes/stats/jvm,fs")
# 检查节点状态
}
send_alert() {
SEVERITY=$1
MESSAGE=$2
curl -X POST "$SLACK_WEBHOOK" \
-d "{\"text\": \"[$SEVERITY] ES Alert: $MESSAGE\"}"
}
while true; do
check_cluster
check_nodes
sleep 60
done
12.10.2 磁盘空间监控
#!/bin/bash
# disk_monitor.sh
THRESHOLD=80
for node in $(curl -s localhost:9200/_cat/nodes?h=name,disk.avail.pct); do
USAGE=$(echo $node | awk -F',' '{print $2}')
if [ "$USAGE" -gt "$THRESHOLD" ]; then
NODE_NAME=$(echo $node | awk -F',' '{print $1}')
echo "Alert: Node $NODE_NAME disk usage at ${USAGE}%"
fi
done
12.11 总结
本章介绍了 Elasticsearch 的监控配置和告警设置,包括 Kibana Monitoring、Metricbeat、Watcher 等工具的使用。完善的监控体系是保障集群稳定运行的基础。