第十二章：监控告警

12.1 监控概述

12.1.1 监控指标

Elasticsearch 监控指标
├── 集群级别
│   ├── 节点数量
│   ├── 分片分布
│   ├── 集群健康状态
│   └── 集群吞吐量
├── 节点级别
│   ├── CPU 使用率
│   ├── 内存使用
│   ├── 磁盘使用
│   └── JVM 堆内存
├── 索引级别
│   ├── 文档数量
│   ├── 索引大小
│   ├── 分片数量
│   └── 刷新时间
└── 搜索/索引性能
    ├── 查询延迟
    ├── 索引延迟
    ├── 搜索吞吐量
    └── 索引吞吐量

12.1.2 监控工具

工具	说明
Kibana Monitoring	官方可视化监控
Metricbeat	系统指标收集
Curator	索引生命周期管理
Watcher	告警功能
Prometheus	指标导出

12.2 Kibana Monitoring

12.2.1 启用 Monitoring

# elasticsearch.yml
xpack.monitoring.enabled: true
xpack.monitoring.collection.enabled: true
xpack.monitoring.collection.interval: 10s

# Kibana 配置
# kibana.yml
monitoring.ui.enabled: true

12.2.2 Monitoring 界面

┌─────────────────────────────────────────────────────────┐
│  Kibana Stack Monitoring                                 │
├─────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │
│  │  Cluster    │  │  Nodes       │  │  Indices     │     │
│  │  Overview   │  │  Listing     │  │  Listing     │     │
│  └─────────────┘  └─────────────┘  └─────────────┘     │
├─────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────┐│
│  │  Time Series Charts                                 ││
│  │  - CPU Usage                                         ││
│  │  - Memory Usage                                      ││
│  │  - Search/Index Rate                                 ││
│  └─────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────┘

12.3 Metricbeat

12.3.1 安装 Metricbeat

# 下载安装
curl -L -O https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.12.0-linux-x86_64.tar.gz
tar xzvf metricbeat-8.12.0-linux-x86_64.tar.gz
cd metricbeat-8.12.0

# 启用 Elasticsearch 模块
./metricbeat modules enable elasticsearch

# 配置
vim metricbeat.yml

12.3.2 Metricbeat 配置

# metricbeat.yml
metricbeat.modules:
- module: elasticsearch
  metricsets:
    - node
    - node_stats
    - cluster_stats
    - index
    - index_stats
    - shard
  period: 10s
  hosts: ["http://localhost:9200"]
  username: "elastic"
  password: "password"

output.elasticsearch:
  hosts: ["localhost:9200"]
  index: "metricbeat-%{+yyyy.MM.dd}"

setup.kibana:
  host: "http://localhost:5601"

12.3.3 启动 Metricbeat

# 测试配置
./metricbeat test config

# 启动
./metricbeat -e

# 加载索引模板
./metricbeat setup --index-management

12.4 集群健康检查

12.4.1 健康状态 API

# 集群健康
GET /_cluster/health

# 等待健康
GET /_cluster/health?wait_for_status=green&timeout=30s

# 索引健康
GET /_cluster/health?level=indices

# 分片健康
GET /_cluster/health?level=shards

12.4.2 健康状态脚本

#!/bin/bash
# check_cluster_health.sh

CLUSTER_URL="http://localhost:9200"
STATUS=$(curl -s "${CLUSTER_URL}/_cluster/health" | jq -r '.status')

if [ "$STATUS" = "red" ]; then
  echo "CRITICAL: Cluster health is RED"
  exit 2
elif [ "$STATUS" = "yellow" ]; then
  echo "WARNING: Cluster health is YELLOW"
  exit 1
else
  echo "OK: Cluster health is GREEN"
  exit 0
fi

12.5 节点监控

12.5.1 节点统计

# 节点信息
GET /_cat/nodes?v

# 节点统计
GET /_nodes/stats

# JVM 统计
GET /_nodes/stats/jvm

# 线程池统计
GET /_nodes/stats/thread_pool

12.5.2 关键指标

指标	警告阈值	严重阈值
CPU 使用率	>70%	>90%
JVM 堆使用	>70%	>85%
磁盘使用	>75%	>90%
分片数量	>1000/节点	>2000/节点
搜索延迟 P99	>500ms	>1000ms

12.6 索引监控

12.6.1 索引统计

# 所有索引
GET /_cat/indices?v

# 索引统计
GET /_stats

# 特定索引
GET /my-index/_stats

# 索引恢复
GET /_cat/recovery?v

12.6.2 分片监控

# 分片列表
GET /_cat/shards?v

# 未分配分片
GET /_cat/shards?h=index,shard,prirep,state,unassigned.reason&v

# 分片统计
GET /_cat/shards?stats=segments

12.7 Watcher 告警

12.7.1 创建 Watch

# CPU 使用率告警
PUT /_watcher/watch/high_cpu_usage
{
  "trigger": {
    "schedule": {
      "interval": "5m"
    }
  },
  "input": {
    "http": {
      "request": {
        "url": "http://localhost:9200/_nodes/stats",
        "auth": {
          "basic": {
            "username": "elastic",
            "password": "password"
          }
        }
      }
    }
  },
  "condition": {
    "script": {
      "source": "return ctx.payload._nodes.nodes.values().any { it.value.cpu.usage && it.value.cpu.usage > 0.9 }"
    }
  },
  "actions": {
    "log_alert": {
      "logging": {
        "level": "warn",
        "text": "High CPU usage detected on nodes"
      }
    },
    "webhook_alert": {
      "webhook": {
        "method": "POST",
        "url": "http://your-webhook-endpoint/alert",
        "body": {
          "message": "High CPU usage detected",
          "timestamp": "{{ctx.execution_time}}"
        }
      }
    }
  }
}

12.7.2 分片告警

PUT /_watcher/watch/unassigned_shards
{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "http": {
      "request": {
        "url": "http://localhost:9200/_cluster/health"
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.number_of_unassigned_shards": {
        "gt": 0
      }
    }
  },
  "actions": {
    "slack_alert": {
      "webhook": {
        "url": "https://hooks.slack.com/services/xxx",
        "body": {
          "text": "Unassigned shards detected: {{ctx.payload.number_of_unassigned_shards}}"
        }
      }
    }
  }
}

12.8 Kibana 仪表盘

12.8.1 内置仪表盘

Metricbeat Elasticsearch Overview
├── [Cluster] Overview
├── [Nodes] Overview
├── [Indices] Overview
└── [Transforms] Overview

12.8.2 自定义仪表盘

创建可视化
添加到仪表盘
设置刷新间隔
配置时间范围

12.8.3 导入导出

# 导出
POST /_api/saved_objects/_export
{
  "objects": [
    { "type": "dashboard", "id": "dashboard-id" }
  ]
}

# 导入
POST /_api/saved_objects/_import

12.9 告警规则

12.9.1 常用告警规则

告警名称	条件	严重程度
集群 Red	status=red	Critical
节点离线	node_count 减少	Critical
磁盘空间不足	disk >85%	Warning
JVM 堆高	heap >85%	Warning
搜索延迟高	P99 >1s	Warning
无副本	replicas=0	Warning

12.9.2 告警配置

{
  "trigger": {
    "schedule": { "interval": "1m" }
  },
  "input": {
    "search": {
      "request": {
        "indices": ["metricbeat-*"],
        "body": {
          "size": 0,
          "query": {
            "range": {
              "system.cpu.user.pct": { "gte": 0.9 }
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total.value": { "gt": 0 }
    }
  },
  "actions": {
    "notify": {
      "email": {
        "profile": "standard",
        "to": ["admin@example.com"],
        "subject": "High CPU Alert",
        "body": "CPU usage exceeded 90%"
      }
    }
  }
}

12.10 运维脚本

12.10.1 集群健康检查脚本

#!/bin/bash
# cluster_monitor.sh

ES_HOST="localhost:9200"
SLACK_WEBHOOK="https://hooks.slack.com/services/xxx"

check_cluster() {
  RESPONSE=$(curl -s -u elastic:password "${ES_HOST}/_cluster/health")
  STATUS=$(echo $RESPONSE | jq -r '.status')
  NODE_COUNT=$(echo $RESPONSE | jq -r '.number_of_nodes')
  UNASSIGNED=$(echo $RESPONSE | jq -r '.number_of_unassigned_shards')
  
  if [ "$STATUS" = "red" ]; then
    send_alert "Critical" "Cluster status: RED"
  elif [ "$STATUS" = "yellow" ]; then
    send_alert "Warning" "Cluster status: YELLOW"
  fi
  
  if [ "$UNASSIGNED" -gt 0 ]; then
    send_alert "Warning" "Unassigned shards: $UNASSIGNED"
  fi
}

check_nodes() {
  RESPONSE=$(curl -s -u elastic:password "${ES_HOST}/_nodes/stats/jvm,fs")
  # 检查节点状态
}

send_alert() {
  SEVERITY=$1
  MESSAGE=$2
  curl -X POST "$SLACK_WEBHOOK" \
    -d "{\"text\": \"[$SEVERITY] ES Alert: $MESSAGE\"}"
}

while true; do
  check_cluster
  check_nodes
  sleep 60
done

12.10.2 磁盘空间监控

#!/bin/bash
# disk_monitor.sh

THRESHOLD=80

for node in $(curl -s localhost:9200/_cat/nodes?h=name,disk.avail.pct); do
  USAGE=$(echo $node | awk -F',' '{print $2}')
  if [ "$USAGE" -gt "$THRESHOLD" ]; then
    NODE_NAME=$(echo $node | awk -F',' '{print $1}')
    echo "Alert: Node $NODE_NAME disk usage at ${USAGE}%"
  fi
done

12.11 总结

本章介绍了 Elasticsearch 的监控配置和告警设置，包括 Kibana Monitoring、Metricbeat、Watcher 等工具的使用。完善的监控体系是保障集群稳定运行的基础。