第九章:集群管理
深入学习 Elasticsearch 集群管理,包括节点发现、分片分配、故障恢复和集群扩容。
最后更新: 2024-01-15
页面目录
第九章:集群管理
9.1 集群架构
9.1.1 节点角色
┌─────────────────────────────────────────────────────────┐
│ Elasticsearch Cluster │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Master Node │ │ Master Node │ │ Master Node │ │
│ │ (elected) │ │ (backup) │ │ (backup) │ │
│ └──────┬──────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ │ Cluster State │
│ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Coordinating Nodes │ │
│ │ (请求路由、结果聚合) │ │
│ └──────────────────┬──────────────────────────┘ │
│ │ │
│ ┌──────────────────┼──────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Data Node│ │Data Node│ │Data Node│ │
│ │ Hot │ │ Warm │ │ Cold │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ ┌─────────┐ │
│ │Ingest │ │
│ │Node │ │
│ └─────────┘ │
└─────────────────────────────────────────────────────────┘
9.1.2 节点配置
# Master 节点
node.master: true
node.data: false
node.ingest: false
# Data 节点
node.master: false
node.data: true
node.ingest: false
# Ingest 节点
node.master: false
node.data: false
node.ingest: true
# Coordinating 节点
node.master: false
node.data: false
node.ingest: false
9.2 节点发现
9.2.1 Zen Discovery
# zenDiscovery 配置
discovery.seed_hosts:
- 192.168.1.101:9300
- 192.168.1.102:9300
- 192.168.1.103:9300
cluster.initial_master_nodes:
- node-1
- node-2
- node-3
discovery.seed_providers: file
9.2.2 EC2 发现插件
# 安装插件
./bin/elasticsearch-plugin install discovery-ec2
# 配置
cloud.aws:
region: us-east-1
access_key: xxx
secret_key: xxx
discovery:
ec2:
groups: sg-12345678
availability_zones: us-east-1a,us-east-1b
9.2.3 Kubernetes 发现
# 使用 Kubernetes Pod DNS
discovery.seed_hosts:
- es-svc.namespace.svc.cluster.local
# 安装 Kubernetes 插件
./bin/elasticsearch-plugin install repository-kubernetes
9.3 分片分配
9.3.1 分片分配策略
# 查看分片分配
GET /_cat/allocation?v
# 查看未分配分片
GET /_cat/shards?h=index,shard,prirep,state,unassigned.reason&v
9.3.2 手动分片分配
# 分配分片到指定节点
POST /_cluster/reroute
{
"commands": [
{
"allocate_empty_primary": {
"index": "my-index",
"shard": 0,
"node": "node-2",
"accept_data_loss": true
}
}
]
}
# 移动分片
POST /_cluster/reroute
{
"commands": [
{
"move": {
"index": "my-index",
"shard": 0,
"from_node": "node-1",
"to_node": "node-2"
}
}
]
}
9.3.3 分片分配过滤
# 按节点属性过滤
PUT /my-index/_settings
{
"index.routing.allocation.include._tier_preference": "data_hot"
}
# 自定义属性过滤
PUT /my-index/_settings
{
"index.routing.allocation.include.tag": "node-1,node-2",
"index.routing.allocation.exclude.tag": "slow"
}
9.3.4 分配感知
# 节点配置 rack 属性
node.attr.rack: rack1
# 索引配置
PUT /my-index/_settings
{
"index.routing.allocation.awareness.attributes": "rack"
}
9.4 故障恢复
9.4.1 自动恢复机制
# 故障检测配置
gateway:
recover_after_nodes: 2
recover_after_time: 5m
expected_nodes: 3
# 分片恢复配置
indices.recovery:
max_bytes_per_sec: 100mb
concurrent_streams: 3
concurrent_small_file_streams: 2
9.4.2 恢复监控
# 查看恢复状态
GET /_cat/recovery?v
# 查看详细恢复信息
GET /my-index/_recovery?detailed=true
# 响应示例
{
"my-index": {
"shards": [
{
"id": 0,
"type": "SNAPSHOT",
"stage": "DONE",
"primary": true,
"start_time": "2024-01-15T10:30:00Z",
"total_time": "1m",
"source": {
"repository": "my-backup",
"snapshot": "snapshot_1"
},
"target": {
"id": "node-1",
"host": "192.168.1.101",
"name": "node-1"
},
"index": {
"size": { "total": "10GB", "recovered": "10GB" },
"files": { "total": 1000, "recovered": 1000 }
}
}
]
}
}
9.4.3 手动恢复
# 取消恢复
POST /_cluster/reroute?master_timeout=30s
{
"commands": [
{
"cancel": {
"index": "my-index",
"shard": 0,
"node": "node-2",
"allow_primary": true
}
}
]
}
9.5 集群扩容
9.5.1 在线扩容
# 添加新节点后自动平衡
# 1. 安装 Elasticsearch 到新节点
# 2. 配置相同的集群名称
# 3. 配置 discovery.seed_hosts
# 4. 启动节点
# 查看节点加入
GET /_cat/nodes?v
9.5.2 滚动重启
# 1. 禁用分片分配
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "primaries"
}
}
# 2. 执行节点维护
# 3. 重新启用分片分配
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": null
}
}
# 4. 等待恢复完成
GET /_cluster/health?wait_for_status=green&timeout=30m
9.5.3 缩容
# 1. 将数据迁移到其他节点
POST /_cluster/reroute
{
"commands": [
{
"move": {
"index": "my-index",
"shard": 0,
"from_node": "node-to-remove",
"to_node": "node-keep"
}
}
]
}
# 2. 排除节点
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.exclude._ip": "192.168.1.100"
}
}
# 3. 等待分片迁移完成后停止节点
9.6 集群设置
9.6.1 集群级别设置
# 临时设置
PUT /_cluster/settings
{
"transient": {
"indices.recovery.max_bytes_per_sec": "100mb"
}
}
# 永久设置
PUT /_cluster/settings
{
"persistent": {
"indices.recovery.max_bytes_per_sec": "100mb"
}
}
# 重置设置
PUT /_cluster/settings
{
"transient": {
"indices.recovery.max_bytes_per_sec": null
}
}
9.6.2 常用集群设置
| 设置 | 说明 | 默认值 |
|---|---|---|
cluster.routing.allocation.enable |
启用分片分配 | all |
cluster.routing.rebalance.enable |
启用重平衡 | all |
indices.recovery.max_bytes_per_sec |
恢复最大速率 | 40mb |
cluster.conential_timeouts |
心跳超时 | 30s |
cluster.max_shards_per_node |
每节点最大分片数 | 1000 |
9.7 集群健康
9.7.1 健康检查
# 基本健康状态
GET /_cluster/health
# 等待特定状态
GET /_cluster/health?wait_for_status=green&timeout=30s
# 检查索引健康
GET /_cluster/health?level=indices
GET /_cluster/health?level=shards
9.7.2 健康问题处理
# 1. 红色索引 - 尝试分配副本
PUT /red-index/_settings
{
"index.number_of_replicas": 1
}
# 2. 如果副本分配失败,删除并重建
DELETE /corrupted-index
# 3. 从快照恢复
POST /_snapshot/my_backup/snapshot_1/_restore
{
"indices": "corrupted-index"
}
9.8 运维命令汇总
| 操作 | 命令 |
|---|---|
| 查看节点 | GET /_cat/nodes?v |
| 查看分片 | GET /_cat/shards?v |
| 查看分配 | GET /_cat/allocation?v |
| 查看集群健康 | GET /_cluster/health?pretty |
| 重新路由 | POST /_cluster/reroute |
| 集群设置 | PUT /_cluster/settings |
9.9 总结
本章介绍了 Elasticsearch 集群管理的核心内容,包括节点发现、分片分配、故障恢复和集群扩容。掌握这些技能对于维护生产环境的 Elasticsearch 集群至关重要。下一章将学习性能优化。