第九章:集群管理

深入学习 Elasticsearch 集群管理,包括节点发现、分片分配、故障恢复和集群扩容。

最后更新: 2024-01-15
页面目录

第九章:集群管理

9.1 集群架构

9.1.1 节点角色

┌─────────────────────────────────────────────────────────┐
│                     Elasticsearch Cluster                │
├─────────────────────────────────────────────────────────┤
│                                                          │
│   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐    │
│   │ Master Node │  │ Master Node │  │ Master Node │    │
│   │  (elected)   │  │  (backup)   │  │  (backup)   │    │
│   └──────┬──────┘  └─────────────┘  └─────────────┘    │
│          │                                               │
│          │ Cluster State                                 │
│          ▼                                               │
│   ┌─────────────────────────────────────────────┐       │
│   │           Coordinating Nodes                 │       │
│   │         (请求路由、结果聚合)                   │       │
│   └──────────────────┬──────────────────────────┘       │
│                      │                                    │
│   ┌──────────────────┼──────────────────────┐           │
│   │                  │                       │           │
│   ▼                  ▼                       ▼           │
│ ┌─────────┐    ┌─────────┐    ┌─────────┐             │
│ │Data Node│    │Data Node│    │Data Node│             │
│ │ Hot     │    │ Warm    │    │ Cold    │             │
│ └─────────┘    └─────────┘    └─────────┘             │
│                                                          │
│   ┌─────────┐                                           │
│   │Ingest   │                                           │
│   │Node     │                                           │
│   └─────────┘                                           │
└─────────────────────────────────────────────────────────┘

9.1.2 节点配置

# Master 节点
node.master: true
node.data: false
node.ingest: false

# Data 节点
node.master: false
node.data: true
node.ingest: false

# Ingest 节点
node.master: false
node.data: false
node.ingest: true

# Coordinating 节点
node.master: false
node.data: false
node.ingest: false

9.2 节点发现

9.2.1 Zen Discovery

# zenDiscovery 配置
discovery.seed_hosts:
  - 192.168.1.101:9300
  - 192.168.1.102:9300
  - 192.168.1.103:9300

cluster.initial_master_nodes:
  - node-1
  - node-2
  - node-3

discovery.seed_providers: file

9.2.2 EC2 发现插件

# 安装插件
./bin/elasticsearch-plugin install discovery-ec2

# 配置
cloud.aws:
  region: us-east-1
  access_key: xxx
  secret_key: xxx

discovery:
  ec2:
    groups: sg-12345678
    availability_zones: us-east-1a,us-east-1b

9.2.3 Kubernetes 发现

# 使用 Kubernetes Pod DNS
discovery.seed_hosts:
  - es-svc.namespace.svc.cluster.local

# 安装 Kubernetes 插件
./bin/elasticsearch-plugin install repository-kubernetes

9.3 分片分配

9.3.1 分片分配策略

# 查看分片分配
GET /_cat/allocation?v

# 查看未分配分片
GET /_cat/shards?h=index,shard,prirep,state,unassigned.reason&v

9.3.2 手动分片分配

# 分配分片到指定节点
POST /_cluster/reroute
{
  "commands": [
    {
      "allocate_empty_primary": {
        "index": "my-index",
        "shard": 0,
        "node": "node-2",
        "accept_data_loss": true
      }
    }
  ]
}

# 移动分片
POST /_cluster/reroute
{
  "commands": [
    {
      "move": {
        "index": "my-index",
        "shard": 0,
        "from_node": "node-1",
        "to_node": "node-2"
      }
    }
  ]
}

9.3.3 分片分配过滤

# 按节点属性过滤
PUT /my-index/_settings
{
  "index.routing.allocation.include._tier_preference": "data_hot"
}

# 自定义属性过滤
PUT /my-index/_settings
{
  "index.routing.allocation.include.tag": "node-1,node-2",
  "index.routing.allocation.exclude.tag": "slow"
}

9.3.4 分配感知

# 节点配置 rack 属性
node.attr.rack: rack1

# 索引配置
PUT /my-index/_settings
{
  "index.routing.allocation.awareness.attributes": "rack"
}

9.4 故障恢复

9.4.1 自动恢复机制

# 故障检测配置
gateway:
  recover_after_nodes: 2
  recover_after_time: 5m
  expected_nodes: 3

# 分片恢复配置
indices.recovery:
  max_bytes_per_sec: 100mb
  concurrent_streams: 3
  concurrent_small_file_streams: 2

9.4.2 恢复监控

# 查看恢复状态
GET /_cat/recovery?v

# 查看详细恢复信息
GET /my-index/_recovery?detailed=true

# 响应示例
{
  "my-index": {
    "shards": [
      {
        "id": 0,
        "type": "SNAPSHOT",
        "stage": "DONE",
        "primary": true,
        "start_time": "2024-01-15T10:30:00Z",
        "total_time": "1m",
        "source": {
          "repository": "my-backup",
          "snapshot": "snapshot_1"
        },
        "target": {
          "id": "node-1",
          "host": "192.168.1.101",
          "name": "node-1"
        },
        "index": {
          "size": { "total": "10GB", "recovered": "10GB" },
          "files": { "total": 1000, "recovered": 1000 }
        }
      }
    ]
  }
}

9.4.3 手动恢复

# 取消恢复
POST /_cluster/reroute?master_timeout=30s
{
  "commands": [
    {
      "cancel": {
        "index": "my-index",
        "shard": 0,
        "node": "node-2",
        "allow_primary": true
      }
    }
  ]
}

9.5 集群扩容

9.5.1 在线扩容

# 添加新节点后自动平衡
# 1. 安装 Elasticsearch 到新节点
# 2. 配置相同的集群名称
# 3. 配置 discovery.seed_hosts
# 4. 启动节点

# 查看节点加入
GET /_cat/nodes?v

9.5.2 滚动重启

# 1. 禁用分片分配
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
  }
}

# 2. 执行节点维护

# 3. 重新启用分片分配
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": null
  }
}

# 4. 等待恢复完成
GET /_cluster/health?wait_for_status=green&timeout=30m

9.5.3 缩容

# 1. 将数据迁移到其他节点
POST /_cluster/reroute
{
  "commands": [
    {
      "move": {
        "index": "my-index",
        "shard": 0,
        "from_node": "node-to-remove",
        "to_node": "node-keep"
      }
    }
  ]
}

# 2. 排除节点
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.exclude._ip": "192.168.1.100"
  }
}

# 3. 等待分片迁移完成后停止节点

9.6 集群设置

9.6.1 集群级别设置

# 临时设置
PUT /_cluster/settings
{
  "transient": {
    "indices.recovery.max_bytes_per_sec": "100mb"
  }
}

# 永久设置
PUT /_cluster/settings
{
  "persistent": {
    "indices.recovery.max_bytes_per_sec": "100mb"
  }
}

# 重置设置
PUT /_cluster/settings
{
  "transient": {
    "indices.recovery.max_bytes_per_sec": null
  }
}

9.6.2 常用集群设置

设置 说明 默认值
cluster.routing.allocation.enable 启用分片分配 all
cluster.routing.rebalance.enable 启用重平衡 all
indices.recovery.max_bytes_per_sec 恢复最大速率 40mb
cluster.conential_timeouts 心跳超时 30s
cluster.max_shards_per_node 每节点最大分片数 1000

9.7 集群健康

9.7.1 健康检查

# 基本健康状态
GET /_cluster/health

# 等待特定状态
GET /_cluster/health?wait_for_status=green&timeout=30s

# 检查索引健康
GET /_cluster/health?level=indices
GET /_cluster/health?level=shards

9.7.2 健康问题处理

# 1. 红色索引 - 尝试分配副本
PUT /red-index/_settings
{
  "index.number_of_replicas": 1
}

# 2. 如果副本分配失败,删除并重建
DELETE /corrupted-index

# 3. 从快照恢复
POST /_snapshot/my_backup/snapshot_1/_restore
{
  "indices": "corrupted-index"
}

9.8 运维命令汇总

操作 命令
查看节点 GET /_cat/nodes?v
查看分片 GET /_cat/shards?v
查看分配 GET /_cat/allocation?v
查看集群健康 GET /_cluster/health?pretty
重新路由 POST /_cluster/reroute
集群设置 PUT /_cluster/settings

9.9 总结

本章介绍了 Elasticsearch 集群管理的核心内容,包括节点发现、分片分配、故障恢复和集群扩容。掌握这些技能对于维护生产环境的 Elasticsearch 集群至关重要。下一章将学习性能优化。