第五章:文档操作

学习 Elasticsearch 的文档 CRUD 操作、批量操作、版本控制和过期文档管理。

最后更新: 2024-01-15
页面目录

第五章:文档操作

5.1 文档基本操作

5.1.1 创建文档

# 自动生成 ID
POST /products/_doc
{
  "name": "iPhone 15",
  "price": 7999.00,
  "category": "electronics",
  "tags": ["手机", "苹果", "5G"]
}

# 响应
{
  "_index": "products",
  "_id": "abc123",
  "_version": 1,
  "result": "created",
  "_shards": { "total": 2, "successful": 1, "failed": 0 }
}

5.1.2 指定 ID 创建

PUT /products/_doc/1
{
  "name": "iPhone 15 Pro",
  "price": 9999.00,
  "category": "electronics",
  "stock": 100,
  "created_at": "2024-01-15T10:30:00Z"
}

5.1.3 读取文档

# 基本读取
GET /products/_doc/1

# 响应
{
  "_index": "products",
  "_id": "1",
  "_version": 3,
  "_seq_no": 15,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "name": "iPhone 15 Pro",
    "price": 9999.00,
    ...
  }
}

# 只获取 source
GET /products/_source/1

# 检查文档是否存在
HEAD /products/_doc/1

5.1.4 更新文档

# 全量替换
PUT /products/_doc/1
{
  "name": "iPhone 15 Pro Max",
  "price": 10999.00,
  "category": "electronics"
}

# 部分更新
POST /products/_update/1
{
  "doc": {
    "price": 9999.00,
    "stock": 50
  }
}

# 使用脚本更新
POST /products/_update/1
{
  "script": {
    "source": "ctx._source.stock -= params.count",
    "params": {
      "count": 5
    }
  }
}

# 增加字段
POST /products/_update/1
{
  "script": {
    "source": "ctx._source.tags.add(params.tag)",
    "params": {
      "tag": "热卖"
    }
  }
}

# 删除字段
POST /products/_update/1
{
  "script": {
    "source": "ctx._source.remove('deleted')"
  }
}

5.1.5 删除文档

# 删除单个文档
DELETE /products/_doc/1

# 按条件删除
POST /products/_delete_by_query
{
  "query": {
    "term": {
      "category": "discontinued"
    }
  }
}

5.2 批量操作

5.2.1 Bulk API 格式

POST /_bulk
{ "index": { "_index": "products", "_id": "1" } }
{ "name": "Product 1", "price": 100 }
{ "create": { "_index": "products", "_id": "2" } }
{ "name": "Product 2", "price": 200 }
{ "update": { "_index": "products", "_id": "1" } }
{ "doc": { "price": 150 } }
{ "delete": { "_index": "products", "_id": "3" } }

5.2.2 批量导入示例

POST /products/_bulk
{ "index": {} }
{ "name": "Laptop", "price": 5999, "category": "electronics" }
{ "index": {} }
{ "name": "Headphones", "price": 299, "category": "electronics" }
{ "index": {} }
{ "name": "Keyboard", "price": 199, "category": "electronics" }
{ "index": {} }
{ "name": "Mouse", "price": 99, "category": "electronics" }

5.2.3 批量跨索引操作

POST /_bulk
{ "index": { "_index": "products", "_id": "1" } }
{ "name": "Product A" }
{ "index": { "_index": "orders", "_id": "1" } }
{ "product_id": "1", "quantity": 2 }

5.2.4 错误处理

# 继续执行,即使有错误
POST /products/_bulk?continue_on_failure=true
{ "index": {} }
{ "name": "P1" }
{ "index": {} }
{ "invalid_field": "will cause error" }
{ "index": {} }
{ "name": "P2" }

# 查看详细错误
POST /products/_bulk?filter_path=errors,items.*.error

5.3 多文档操作

5.3.1 MGet - 批量读取

GET /_mget
{
  "docs": [
    { "_index": "products", "_id": "1" },
    { "_index": "products", "_id": "2" },
    { "_index": "products", "_id": "3", "_source": ["name", "price"] }
  ]
}

# 简化写法
GET /products/_mget
{
  "ids": ["1", "2", "3"]
}

5.3.2 MSearch - 批量搜索

POST /_msearch
{"index": "products"}
{"query": {"match": {"name": "iPhone"}}, "from": 0, "size": 10}
{"index": "products"}
{"query": {"term": {"category": "electronics"}}, "from": 0, "size": 5}

5.3.3 Bulk 优化建议

# 控制批量大小
POST /products/_bulk?batch_size=10000

# 设置刷新策略
POST /products/_bulk?refresh=false
POST /products/_bulk?refresh=true
POST /products/_bulk?refresh=wait_for

5.4 乐观并发控制

5.4.1 使用版本号

# 指定版本更新(版本匹配才更新)
PUT /products/_doc/1?if_seq_no=10&if_primary_term=1
{
  "name": "Updated Name",
  "price": 7999
}

# 响应 - 版本冲突
{
  "error": {
    "type": "version_conflict_engine_exception",
    ...
  }
}

5.4.2 使用外部版本

# 使用外部系统版本号
PUT /products/_doc/1?version=12345&version_type=external
{
  "name": "Updated Name"
}

5.4.3 retry_on_conflict

# 失败自动重试
POST /products/_update/1?retry_on_conflict=3
{
  "doc": {
    "price": 7999
  }
}

5.5 文档路由

5.5.1 路由原理

文档 ID ──Hash──► [0, num_shards-1] ──► 分片编号

5.5.2 自定义路由

# 创建时指定路由
PUT /products/_doc/1?routing=user-123
{
  "name": "Product",
  "user_id": "user-123"
}

# 查询时指定路由
GET /products/_doc/1?routing=user-123

# 搜索时指定路由
GET /products/_search?routing=user-123
{
  "query": {
    "term": { "user_id": "user-123" }
  }
}

5.5.3 路由策略

{
  "settings": {
    "index.routing_partition.size": 2
  }
}

5.6 文档过期管理

5.6.1 TTL 字段(已废弃)

# 不推荐使用,使用 ILM 替代
PUT /logs
{
  "mappings": {
    "properties": {
      "message": { "type": "text" },
      "@timestamp": { "type": "date" }
    }
  },
  "settings": {
    "index.lifecycle.name": "logs-policy"
  }
}

5.6.2 索引生命周期管理(ILM)

# 创建生命周期策略
PUT /_ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "7d"
          }
        }
      },
      "warm": {
        "min_age": "30d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 }
        }
      },
      "cold": {
        "min_age": "60d",
        "actions": {
          "freeze": {}
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

5.7 常用命令汇总

操作 命令
创建文档 POST /index/_docPUT /index/_doc/{id}
读取文档 GET /index/_doc/{id}
更新文档 POST /index/_update/{id}
删除文档 DELETE /index/_doc/{id}
批量创建 POST /_bulk
批量读取 GET /_mget
条件删除 POST /index/_delete_by_query

5.8 性能优化建议

  1. 批量写入:使用 Bulk API,减少网络开销
  2. 禁用刷新:批量导入时设置 refresh_interval: -1
  3. 合理分片:避免过多小分片
  4. 使用路由:对于有明确路由键的查询
  5. 控制副本:导入完成后开启副本
# 导入优化示例
PUT /products/_settings
{
  "index": {
    "refresh_interval": "-1",
    "number_of_replicas": 0
  }
}
# ... 执行批量导入 ...
PUT /products/_settings
{
  "index": {
    "refresh_interval": "1s",
    "number_of_replicas": 1
  }
}

5.9 总结

本章介绍了 Elasticsearch 文档的 CRUD 操作、批量操作、版本控制和过期文档管理。熟练掌握这些操作是进行数据操作的基础。下一章将学习强大的查询 DSL。