第八章:中文分词
学习 Elasticsearch 中文分词器配置,包括 IK 分词器、结巴分词器等常用中文分词方案。
最后更新: 2024-01-15
页面目录
第八章:中文分词
8.1 分词器基础
8.1.1 分词流程
输入文本 ──► 字符过滤器 ──► 分词器 ──► 词过滤器 ──► 输出词元
(Char Filter) (Tokenizer) (Token Filter)
8.1.2 Elasticsearch 内置分词器
| 分词器 | 说明 |
|---|---|
standard |
默认分词器,英文按单词分,中文按单字分 |
simple |
简单分词,按非字母字符分割 |
whitespace |
按空格分割 |
keyword |
不分词,整个字符串作为一个词 |
pattern |
正则分词 |
8.1.3 分词测试
# 测试分词效果
POST _analyze
{
"analyzer": "standard",
"text": "iPhone 15 Pro Max"
}
# 输出
{
"tokens": [
{ "token": "iphone", "start_offset": 0, "end_offset": 6 },
{ "token": "15", "start_offset": 7, "end_offset": 9 },
{ "token": "pro", "start_offset": 10, "end_offset": 13 },
{ "token": "max", "start_offset": 14, "end_offset": 17 }
]
}
8.2 中文分词挑战
8.2.1 中文分词难点
- 歧义识别:
南京市/长江大桥vs南京/市长/江大桥 - 新词发现:
内马尔、元宇宙、AI大模型 - 英文数字处理:
12345、2024年 - 混合文本:
iPhone15系列新品
8.2.2 默认分词效果
POST _analyze
{
"analyzer": "standard",
"text": "南京市长江大桥"
}
# 结果:单字分词
{ "token": "南" }
{ "token": "京" }
{ "token": "市" }
{ "token": "长" }
{ "token": "江" }
{ "token": "大" }
{ "token": "桥" }
8.3 IK 分词器
8.3.1 安装 IK 分词器
# Elasticsearch 8.x 安装
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v8.12.0/elasticsearch-analysis-ik-8.12.0.zip
# 确认安装
./bin/elasticsearch-plugin list
8.3.2 IK 分词模式
| 模式 | 说明 | 适用场景 |
|---|---|---|
ik_smart |
智能分词 | 搜索阶段 |
ik_max_word |
最细粒度 | 索引阶段 |
# ik_smart 模式
POST _analyze
{
"analyzer": "ik_smart",
"text": "南京市长江大桥"
}
# 结果
{ "token": "南京市" }
{ "token": "长江大桥" }
# ik_max_word 模式
POST _analyze
{
"analyzer": "ik_max_word",
"text": "南京市长江大桥"
}
# 结果
{ "token": "南京市" }
{ "token": "京市" }
{ "token": "长江大桥" }
{ "token": "长江" }
{ "token": "江大" }
{ "token": "大桥" }
8.3.3 配置 IK 索引
PUT /articles
{
"settings": {
"analysis": {
"analyzer": {
"my_ik_analyzer": {
"type": "custom",
"tokenizer": "ik_max_word",
"filter": ["lowercase"]
},
"my_ik_search_analyzer": {
"type": "custom",
"tokenizer": "ik_smart",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_ik_analyzer",
"search_analyzer": "my_ik_search_analyzer"
},
"content": {
"type": "text",
"analyzer": "my_ik_analyzer",
"search_analyzer": "my_ik_search_analyzer"
}
}
}
}
8.4 IK 词典配置
8.4.1 词典文件位置
elasticsearch/
└── config/
└── analysis-ik/
├── IKAnalyzer.cfg.xml
├── extra_main.dic
├── extra_single_word.dic
├── extra_single_word_full.dic
├── extra_stopword.dic
├── suffix.dic
├── preposition.dic
└── quantifier.dic
8.4.2 配置自定义词典
编辑 IKAnalyzer.cfg.xml:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/properties/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!-- 用户词典 -->
<entry key="ext_dict">custom/mydict.dic;custom/proper_nouns.dic</entry>
<!-- 停止词典 -->
<entry key="ext_stopwords">custom/stopwords.dic</entry>
<!-- 同义词词典 -->
<entry key="remote_ext_dict">http://your-server/dict/terms.txt</entry>
</properties>
8.4.3 自定义词典格式
# mydict.dic
iPhone 15
iPhone 15 Pro
iPhone 15 Pro Max
Apple Vision Pro
8.5 结巴分词器
8.5.1 安装结巴分词器
# 下载并编译
git clone https://github.com/sing1ee/elasticsearch-jieba-plugin.git
cd elasticsearch-jieba-plugin
mvn package
./bin/elasticsearch-plugin install file:target/elasticsearch-jieba-plugin-*.zip
8.5.2 结巴分词模式
# 默认模式
POST _analyze
{
"analyzer": "jieba_index",
"text": "南京市长江大桥"
}
# 搜索模式
POST _analyze
{
"analyzer": "jieba_search",
"text": "南京市长江大桥"
}
8.6 拼音分词器
8.6.1 安装拼音分词器
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v8.12.0/elasticsearch-analysis-pinyin-8.12.0.zip
8.6.2 拼音分词配置
PUT /users
{
"settings": {
"analysis": {
"filter": {
"pinyin_filter": {
"type": "pinyin",
"keep_first_letter": true,
"keep_separate_first_letter": false,
"keep_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"lowercase": true
}
},
"analyzer": {
"pinyin_analyzer": {
"tokenizer": "ik_max_word",
"filter": ["pinyin_filter"]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "pinyin_analyzer"
}
}
}
}
8.6.3 组合 IK + 拼音
PUT /products
{
"settings": {
"analysis": {
"filter": {
"pinyin_filter": {
"type": "pinyin",
"keep_first_letter": true,
"keep_full_pinyin": true
},
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"ik_pinyin_analyzer": {
"tokenizer": "ik_max_word",
"filter": ["pinyin_filter"]
},
"ik_pinyin_search_analyzer": {
"tokenizer": "ik_smart",
"filter": ["pinyin_filter"]
},
"autocomplete_analyzer": {
"tokenizer": "ik_max_word",
"filter": ["edge_ngram_filter"]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "ik_pinyin_analyzer",
"search_analyzer": "ik_pinyin_search_analyzer",
"fields": {
"autocomplete": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
}
8.7 搜索建议(Completion Suggester)
8.7.1 配置自动补全
PUT /products
{
"mappings": {
"properties": {
"suggest": {
"type": "completion",
"analyzer": "ik_max_word",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 50
},
"name": {
"type": "text"
}
}
}
}
8.7.2 使用自动补全
POST /products/_search
{
"suggest": {
"product_suggest": {
"prefix": "iphon",
"completion": {
"field": "suggest",
"size": 5,
"skip_duplicates": true,
"fuzzy": {
"fuzziness": "AUTO"
}
}
}
}
}
8.8 同义词配置
8.8.1 同义词词典
# synonyms.txt
手机,移动电话,智能手机
电脑,计算机,PC
iPhone,苹果手机
8.8.2 配置同义词分析器
PUT /articles
{
"settings": {
"analysis": {
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt",
"updateable": true
}
},
"analyzer": {
"synonym_analyzer": {
"tokenizer": "ik_max_word",
"filter": ["synonym_filter"]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "synonym_analyzer"
}
}
}
}
8.9 总结
本章介绍了中文分词器的基本概念和配置方法,重点讲解了 IK 分词器的安装和使用,以及与拼音分词器的组合配置。合理选择分词器是实现中文搜索的关键。下一章将学习集群管理。