第八章:中文分词

学习 Elasticsearch 中文分词器配置,包括 IK 分词器、结巴分词器等常用中文分词方案。

最后更新: 2024-01-15
页面目录

第八章:中文分词

8.1 分词器基础

8.1.1 分词流程

输入文本 ──► 字符过滤器 ──► 分词器 ──► 词过滤器 ──► 输出词元
           (Char Filter)      (Tokenizer)    (Token Filter)

8.1.2 Elasticsearch 内置分词器

分词器 说明
standard 默认分词器,英文按单词分,中文按单字分
simple 简单分词,按非字母字符分割
whitespace 按空格分割
keyword 不分词,整个字符串作为一个词
pattern 正则分词

8.1.3 分词测试

# 测试分词效果
POST _analyze
{
  "analyzer": "standard",
  "text": "iPhone 15 Pro Max"
}

# 输出
{
  "tokens": [
    { "token": "iphone", "start_offset": 0, "end_offset": 6 },
    { "token": "15", "start_offset": 7, "end_offset": 9 },
    { "token": "pro", "start_offset": 10, "end_offset": 13 },
    { "token": "max", "start_offset": 14, "end_offset": 17 }
  ]
}

8.2 中文分词挑战

8.2.1 中文分词难点

  • 歧义识别南京市/长江大桥 vs 南京/市长/江大桥
  • 新词发现内马尔元宇宙AI大模型
  • 英文数字处理123452024年
  • 混合文本iPhone15系列新品

8.2.2 默认分词效果

POST _analyze
{
  "analyzer": "standard",
  "text": "南京市长江大桥"
}

# 结果:单字分词
{ "token": "南" }
{ "token": "京" }
{ "token": "市" }
{ "token": "长" }
{ "token": "江" }
{ "token": "大" }
{ "token": "桥" }

8.3 IK 分词器

8.3.1 安装 IK 分词器

# Elasticsearch 8.x 安装
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v8.12.0/elasticsearch-analysis-ik-8.12.0.zip

# 确认安装
./bin/elasticsearch-plugin list

8.3.2 IK 分词模式

模式 说明 适用场景
ik_smart 智能分词 搜索阶段
ik_max_word 最细粒度 索引阶段
# ik_smart 模式
POST _analyze
{
  "analyzer": "ik_smart",
  "text": "南京市长江大桥"
}

# 结果
{ "token": "南京市" }
{ "token": "长江大桥" }

# ik_max_word 模式
POST _analyze
{
  "analyzer": "ik_max_word",
  "text": "南京市长江大桥"
}

# 结果
{ "token": "南京市" }
{ "token": "京市" }
{ "token": "长江大桥" }
{ "token": "长江" }
{ "token": "江大" }
{ "token": "大桥" }

8.3.3 配置 IK 索引

PUT /articles
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_ik_analyzer": {
          "type": "custom",
          "tokenizer": "ik_max_word",
          "filter": ["lowercase"]
        },
        "my_ik_search_analyzer": {
          "type": "custom",
          "tokenizer": "ik_smart",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "my_ik_analyzer",
        "search_analyzer": "my_ik_search_analyzer"
      },
      "content": {
        "type": "text",
        "analyzer": "my_ik_analyzer",
        "search_analyzer": "my_ik_search_analyzer"
      }
    }
  }
}

8.4 IK 词典配置

8.4.1 词典文件位置

elasticsearch/
└── config/
    └── analysis-ik/
        ├── IKAnalyzer.cfg.xml
        ├── extra_main.dic
        ├── extra_single_word.dic
        ├── extra_single_word_full.dic
        ├── extra_stopword.dic
        ├── suffix.dic
        ├── preposition.dic
        └── quantifier.dic

8.4.2 配置自定义词典

编辑 IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/properties/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    
    <!-- 用户词典 -->
    <entry key="ext_dict">custom/mydict.dic;custom/proper_nouns.dic</entry>
    
    <!-- 停止词典 -->
    <entry key="ext_stopwords">custom/stopwords.dic</entry>
    
    <!-- 同义词词典 -->
    <entry key="remote_ext_dict">http://your-server/dict/terms.txt</entry>
</properties>

8.4.3 自定义词典格式

# mydict.dic
iPhone 15
iPhone 15 Pro
iPhone 15 Pro Max
Apple Vision Pro

8.5 结巴分词器

8.5.1 安装结巴分词器

# 下载并编译
git clone https://github.com/sing1ee/elasticsearch-jieba-plugin.git
cd elasticsearch-jieba-plugin
mvn package
./bin/elasticsearch-plugin install file:target/elasticsearch-jieba-plugin-*.zip

8.5.2 结巴分词模式

# 默认模式
POST _analyze
{
  "analyzer": "jieba_index",
  "text": "南京市长江大桥"
}

# 搜索模式
POST _analyze
{
  "analyzer": "jieba_search",
  "text": "南京市长江大桥"
}

8.6 拼音分词器

8.6.1 安装拼音分词器

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v8.12.0/elasticsearch-analysis-pinyin-8.12.0.zip

8.6.2 拼音分词配置

PUT /users
{
  "settings": {
    "analysis": {
      "filter": {
        "pinyin_filter": {
          "type": "pinyin",
          "keep_first_letter": true,
          "keep_separate_first_letter": false,
          "keep_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "lowercase": true
        }
      },
      "analyzer": {
        "pinyin_analyzer": {
          "tokenizer": "ik_max_word",
          "filter": ["pinyin_filter"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "pinyin_analyzer"
      }
    }
  }
}

8.6.3 组合 IK + 拼音

PUT /products
{
  "settings": {
    "analysis": {
      "filter": {
        "pinyin_filter": {
          "type": "pinyin",
          "keep_first_letter": true,
          "keep_full_pinyin": true
        },
        "edge_ngram_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }
      },
      "analyzer": {
        "ik_pinyin_analyzer": {
          "tokenizer": "ik_max_word",
          "filter": ["pinyin_filter"]
        },
        "ik_pinyin_search_analyzer": {
          "tokenizer": "ik_smart",
          "filter": ["pinyin_filter"]
        },
        "autocomplete_analyzer": {
          "tokenizer": "ik_max_word",
          "filter": ["edge_ngram_filter"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "ik_pinyin_analyzer",
        "search_analyzer": "ik_pinyin_search_analyzer",
        "fields": {
          "autocomplete": {
            "type": "text",
            "analyzer": "autocomplete_analyzer",
            "search_analyzer": "standard"
          }
        }
      }
    }
  }
}

8.7 搜索建议(Completion Suggester)

8.7.1 配置自动补全

PUT /products
{
  "mappings": {
    "properties": {
      "suggest": {
        "type": "completion",
        "analyzer": "ik_max_word",
        "preserve_separators": true,
        "preserve_position_increments": true,
        "max_input_length": 50
      },
      "name": {
        "type": "text"
      }
    }
  }
}

8.7.2 使用自动补全

POST /products/_search
{
  "suggest": {
    "product_suggest": {
      "prefix": "iphon",
      "completion": {
        "field": "suggest",
        "size": 5,
        "skip_duplicates": true,
        "fuzzy": {
          "fuzziness": "AUTO"
        }
      }
    }
  }
}

8.8 同义词配置

8.8.1 同义词词典

# synonyms.txt
手机,移动电话,智能手机
电脑,计算机,PC
iPhone,苹果手机

8.8.2 配置同义词分析器

PUT /articles
{
  "settings": {
    "analysis": {
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "synonyms_path": "analysis/synonyms.txt",
          "updateable": true
        }
      },
      "analyzer": {
        "synonym_analyzer": {
          "tokenizer": "ik_max_word",
          "filter": ["synonym_filter"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "synonym_analyzer"
      }
    }
  }
}

8.9 总结

本章介绍了中文分词器的基本概念和配置方法,重点讲解了 IK 分词器的安装和使用,以及与拼音分词器的组合配置。合理选择分词器是实现中文搜索的关键。下一章将学习集群管理。