- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

es分词器

Nick Qiu 发表于 2021/06/19 22:17:57 2021/06/19

【摘要】分词器定义分词器的定义：将文本按一定逻辑分解成为多个单词ES默认的分词器：standard analyzer,标准分词器，默认使用该分词器simple analyzer,去掉不是字母的字符，然后再分词whitespace analyzer，使用空格分词stop analyzer，与simple分词器类似，不过分词结果删除掉了停止词，如（the , a ,an ,this ,of,at ,...

分词器

定义

分词器的定义：将文本按一定逻辑分解成为多个单词

ES默认的分词器：

standard analyzer,标准分词器，默认使用该分词器
simple analyzer,去掉不是字母的字符，然后再分词
whitespace analyzer，使用空格分词
stop analyzer，与simple分词器类似，不过分词结果删除掉了停止词，如（the , a ,an ,this ,of,at ,in等）
language analyzer，指定语言的分词器，如english
pattern analyzer，正则分词器

分词器练习

# >>>查看字符串分词结果
curl -X POST "localhost:9200/_analyze?pretty" -H "Content-Type: application/json" -d '
{
  "analyzer": "standard",
  "text": "Hello my word! 中国"
}
'

# <<<返回分词后的结果
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "my",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "word",
      "start_offset" : 9,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "中",
      "start_offset" : 15,
      "end_offset" : 16,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "国",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    }
  ]
}


# 发送简单分词器
curl -X POST "localhost:9200/_analyze?pretty" -H "Content-Type: application/json" -d '
{
  "analyzer": "simple",
  "text": "Hello my word! 中国"
}
'

# 应答，发现感叹号被去掉了
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "my",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "word",
      "start_offset" : 9,
      "end_offset" : 13,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "中国",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    }
  ]
}

中文分词器

standard分词器分中文，会把中文字符串的每个字都分开
smartCn，中英文分词器 (插件名称:analysis-smartcn)
- 安装分词器,bin目录下执行命令sh elasticsearch-plugin install analysis-smartcn,安装后重新启动es
- 卸载分词器，bin目录下执行命令sh elasticsearch-plugin remove analysis-smartcn,安装后重新启动es
IK分词器
- 安装分词器：
  - 下载:https://github.com/medcl/elasticsearch-analysis-ik/release （版本需要和es相同）
  - 解压到plugins目录
  - 安装后重启
  - 验证插件安装

curl -X POST "localhost:9200/_analyze" -H 'Content-type: application/json' -d '
{
  "analyzer": "ik_max_word",
  "text": "解析中文需要这个插件"
}
'

创建索引时指定分词器

# 
curl -X PUT "localhost:9200/test_index" -H "Content-Type: application/json" -d '
{
  "settings": {
    "analysis": {
      "analyzer":{
        "my_analyzer":{
          "type": "whitespace"
        }
      }
    }
  },
  "mappings": {
    "properties":{
      "name": {
        "type": "text",
        "analyzer": "my_analyzer"
      },
       "age": {
        "type": "long" 
      }
    }
  }
}
'

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

es分词器

分词器

定义

分词器练习

中文分词器

创建索引时指定分词器

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品