- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

【Elasticsearch】文本分析 Text analysis 配置文本分析 (2)

小雨青年发表于 2022/03/29 01:02:42 2022/03/29

【摘要】简单实例这是一个简单的分析器，将文本通过空格拆分成各个tokens POST _analyze { "analyzer": "whitespace", "text": "The qu...

简单实例

这是一个简单的分析器，将文本通过空格拆分成各个tokens

POST _analyze
{
  "analyzer": "whitespace",
  "text":     "The quick brown fox."
}

  
 
  1
  2
  3
  4
  5

{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "fox.",
      "start_offset" : 16,
      "end_offset" : 20,
      "type" : "word",
      "position" : 3
    }
  ]
}


  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33

带完整解析的文本分析

POST _analyze
{
  "char_filter": [
    "html_strip"
  ],
  "tokenizer": "standard",
  "filter":  [ "lowercase", "asciifolding" ],
  "text": "Is this déja vu  <b>test</b> ?"
}

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9

其中对应的是

char_filter , character filter 文本过滤器，可以配置0到多个
tokenizer，令牌分析器，有且仅有一个
filter，token filter，令牌过滤器，可以配置0到多个

经过上面的过滤查询，分别对应

去掉HTML标签
默认分析器，可以删除大部分标点符号
拼写转小写，将不在基本拉丁Unicode块中的字母，数字和符号字符（前127个ASCII字符）转换为等效的ASCII（如果存在）

得到的结果如下

{
  "tokens" : [
    {
      "token" : "is",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "this",
      "start_offset" : 3,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "deja",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "vu",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "test",
      "start_offset" : 20,
      "end_offset" : 28,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}


  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40

参考资料

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-overview.html

文章来源: coderfix.blog.csdn.net，作者：小雨青年，版权归原作者所有，如需转载，请联系作者。

原文链接：coderfix.blog.csdn.net/article/details/114366212

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

【Elasticsearch】文本分析 Text analysis 配置文本分析 (2)

简单实例

带完整解析的文本分析

参考资料

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

【Elasticsearch】文本分析 Text analysis 配置文本分析 (2)

简单实例

带完整解析的文本分析

参考资料

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品