【Elasticsearch】文本分析 Text analysis 查询_search中使用分析 (3)
内置的analyzer
fingerprint
指纹分析器实现了一个指纹算法,OpenRefine项目使用该算法来协助聚类。
内部的流程为
- 转换小写
- 去掉扩展字符
- 排序
- 删除重复字符
- 删除配置的停止(stop)单词
示例如下
POST _analyze
{
"analyzer": "fingerprint",
"text": "Yes yes, Gödel said this sentence is consistent and."
}
- 1
- 2
- 3
- 4
- 5
[ and consistent godel is said sentence this yes ]
- 1
keyword
关键词分析器,什么事情都没做,直接返回原来的字符串。
POST _analyze
{
"analyzer": "keyword",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
- 1
- 2
- 3
- 4
- 5
[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]
- 1
Language
语言分析器,一组旨在分析特定语言文本的分析器。 支持以下类型:阿拉伯语,亚美尼亚语,巴斯克语,孟加拉语,巴西语,保加利亚语,加泰罗尼亚语,cjk,捷克语,丹麦语,荷兰语,英语,爱沙尼亚语,芬兰语,法语,加利西亚语,德语,希腊语,印地语,印地语,匈牙利语,印度尼西亚语,爱尔兰语, 意大利语,拉脱维亚语,立陶宛语,挪威语,波斯语,葡萄牙语,罗马尼亚语,俄语,索拉尼语,西班牙语,瑞典语,土耳其语,泰语。
arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai
pattern
正则分析器,默认正则表达式是\W+
POST _analyze
{
"analyzer": "pattern",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
- 1
- 2
- 3
- 4
- 5
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
- 1
simple
- 去掉非字母字符
- 转换小写
POST _analyze
{
"analyzer": "simple",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
- 1
- 2
- 3
- 4
- 5
[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
- 1
standard
标准分析器是默认的,如果不指定就是这个。
POST _analyze
{
"analyzer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
- 1
- 2
- 3
- 4
- 5
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
- 1
stop
停止分析器基本是和simple一样的,只是配置上增加了stopwords
POST _analyze
{
"analyzer": "stop",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
- 1
- 2
- 3
- 4
- 5
[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]
- 1
whitespace
空格分析器在遇到空格字符时会将文本分解为多个词
POST _analyze
{
"analyzer": "whitespace",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
- 1
- 2
- 3
- 4
- 5
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
- 1
需要注意的点
- 分析器默认配置可以直接用在搜索上
- 如果需要额外的配置比如
stopwords
需要自定义分析器 - 搜索会是使用分析器的处理结果作为查询的条件,这样做相当于自己在搜索之前处理了用户的输入
使用分析器查询
GET _search
{
"query": {
"match": {
"message": {
"query": "a Pose",
"analyzer": "stop"
}
}
}
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
参考资料
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-fingerprint-analyzer.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/specify-analyzer.html
文章来源: coderfix.blog.csdn.net,作者:小雨青年,版权归原作者所有,如需转载,请联系作者。
原文链接:coderfix.blog.csdn.net/article/details/114392540
- 点赞
- 收藏
- 关注作者
评论(0)