赞
踩
ES分词器可在索引创建之前将字段拆分为对应词元,用于建立对应倒排索引;查询时将查询关键词根据指定分词器进行分词,然后进行索引数据查询;ES内置分词器介绍.xmind
ES分词器包含三部分:
所以,自定义analyzer内容如下:
{ "analysis": { "filter": { "filter_a": { ... }, "filter_b": { ... } }, "analyzer": { "analyzer_a": { "tokenizer": "...", "filter": "...", "token_chars": "...", ...//其他属性配置 }, "analyzer_b": { "tokenizer": "...", "filter": "...", "token_chars": "...", ...//其他属性配置 } }, "tokenizer": { "tokenizer_a": { ... } }, "char_filter": { } } }
在分词之前对原字段字符进行过滤,主要包含三种:
{
"tokenizer": "keyword",
"char_filter": [ "html_strip" ]
}
也可自定义分词器,指定排除过滤的标签:
"char_filter": {
"my_char_filter": {
"type": "html_strip",
"escaped_tags": ["b"]
}
}
mappings:字符映射,将一个字符替换为另一个字符,空格字符需转义 "char_filter": { "my_char_filter": { "type": "mapping", "mappings": [ "٠ => 0", "١ => 1", "٢ => 2", "٣ => 3", "٤ => 4", "٥ => 5", "٦ => 6", "٧ => 7", "٨ => 8", "٩ => 9" ] } } mappings_path:指定映射文件路径,文件utf-8格式,内容与上面mappings一致
pattern:java正则
replacement:需替换的内容
flags:正则模式
官网示例:
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
}
}
tokenizer,对输入文本进行处理,拆分成各个词元;
ES内置分词器如下:
属性:
max_token_length:最大词元长度
"tokenizer": {
"my_tokenizer": {
"type": "standard",
"max_token_length": 5
}
}
POST _analyze
{
"tokenizer": "letter",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
分词结果:
[ The, QUICK, Brown, Foxes, jumped, over, the, lazy, dog, s, bone ]
POST _analyze
{
"tokenizer": "lowercase",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
分词结果:
[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
POST _analyze
{
"tokenizer": "whitespace",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
分词结果:
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
属性:
max_token_length:最大词元长度,参考standard
POST _analyze
{
"tokenizer": "uax_url_email",
"text": "Email me at john.smith@global-international.com"
}
分词结果:
[ Email, me, at, john.smith@global-international.com ]
standard分词时结果会是:
[ Email, me, at, john.smith, global, international.com ]
属性: min_gram:最小长度,默认1 max_gram:最大长度,默认2 token_chars:分词字符,枚举值如下: letter:文本字符,如 a,b,c 京等; digit :数字 whitespace:空白字符 punctuation:标点符号 symbol:符号 POST _analyze { "tokenizer": "ngram", "text": "Quick Fox" } 分词结果: [ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " ", " F", F, Fo, o, ox, x ]
属性:与ngram一致
POST _analyze
{
"tokenizer": "edge_ngram",
"text": "Quick Fox"
}
分词结果:
[ Q, Qu ]
属性:
buffer_size:keyword指定缓冲大小,默认255,超过255个字符忽略,不建议修改
属性: pattern:java正则,默认\W+; flags:正则模式, group:哪个分组的词作为正则内容,默认-1(split) POST _analyze { "tokenizer": "pattern", "text": "The foo_bar_size's default is 5." } 分词结果: [ The, foo_bar_size, s, default, is, 5 ] 自定义pattern分析器:以逗号拆分输入 "my_tokenizer": { "type": "pattern", "pattern": "," }
后置处理器,tokenizer拆分词元之后,filter进行后续处理,可新增或者删除词元,如中文的拼音分词器、同义词 就是使用此方式,在此仅介绍常用filter;
属性:
min:最小长度 默认0
max:最大长度 默认Integer.MAX_VALUE
属性:
stopwords:停用词集合,默认_english_
stopwords_path:停用词文件路径
ignore_case:忽略大小写
remove_trailing:删除最后一个词元,false
"filter": {
"my_stop": {
"type": "stop",
"stopwords": ["and", "is", "the"]
}
}
属性:
generate_word_parts:单词拆分,"PowerShot" ⇒ "Power" "Shot",默认true
generate_number_parts:数字拆分,"500-42" ⇒ "500" "42",默认true
catenate_words:连词拆分,"wi-fi" ⇒ "wifi",默认false
catenate_numbers:连续数字拆分:"500-42" ⇒ "50042",默认false
catenate_all:"wi-fi-4000" ⇒ "wifi4000"
split_on_case_change:大小写改变时拆分词元
preserve_original:保留原始值,"500-42" ⇒ "500-42" "500" "42",默认false
split_on_numerics:数字拆分,"j2se" ⇒ "j" "2" "se",默认true
属性: synonyms_path:同义词文件路径 synonyms:同义词配置 "filter" : { "synonym" : { "type" : "synonym", "format" : "wordnet", "synonyms" : [ "s(100000001,1,'abstain',v,1,0).", "s(100000001,2,'refrain',v,1,0).", "s(100000001,3,'desist',v,1,0)." ] } } 同义词文件配置规则: utf-8文件 每行多个词以英文逗号分隔,表示双向同义词 单向同义词以 =>分隔,如 i-pod, i pod => ipod
中文分词常用分词主要为ik和pinyin;
#ik分词器下载地址:
https://github.com/medcl/elasticsearch-analysis-ik
#拼音分词器下载地址:
https://github.com/medcl/elasticsearch-analysis-pinyin
text:我是中国人
ik_smart:我,是,中国人
ik_max_word:我,是,中国人,中国,国人
ik_smart和ik_max_word根据本身业务需求选择;
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。