赞
踩
和术语级别查询(Term-Level Queries)不同,全文检索查询(Full Text Queries)旨在基于相关性搜索和匹配文本数据
。这些查询会对输入的文本进行分析,将其拆分
为词项(单个单词),并执行诸如分词、词干处理和标准化等操作。
全文检索的关键特点:
PUT full_index { "settings": { "number_of_replicas": 1, "number_of_shards": 1 }, "mappings": { "properties": { "name": { "type": "text" }, "age": { "type": "long" }, "description" : { "type" : "text", "analyzer": "ik_max_word", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } } } } 测试数据如下: {name=张三, description=北京故宫圆明园, age=11} {name=王五, description=南京总统府, age=15} {name=李四, description=北京市天安门广场, age=18} {name=富贵, description=南京市中山陵, age=22} {name=来福, description=山东济南趵突泉, age=8} {name=憨憨, description=安徽黄山九华山, age=27} {name=小七, description=上海东方明珠, age=31}
匹配查询: match在匹配时会对所查找的关键词进行分词,然后按分词匹配查找。
match支持以下参数:
DSL: 索引description字段包含 “南京总统府” 的数据
GET full_index/_search { "query": { "match": { "description": "南京总统府" } } } 返回数据如下: { "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 1.2667978, "hits" : [ { "_index" : "full_index", "_type" : "_doc", "_id" : "2", "_score" : 1.2667978, "_source" : { "name" : "王五", "age" : 15, "description" : "南京总统府" } }, { "_index" : "full_index", "_type" : "_doc", "_id" : "4", "_score" : 1.0751815, "_source" : { "name" : "富贵", "age" : 22, "description" : "南京市中山陵" } } ] } }
springboot实现:
private final static Logger LOGGER = LoggerFactory.getLogger(FullTextQuery.class); private static final String INDEX_NAME = "full_index"; @Resource private RestHighLevelClient client; @RequestMapping(value = "/match_query", method = RequestMethod.GET) @ApiOperation(value = "DSL - match_query") public void match_query() throws Exception { // 定义请求对象 SearchRequest searchRequest = new SearchRequest(INDEX_NAME); // 查询所有 searchRequest.source(new SearchSourceBuilder().query(QueryBuilders.matchQuery("description","南京总统府"))); // 打印返回数据 printLog(client.search(searchRequest, RequestOptions.DEFAULT)); } private void printLog(SearchResponse searchResponse) { SearchHits hits = searchResponse.getHits(); System.out.println("返回hits数组长度:" + hits.getHits().length); for (SearchHit hit: hits.getHits()) { System.out.println(hit.getSourceAsMap().toString()); } } 返回结果如下: 返回hits数组长度:2 {name=王五, description=南京总统府, age=15} {name=富贵, description=南京市中山陵, age=22}
分析: 此时可以发现当搜索 “南京总统府” 时,返回了两条数据,那么为什么 “南京市中山陵” 也被搜索到了呢?
原因就是全文检索会拆分
搜索的此项,因为在创建索引的时候指定了 description 字段的分词方式是 “ik_max_word” ,而该分词类型会将 “南京总统府” 拆分成如下词类去搜索倒排索引:
POST _analyze { "analyzer": "ik_max_word", "text": ["南京总统府"] } { "tokens" : [ { "token" : "南京", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "总统府", "start_offset" : 2, "end_offset" : 5, "type" : "CN_WORD", "position" : 1 }, { "token" : "总统", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 2 }, { "token" : "府", "start_offset" : 4, "end_offset" : 5, "type" : "CN_CHAR", "position" : 3 } ] }
其中就有"南京"这个词项,所以用 “南京总统府” 去搜索是可以搜到 “南京中山陵” 的数据,那么match_query的operator也不用多说,就是满足所有拆分的词项
比如此时我们再插入一条数据: POST /full_index/_bulk {"index":{"_id":8}} {"name":"张三","age":11,"description":"南京总统"} 当我们搜索:"南京总统",可以搜到两条数据 GET full_index/_search { "query": { "match": { "description": { "query": "南京总统", "operator": "and" } } } } 数据如下: { "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 2.898355, "hits" : [ { "_index" : "full_index", "_type" : "_doc", "_id" : "8", "_score" : 2.898355, "_source" : { "name" : "张三", "age" : 11, "description" : "南京总统" } }, { "_index" : "full_index", "_type" : "_doc", "_id" : "2", "_score" : 2.35562, "_source" : { "name" : "王五", "age" : 15, "description" : "南京总统府" } } ] } } 但是当搜索:"南京总统府"时,只能搜索到一条数据,就是因为分词时,有一个词项"府",在其中一条数据中不存在
多字段查询:可以根据字段类型,决定是否使用分词查询,得分最高的在前面
注意:字段类型分词,将查询条件分词之后进行查询,如果该字段不分词就会将查询条件作为整体进行查询。
DSL: 查询 “name” 或者 “description” 这两个字段中出现 “北京王五” 词汇的数据
GET full_index/_search { "query": { "multi_match": { "query": "北京王五", "fields": ["name","description"] } } } 返回结果如下: { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 3.583519, "hits" : [ { "_index" : "full_index", "_type" : "_doc", "_id" : "2", "_score" : 3.583519, "_source" : { "name" : "王五", "age" : 15, "description" : "南京总统府" } }, { "_index" : "full_index", "_type" : "_doc", "_id" : "1", "_score" : 1.4959542, "_source" : { "name" : "张三", "age" : 11, "description" : "北京故宫圆明园" } }, { "_index" : "full_index", "_type" : "_doc", "_id" : "3", "_score" : 0.98645234, "_source" : { "name" : "李四", "age" : 18, "description" : "北京市天安门广场" } } ] } }
springboot实现:
@RequestMapping(value = "/multi_match", method = RequestMethod.GET) @ApiOperation(value = "DSL - multi_match") public void multi_match() throws Exception { // 定义请求对象 SearchRequest searchRequest = new SearchRequest(INDEX_NAME); // 查询所有 searchRequest.source(new SearchSourceBuilder().query( QueryBuilders.multiMatchQuery("北京王五", new String[]{"name","description"}))); // 打印返回数据 printLog(client.search(searchRequest, RequestOptions.DEFAULT)); } 查询结果如下: 返回hits数组长度:3 {name=王五, description=南京总统府, age=15} {name=张三, description=北京故宫圆明园, age=11} {name=李四, description=北京市天安门广场, age=18}
前面也强调到
字段类型分词,将查询条件分词之后进行查询,如果该字段不分词就会将查询条件作为整体进行查询
那么我们来测试一下,比如当不对 “description” 字段分词时查询
GET full_index/_search { "query": { "multi_match": { "query": "北京王五", "fields": ["name","description.keyword"] } } } 返回结果如下: { "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 3.583519, "hits" : [ { "_index" : "full_index", "_type" : "_doc", "_id" : "2", "_score" : 3.583519, "_source" : { "name" : "王五", "age" : 15, "description" : "南京总统府" } } ] } }
可以看到,当使用 “description.keyword” 也就是不对 “description” 进行分词时,只返回了一条数据,该条数据只有 “name” 字段为 “王五” 满足了查询条件分词匹配后的结果。
短语搜索(match phrase)会对搜索文本进行文本分析,然后到索引中寻找搜索的每个分词并要求分词相邻,你可以通过调整slop参数设置分词出现的最大间隔距离。match_phrase 会将检索关键词分词。
DSL: 搜索 "description " 字段有 “北京故宫” 的数据
GET full_index/_search { "query": { "match_phrase": { "description": { "query": "北京故宫" } } } } 返回数据如下: { "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 3.5884824, "hits" : [ { "_index" : "full_index", "_type" : "_doc", "_id" : "1", "_score" : 3.5884824, "_source" : { "name" : "张三", "age" : 11, "description" : "北京故宫圆明园" } } ] } }
springboot实现:
@RequestMapping(value = "/match_phrase", method = RequestMethod.GET)
@ApiOperation(value = "DSL - match_phrase")
public void match_phrase() throws Exception {
// 定义请求对象
SearchRequest searchRequest = new SearchRequest(INDEX_NAME);
// 查询所有
searchRequest.source(new SearchSourceBuilder().query(
QueryBuilders.matchPhraseQuery("description","北京故宫")));
// 打印返回数据
printLog(client.search(searchRequest, RequestOptions.DEFAULT));
}
返回数据如下:
返回hits数组长度:1
{name=张三, description=北京故宫圆明园, age=11}
思考: 搜索 "description " 字段有 “北京故宫” 的数据有返回,那么搜索 “北京圆明园” ,为什么没数据返回?
GET full_index/_search { "query": { "match_phrase": { "description": { "query": "北京圆明园" } } } } 返回数据如下: { "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 0, "relation" : "eq" }, "max_score" : null, "hits" : [ ] } }
原因分析: 先查看 “北京故宫圆明园” 的分词结果,如下:
POST _analyze { "analyzer": "ik_max_word", "text": ["北京故宫圆明园"] } { "tokens" : [ { "token" : "北京", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "故宫", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 1 }, { "token" : "圆明园", "start_offset" : 4, "end_offset" : 7, "type" : "CN_WORD", "position" : 2 } ] }
可以发现 “北京” 和 “圆明园” 并不是相邻的词条,他们之间相差了一个词条,所以这时候就需要用到 “slop” ,
slop参数告诉match_phrase查询词条能够相隔多远时仍然将文档视为匹配
GET full_index/_search { "query": { "match_phrase": { "description": { "query": "北京圆明园", "slop": 1 } } } } 返回结果如下: { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 2.4425511, "hits" : [ { "_index" : "full_index", "_type" : "_doc", "_id" : "1", "_score" : 2.4425511, "_source" : { "name" : "张三", "age" : 11, "description" : "北京故宫圆明园" } } ] } }
允许我们在单个查询字符串中指定AND | OR | NOT条件,同时也和 multi_match query 一样,支持多字段搜索。和match类似,但是match需要指定字段名,query_string是在所有字段中搜索,范围更广泛。
注意: 查询字段分词就将查询条件分词查询,查询字段不分词将查询条件不分词查询
DSL: 搜索当前索引所有字段中含有 “北京故宫” 的文档
GET full_index/_search { "query": { "query_string": { "query": "安徽张三" } } } 返回数据如下: { "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 2.5618675, "hits" : [ { "_index" : "full_index", "_type" : "_doc", "_id" : "1", "_score" : 2.5618675, "_source" : { "name" : "张三", "age" : 11, "description" : "北京故宫圆明园" } }, { "_index" : "full_index", "_type" : "_doc", "_id" : "8", "_score" : 2.5618675, "_source" : { "name" : "张三", "age" : 11, "description" : "南京总统" } }, { "_index" : "full_index", "_type" : "_doc", "_id" : "6", "_score" : 1.7342355, "_source" : { "name" : "憨憨", "age" : 27, "description" : "安徽黄山九华山" } } ] } }
springboot实现:
@RequestMapping(value = "/query_string", method = RequestMethod.GET) @ApiOperation(value = "DSL - query_string") public void query_string() throws Exception { // 定义请求对象 SearchRequest searchRequest = new SearchRequest(INDEX_NAME); // 查询所有 searchRequest.source(new SearchSourceBuilder().query( QueryBuilders.queryStringQuery("安徽张三"))); // 打印返回数据 printLog(client.search(searchRequest, RequestOptions.DEFAULT)); } 返回hits数组长度:3 {name=张三, description=北京故宫圆明园, age=11} {name=张三, description=南京总统, age=11} {name=憨憨, description=安徽黄山九华山, age=27}
指定字段查询: “description” 字段中含有 “安徽张三” 的文档
GET full_index/_search { "query": { "query_string": { "query": "安徽张三", "fields": ["description"] } } } 返回数据如下: { "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 1.7342355, "hits" : [ { "_index" : "full_index", "_type" : "_doc", "_id" : "6", "_score" : 1.7342355, "_source" : { "name" : "憨憨", "age" : 27, "description" : "安徽黄山九华山" } } ] } }
指定多个字段查询 : 查询 “安徽” “憨憨” 同时满足
GET full_index/_search { "query": { "query_string": { "query": "安徽 AND 憨憨", "fields": ["description","name"] } } } 返回: { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 6.6615744, "hits" : [ { "_index" : "full_index", "_type" : "_doc", "_id" : "6", "_score" : 6.6615744, "_source" : { "name" : "憨憨", "age" : 27, "description" : "安徽黄山九华山" } } ] } }
GET full_index/_search { "query": { "query_string": { "query": "(安徽 AND 憨憨)OR 张三", "fields": ["description","name"] } } } 返回数据如下: { "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 6.6615744, "hits" : [ { "_index" : "full_index", "_type" : "_doc", "_id" : "6", "_score" : 6.6615744, "_source" : { "name" : "憨憨", "age" : 27, "description" : "安徽黄山九华山" } }, { "_index" : "full_index", "_type" : "_doc", "_id" : "1", "_score" : 2.5618675, "_source" : { "name" : "张三", "age" : 11, "description" : "北京故宫圆明园" } }, { "_index" : "full_index", "_type" : "_doc", "_id" : "8", "_score" : 2.5618675, "_source" : { "name" : "张三", "age" : 11, "description" : "南京总统" } } ] } }
query_string query 这种查询方式类似于 match query匹配查询 结合 multi_match query 多字段查询 一起使用。
类似Query String,但是会忽略错误的语法,同时只支持部分查询语法,不支持AND OR NOT,会当作字符串处理。支持部分逻辑:
GET full_index/_search { "query": { "simple_query_string": { "query": "(安徽 + 憨憨) | 张三", "fields": ["description","name"] } } } 返回结果如下: { "took" : 41, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 6.6615744, "hits" : [ { "_index" : "full_index", "_type" : "_doc", "_id" : "6", "_score" : 6.6615744, "_source" : { "name" : "憨憨", "age" : 27, "description" : "安徽黄山九华山" } }, { "_index" : "full_index", "_type" : "_doc", "_id" : "1", "_score" : 2.5618675, "_source" : { "name" : "张三", "age" : 11, "description" : "北京故宫圆明园" } }, { "_index" : "full_index", "_type" : "_doc", "_id" : "8", "_score" : 2.5618675, "_source" : { "name" : "张三", "age" : 11, "description" : "南京总统" } } ] } }
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。