赞
踩
一般我们在使用elasticsearch进行去重是通过在聚合里使用cardinality对统计结果的去重,比如有个字段“one_account.one_account_no”,有两个文档的“one_account.one_account_no”值都是111,那么对“one_account.one_account_no“”去重后结果是1。
dsl语句:
POST user_onoffline_log/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"one_account_no_aggs": {
"cardinality": {
"field": "one_account.one_account_no"
}
}
}
}
结果:
{ "took": 565, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 21, "max_score": 0, "hits": [] }, "aggregations": { "one_account_no_aggs": { "value": 14 } } }
可以看到使用cardinality对one_account.one_account_no字段去重后的计数值为14。
上面的使用cardinality去重是作为统计来使用,如果我们想查询所有去除重复后的one_account_no有哪些而不是仅仅得到一个数字14的话,可以使用collapse对内容去重。
dsl语句:
GET customer/_search { "from": 0, "size": 5, "query": { "bool": { "filter": [ { "exists": { "field": "one_account.one_account_no", "boost": 1 } } ] } }, "collapse": { "field": "one_account.one_account_no" }, "_source": { "includes": [ "one_account.one_account_no" ],"excludes": [] }, "aggregations": { "count": { "cardinality": { "field": "one_account.one_account_no" } } } }
说明:exists筛选出字段one_account.one_account_no存在的,collapse对多个相同的只取其中一个展示,_source要返回展示的字段,cardinality去重统计one_account.one_account_no的数量
结果:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 6, "successful" : 6, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 21, "max_score" : 0.0, "hits" : [ { "_index" : "customer_v2025", "_type" : "customer_info", "_id" : "105100015130", "_score" : 0.0, "_source" : { "one_account" : { "one_account_no" : "105100015130" } }, "fields" : { "one_account.one_account_no" : [ "105100015130" ] } }, { "_index" : "customer_v2025", "_type" : "customer_info", "_id" : "99522458", "_score" : 0.0, "_source" : { "one_account" : { "one_account_no" : "99522458" } }, "fields" : { "one_account.one_account_no" : [ "99522458" ] } }, { "_index" : "customer_v2025", "_type" : "customer_info", "_id" : "105500032110", "_score" : 0.0, "_source" : { "one_account" : { "one_account_no" : "105500032110" } }, "fields" : { "one_account.one_account_no" : [ "105500032110" ] } }, { "_index" : "customer_v2025", "_type" : "customer_info", "_id" : "110600001247", "_score" : 0.0, "_routing" : "110600001247", "_source" : { "one_account" : { "one_account_no" : "110600001248" } }, "fields" : { "one_account.one_account_no" : [ "110600001248" ] } }, { "_index" : "customer_v2025", "_type" : "customer_info", "_id" : "110600000858", "_score" : 0.0, "_source" : { "one_account" : { "one_account_no" : "110600000858" } }, "fields" : { "one_account.one_account_no" : [ "110600000858" ] } } ] }, "aggregations" : { "count" : { "value" : 14 } } }
可以看出总的含有one_account.one_account_no的有21个,去除重复的后有14个,
因为采用了from-size分页,hits里只返回了5条,要把14条都显示出来,可以把from改为5或10,来查看第2页和第3页的。
Java代码:
SearchRequest searchRequest = new SearchRequest("customer");
searchRequest.types("customer_info");
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
boolQueryBuilder.filter(QueryBuilders.existsQuery("one_account.one_account_no"));
searchSourceBuilder.collapse(new CollapseBuilder("one_account.one_account_no"));
searchSourceBuilder.aggregation(AggregationBuilders.cardinality("count").field("one_account.one_account_no"));
searchSourceBuilder.from(0);
searchSourceBuilder.size(5);
searchSourceBuilder.query(boolQueryBuilder);
searchRequest.source(searchSourceBuilder);
SearchResponse searchresponse = restHighLevelClient.search(searchRequest);
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。