赞
踩
在elasticsearch聚合原理分析这篇文章中介绍了:
(1)为什么使用正排索引进行聚合操作;
(2)doc value和fielddata的区别;global ordinals等。
但是,并没有解决这种问题:数据量大的时候,要取聚合后的top N,且N很大的问题。即假设:在10个亿的数据中,进行聚合排序,取 前 10 万的聚合结果,并且支持分页。
这里,首先给出结论:
(1)elasticsearch 不支持聚合后分页;
(2)如果一定要聚合分页,那么只能拿到聚合的所有结果,然后自己在应用程序中进行分页;
(3)从业务角度,比如可否根据时间等特性,将数据分别放在不同的索引中;尽量限制缩小聚合的数据范围;
(4)对于需要拿到的聚合结果太多,对elasticsearch的性能比较大。可以试试增加机器数量,使得shard分散在不同的机器上;
(5)如果要取聚合的全量数据,用分批聚合。
实验数据情况:4126936条的数据量。使用 cardinality 统计字段user去重后的数量,207418。elasticsearch版本是6.3版本。
以下将分别介绍
分批num_partitions,与聚合的差别
直接聚合取 top 10万的结果
使用time-based的索引,减少global ordinals的重建工作
聚合原理分析
一、分批num_partitions
如果要拿出聚合后的所有数据,可以分区取。注意,分区,不是分页的功能!
Filtering Values with partitions
Sometimes there are too many unique terms to process in a single request/response pair so it can be useful to break the analysis up into multiple requests. This can be achieved by grouping the field’s values into a number of partitions at query-time and processing only one partition in each request.
Note that the size setting for the number of results returned needs to be tuned with the num_partitions. For this particular account-expiration example the process for balancing values for size and num_partitions would be as follows:
1. Use the cardinality aggregation to estimate the total number of unique account_id values
2. Pick a value for num_partitions to break the number from 1) up into more manageable chunks
3. Pick a size value for the number of responses we want from each partition
4. Run a test request
If we have a circuit-breaker error we are trying to do too much in one request and must increase num_partitions. If the request was successful but the last account ID in the date-sorted test response was still an account we might want to expire then we may be missing accounts of interest and have set our numbers too low. We must either
increase the size parameter to return more results per partition (could be heavy on memory) or
increase the num_partitions to consider less accounts per request (could increase overall processing time as we need to make more requests)
Ultimately this is a balancing act between managing the Elasticsearch resources required to process a single request and the volume of requests that the client application must issue to complete a task.
第一组实验:分批聚合取数据,每次取100条,取5次。
GET index-test/_search{ "size": 0, "aggs": { "user_aggregation": { "terms": { "field": "user", "include": { "partition": 0, "num_partitions": 5 }, "size": 100, "order": { "max_time": "desc" }, "show_term_doc_count_error": true }, "aggs": { "max_time": { "max": { "field": "time" } } } } }}
将 "partition": 0
的数值,从0依次到4,执行五次,每次获取到100个结果,累计共获取到500个结果。
第二组实验:一次聚合取数据500条。
GET index-test/_search{ "size": 0, "aggs": { "user_aggregation": { "terms": { "field": "user", "size": 500, "order": { "max_time": "desc" }, "show_term_doc_count_error": true }, "aggs": { "max_time": { "max": { "field": "time" } } } } }}
"doc_count_error_upper_bound": -1
This shows an error value for each term returned by the aggregation which represents the worst case error in the document count and can be useful when deciding on a value for the shard_size
parameter. This is calculated by summing the document counts for the last term returned by all shards which did not return the term.
These errors can only be calculated in this way when the terms are ordered by descending document count. When the aggregation is ordered by the terms values themselves (either ascending or descending) there is no error in the document count since if a shard does not return a particular term which appears in the results from another shard, it must not have that term in its index. When the aggregation is either sorted by a sub aggregation or in order of ascending document count, the error in the document counts cannot be determined and is given a value of -1 to indicate this.
"sum_other_doc_count": 3956479
when there are lots of unique terms, Elasticsearch only returns the top terms; this number is the sum of the document counts for all buckets that are not part of the response。
由返回结果的参数doc_count_error_upper_bound可以得知,得到的数量是不准确的。
{ "key": "5a609b16e8ac2b49c9afb63f", "doc_count": 4, "doc_count_error_upper_bound": -1, "max_time": { "value": 1591718396431, "value_as_string": "2020-06-09T15:59:56.431Z" }}
在文章elasticsearch踩坑记录(一)中提到了,使用聚合后的某条结果,进行精准的二次查询,最终得到精准数量。
观察两次实验的结果,进行对比分析,得到以下结论:
(1)第一组分批取的结果,有的结果可能是 不存在于 第二组的500个结果中的。【比如,取第二组实验的最后一个结果a,将a在第一组的5批结果中搜索,发现a不是排在某一批结果的最后一个,在a的后面,还有其他的结果】
可能的原因:因为不同的 user 数量达到了 二十多万,而我的每一批只有100个,分为5批,显然没有覆盖所有结果。
(2)第一组的每一批,取到的结果,每一批的顺序是有序的;而且不同批的结果中,不会存在重复的,即不可能存在:某个结果,同时存在于多个批次返回的结果中。
二、直接聚合取 top 10万的结果
将上述一中的第二组实验,size改成 100000,得到结果如下:
#! Deprecation: This aggregation creates too many buckets (10001) and will throw an error in future versions. You should update the [search.max_buckets] cluster setting or use the [composite] aggregation to paginate all buckets in multiple requests.{ "took": 11081, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 4126936, "max_score": 0, "hits": [] }, "aggregations": { "user_aggregation": { "doc_count_error_upper_bound": -1, "sum_other_doc_count": 1735391, "buckets": [ { "key": "user_1", "doc_count": 4, "doc_count_error_upper_bound": -1, "max_time": { "value": 1591718396431, "value_as_string": "2020-06-09T15:59:56.431Z" } }, { "key": "user_2", "doc_count": 5, "doc_count_error_upper_bound": -1, "max_time": { "value": 1591718389238, "value_as_string": "2020-06-09T15:59:49.238Z" } },
总共花费了 11秒的时间返回结果。这台实验elasticsearch机器的配置情况:一个master节点,两个data节点。内存总共是9G。
如果将聚合的size改成三十万,得到的结果:
#! Deprecation: This aggregation creates too many buckets (10001) and will throw an error in future versions. You should update the [search.max_buckets] cluster setting or use the [composite] aggregation to paginate all buckets in multiple requests.{ "took": 16692, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 4126936, "max_score": 0, "hits": [] }, "aggregations": { "user_aggregation": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "user_1", "doc_count": 4, "doc_count_error_upper_bound": 0, "max_time": { "value": 1591718396431, "value_as_string": "2020-06-09T15:59:56.431Z" } }, { "key": "user_2", "doc_count": 5, "doc_count_error_upper_bound": 0, "max_time": { "value": 1591718389238, "value_as_string": "2020-06-09T15:59:49.238Z" } },
可以看到doc_count_error_upper_bound 是0了。这次的执行耗费了 16.7秒。
以下部分摘抄于其他博客的文章
三、Use time-based indices
Global ordinals only need to be re-created on a shard if that shard has been modified since the last computation of its global ordinals. If a shard is unmodified since the last computation of its global ordinals, then previously calculated global ordinals will continue to be used. For time-series data, implementing time-based indices is a good way to ensure that the majority of indices/shards remain unmodified, which will reduce the size of the global ordinals that need to be recomputed after a refresh operation.
For example, if two years of data is stored in monthly indices instead of in one large index, then each monthly index is 1/24th the size that one large index would be. Since we are considering time-series data, we know that only the most recent monthly index will have new documents inserted. This means that only one of the 24 indices is actively written into. Since global ordinals are only rebuilt on shards that have been modified, the shards in 23 of the 24 monthly indices will continue to use previously computed global ordinals. This would reduce the work required to build global ordinals by a factor of up to 24 times when compared to storing two years worth of data in one large index.
将一个与时间有关的数据,分成按照月份存储,比如2020-01存储1月时候的数据,现在是2020年8月,就用2020-08的索引存储此月的数据。那么历史数据索引中的分片,因为数据不再变化,其中的global ordinals,可以继续使用。这样,与将一个两年的数据存储在一个大索引中相比,这将使构建global ordinals所需的工作减少24倍。
四、聚合原理分析
最简单的单层terms聚合大致是下面这样一个执行步骤:
1. 为要聚合的字段构造Global Ordinals。【什么是Global Ordinals参考[global-ordinals](https://www.elastic.co/guide/cn/elasticsearch/guide/cn/preload-fielddata.html#global-ordinals). 】, 这个过程的速度不是单纯和文档数量有关系,更多的是取决于索引有多少个段文件,以及字段的不同唯一值的数量(cardinality)。 段文件的数量和磁盘IO能力决定了多快能将这些数据读入内存,而字段唯一值的多少决定了需要在内存里生成多少个分桶,唯一值越多,分桶占用的内存越高。
2. 根据match查询的结果,也就是得到的文档ID集合,借助统计字段的doc values,拿到统计字段的值集合。
3. 将统计字段的值集合映射到为global ordinals构建的分桶里。
4. 统计各个分桶里的值个数.
5. 根据聚合设置的size,返回top size的分桶数据。
海量数据场景下,对Terms aggregation性能影响最大的还是对应字段的唯一值的多寡。冷执行的情况下,由于需要读取各个segments的doc values,如果segments非常多,构造global ordinals可能耗时非常长。对于不再更新的索引,将其force merge成一个segment,可以免去global ordinals的构造过程,从而极大提速聚合速度。对于一直在更新的索引,可以延长索引refresh周期,提高global ordinals缓存的有效期。在查询聚合性能要求高于写入性能的场景下,也可以利用eager_global_ordinals来将构建时间移到索引阶段。
如果聚合的场景是从大量的数据中过滤出少量数据进行聚合(百万级),可以在执行参数里加入execution_hint: map,直接在结果集上用map的方式进行计算,对比默认的global ordinals的计算方式速度可能会高几倍到几十倍。
如果是多层聚合,则又要复杂得多,bucket构建过程分为depth first和breath first两种,建议仔细读一下相关文档,结合数据特性进行测试分析后,选用合适的执行方式。
总结来说对于terms aggregation,ES提供了多种执行方式,各种方式在内存使用方面,速度方面各有取舍。通常来说,默认的执行方式多数场景下都没有什么问题,只有一些比较极端的场景下,ES不会非常智能的自动选择最佳执行路径,需要使用者对数据和ES本身有一定熟悉程度,灵活选择。
“Elasticsearch 中增加分片数量,聚合一定会变快吗?”
聚合过程中动态申请内存过于频繁,生成了大量临时对象,给 YGC 造成较大压力。
增加分片,提升聚合并行度不一定能加快聚合速度,要考虑业务的聚合语句对内存的压力有多大,像今天的例子中,40 个分片如果散布在更多的节点中,GC 就不是问题,整体聚合速度就应该快很多。类似的,如果聚合产生的 bucket 少一些的时候,增加聚合并行度可以明显提升整体聚合速度。
总之,聚合要考虑对节点内存的压力,但是这不太好量化出来。
建议上线之前提前做好压测。
高基数的数据上执行 bucket 聚合有比较大的压力。
参考文章
1、https://blog.csdn.net/laoyang360/article/details/79112946/
2、https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions
3、https://elasticsearch.cn/question/1797
4、https://www.elastic.co/cn/blog/improving-the-performance-of-high-cardinality-terms-aggregations-in-elasticsearch
5、https://github.com/elastic/elasticsearch/issues/19780
6、https://my.oschina.net/KasuganoShin/blog/4407521
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。