当前位置:   article > 正文

Elasticsearch 聚合检索 (分组统计)_elasticsearch 分组统计

elasticsearch 分组统计

1 普通聚合分析

1.1 直接聚合统计

(1) 计算每个tag下的文档数量, 请求语法:

  1. GET book_shop/it_book/_search
  2. {
  3. "size": 0, // 不显示命中(hits)的所有文档信息
  4. "aggs": {
  5. "group_by_tags": { // 聚合结果的名称, 需要自定义(复制时请去掉此注释)
  6. "terms": {
  7. "field": "tags"
  8. }
  9. }
  10. }
  11. }

(2) 发生错误:

说明: 索引book_shop的mapping映射是ES自动创建的, 它把tag解析成了text类型, 在发起对tag的聚合请求后, 将抛出如下错误:

  1. {
  2. "error": {
  3. "root_cause": [
  4. {
  5. "type": "illegal_argument_exception",
  6. "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [tags] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
  7. }
  8. ],
  9. "type": "search_phase_execution_exception",
  10. "reason": "all shards failed",
  11. "phase": "query",
  12. "grouped": true,
  13. "failed_shards": [......]
  14. },
  15. "status": 400
  16. }

(3) 错误分析:

错误信息: Set fielddata=true on [xxxx] ......
错误分析: 默认情况下, Elasticsearch 对 text 类型的字段(field)禁用了 fielddata;
text 类型的字段在创建索引时会进行分词处理, 而聚合操作必须基于字段的原始值进行分析;
所以如果要对 text 类型的字段进行聚合操作, 就需要存储其原始值 —— 创建mapping时指定fielddata=true, 以便通过反转倒排索引(即正排索引)将索引数据加载至内存中.

(4) 解决方案一: 对text类型的字段开启fielddata属性:

  • 将要分组统计的text field(即tags)的fielddata设置为true:

    1. PUT book_shop/_mapping/it_book
    2. {
    3. "properties": {
    4. "tags": {
    5. "type": "text",
    6. "fielddata": true
    7. }
    8. }
    9. }
  • 可参考官方文档进行设置:
    fielddata | Elasticsearch Guide [6.6] | Elastic. 成功后的结果如下:

    1. {
    2. "acknowledged": true
    3. }
  • 再次统计, 得到的结果如下:

    1. {
    2. "took": 153,
    3. "timed_out": false,
    4. "_shards": {
    5. "total": 5,
    6. "successful": 5,
    7. "skipped": 0,
    8. "failed": 0
    9. },
    10. "hits": {
    11. "total": 4,
    12. "max_score": 0.0,
    13. "hits": []
    14. },
    15. "aggregations": {
    16. "group_by_tags": {
    17. "doc_count_error_upper_bound": 0,
    18. "sum_other_doc_count": 6,
    19. "buckets": [
    20. {
    21. "key": "java",
    22. "doc_count": 3
    23. },
    24. {
    25. "key": "程",
    26. "doc_count": 2
    27. },
    28. ......
    29. ]
    30. }
    31. }
    32. }

(5) 解决方法二: 使用内置keyword字段:

  • 开启fielddata将占用大量的内存.

  • Elasticsearch 5.x 版本开始支持通过text的内置字段keyword作精确查询、聚合分析:

    1. GET shop/it_book/_search
    2. {
    3. "size": 0,
    4. "aggs": {
    5. "group_by_tags": {
    6. "terms": {
    7. "field": "tags.keyword" // 使用text类型的内置keyword字段
    8. }
    9. }
    10. }
    11. }

1.2 先检索, 再聚合

(1) 统计name中含有“jvm”的图书中每个tag的文档数量, 请求语法:

  1. GET book_shop/it_book/_search
  2. {
  3. "query": {
  4. "match": { "name": "jvm" }
  5. },
  6. "aggs": {
  7. "group_by_tags": { // 聚合结果的名称, 需要自定义. 下面使用内置的keyword字段:
  8. "terms": { "field": "tags.keyword" }
  9. }
  10. }
  11. }

(2) 响应结果:

  1. {
  2. "took" : 7,
  3. "timed_out" : false,
  4. "_shards" : {
  5. "total" : 5,
  6. "successful" : 5,
  7. "skipped" : 0,
  8. "failed" : 0
  9. },
  10. "hits" : {
  11. "total" : 1,
  12. "max_score" : 0.64072424,
  13. "hits" : [
  14. {
  15. "_index" : "book_shop",
  16. "_type" : "it_book",
  17. "_id" : "2",
  18. "_score" : 0.64072424,
  19. "_source" : {
  20. "name" : "深入理解Java虚拟机:JVM高级特性与最佳实践",
  21. "author" : "周志明",
  22. "category" : "编程语言",
  23. "desc" : "Java图书领域公认的经典著作",
  24. "price" : 79.0,
  25. "date" : "2013-10-01",
  26. "publisher" : "机械工业出版社",
  27. "tags" : [
  28. "Java",
  29. "虚拟机",
  30. "最佳实践"
  31. ]
  32. }
  33. }
  34. ]
  35. },
  36. "aggregations" : {
  37. "group_by_tags" : {
  38. "doc_count_error_upper_bound" : 0,
  39. "sum_other_doc_count" : 0,
  40. "buckets" : [
  41. {
  42. "key" : "Java",
  43. "doc_count" : 1
  44. },
  45. {
  46. "key" : "最佳实践",
  47. "doc_count" : 1
  48. },
  49. {
  50. "key" : "虚拟机",
  51. "doc_count" : 1
  52. }
  53. ]
  54. }
  55. }
  56. }

1.3 扩展: fielddata和keyword的聚合比较

  • 为某个 text 类型的字段开启fielddata字段后, 聚合分析操作会对这个字段的所有分词分别进行聚合, 获得的结果大多数情况下并不符合我们的需求.

  • 使用keyword内置字段, 不会对相关的分词进行聚合, 结果可能更有用.

—— 推荐使用text类型字段的内置keyword进行聚合操作.

2 嵌套聚合

2.1 先分组, 再聚合统计

(1) 先按tags分组, 再计算每个tag下图书的平均价格, 请求语法:

  1. GET book_shop/it_book/_search
  2. {
  3. "size": 0,
  4. "aggs": {
  5. "group_by_tags": {
  6. "terms": { "field": "tags.keyword" },
  7. "aggs": {
  8. "avg_price": {
  9. "avg": { "field": "price" }
  10. }
  11. }
  12. }
  13. }
  14. }

(2) 响应结果:

  1. "hits" : {
  2. "total" : 3,
  3. "max_score" : 0.0,
  4. "hits" : [ ]
  5. },
  6. "aggregations" : {
  7. "group_by_tags" : {
  8. "doc_count_error_upper_bound" : 0,
  9. "sum_other_doc_count" : 0,
  10. "buckets" : [
  11. {
  12. "key" : "Java",
  13. "doc_count" : 3,
  14. "avg_price" : {
  15. "value" : 102.33333333333333
  16. }
  17. },
  18. {
  19. "key" : "编程语言",
  20. "doc_count" : 2,
  21. "avg_price" : {
  22. "value" : 114.0
  23. }
  24. },
  25. ......
  26. ]
  27. }
  28. }

2.2 先分组, 再统计, 最后排序

(1) 计算每个tag下图书的平均价格, 再按平均价格降序排序, 查询语法:

  1. GET book_shop/it_book/_search
  2. {
  3. "size": 0,
  4. "aggs": {
  5. "all_tags": {
  6. "terms": {
  7. "field": "tags.keyword",
  8. "order": { "avg_price": "desc" } // 根据下述统计的结果排序
  9. },
  10. "aggs": {
  11. "avg_price": {
  12. "avg": { "field": "price" }
  13. }
  14. }
  15. }
  16. }
  17. }

(2) 响应结果:

与#2.1节内容相似, 区别在于按照价格排序显示了.

2.3 先分组, 组内再分组, 然后统计、排序

(1) 先按价格区间分组, 组内再按tags分组, 计算每个tags组的平均价格, 查询语法:

  1. GET book_shop/it_book/_search
  2. {
  3. "size": 0,
  4. "aggs": {
  5. "group_by_price": {
  6. "range": {
  7. "field": "price",
  8. "ranges": [
  9. { "from": 00, "to": 100 },
  10. { "from": 100, "to": 150 }
  11. ]
  12. },
  13. "aggs": {
  14. "group_by_tags": {
  15. "terms": { "field": "tags.keyword" },
  16. "aggs": {
  17. "avg_price": {
  18. "avg": { "field": "price" }
  19. }
  20. }
  21. }
  22. }
  23. }
  24. }
  25. }

(2) 响应结果:

  1. "hits" : {
  2. "total" : 3,
  3. "max_score" : 0.0,
  4. "hits" : [ ]
  5. },
  6. "aggregations" : {
  7. "group_by_price" : {
  8. "buckets" : [
  9. {
  10. "key" : "0.0-100.0", // 区间0.0-100.0
  11. "from" : 0.0,
  12. "to" : 100.0,
  13. "doc_count" : 1, // 共查找到了3条文档
  14. "group_by_tags" : { // 对tags分组聚合
  15. "doc_count_error_upper_bound" : 0,
  16. "sum_other_doc_count" : 0,
  17. "buckets" : [
  18. {
  19. "key" : "Java",
  20. "doc_count" : 1,
  21. "avg_price" : {
  22. "value" : 79.0
  23. }
  24. },
  25. ......
  26. ]
  27. }
  28. },
  29. {
  30. "key" : "100.0-150.0",
  31. "from" : 100.0,
  32. "to" : 150.0,
  33. "doc_count" : 2,
  34. "group_by_tags" : {
  35. "doc_count_error_upper_bound" : 0,
  36. "sum_other_doc_count" : 0,
  37. "buckets" : [
  38. {
  39. "key" : "Java",
  40. "doc_count" : 2,
  41. "avg_price" : {
  42. "value" : 114.0
  43. }
  44. },
  45. ......
  46. }
  47. ]
  48. }
  49. }
  50. ]
  51. }
  52. }
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/盐析白兔/article/detail/505302
推荐阅读
相关标签
  

闽ICP备14008679号