当前位置:   article > 正文

ElasticSearch教程——cardinality(去重)算法之优化内存开销以及HLL算法_elasticsearch cardinality

elasticsearch cardinality

ElasticSearch汇总请查看:ElasticSearch教程——汇总篇

说明,一般使用第一种和第二种方法,很少使用第三种HLL优化(效果不是特别明显)

1、cardinality语法

es,去重,cartinality metric,对每个bucket中的指定的field进行去重,取去重后的count,类似于count(distcint)
cardinality,count(distinct),5%的错误率,性能在100ms左右

  1. {
  2. "size" : 0,
  3. "aggs" : {
  4. "months" : {
  5. "date_histogram": {
  6. "field": "sold_date",
  7. "interval": "month"
  8. },
  9. "aggs": {
  10. "distinct_colors" : {
  11. "cardinality" : {
  12. "field" : "brand"
  13. }
  14. }
  15. }
  16. }
  17. }
  18. }
  1. {
  2. "took": 70,
  3. "timed_out": false,
  4. "_shards": {
  5. "total": 5,
  6. "successful": 5,
  7. "failed": 0
  8. },
  9. "hits": {
  10. "total": 8,
  11. "max_score": 0,
  12. "hits": []
  13. },
  14. "aggregations": {
  15. "group_by_sold_date": {
  16. "buckets": [
  17. {
  18. "key_as_string": "2016-05-01T00:00:00.000Z",
  19. "key": 1462060800000,
  20. "doc_count": 1,
  21. "distinct_brand_cnt": {
  22. "value": 1
  23. }
  24. },
  25. {
  26. "key_as_string": "2016-06-01T00:00:00.000Z",
  27. "key": 1464739200000,
  28. "doc_count": 0,
  29. "distinct_brand_cnt": {
  30. "value": 0
  31. }
  32. },
  33. {
  34. "key_as_string": "2016-07-01T00:00:00.000Z",
  35. "key": 1467331200000,
  36. "doc_count": 1,
  37. "distinct_brand_cnt": {
  38. "value": 1
  39. }
  40. },
  41. {
  42. "key_as_string": "2016-08-01T00:00:00.000Z",
  43. "key": 1470009600000,
  44. "doc_count": 1,
  45. "distinct_brand_cnt": {
  46. "value": 1
  47. }
  48. },
  49. {
  50. "key_as_string": "2016-09-01T00:00:00.000Z",
  51. "key": 1472688000000,
  52. "doc_count": 0,
  53. "distinct_brand_cnt": {
  54. "value": 0
  55. }
  56. },
  57. {
  58. "key_as_string": "2016-10-01T00:00:00.000Z",
  59. "key": 1475280000000,
  60. "doc_count": 1,
  61. "distinct_brand_cnt": {
  62. "value": 1
  63. }
  64. },
  65. {
  66. "key_as_string": "2016-11-01T00:00:00.000Z",
  67. "key": 1477958400000,
  68. "doc_count": 2,
  69. "distinct_brand_cnt": {
  70. "value": 1
  71. }
  72. },
  73. {
  74. "key_as_string": "2016-12-01T00:00:00.000Z",
  75. "key": 1480550400000,
  76. "doc_count": 0,
  77. "distinct_brand_cnt": {
  78. "value": 0
  79. }
  80. },
  81. {
  82. "key_as_string": "2017-01-01T00:00:00.000Z",
  83. "key": 1483228800000,
  84. "doc_count": 1,
  85. "distinct_brand_cnt": {
  86. "value": 1
  87. }
  88. },
  89. {
  90. "key_as_string": "2017-02-01T00:00:00.000Z",
  91. "key": 1485907200000,
  92. "doc_count": 1,
  93. "distinct_brand_cnt": {
  94. "value": 1
  95. }
  96. }
  97. ]
  98. }
  99. }
  100. }

 

2、precision_threshold优化准确率和内存开销

  1. GET /tvs/sales/_search
  2. {
  3. "size" : 0,
  4. "aggs" : {
  5. "distinct_brand" : {
  6. "cardinality" : {
  7. "field" : "brand",
  8. "precision_threshold" : 100
  9. }
  10. }
  11. }
  12. }

brand去重,如果brand(品牌)的unique value,在100个以内,小米,长虹,三星,TCL,HTL。。。

在多少个unique value以内,cardinality,几乎保证100%准确
cardinality算法,会占用precision_threshold * 8 byte 内存消耗,100 * 8 = 800个字节
占用内存很小而且unique value如果的确在值以内,那么可以确保100%准确
100,数百万的unique value,错误率在5%以内

precision_threshold,值设置的越大,占用内存越大,可以确保更多unique value的场景下,100%的准确

field,去重,count,这时候,unique value,10000,
precision_threshold=10000,
10000 * 8 = 80000 个byte,
80000 / 1024 ≈ 80KB


3、HyperLogLog++ (HLL)算法性能优化

cardinality底层算法:HLL算法,HLL算法的性能
会对所有的uqniue value取hash值,通过hash值近似去求distcint count,误差

默认情况下,发送一个cardinality请求的时候,会动态地对所有的field value,取hash值; 将取hash值的操作,前移到建立索引的时候

创建索引时, brand field type 增加创建其hash值索引
注:这边的“murmur3”是一种取hash值的算法

  1. PUT /tvs/
  2. {
  3. "mappings": {
  4. "sales": {
  5. "properties": {
  6. "brand": {
  7. "type": "text",
  8. "fields": {
  9. "hash": {
  10. "type": "murmur3"
  11. }
  12. }
  13. }
  14. }
  15. }
  16. }
  17. }

根据hash值作引进行cartinality metric

  1. GET /tvs/sales/_search
  2. {
  3. "size" : 0,
  4. "aggs" : {
  5. "distinct_brand" : {
  6. "cardinality" : {
  7. "field" : "brand.hash",
  8. "precision_threshold" : 100
  9. }
  10. }
  11. }
  12. }

 

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/秋刀鱼在做梦/article/detail/799076
推荐阅读
相关标签
  

闽ICP备14008679号