当前位置: article > 正文

Elasticsearch】es 模糊查询导致Elasticsearch服务宕机_elasticsearch7.9查询导致故障

作者：不正经 | 2024-02-24 22:18:53

踩

elasticsearch7.9查询导致故障

本文为博主九师兄（QQ:541711153 欢迎来探讨技术）原创文章，未经允许博主不允许转载。

可以加我问问题，免费解答，有问题可以先私聊我，本人每天都在线，会帮助需要的人。

但是本博主因为某些原因，心灰意冷了，决心打死不写免费了。不想解释了。。

文章目录

1.概述

在这里插入图片描述

1.概述

转载：模糊查询导致Elasticsearch服务宕机

之前我在社区里写过《ElasticSearch集群故障案例分析: 警惕通配符查询》一文，讲的是关于通配符查询可能引起ES集群负载过高的问题。当时提到wildcard query构造的non-deterministic automaton要经历一个determinize的过程，其间如果生成的状态数量过高，可能引起集群负载彪高，影响对外服务。但因为determinize的过程中，Lucene对生成的状态数量做了限制，因此在问题查询过去以后，集群还是可以恢复常态。

然而近期我们线上的另外一起故障，使我意识到，Prefix/Regex/Fuzzy一类的模糊查询可能直接让整个集群直接挂掉。

问题出现时，ES服务端日志有如下报错:

[2017-06-14T21:06:39,330][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [xx.xx.xx.xx] fatal error in thread [elasticsearch[xx.xx.xx.xx][search][T#29]], exiting
java.lang.StackOverflowError
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1053) ~[lucene-core-6.2.1.jar:6.2.1 43ab70147eb494324a1410f7a9f16a896a59bc6f - shalin - 2016-09-15 05:15:20]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1053) ~[lucene-core-6.2.1.jar:6.2.1 43ab70147eb494324a1410f7a9f16a896a59bc6f - shalin - 2016-09-15 05:15:20]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1053) ~[lucene-core-6.2.1.jar:6.2.1 43ab70147eb494324a1410f7a9f16a896a59bc6f - shalin - 2016-09-15 05:15:20]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1053) ~[lucene-core-6.2.1.jar:6.2.1 43ab70147eb494324a1410f7a9f16a896a59bc6f - shalin - 2016-09-15 05:15:20]
1
2
3
4
5
6

调查后发现，Prefix/Regex/Fuzzy一类的Query，是直接构造的deterministic automaton，如果查询字符串过长，或者pattern本身过于复杂，构造出来的状态过多，之后一个isFinite的Lucene方法调用可能产生堆栈溢出。

一个可以复现问题的regex query如下:

POST /test_index/_search
{
  "query": {
    "regexp": {
      "test": "t{1,9500}"
    }
  }
}

1
2
3
4
5
6
7
8
9

我的执行报错如下

{
  "error" : {
    "root_cause" : [
      {
        "type" : "query_shard_exception",
        "reason" : "failed to create query: input automaton is too large: 1001",
        "index_uuid" : "pFxnWiwdSFSt9V6l35fPzQ",
        "index" : "test_index"
      }
    ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [
      {
        "shard" : 0,
        "index" : "test_index",
        "node" : "ZiR6PjzPSX6NI99Awisr2g",
        "reason" : {
          "type" : "query_shard_exception",
          "reason" : "failed to create query: input automaton is too large: 1001",
          "index_uuid" : "pFxnWiwdSFSt9V6l35fPzQ",
          "index" : "test_index",
          "caused_by" : {
            "type" : "illegal_argument_exception",
            "reason" : "input automaton is too large: 1001"
          }
        }
      }
    ]
  },
  "status" : 400
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

Github上的issue链接： issues/24553。

对于我们这次特定的问题，是因为prefix Query里没有限制用户输入的长度。看ES的源码，PrefixQuery继承自Lucene的AutomatonQuery，在实例化的时候，maxDeterminizedStates传的是Integer.MAX_VALUE, 并且生成automaton之前，prefix的长度也没有做限制。个人认为这里可能应该限制一下大小，避免产生过多的状态:

public class PrefixQuery extends AutomatonQuery {

  /** Constructs a query for terms starting with <code>prefix</code>. */
  public PrefixQuery(Term prefix) {
    // It's OK to pass unlimited maxDeterminizedStates: the automaton is born small and determinized:
    super(prefix, toAutomaton(prefix.bytes()), Integer.MAX_VALUE, true);
    if (prefix == null) {
      throw new NullPointerException("prefix must not be null");
    }
1
2
3
4
5
6
7
8
9

最终抛出异常的代码是

org.apache.lucene.util.automaton.Operations.isFinite，  
1

可以看到这段代码里用了递归，递归的深度取决于状态转移的数量。根据注释的说明，这是一段待完善的代码，因为使用了递归，可能导致堆栈溢出:

  // TODO: not great that this is recursive... in theory a
  // large automata could exceed java's stack
  private static boolean isFinite(Transition scratch, Automaton a, int state, BitSet path, BitSet visited) {
    path.set(state);
    int numTransitions = a.initTransition(state, scratch);
    for(int t=0;t<numTransitions;t++) {
      a.getTransition(state, t, scratch);
      if (path.get(scratch.dest) || (!visited.get(scratch.dest) && !isFinite(scratch, a, scratch.dest, path, visited))) {
        return false;
      }
    }
    path.clear(state);
    visited.set(state);
    return true;
  }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

由此可见，在项目里使用了模糊查询的同学，一定一定要注意限制用户输入长度，否则可能导致集群负载过高或者整个挂掉。

虽然Lucene/Elasticsearch应该在代码层面做一些限制，确保有问题的query不会导致stack overflow，但是当用到这类查询的时候，程序员的思维方式还局限在RDBMS开发的时代。我们应该多在数据索引阶段下功夫，确保尽量用最高效的term query来完成绝大多数的查询。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/不正经/article/detail/136326