当前位置:   article > 正文

Quartz集群调度出现的WARN警告问题_clustermanager: detected 1 failed or restarted ins

clustermanager: detected 1 failed or restarted instances.

1、报错内容如下:
在这里插入图片描述
具体描述如下图所示:

This scheduler instance xxxx is still active but was recovered by another instance in the cluster. This may cause inconsistent behavior.
ClusterManager detected 1 failed or restarted instances.
  • 1
  • 2

分析:
1、可以看到当前日志是由LocalDataSourceJobStore打印出来的,源码查看无日志信息,往父类和接口进行查找到JobStoreSupport,主要源码如下:

protected void clusterRecover(Connection conn, List<SchedulerStateRecord> failedInstances)
        throws JobPersistenceException {

        if (failedInstances.size() > 0) {

            long recoverIds = System.currentTimeMillis();

            logWarnIfNonZero(failedInstances.size(),
                    "ClusterManager: detected " + failedInstances.size()
                            + " failed or restarted instances.");
            // 省略后面的N行代码
            // ....
        }
    }
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
protected List<SchedulerStateRecord> findFailedInstances(Connection conn)
        throws JobPersistenceException {
        try {
            List<SchedulerStateRecord> failedInstances = new LinkedList<SchedulerStateRecord>();
            boolean foundThisScheduler = false;
            long timeNow = System.currentTimeMillis();
            
            List<SchedulerStateRecord> states = getDelegate().selectSchedulerStateRecords(conn, null);

            for(SchedulerStateRecord rec: states) {
        
                // find own record...
                if (rec.getSchedulerInstanceId().equals(getInstanceId())) {
                    foundThisScheduler = true;
                    if (firstCheckIn) {
                        failedInstances.add(rec);
                    }
                } else {
                    // find failed instances...
                    if (calcFailedIfAfter(rec) < timeNow) {
                        failedInstances.add(rec);
                    }
                }
            }
            
            // The first time through, also check for orphaned fired triggers.
            if (firstCheckIn) {
                failedInstances.addAll(findOrphanedFailedInstances(conn, states));
            }
            
            // If not the first time but we didn't find our own instance, then
            // 不是当前机器同时也不是第一次进行check.
            if ((!foundThisScheduler) && (!firstCheckIn)) {
                // FUTURE_TODO: revisit when handle self-failed-out impl'ed (see FUTURE_TODO in clusterCheckIn() below)
                getLog().warn(
                    "This scheduler instance (" + getInstanceId() + ") is still " + 
                    "active but was recovered by another instance in the cluster.  " +
                    "This may cause inconsistent behavior.");
            }
            
            return failedInstances;
        } catch (Exception e) {
            lastCheckin = System.currentTimeMillis();
            throw new JobPersistenceException("Failure identifying failed instances when checking-in: "
                    + e.getMessage(), e);
        }
    }
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47

可以看到代码中的 // find failed instances… 下面的calcFailedIfAfter方法:

protected long calcFailedIfAfter(SchedulerStateRecord rec) {
   return rec.getCheckinTimestamp() +
        Math.max(rec.getCheckinInterval(), 
                (System.currentTimeMillis() - lastCheckin)) +
        7500L;
}	
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

由于数据库中没有找到当前机器的instance并不是第一次check,所以会打印如下日志:

This scheduler instance xxxx is still active but was recovered by another instance in the cluster. This may cause inconsistent behavior.
  • 1

同时有其他机器节点的时间发生了超时,由于系统的时间差值较大,超过7.5秒,才会将失败的实例增加到failedInstances中,由于存在超时通讯的节点,所以会执行调用clusterRecover方法,则会打印如下的日志:

ClusterManager detected 1 failed or restarted instances.
  • 1

所以这个问题主要是由于系统服务器时间不同步导致的,同步集群当中服务的时间即可解决该问题。当前源码学习仍在进行中,如有不对,请不吝赐教,感激不尽!

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/笔触狂放9/article/detail/120003?site
推荐阅读
相关标签
  

闽ICP备14008679号