赞
踩
因修改了yarn的配置,需要对yarn ResourceManager进行重启,重启完发现两个ResourceManager状态均为standby,用户无法在yarn集群上提交任务, Yarn服务异常。
ResourceManager Exception日志如下:
通过Yarn HA机制得知 standby状态的RM会对正在运行的任务尝试恢复,具体过程如下:
当NM与重新启动的RM重新同步时,NM不会杀死容器。它继续管理容器,并在重新注册时将容器状态发送到RM。
RM通过吸收这些容器的信息来重建容器实例和相关应用程序的调度状态。与此同时AM需要将未完成的资源请求重新发送给RM,因为RM可能会在关闭时丢失未完成的请求。
使用AMRMClient库与RM通信的应用程序编写者无需担心AM在重新同步时向RM重新发送资源请求的部分,因为它自动由库本身处理。
查看Yarn正在运行的任务ID application_1606183701564_9494(只有一个任务正在运行)
application_1606183701564_9494(只有一个任务正在运行)
根据任务ID查看standby 状态下ResourceManager日志
2020-11-26 20:05:02,369 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1606183701564_9494 with 1 attempts and final state = NONE
2020-11-26 20:05:23,123 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Cannot submit application application_1606183701564_9494 to queue root.default because it has zero amount of resource for a requested resource! Invalid requested AM resources: [MaxResourceValidationResult{resourceRequest={AllocationRequestId: -1, Priority: 0, Capability: <memory:2048, vCores:1>, # Containers: 1, Location: *, Relax Locality: true, Execution Type Request: {Execution Type: GUARANTEED, Enforce Execution Type: false}, Node Label Expression: }, invalidResources=[name: memory-mb, units: Mi, type: COUNTABLE, value: 2048, minimum allocation: 0, maximum allocation: 9223372036854775807, name: vcores, units: , type: COUNTABLE, value: 1, minimum allocation: 0, maximum allocation: 9223372036854775807]}], maximum queue resources: <memory:0, vCores:0>
2020-11-26 20:05:23,126 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:526)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1257)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:132)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1266)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1207)
at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:908)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:116)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:1078)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$2300(RMAppImpl.java:118)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1142)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1083)
at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:891)
at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:358)
at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:552)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1406)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:769)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1159)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1199)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1195)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1195)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:320)
at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:894)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:473)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:651)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:526)
观察日志发现任务application_1606183701564_9494的最终状态为None,AM向队列root.default请求资源时没有资源
8.借助搜索引擎进行问题关键字检索
发现CDH的fixed_issues与我们的问题一致
YARN Resource Managers will stay in standby state after failover or startup https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_634_fixed_issues.html
CDH指向的Yarn社区的Jira链接
Improve error handling when application recovery fails with exception https://issues.apache.org/jira/browse/YARN-7913
a.Yarn开启HA 并选择fair-scheduler队列模式
b.修改standby RM的fair-scheduler.xml使得其在故障转移时在相应的队列上无权限or资源申请到资源
c.抛出空指针异常导致standby RM 转换至active失败
因为我们没有对standby RM的fair-scheduler.xml进行修改,所以下一步查看fair-scheduler.xml的内容定位问题
查看队列root.default的配置 cat /run/cloudera-scm-agent/process/2804-yarn-RESOURCEMANAGER/fair-scheduler.xm
怀疑是memory-mb=100.0%, vcores=100.0% 无法被识别
删除root.default队列 maxResources的标签后Yarn RM HA故障转移成功
fair-scheduler.xml中队列root.default的maxResources标签值无法被识别导致队列最大可用资源为0
standby RM 启动后,集群中进行中的任务AM会继续向该RM申请资源
新的RM无法在root.default队列申请到资源
查看源码发现此时application的状态为New,因为APP_REJECTED事件还未处理完成(处理完成状态应为Faild),导致该application无法在scheduler找到而抛出空指针异常
关闭所有的ResourceManager进程
通过zk客户端查看 sh /opt/cloudera/parcels/CDH/lib/zookeeper/bin/zkCli.sh
ls /rmstore/ZKRMStateRoot/RMAppRoot 目录下
不为空则使用该命令 deleteall /rmstore/ZKRMStateRoot/RMAppRoot 删除目录文件
删除完再启动ResourceManager恢复正常
此种方式是将ZK中需要RM恢复的任务清空实现的,即RM不恢复正在运行的任务,会导致集群正在运行的任务停止
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。