赞
踩
常用的命令
1、 yarn rmadmin -getServiceState rm1 查看active或者是standby状态
2、手动切换主备
yarn rmadmin -transitionToStandby rm2 --forcemanual 将rm2主切换成备
yarn rmadmin -transitionToActive rm1 --forcemanual 将rm1备切换成主
yarn rmadmin -getServiceState rm1
Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 10000, but not removing app application_1618886060273_3657 from state store as log aggregation have not finished yet
yarn的BUG:https://issues.apache.org/jira/browse/YARN-4946
修复步骤是先处理standby RM,再处理active RM,RM节点替换步骤是:
(1) mv方式备份旧包: mv $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-$version.jar <备份路径>
(2)copy新包 ,下载高于3.2.0版本hadoop-yarn-server-resourcemanager-$version.jar包
(3)重启RM
先yarn-site配置
yarn.node-labels.enabled=true
yarn.node-labels.fs-store.root-dir=hdfs://namenode:port/path/node-labels/
再master节点执行:
##添加标签
yarn rmadmin -addToClusterNodeLabels "label_1(exclusive=true/false),label_2(exclusive=true/false)"
## exclusive 默认(true)
##查看标签
yarn cluster --list-node-labels
##删除YARN Node Labels
yarn rmadmin -removeFromClusterNodeLabels "<label>[,<label>,...]"
##增加/修改标签映射
yarn rmadmin -replaceLabelsOnNode “node1[:port]=label1 node2=label2” [-failOnUnknownNodes]
##node1的地址需要到yarn页面看nodes查看
队列配置
<configuration> <property> <name>yarn.scheduler.capacity.maximum-applications</name> <value>10000</value> <description>Maximum number of applications that can be pending and running.</description> </property> <property> <name>yarn.scheduler.capacity.maximum-am-resource-percent</name> <value>0.25</value> <description>Maximum percent of resources in the cluster which can be used to run application masters i.e. controls number of concurrent running applications.</description> </property> <property> <name>yarn.scheduler.capacity.resource-calculator</name> <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value> <description>The ResourceCalculator implementation to be used to compare Resources in the scheduler.The default i.e. DefaultResourceCalculator only uses Memory while DominantResourceCalculator uses dominant-resource to compare multi-dimensional resources such as Memory, CPU etc.</description> </property> <property> <name>yarn.scheduler.capacity.root.queues</name> <value>default,a1,a2,a3</value> <description>The queues at the this level (root is the root queue).</description> </property> <property> <name>yarn.scheduler.capacity.root.a1.accessible-node-labels</name> <value>test1</value> </property> <property> <name>yarn.scheduler.capacity.root.a2.accessible-node-labels</name> <value>test2</value> </property> <property> <name>yarn.scheduler.capacity.root.a3.accessible-node-labels</name> <value>test2</value> </property> <property> <name>yarn.scheduler.capacity.root.a1.accessible-node-labels.test1.capacity</name> <value>100</value> </property> <property> <name>yarn.scheduler.capacity.root.a2.accessible-node-labels.test2.capacity</name> <value>30</value> </property> <property> <name>yarn.scheduler.capacity.root.a3.accessible-node-labels.test2.capacity</name> <value>70</value> </property> <property> <name>yarn.scheduler.capacity.root.default.capacity</name> <value>50</value> </property> <property> <name>yarn.scheduler.capacity.root.a1.capacity</name> <value>20</value> </property> <property> <name>yarn.scheduler.capacity.root.a2.capacity</name> <value>20</value> </property> <property> <name>yarn.scheduler.capacity.root.a3.capacity</name> <value>10</value> </property> <property> <name>yarn.scheduler.capacity.root.a1.maximum-capacity</name> <value>100</value> </property> <property> <name>yarn.scheduler.capacity.root.a2.maximum-capacity</name> <value>100</value> </property> <property> <name>yarn.scheduler.capacity.node-locality-delay</name> <value>-1</value> <description>Number of missed scheduling opportunities after which the CapacityScheduler attempts to schedule rack-local containers. Typically this should be set to number of nodes in the cluster.</description> </property> <property> <name>yarn.scheduler.capacity.queue-mappings</name> <value></value> <description>A list of mappings that will be used to assign jobs to queues. The syntax for this list is [u|g]:[name]:[queue_name][,next mapping]* Typically this list will be used to map users to queues,for example, u:%user:%user maps all users to queues with the same name as the user.</description> </property> <property> <name>yarn.scheduler.capacity.queue-mappings-override.enable</name> <value>false</value> <description>If a queue mapping is present, will it override the value specified by the user? This can be used by administrators to place jobs in queues that are different than the one specified by the user. The default is false.</description> </property> </configuration>
重启yarn RM
扩容即可
2020-09-17 13:12:31,660 ERROR org.mortbay.log: /ws/v1/cluster/apps
javax.ws.rs.WebApplicationException: javax.xml.bind.MarshalException
用户有个root.etl.streaming的队列,但是使用使用queue=root.etl.streaming报错没有队列,使用queue=etl.streaming也报错没有队列
解决办法:queue指定队列只需要指定最后的队列名,所以queue=streaming后成功
1)发现ACCEPTED作业的描述信息中有如下日志:
yarn.scheduler.capacity.maximum-am-resource-percent默认0.25可以适当调大
2)还有一种是对应的队列满了,比如下面情况:
一共34000,已经使用了32768,用户提交任务一种处于ACCEPTED
调大队列的大小~
3)用户有2个队列,default队列分配75%的资源,但是设置最大资源是100%,当使用到75%时任务处于ACCEPTED
原因时yarn.scheduler.capacity.root.default.user-limit-factor参数默认时1,标识一个用户只能使用队列分配的最大值75%,当yarn.scheduler.capacity.root.default.user-limit-factor设置成2时,就可以继续提交到空闲内存
,或者换一个用户提交作业也可以提交到空闲内存
必须对root设置权限
1、修改去除已下线节点的地址/etc/ecm/hadoop-conf/yarn.exclude
2、header节点执行 yarn rmadmin -refreshNodes
3、emr控制台启动nm
新版本hadoop.http.authentication.simple.anonymous.allowed这个值为false导致访问yarn页面的时候要加上?user.name=xxxxxx,否则没有权限
或者打开权限hadoop.http.authentication.simple.anonymous.allowed改为true,然后重启yarn
问题描述:用户重启rm后2个rm都是standby状态,
先使用hadoop用户执行命令yarn rmadmin -transitionToActive rm1 --forcemanual强制rm1切换成active,但是不成功
看报错信息是zk出现问题,但是使用zkCli.sh是没有问题的,开源资料查询需要调大
“-Djute.maxbuffer=10000000”
逐个重启zk,先启动follower
还是报错,在yarn的环境也添加上/var/lib/ecm-agent/cache/ecm/service/YARN/x.x.x.x.x/package/templates 里找到yarn-env.sh YARN_OPTS = “-Djute.maxbuffer=10000000”
重启yarn,还是报错
使用命令yarn resourcemanager -format-state-store格式化rm的状态存储,
还是报错,只能手动去zk删除对应的任务znode状态
deleteall /rmstore/ZKRMStateRoot/RMAppRoot/application_ 1595251161356_11443
删除后成功
解决办法:作业触发了 ResourceManager zk的异常,修复zk 问题以后重启RM 卡在了一个异常作业上1595251161356_11443,这个情况下,可以清理zk store快速恢复,但是有个异常,所以手动清理了异常作业的zk信息
原因是开启了acl的权限
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。