Yarn问题_yarn rmadmin -getallservicestate

1、 yarn rmadmin -getServiceState rm1 查看active或者是standby状态

yarn rmadmin -transitionToStandby rm2 --forcemanual 将rm2主切换成备
yarn rmadmin -transitionToActive rm1 --forcemanual 将rm1备切换成主
yarn rmadmin -getServiceState rm1

1、yarn假死状态,日志一直刷新以下信息:log aggregation have not finished yet

Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 10000, but not removing app application_1618886060273_3657 from state store as log aggregation have not finished yet
修复步骤是先处理standby RM,再处理active RM,RM节点替换步骤是:
(1) mv方式备份旧包: mv $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-$version.jar <备份路径>
(2)copy新包 ,下载高于3.2.0版本hadoop-yarn-server-resourcemanager-$version.jar包



 yarn rmadmin -addToClusterNodeLabels "label_1(exclusive=true/false),label_2(exclusive=true/false)" 
## exclusive 默认(true)
 yarn cluster --list-node-labels
##删除YARN Node Labels
yarn rmadmin -removeFromClusterNodeLabels "<label>[,<label>,...]"
yarn rmadmin -replaceLabelsOnNode “node1[:port]=label1 node2=label2” [-failOnUnknownNodes]
    <description>Maximum number of applications that can be pending and running.</description>
    <description>Maximum percent of resources in the cluster which can be used to run application masters i.e. controls number of concurrent running applications.</description>
    <description>The ResourceCalculator implementation to be used to compare Resources in the scheduler.The default i.e. DefaultResourceCalculator only uses Memory while DominantResourceCalculator uses dominant-resource to compare multi-dimensional resources such as Memory, CPU etc.</description>
    <description>The queues at the this level (root is the root queue).</description>

    <description>Number of missed scheduling opportunities after which the CapacityScheduler attempts to schedule rack-local containers. Typically this should be set to number of nodes in the cluster.</description>
    <description>A list of mappings that will be used to assign jobs to queues. The syntax for this list is [u|g]:[name]:[queue_name][,next mapping]* Typically this list will be used to map users to queues,for example, u:%user:%user maps all users to queues with the same name as the user.</description>
    <description>If a queue mapping is present, will it override the value specified by the user? This can be used by administrators to place jobs in queues that are different than the one specified by the user. The default is false.</description>

2020-09-17 13:12:31,660 ERROR org.mortbay.log: /ws/v1/cluster/apps
javax.ws.rs.WebApplicationException: javax.xml.bind.MarshalException

  • with linked exception:
    [javax.xml.stream.XMLStreamException: org.mortbay.jetty.EofException]
    at com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:159)
    at com.sun.jersey.spi.container.ContainerResponse.write(ContainerResponse.java:306)
    at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1437)
    at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
    at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
    at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
    at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
    at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
    at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
    at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:142)
    at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
    at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
    at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
    at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
    at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
    at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
    at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)
    at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
    at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)
    at org.codehaus.jackson.impl.WriterBasedGenerator._flushBuffer(WriterBasedGenerator.java:1812)
    at org.codehaus.jackson.impl.WriterBasedGenerator._writeString(WriterBasedGenerator.java:987)
    at org.codehaus.jackson.impl.WriterBasedGenerator._writeFieldName(WriterBasedGenerator.java:328)
    at org.codehaus.jackson.impl.WriterBasedGenerator.writeFieldName(WriterBasedGenerator.java:197)
    at com.sun.jersey.json.impl.writer.JacksonStringMergingGenerator.writeFieldName(JacksonStringMergingGenerator.java:140)
    at com.sun.jersey.json.impl.writer.Stax2JacksonWriter.writeStartElement(Stax2JacksonWriter.java:183)
    … 63 more
    Caused by: java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
2、header节点执行 yarn rmadmin -refreshNodes

8、curl http://hadoop:8088/cluster 显示没权限




先使用hadoop用户执行命令yarn rmadmin -transitionToActive rm1 --forcemanual强制rm1切换成active,但是不成功
还是报错,在yarn的环境也添加上/var/lib/ecm-agent/cache/ecm/service/YARN/x.x.x.x.x/package/templates 里找到yarn-env.sh YARN_OPTS = “-Djute.maxbuffer=10000000”

使用命令yarn resourcemanager -format-state-store格式化rm的状态存储,

deleteall /rmstore/ZKRMStateRoot/RMAppRoot/application_ 1595251161356_11443
解决办法:作业触发了 ResourceManager zk的异常,修复zk 问题以后重启RM 卡在了一个异常作业上1595251161356_11443,这个情况下,可以清理zk store快速恢复,但是有个异常,所以手动清理了异常作业的zk信息

10、spark on yarn 显示提交作业的vcore数量和参数提交的对不上

yarn.scheduler.capacity.resource-calculator org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator 改成:org.apache.hadoop.yarn.util.resource.DominantResourceCalculator

11、yarn查看日志报错:User [dr.who] is not authorized to view the logs for container_1623238721588_0301_02_000001 in log file


