spark作业执行失败,重新执行的时候,查看sparkui,发现存在大量失败的task,执行结束后,通过yarn-ui看到报错日志如下:
- User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure:
- ShuffleMapStage 1 (javaRDD at SumDeliveryIndexFactory.java:628) has failed the maximum allowable
- number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException:
- Failed to connect to xxxx/10.136.22.22:34192
由报错可见,Failed to connect to xxxx/10.136.22.22:34192,连接10.136.22.22失败。进入10.136.22.22主机,查看nodemanager日志,于是看到了以下的错误信息:running beyond physical memory limits.Killing container。可见,由于使用的物理内存超出了container的内存大小,被强制kill了。
解决办法:spark-submit 添加参数,调大spark.yarn.executor.memoryOverhead=4G
错误日志:
- 2017-11-14 11:33:07,273 INFO
- org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory
- usage of ProcessTree 236569 for container-id container_e31_1510205192678_147416_02_000024:
- 40.7 GB of 40 GB physical memory used; 41.9 GB of 84 GB virtual memory used
- 2017-11-14 11:33:07,273 WARN
- org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Process
- tree for container: container_e31_1510205192678_147416_02_000024 has processes older than 1
- iteration runningover the configured limit. Limit=42949672960, current usage = 43653300224
- 2017-11-14 11:33:07,274 WARN
- org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
- Container [pid=236569,containerID=container_e31_1510205192678_147416_02_000024] is running beyond
- physical memory limits. Current usage: 40.7 GB of 40 GB physical memory used; 41.9 GB of 84 GB
- virtual memory used. Killing container.
参考博客:
- 1、Spark Executor在YARN上的内存分配
- http://blog.csdn.net/hammertank/article/details/48346285
-
- 2、yarn is running beyond physical memory limits 问题解决
- http://blog.csdn.net/oaimm/article/details/25298691
-
- 3、Yarn简单介绍及内存配置
- http://blog.chinaunix.net/uid-28311809-id-4383551.html