赞
踩
2023-09-13 19:57:28,125 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 61275 (type=CHECKPOINT) @ 1694606247915 for job 9a6f6c003e8eb3edf8cea8b3b0966456.
2023-09-13 19:57:29,391 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 61275 for job 9a6f6c003e8eb3edf8cea8b3b0966456 (27026 bytes in 1018 ms).
2023-09-13 19:59:28,064 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 61276 (type=CHECKPOINT) @ 1694606367915 for job 9a6f6c003e8eb3edf8cea8b3b0966456.
2023-09-13 19:59:48,200 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - PartitionCommitter -> Sink: end (1/1) (cf1974ce54d8116af731fb3838552db6) switched from RUNNING to FAILED on container_e38_1686722180292_14272_01_000010 @ dn05xxx-xxx16.xxx.com (dataPort=43355).
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id container_e38_1686722180292_14272_01_000010(dn05xxx-xxx16.xxx.com:8041) timed out.
集群中一台机器dn05由于过载已被NN暂时下线,接下来由于dn05一直无法连接,程序最终CANCELED.
#开始重启
Job default: INSERT INTO ......(9a6f6c003e8eb3edf8cea8b3b0966456) switched from state RESTARTING to RUNNING.
2023-09-13 20:00:11,061 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job 9a6f6c003e8eb3edf8cea8b3b0966456 from Checkpoint 61275 @ 1694606247915 for 9a6f6c003e8eb3edf8cea8b3b0966456 located at hdfs://nameservicexxx/user/xxx/realdb/flink/checkpoint/9a6f6c003e8eb3edf8cea8b3b0966456/chk-61275
streaming-writer (1/6) (07c19cfa57ae78887325b3fecc4547e4) switched from SCHEDULED to DEPLOYING.
.......
streaming-writer (1/6) (attempt #1) with attempt id 9bce97d536faba6a676755deeaa718c7 to container_e38_1686722180292_14272_01_000009 @ dn05xxx-xxx16.xxx.com (dataPort=43151) with allocation id 95f6802610ca3892f9543af43c04901d
streaming-writer (2/6) (attempt #1) with attempt id 9b73205a06f56376b8902462202cb34d to container_e38_1686722180292_14272_01_000012 @ dn06xxx-xxx17.xxx.com (dataPort=43436) with allocation id a14b93ef41a0d5b2746c2a5a6523cf89
streaming-writer (3/6) (attempt #1) with attempt id 4f1e2772ed77566a31a20fc93c51b2da to container_e38_1686722180292_14272_01_000013 @ dn06xxx-xxx17.xxx.com (dataPort=36781) with allocation id ddbe6ee511eba735866c63ff5599556a
streaming-writer (4/6) (attempt #1) with attempt id 655f537c478fd2524671d686630880bc to container_e38_1686722180292_14272_01_000011 @ dn05xxx-xxx16.xxx.com (dataPort=42400) with allocation id b3b7cecc550fa8597f79e42479b82fd
streaming-writer (5/6) (attempt #1) with attempt id dec418b041d5123110ee03fa338d5694 to container_e38_1686722180292_14272_01_000016 @ dn08xxx-xxx19.xxx.com (dataPort=36219) with allocation id f3d715629bdf509204f45a96dd0e6e09
streaming-writer (6/6) (attempt #1) with attempt id 1d074d7cf6c33330fddb3691b7bc365d to container_e38_1686722180292_14272_01_000014 @ dn06xxx-xxx17.xxx.com (dataPort=43549) with allocation id 0c3d867bb2ebafac0bc492ce467162f1
task manager的六个container仅部署在三台机器[dn05,dn06,dn08],集群有9个DN,可见此时集群的负载很高。(后续查看NM日志发现从19:59开始dn05已处于过载状态,被临时下线,此时该DN无法读写,但对于yarn而言它的计算资源依然可以申请。[下线是为了防止DN宕机数据丢失,机器负载降低后会重新上线,但这也造成了机器的频繁上下线])。
streaming-writer (1/6) (9bce97d536faba6a676755deeaa718c7) switched from DEPLOYING to RUNNING
......
streaming-writer (5/6) (dec418b041d5123110ee03fa338d5694) switched from DEPLOYING to RUNNING.
#在这之后,dn05又被下线(很快又上线,该过程很频繁,大概是因为集群资源当时太紧张,没有其他机器可供分配)。
2023-09-13 20:00:18,266 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to [null] failed with java.net.ConnectException: Connection refused: dn05xxx-xxx16.xxx.com/172.40.13.17:34930 【注意:没出现TM terminated 异常,非严重异常即dn05及时上线了】
....
#重新申请资源恢复dn05失败的两个TaskManagers:一个由redundant机制直接提供一个(priority 1);另一个由yarn分配,遗憾的是Requesting new worker还在dn05上(资源紧张,机器频繁上下线)。
第一个TM【org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requesting new worker with resource spec WorkerResourceSpec {cpuCores=1.0, taskHeapSize=1.425gb (1530082070 bytes), taskOffHeapSize=0 bytes, networkMemSize=343.040mb (359703515 bytes), managedMemSize=1.340gb (1438814063 bytes)}, current pending count: 1】
第二个TM【org.apache.flink.yarn.YarnResourceManagerDriver [] - Requesting new TaskExecutor container with resource TaskExecutorProcessSpec {cpuCores=1.0, frameworkHeapSize=128.000mb (134217728 bytes), frameworkOffHeapSize=128.000mb (134217728 bytes), taskHeapSize=1.425gb (1530082070 bytes), taskOffHeapSize=0 bytes, networkMemSize=343.040mb (359703515 bytes), managedMemorySize=1.340gb (1438814063 bytes), jvmMetaspaceSize=256.000mb (268435456 bytes), jvmOverheadSize=409.600mb (429496736 bytes)}, priority 1】
#第一个TM还在dn05上。
2023-09-13 20:00:30,697 INFO org.apache.flink.yarn.YarnResourceManagerDriver [] - TaskExecutor container_e40_1686722180292_14272_01_000001(dn05xxx-xxx16.xxx.com:8041) will be started on dn05xxx-xxx16.xxx.com with TaskExecutorProcessSpec {cpuCores=1.0, frameworkHeapSize=128.000mb (134217728 bytes), frameworkOffHeapSize=128.000mb (134217728 bytes), taskHeapSize=1.425gb (1530082070 bytes), taskOffHeapSize=0 bytes, networkMemSize=343.040mb (359703515 bytes), managedMemorySize=1.340gb (1438814063 bytes), jvmMetaspaceSize=256.000mb (268435456 bytes), jvmOverheadSize=409.600mb (429496736 bytes)}.
......
#dn08处于负载临界点,内存不足,Failed to create Hive RecordWriter;出现严重故障,无法恢复,程序进入准备重启状态。
streaming-writer (5/6) (dec418b041d5123110ee03fa338d5694) switched from RUNNING to FAILED on container_e38_1686722180292_14272_01_000016 @ dn08xxx-xxx19.xxx.com (dataPort=36219).
org.apache.flink.connectors.hive.FlinkHiveException: org.apache.flink.table.catalog.exceptions.CatalogException: Failed to create Hive RecordWrite....Caused by: java.lang.OutOfMemoryError: Java heap space
#取消任务
Job default: INSERT INTO ...... (9a6f6c003e8eb3edf8cea8b3b0966456) switched from state RUNNING to RESTARTING. ......
(4/6) (4842bb76778fd3cfe11805b5a3e033cb) switched from RUNNING to CANCELING. ......多个 ......
(5/6) (34c88ea306caca716e8a161184128f48) switched from CANCELING to CANCELED. ......多个 ......
Job default: INSERT INTO ......(9a6f6c003e8eb3edf8cea8b3b0966456) switched from state RESTARTING to RUNNING.
.....chk-61275....
...(3/6) (58ab717d3d93694434ab3c4a5d31b4ab) switched from CREATED to SCHEDULED....
...streaming-writer (3/6) (58ab717d3d93694434ab3c4a5d31b4ab) switched from SCHEDULED to DEPLOYING....
...streaming-writer (3/6) (attempt #2) with attempt id 58ab717d3d93694434ab3c4a5d31b4ab to container_e38_1686722180292_14272_01_000013@ dn06xxx-xxx17.xxx.com (dataPort=36781) with allocation id ddbe6ee511eba735866c63ff5599556a...
...switched from DEPLOYING to RUNNING...
#dn08的部署运行再度出现严重错误,并准备再重启
streaming-writer (5/6) (8312252753df1bfe91bb3d8165fff4ec) switched from DEPLOYING to RUNNING
compact-operator (5/6) (83b87a0cbda2991a416d7c423f82be94) switched from DEPLOYING to RUNNING
【streaming-writer (5/6) (8312252753df1bfe91bb3d8165fff4ec) switched from RUNNING to FAILED on container_e38_1686722180292_14272_01_000016 @ dn08xxx-xxx19.xxx.com (dataPort=36219).】
org.apache.flink.table.catalog.exceptions.CatalogException: Failed to create Hive RecordWriter
Caused by: java.lang.OutOfMemoryError: Java heap space
Job default: INSERT INTO ...(9a6f6c003e8eb3edf8cea8b3b0966456) switched from state RUNNING to RESTARTING.
取消任务;
#第三次重启开始,部署的机器还是这三台dn05,dn06,dn08()
switched from state RESTARTING to RUNNING;
chk-61275 restore;
switched from CREATED to SCHEDULED;
switched from SCHEDULED to DEPLOYING;
switched from DEPLOYING to RUNNING.
到dn08时再度内存oom;
#接下来基本按照这个重启流程重启了设定的剩余的次数,到部署dm08的tm开始运行时Failed;
#attempt #24次时,部署的机器还是这三台dn05,dn06,dn08;此时集群资源依然高负载(20:38)
#attempt #25次时,部署的机器还是这三台dn05,dn06,dn08;此时集群资源依然高负载(20:39);
此次重试dn06在 (3/6) (2ffc15ce8f313ff5d65595ef01b28443) switched from DEPLOYING to RUNNING 后被下线;
Association with remote system [akka.tcp://flink@dn06xxx-xxx17.xxx.com:46827] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2023-09-13 20:39:57,921 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker container_e38_1686722180292_14272_01_000013 is terminated. Diagnostics: [2023-09-13 20:39:57.166]Container exited with a non-zero exit code 239
[2023-09-13 20:39:57.166]Container exited with a non-zero exit code 239#出现严重Tm terminated异常,准备重启;为何dn06-TM没恢复的可能原因:dn06一直未能上线,最新状态数据无法恢复,仅能通过checkpoints恢复,那么只能重启恢复了。
#dn08异常的大致时段: 19:59 ~ 20:48 何时恢复的正常未考究;
#dn06异常的大致时段: 20:39(不确定) ~ 20:48 何时恢复的正常未考究,该过程出现了频繁的上下线;
dn05异常的大致时段: 20:00 ~ 20:48 何时恢复的正常未考究,该过程出现了频繁的上下线;
第一次重启:
原因:checkpoints失败,严重异常导致重启;
现象:该过程中dn05被短暂下线后恢复上线;dn06 normal;dn08异常。
第二次重启:
原因:dn08 OOM,严重异常导致重启;
现象:该过程中dn05,dn06 normal;dn08异常。
第三次至二十四次重启:【8:06~20:38】
原因:dn08 OOM,严重异常导致重启;
现象:该过程中dn05,dn06 normal;dn08异常。
第二十五次重启:
原因:第二十四次重启时dn08 OOM,严重异常导致再次重启;
现象:该过程中dn05 normal;dn08异常;dn06 下线且较长时间未上线,超过了TM恢复的时限。
第二十六次重启:
原因:第二十五次重启时dn06 较长时间下线,严重异常导致再次重启;
现象:该过程中dn05 normal;dn08异常;dn06可能异常:一个DEPLOYING to RUNNING时重新申请了两个worker(redundant:1[container_e40_1686722180292_14272_01_000002];dn05:1);
第二十七次重启:
原因:第二十六次重启时dn08 OOM,严重异常导致再次重启;
现象:该过程中dn05,dn06 normal;dn08异常;
第二十八次重启:
原因:第二十七次重启时dn08 OOM,严重异常导致再次重启;
现象:该过程中dn05,dn06 normal;dn08异常;
第二十九次重启:
原因:第二十八次重启时dn08 OOM,严重异常导致再次重启;
现象:该过程中dn05下线且较长lost(严重异常)导致TM无法恢复开始下次重启;dn06,dn08状态未知;
最后一次重启:【到达配置的30次了】
原因:第二十九次重启时dn05 较长时间下线,严重异常导致再次重启;
现象:dn08异常OOM;dn05可能正常;dn06异常或者下线;
出现一次有2个TM(dn06)重新申请:container_e40_1686722180292_14272_01_000003(dn10xxx-xxx20.xxx.com:8041)->yarn分配。
结局:
Job default: INSERT INTO ... (9a6f6c003e8eb3edf8cea8b3b0966456) switched from state FAILING to FAILED
Shutting down...
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。