k8s一直存在terminating任务的那点线索_failed to exit within 30 seconds of signal 15 - us

作者：花生_TL007 | 2024-06-07 07:18:57

踩

failed to exit within 30 seconds of signal 15 - using the force

问题来源：

博主所在工作集群中经常遇到k8s的deploy和job中存在terminating任务的现场，顺藤摸瓜发现造成terminating的原因是pod所在节点（ubuntu16.04.6）的容器中有进程未杀掉导致；该进程为D进程，难以处理。

pod所在节点日志有以下特征：

1、大量OOM记录

2、syslog（dmesg亦如此）频繁SLUB（后经网络游历该日志虽为系统bug，非此篇文章描述问题的起源。）

SLUB: Unable to allocate memory on node -1 (gfp=0x2080020)

3、docker的日志则是：


Aug 26 21:12:22 n002 dockerd[1632]: time="2020-08-26T21:12:22.358239959+08:00" level=info msg="Container b39ef98d452cd825cd6ab4e07767b5e8091d055e75e9a7b96ba83ba9c4ac2089 failed to exit within 30 seconds of signal 15 - using the force"
evel=info msg="Container b39ef98d452c failed to exit within 10 seconds of kill - trying direct SIGKILL"

4、dmesg中大量nfs retry日志：


kernel: [1032289.079654] nfs: server 10.32.0.10 not responding, still trying
kernel: [1032289.079664] nfs: server 10.32.0.10 not responding, still trying
kernel: [1032289.151627] nfs: server 10.32.0.10 not responding, still trying

2020年12月13日 01:00:50增加信息，先睡了，日后补充，有问题交流。：

https://k8s.imroc.io/avoid/handle-cgroup-oom-in-userspace-with-oom-guard/

https://k8s.imroc.io/troubleshooting/pod/slow-terminating/

https://www.cnblogs.com/jmliao/p/11322804.html

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/花生_TL007/article/detail/684429