赞
踩
问题来源:
博主所在工作集群中经常遇到k8s的deploy和job中存在terminating任务的现场,顺藤摸瓜发现造成terminating的原因是pod所在节点(ubuntu16.04.6)的容器中有进程未杀掉导致;该进程为D进程,难以处理。
pod所在节点日志有以下特征:
1、大量OOM记录
2、syslog(dmesg亦如此)频繁SLUB(后经网络游历该日志虽为系统bug,非此篇文章描述问题的起源。)
SLUB: Unable to allocate memory on node -1 (gfp=0x2080020)
3、docker的日志则是:
- Aug 26 21:12:22 n002 dockerd[1632]: time="2020-08-26T21:12:22.358239959+08:00" level=info msg="Container b39ef98d452cd825cd6ab4e07767b5e8091d055e75e9a7b96ba83ba9c4ac2089 failed to exit within 30 seconds of signal 15 - using the force"
- evel=info msg="Container b39ef98d452c failed to exit within 10 seconds of kill - trying direct SIGKILL"
4、dmesg中大量nfs retry日志:
- kernel: [1032289.079654] nfs: server 10.32.0.10 not responding, still trying
- kernel: [1032289.079664] nfs: server 10.32.0.10 not responding, still trying
- kernel: [1032289.151627] nfs: server 10.32.0.10 not responding, still trying
2020年12月13日 01:00:50增加信息,先睡了,日后补充,有问题交流。:
https://k8s.imroc.io/avoid/handle-cgroup-oom-in-userspace-with-oom-guard/
https://k8s.imroc.io/troubleshooting/pod/slow-terminating/
https://www.cnblogs.com/jmliao/p/11322804.html
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。