赞
踩
断电重启,手动重启,人工误操作,机器死机,蓝鲸进程怎么办?
本文基于《如何安装蓝鲸的saas-o之bk_nodeman?》
https://blog.csdn.net/haoding205/article/details/82784686
在上文中,我们知道,快速部署蓝鲸,正在安装蓝鲸的saas-o之bk_nodeman时,发现第一台机器由于内存不足卡死,必须要重启,我们就要做好预案,重启后进程怎么办?
机器重启后
确认 /etc/resolv.conf 里第一个nameserver是 127.0.0.1,option选项不能有rotate
检查重启机器的crontab,是否有自动拉起进程的配置 crontab -l | grep process_watch,重启后的自动拉起主要靠crontab
中控机上确认所有进程状态:./bkcec status all, 正常情况下应该都是正常拉起RUNNING状态,如果有EXIT的,则尝试手动拉起。手动拉起的具体方法参考组件的启动停止
如果社区版所有机器同时重启,很大概率会有很多进程启动失败,因为不同机器上组件恢复的时间没法控制,导致依赖的组件还没启动起来,导致失败,连锁反应。所以这种情况,遵循和安装时的启动原则:
先启动db
启动依赖的其他开源组件及服务
启动蓝鲸产品
如果已经部署过SaaS,那么手动拉起。
./bkcec start saas-o # 正式环境
./bkcec start saas-t # 测试环境
略
静候应用进程被crontab定时任务自动拉起,然后检查效果:
[root@paas-1 install]# ./bkcec status all #检查状态 [192.168.1.101] consul: RUNNING [192.168.1.101] nginx: RUNNING [192.168.1.101] zk: RUNNING [192.168.1.101] rabbitmq: RUNNING [192.168.1.101] paas_agent() paas_agent RUNNING pid 4670, uptime 0:01:34 [192.168.1.101] nginx: RUNNING [192.168.1.101] paas_agent(T) paas_agent RUNNING pid 4670, uptime 0:01:39 [192.168.1.101] nginx: RUNNING [192.168.1.101] es: RUNNING [192.168.1.101] kafka: RUNNING --------------------------------------------------------------------------------------------------------- [192.168.1.101] dataapi dataapi RUNNING pid 3691, uptime 0:01:55 [192.168.1.101] dataapi dataapi-celery-1 RUNNING pid 3688, uptime 0:01:55 [192.168.1.101] dataapi dataapi-celery-2 RUNNING pid 3689, uptime 0:01:55 [192.168.1.101] dataapi dataapi-celery-3 RUNNING pid 3690, uptime 0:01:55 [192.168.1.101] monitor collect:collect0 RUNNING pid 5689, uptime 0:01:34 [192.168.1.101] monitor common:logging RUNNING pid 5690, uptime 0:01:34 [192.168.1.101] monitor common:scheduler RUNNING pid 5692, uptime 0:01:34 [192.168.1.101] monitor converge:converge0 RUNNING pid 5694, uptime 0:01:34 [192.168.1.101] monitor detect_cron RUNNING pid 5681, uptime 0:01:34 [192.168.1.101] monitor kernel:cron RUNNING pid 5683, uptime 0:01:34 [192.168.1.101] monitor kernel:match_alarm0 RUNNING pid 5685, uptime 0:01:34 [192.168.1.101] monitor kernel:qos RUNNING pid 5686, uptime 0:01:34 [192.168.1.101] monitor run_data_access:run_data_access0 RUNNING pid 5680, uptime 0:01:34 [192.168.1.101] monitor run_detect_new:run_detect_new0 RUNNING pid 5679, uptime 0:01:34 [192.168.1.101] monitor run_poll_alarm:run_poll_alarm0 RUNNING pid 5678, uptime 0:01:34 [192.168.1.101] databus databus_es RUNNING pid 4990, uptime 0:01:41 [192.168.1.101] databus databus_etl RUNNING pid 4992, uptime 0:01:41 [192.168.1.101] databus databus_jdbc RUNNING pid 4989, uptime 0:01:41 [192.168.1.101] databus databus_redis RUNNING pid 4993, uptime 0:01:41 [192.168.1.101] databus databus_tsdb RUNNING pid 4991, uptime 0:01:41 [192.168.1.101] fta common:apiserver RUNNING pid 3685, uptime 0:02:02 [192.168.1.101] fta common:jobserver RUNNING pid 3683, uptime 0:02:02 [192.168.1.101] fta common:logging RUNNING pid 3687, uptime 0:02:02 [192.168.1.101] fta common:polling0 RUNNING pid 3684, uptime 0:02:02 [192.168.1.101] fta common:qos RUNNING pid 3681, uptime 0:02:02 [192.168.1.101] fta common:scheduler0 RUNNING pid 3682, uptime 0:02:02 [192.168.1.101] fta common:webserver RUNNING pid 3686, uptime 0:02:02 [192.168.1.101] fta fta:collect0 RUNNING pid 3678, uptime 0:02:02 [192.168.1.101] fta fta:converge0 RUNNING pid 3673, uptime 0:02:02 [192.168.1.101] fta fta:job RUNNING pid 3675, uptime 0:02:02 [192.168.1.101] fta fta:match_alarm0 RUNNING pid 3679, uptime 0:02:02 [192.168.1.101] fta fta:match_alarm1 RUNNING pid 3680, uptime 0:02:02 [192.168.1.101] fta fta:match_alarm2 RUNNING pid 3677, uptime 0:02:02 [192.168.1.101] fta fta:match_alarm3 RUNNING pid 3676, uptime 0:02:02 [192.168.1.101] fta fta:poll_alarm RUNNING pid 3672, uptime 0:02:02 [192.168.1.101] fta fta:solution RUNNING pid 3674, uptime 0:02:02 [192.168.1.102] consul: RUNNING [192.168.1.102] mysqld: RUNNING [192.168.1.102] mongod: RUNNING [192.168.1.102] zk: RUNNING [192.168.1.102] paas_agent() paas_agent RUNNING pid 3714, uptime 1 day, 0:32:50 [192.168.1.102] nginx: RUNNING [192.168.1.102] paas_agent(O) paas_agent RUNNING pid 3714, uptime 1 day, 0:32:52 [192.168.1.102] nginx: RUNNING [192.168.1.102] es: RUNNING [192.168.1.102] kafka: RUNNING [192.168.1.102] beanstalk: RUNNING [192.168.1.103] consul: RUNNING [192.168.1.103] license: RUNNING [192.168.1.103] redis: RUNNING --------------------------------------------------------------------------------------------------------- [192.168.1.103] open_paas appengine RUNNING pid 13809, uptime 2 days, 5:03:29 [192.168.1.103] open_paas esb RUNNING pid 13808, uptime 2 days, 5:03:29 [192.168.1.103] open_paas login RUNNING pid 13806, uptime 2 days, 5:03:29 [192.168.1.103] open_paas paas RUNNING pid 13805, uptime 2 days, 5:03:29 [192.168.1.103] gse_alarm: RUNNING [192.168.1.103] gse_ops: RUNNING [192.168.1.103] gse_opts: RUNNING [192.168.1.103] gse_api: RUNNING [192.168.1.103] gse_btsvr: RUNNING [192.168.1.103] gse_data: RUNNING [192.168.1.103] gse_dba: RUNNING [192.168.1.103] gse_task: RUNNING [192.168.1.103] cmdb-nginx: RUNNING [192.168.1.103] server cmdb_adminserver RUNNING pid 16427, uptime 1 day, 2:01:20 [192.168.1.103] server cmdb_apiserver RUNNING pid 16416, uptime 1 day, 2:01:20 [192.168.1.103] server cmdb_auditcontoller RUNNING pid 16412, uptime 1 day, 2:01:20 [192.168.1.103] server cmdb_datacollection RUNNING pid 16426, uptime 1 day, 2:01:20 [192.168.1.103] server cmdb_eventserver RUNNING pid 16417, uptime 1 day, 2:01:20 [192.168.1.103] server cmdb_hostcontroller RUNNING pid 16406, uptime 1 day, 2:01:20 [192.168.1.103] server cmdb_hostserver RUNNING pid 16407, uptime 1 day, 2:01:20 [192.168.1.103] server cmdb_objectcontroller RUNNING pid 16409, uptime 1 day, 2:01:20 [192.168.1.103] server cmdb_proccontroller RUNNING pid 16428, uptime 1 day, 2:01:20 [192.168.1.103] server cmdb_procserver RUNNING pid 16411, uptime 1 day, 2:01:20 [192.168.1.103] server cmdb_toposerver RUNNING pid 16408, uptime 1 day, 2:01:20 [192.168.1.103] server cmdb_webserver RUNNING pid 16410, uptime 1 day, 2:01:20 [192.168.1.103] zk: RUNNING [192.168.1.103] job: RUNNING [192.168.1.103] es: RUNNING [192.168.1.103] kafka: RUNNING [192.168.1.103] influxdb: RUNNING [root@paas-1 install]#
实践成功,重启后应用进程被crontab自动拉起了,经过脚本检测,所有进程启动正常,符合预期结果。
此时,我们可以回到《如何安装蓝鲸的saas-o之bk_nodeman?》
https://blog.csdn.net/haoding205/article/details/82784686
继续完成bk_nodeman的安装,直到100%完成并成功为止。
http://docs.bk.tencent.com/bkce_install_guide/maintain.html#migrate_module
好了,聪明如你,知道了安装有蓝鲸的机器断电重启了怎么办,是不是很欢喜 _
还有其他问题的可以在评论区留言或者扫码加博主获取资源或者提问。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。