GPU服务器docker启动失败问题解决_docker failed to retrieve /usr/bin/nvidia-containe

作者：思考机器3 | 2024-01-30 15:02:36

踩

docker failed to retrieve /usr/bin/nvidia-container-runtime version: fork/ex

centos7 服务升级内核到4.4以后，执行docker run报错：


docker: Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/moby/fed8039bea12d7d6e31da38c3466459b8aab55c7fe191c82774ec11b2ea870a7/log.json: no such file or directory): exec: "nvidia-container-runtime": executable file not found in $PATH: unknown.
ERRO[0163] error waiting for container: context canceled

执行nvidia-container-cli -k -d /dev/tty info命令检查，提示如下：


-- WARNING, the following logs are for debugging purposes only --
 
I0819 02:31:17.719753 137864 nvc.c:282] initializing library context (version=1.2.0, build=d22237acaea94aa5ad5de70aac903534ed598819)
I0819 02:31:17.720023 137864 nvc.c:256] using root /
I0819 02:31:17.720043 137864 nvc.c:257] using ldcache /etc/ld.so.cache
I0819 02:31:17.720061 137864 nvc.c:258] using unprivileged user 65534:65534
I0819 02:31:17.720349 137864 nvc.c:299] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0819 02:31:17.720549 137864 nvc.c:301] dxcore initialization failed, continuing assuming a non-WSL environment
I0819 02:31:17.736760 137874 nvc.c:192] loading kernel module nvidia
E0819 02:31:17.757157 137874 nvc.c:194] could not load kernel module nvidia
I0819 02:31:17.757210 137874 nvc.c:204] loading kernel module nvidia_uvm
E0819 02:31:17.775261 137874 nvc.c:206] could not load kernel module nvidia_uvm
I0819 02:31:17.775305 137874 nvc.c:212] loading kernel module nvidia_modeset
E0819 02:31:17.792443 137874 nvc.c:214] could not load kernel module nvidia_modeset
I0819 02:31:17.793337 137887 driver.c:101] starting driver service
I0819 02:31:17.831025 137864 driver.c:196] driver service terminated with signal 15
nvidia-container-cli: initialization error: nvml error: driver not loaded

说明GPU的驱动没有正确安装，应该是内核升级后，原有的驱动失效了。

执行uname -a输出结果为：


4.4.232-1.el7.elrepo.x86_64 #1 SMP Fri Jul 31 11:49:26 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

解决方法是卸载掉原有的kernel-header，安装新版本的header和devel包：


yum remove kernel-header -y
wget http://ftp.osuosl.org/pub/elrepo/kernel/el7/x86_64/RPMS/kernel-lt-devel-4.4.232-1.el7.elrepo.x86_64.rpm
wget http://ftp.osuosl.org/pub/elrepo/kernel/el7/x86_64/RPMS/kernel-lt-headers-4.4.232-1.el7.elrepo.x86_64.rpm
rpm -ivh kernel-lt-headers-4.4.232-1.el7.elrepo.x86_64.rpm  kernel-lt-devel-4.4.232-1.el7.elrepo.x86_64.rpm

之后，重新安装nvidia的驱动和cuda：


wget https://developer.download.nvidia.cn/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-10.1.243-1.x86_64.rpm
wget https://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/e/epel-release-7-12.noarch.rpm
rpm -ivh cuda-repo-rhel7-10.1.243-1.x86_64.rpm
rpm -ivh epel-release-7-12.noarch.rpm
yum clean all
yum -y install nvidia-driver-latest-dkms cuda cuda-drivers

最后执行：

docker run --gpus '"device=1,2"' nvidia/cuda:10.0-base nvidia-smi

显示正常

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/blog/article/detail/47079

推荐阅读

article
Linux Docker 图形化工具 Portainer远程访问
文章浏览阅读1w次，点赞141次，收藏140次。探索LinuxDocker管理神器Portainer，解锁远程访问技巧，轻松图形化管理你的容器。LinuxDocker图形化工具Portainer远程访问文章目录前言1.部署Portainer... [详细]
赞
踩
article
Docker + Jenkins + Nginx实现前端自动化部署
文章浏览阅读3.6w次，点赞92次，收藏126次。文章有点长，如果你是以学习的态度来看这篇文章，建议收藏起来慢慢看。前端自动化部署一直以来概念很清楚知道怎么回事，但是其中怎么操作没怎么研究过，虽然之前环境都搭起来了，但是也只是Jenkins... [详细]
赞
踩
article
Docker（镜像、容器、仓库）工具安装使用命令行选项及构建、共享和运行容器化应用程序
文章浏览阅读1.7k次，点赞80次，收藏73次。Docker（镜像、容器、仓库）工具安装使用命令行选项及构建、共享和运行容器化应用程序Docker（镜像、容器、仓库）工具安装使用命令行选项及构建、共享和运行容器化应用程序文章目录前言... [详细]
赞
踩
article
已解决Error：Flash Download failed -“Cortex-M3”异常的正确解决方法，亲测有效！！！
文章浏览阅读7.6w次，点赞104次，收藏205次。对于FlashDownloadfailed-"Cortex-M3"错误，你可以尝试以下解决方法：已解决Error：FlashDownloadfailed-“Cortex-M3”异常的正确解... [详细]
赞
踩
article
Docker Swarm总结+基础、集群搭建维护、安全以及集群容灾（1/5）
文章浏览阅读1.2k次，点赞31次，收藏16次。DockerSwarm是由Docker公司推出的Docker的原生集群管理系统，它将一个Docker主机池变成了一个单独的虚拟主机，用户只需通过简单的API即可实现与Docker集群的通信。D... [详细]
赞
踩
article
【Docker】Docker与Kubernetes：区别与优势对比
一种革新性的容器技术一、Docker与Kubernetes简介二、架构和部署模型1.Docker部署模型2.构建Docker镜像3.运行容器4.编排工具三、可移植性和可扩展性1.可移植性（Portability）：2.可扩展性（Scalab... [详细]
赞
踩
article
Docker Registry本地镜像仓库部署并实现远程连接拉取镜像
DockerRegistry本地镜像仓库,简单几步结合cpolar内网穿透工具实现远程pullorpush(拉取和推送)镜像,不受本地局域网限制！DockerRegistry本地镜像仓库部署并实现远程连接拉取镜像Linux本地DockerR... [详细]
赞
踩
article
Python项目打包, docker build构建docker镜像, Docker Compose (Docker编配)_docker打包python项目
将本地Python项目打包构建docker镜像，并在docker中运行【流程1-8】8、DockerCompose(Docker编配)附录:其他常用docker命令。_docker打包python项目docker打包python项目重点摘要... [详细]
赞
踩
article
【Docker】深入理解Docker：一种革新性的容器技术
一种革新性的容器技术1.Docker的核心概念2.Docker的主要优势分为两部分2.1（一）Docker的主要优势2.2（二）Docker的主要优势3.Docker的使用场景【Docker】深入理解Docker：一种革新性的容器技术前言 ... [详细]
赞
踩
article
Linux中docker的基本操作_linux docker
是一个开源的应用容器引擎，基于go语言开发并遵循了apache2.0协议是在Linux容器里运行应用的开源工具是一种轻量级的“虚拟机”docker的容器技术可以在一台虚拟机上轻松为任何应用创建一个轻量级的、可移植的、自给自足的容器docke... [详细]
赞
踩
article
已解决ERROR: Failed building wheel for opencv-python-headless
已解决ERROR:Failedbuildingwheelforopencv-python-headlessFailedtobuildopencv-python-headlessERROR:Couldnotbuildwheelsforopen... [详细]
赞
踩
article
WebSocket connection to ‘ws://localhost:3000/ws‘ failed: Error in connection establishment: net::ERR
控制台一直出现WebSocketconnectionto'ws://localhost:3000/ws'failed:Errorinconnectionestablishment:net::ERR_CONNECTION_REFUSEDWeb... [详细]
赞
踩
article
【Copilot】登录报错 Extension activation failed: “No auth flow succeeded.“（VSCode）
当尝试在VisualStudioCode中登录GitHubCopilot插件时，会出现报错的情况，如下图所示：尽管在浏览器中成功授权了GitHub账户，但在返回VSCode后仍然报错，如下图所示：同时，在终端中也会显示如下错误信息：原因分析... [详细]
赞
踩
article
Docker | Docker常用命令
大家好，我是Leo哥Docker|Docker常用命令Docker|Docker常用命令✅作者简介：大家好，我是Leo，热爱Java后端开发者，一个想要与大家共同进步的男人... [详细]
赞
踩
article
【Docker】Docker Compose，yml 配置指令参考的详细讲解
【Docker】DockerCompose，yml配置指令参考的详细讲解【Docker】DockerCompose，yml配置指令参考的详细讲解作者简介：辭七七，目前大二，正在学习C/C++，Java，Python等作者主页：七七的个人主页... [详细]
赞
踩
article
Git 常见错误之 fatal: Authentication failed 简单解决方法_authentication failed怎么解决
Git(读音为/gɪt/。)是一个开源的分布式版本控制系统，可以有效、高速地处理从很小到非常大的项目版本管理。[1]Git是LinusTorvalds为了帮助管理Linux内核开发而开发的一个开放源码的版本控制软件。本节介绍，Git常见错误... [详细]
赞
踩
article
permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock
permissiondeniedwhiletryingtoconnecttotheDockerdaemonsocketatunix:///var/run/docker.sock:Get"http://%2Fvar%2Frun%2Fdocke... [详细]
赞
踩
article
Docker | 发布镜像到镜像仓库
大家好，我是Leo哥Docker|发布镜像到镜像仓库✅作者简介：大家好，我是Leo，热爱Java后端开发者，一个想要与大家共同进步的男人... [详细]
赞
踩
article
【ERROR】chaincode install failed with status: 500 - failed to invoke backing implementation xxx
Error:chaincodeinstallfailedwithstatus:500-failedtoinvokebackingimplementationof'InstallChaincode':couldnotbuildchaincod... [详细]
赞
踩
article
遇到：PytorchStreamReader failed reading zip archive: failed finding central 错误应该如何解决
，并将其放置在与代码相同的目录中。如果模型文件存在且没有损坏，代码应该能够成功加载模型，并输出"模型加载成功！如果遇到错误，代码将输出"模型加载失败："后面跟着具体的错误信息。方法加载模型参数，并将其加载到模型实例中。最后，我们将模型... [详细]
赞
踩

GPU服务器docker启动失败问题解决_docker failed to retrieve /usr/bin/nvidia-containe

Linux Docker 图形化工具 Portainer远程访问

Docker + Jenkins + Nginx实现前端自动化部署

Docker（镜像、容器、仓库）工具安装使用命令行选项及构建、共享和运行容器化应用程序

已解决Error：Flash Download failed -“Cortex-M3”异常的正确解决方法，亲测有效！！！

Docker Swarm总结+基础、集群搭建维护、安全以及集群容灾（1/5）

【Docker】Docker与Kubernetes：区别与优势对比

Docker Registry本地镜像仓库部署并实现远程连接拉取镜像

Python项目打包, docker build构建docker镜像, Docker Compose (Docker编配)_docker打包python项目

【Docker】深入理解Docker：一种革新性的容器技术

Linux中docker的基本操作_linux docker

已解决ERROR: Failed building wheel for opencv-python-headless

WebSocket connection to ‘ws://localhost:3000/ws‘ failed: Error in connection establishment: net::ERR

【Copilot】登录报错 Extension activation failed: “No auth flow succeeded.“（VSCode）

Docker | Docker常用命令

【Docker】Docker Compose，yml 配置指令参考的详细讲解

Git 常见错误之 fatal: Authentication failed 简单解决方法_authentication failed怎么解决

permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock

Docker | 发布镜像到镜像仓库

【ERROR】chaincode install failed with status: 500 - failed to invoke backing implementation xxx

遇到：PytorchStreamReader failed reading zip archive: failed finding central 错误应该如何解决