赞
踩
参考资料
Docker 中无法使用 GPU 时该怎么办(无法初始化 NVML:未知错误)
按照下面这篇文章当中引用的文章来(附录1)
SOLVED Docker with GPU: “Failed to initialize NVML: Unknown Error”
解决方案需要的条件:
需要在服务器上docker的admin list之中. 不需要服务器整体的admin权限. 我在创建docker的时候向管理员申请了把握加到docker list当中了. 如果你能够创建docker你就满足这个条件了
问题描述:
在主机上nvidia-smi正常, 但是在docker上报错如标题.
解决: 使用上述方法修改. 但是有一些不同
#在docker当中
cd /etc/nvidia-container-runtime/
sudo touch config.toml
sudo vim config.toml
#把下面的config.toml内容复制进去
#ESC, :wq
disable-require = false #swarm-resource = "DOCKER_RESOURCE_GPU" #accept-nvidia-visible-devices-envvar-when-unprivileged = true #accept-nvidia-visible-devices-as-volume-mounts = false [nvidia-container-cli] #root = "/run/nvidia/driver" #path = "/usr/bin/nvidia-container-cli" environment = [] #debug = "/var/log/nvidia-container-toolkit.log" #ldcache = "/etc/ld.so.cache" load-kmods = true #no-cgroups = false #user = "root:video" ldconfig = "@/sbin/ldconfig.real" [nvidia-container-runtime] #debug = "/var/log/nvidia-container-runtime.log" log-level = "info" # Specify the runtimes to consider. This list is processed in order and the PATH # searched for matching executables unless the entry is an absolute path. runtimes = [ "docker-runc", "runc", ] mode = "auto" [nvidia-container-runtime.modes.csv] mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
然后就成功了
附录1
I’ve bumped to the same issue after recent update of nvidia related packages. Fortunately, I managed to fix it.
Method 1, recommended
Kernel parameter
The easiest way to ensure the presence of systemd.unified_cgroup_hierarchy=false param is to check /proc/cmdline :
cat /proc/cmdline
It’s of course related to a method with usage of boot loader. You can hijack this file to set the parameter on runtime https://wiki.archlinux.org/title/Kernel_parameters#Hijacking_cmdline
nvidia-container configuration
In the file
/etc/nvidia-container-runtime/config.toml
set the parameter
no-cgroups = false
After that restart docker and run test container:
sudo systemctl restart docker
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Method 2
Actually, you can try to bypass cgroupsv2 by setting (in file mentioned above)
no-cgroups = true
Then you must manually pass all gpu devices to the container. Check this answer for the list of required mounts:https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-851039827
For debugging purposes, just run:
sudo systemctl restart docker
sudo docker run --rm --gpus all --privileged -v /dev:/dev nvidia/cuda:11.0-base nvidia-smi
Good luck
Last edited by szalinski (2021-06-04 23:41:06)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。