赞
踩
深度学习训练过程中如果中断,很容易造成显存占用不释放的问题。做个记录,留着备用。
表现为报错:
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory
1.查看是否出现了问题:nvidia-smi
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.130 Driver Version: 384.130 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 TITAN V Off | 00000000:01:00.0 On | N/A | | 39% 53C P2 36W / 250W | 11959MiB / 12055MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1017 G /usr/lib/xorg/Xorg 298MiB | | 0 1834 G /opt/teamviewer/tv_bin/TeamViewer 6MiB | | 0 2045 G compiz 177MiB | | 0 4118 G ...-token=D609226DD6A56AEBB70B08FB7BC10F2E 78MiB | | 0 4603 G ...uest-channel-token=11061898972785214487 59MiB | | 0 16481 C python3 418MiB | | 0 16537 C python3 10916MiB | +-----------------------------------------------------------------------------+
2.发现16537是罪魁祸首
kill -9 16537
3.监控GPU:3代表3秒
watch -n 3 nvidia-smi
4.监控cpu和内存
top -d 1
free -m
5.清除cache缓存内存空间
sudo sh -c 'echo 1 > /proc/sys/vm/drop_caches'
sudo sh -c 'echo 2 > /proc/sys/vm/drop_caches'
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。