赞
踩
在Megatron-LM/Pytorch运行中报错如下:
No module named 'fused_layer_norm_cuda'
: apex没有装或者装的不对,注意直接用pip install apex装的不是真正的nvdia-apex,必须通过源码编译安装ModuleNotFoundError: No module named 'packaging'
: 在新版的apex上编译会遇到报错,需要切换到之前的代码版本No module named 'amp_C'
: 编译指令使用 pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
,编译后还需要额外执行python setup.py install
ImportError: libc10.so: cannot open shared object file: No such file or directory
: libc10.so
是跟着pytorch一起装的NVIDIA APEX 代码库:https://github.com/NVIDIA/apex
apt-get install -y ninja-build libssl-dev libffi-dev
如果上面依赖不够,可以试试如下:
apt install -y ninja-build build-essential pkg-config zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev libsqlite3-dev libbz2-dev
wget https://www.python.org/ftp/python/3.10.12/Python-3.10.12.tgz
tar zxf Python-3.10.12.tgz && cd Python-3.10.12
./configure
make altinstall
python默认安装路径是/usr/local/bin
下,需要设置下PATH和软链
export PATH=/usr/local/bin:$PATH
ln -s /usr/local/bin/python3.10 /usr/local/bin/python
ln -s /usr/local/bin/pip3.10 /usr/local/bin/pip
pytorch-1.12.1-gpu
版安装,为了解决libc10.so
找不到的问题,同时apex安装也依赖torchpip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
pip uninstall apex
git clone https://github.com/NVIDIA/apex
cd apex
git checkout 22.04-dev
pip install -r requirements.txt
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
amp_C
之前要先引入torch
import torch
import amp_C
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。