赞
踩
本文的目的在于梳理NLP企业级的应用任务,根据jd判断当前主流的NLP在企业中的应用水平,熟悉常见的任务+技术+数据+评价方式。
1.cuda 和 cudnn的选择: pip3 install torch==1.10.2+cu113 torchvision==0.11.3+cu113 torchaudio===0.10.2+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html 2.python版本选择: 在python3.6和python3.8中测试了transformers库下bert-base-chinese的模型,transformers 4.16.2 无法在python3.6上运行,因此整体环境切换到python3.8版本,全部使用conda虚拟环境. 3.本地cuda环境版本: (py38_pt_common) D:\dev\envs\py38_pt_common>nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Wed_Jun__2_19:25:35_Pacific_Daylight_Time_2021 Cuda compilation tools, release 11.4, V11.4.48 Build cuda_11.4.r11.4/compiler.30033411_0 4.关于cuda和cudnn版本可以搜索得到 本地使用的cudnn版本是cudnn8
# 创建虚拟环境
conda create --prefix=D:\dev\envs\py38_pt_common python=3.8
# 激活虚拟环境
conda activate D:\dev\envs\py38_pt_common
# 安装torch
pip3 install torch==1.10.2+cu113 torchvision==0.11.3+cu113 torchaudio===0.10.2+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
# 安装必要的库
pip install scikit-learn
pip install transformers
pip install datasets
pip install sentencepiece
训练一段时间后出现了RuntimeError: CUDA out of memory, 具体报错如下: The best model has been saved 86%|████████▋ | 151/175 [06:44<02:43, 6.80s/it]151 0.18657496571540833 87%|████████▋ | 152/175 [06:44<01:52, 4.87s/it]152 0.38517382740974426 87%|████████▋ | 153/175 [06:44<01:17, 3.53s/it]153 0.19268524646759033 88%|████████▊ | 154/175 [06:45<00:54, 2.58s/it]154 0.11133516579866409 89%|████████▊ | 155/175 [06:45<00:38, 1.92s/it]155 0.25511711835861206 89%|████████▉ | 156/175 [06:46<00:27, 1.46s/it]156 0.18305906653404236 90%|████████▉ | 157/175 [06:46<00:20, 1.13s/it]157 0.2746654152870178 90%|█████████ | 158/175 [06:46<00:15, 1.11it/s]158 0.20452025532722473 91%|█████████ | 159/175 [06:47<00:11, 1.35it/s]159 0.19692355394363403 91%|█████████▏| 160/175 [06:47<00:09, 1.58it/s]160 0.2176191508769989 Validation Accuracy: 0.9296875 92%|█████████▏| 161/175 [07:08<01:35, 6.84s/it]The best model has been saved 161 0.21764002740383148 93%|█████████▎| 162/175 [07:09<01:03, 4.90s/it]162 0.11657670140266418 93%|█████████▎| 163/175 [07:09<00:42, 3.54s/it]163 0.2392028421163559 94%|█████████▎| 164/175 [07:09<00:28, 2.59s/it]164 0.34872329235076904 94%|█████████▍| 165/175 [07:10<00:19, 1.92s/it]165 0.12209181487560272 95%|█████████▍| 166/175 [07:10<00:13, 1.46s/it]166 0.10830333083868027 95%|█████████▌| 167/175 [07:11<00:09, 1.13s/it]167 0.21035946905612946 96%|█████████▌| 168/175 [07:11<00:06, 1.10it/s]168 0.325456827878952 97%|█████████▋| 169/175 [07:11<00:04, 1.33it/s]169 0.0809602439403534 97%|█████████▋| 170/175 [07:12<00:03, 1.55it/s]170 0.2002611756324768 98%|█████████▊| 171/175 [07:32<00:25, 6.46s/it]Validation Accuracy: 0.9275568181818182 171 0.44647514820098877 98%|█████████▊| 172/175 [07:32<00:13, 4.64s/it]172 0.08488976210355759 99%|█████████▉| 173/175 [07:32<00:06, 3.36s/it]173 0.09736869484186172 99%|█████████▉| 174/175 [07:33<00:02, 2.46s/it]174 0.13555729389190674 100%|██████████| 175/175 [07:34<00:00, 2.60s/it] Epoch 1 0%| | 0/175 [00:13<?, ?it/s] Traceback (most recent call last): File "<input>", line 1, in <module> File "C:\app\JetBrains\PyCharm 2021.1.3\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile pydev_imports.execfile(filename, global_vars, local_vars) # execute the script File "C:\app\JetBrains\PyCharm 2021.1.3\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "D:/CCFBDCI2020-main/code1/my_train_bert.py", line 235, in <module> train_eval(model, criterion, optimizer, train_loader, val_loader, epochs=10) File "D:/CCFBDCI2020-main/code1/my_train_bert.py", line 146, in train_eval logits = model(batch[0], batch[1], batch[2]) File "D:\dev\envs\py38_pt_common\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "D:/CCFBDCI2020-main/code1/my_train_bert.py", line 82, in forward outputs = self.bert(input_ids, token_type_ids=token_type_ids, attention_mask=attn_masks) File "D:\dev\envs\py38_pt_common\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "D:\dev\envs\py38_pt_common\lib\site-packages\transformers\models\xlnet\modeling_xlnet.py", line 1246, in forward outputs = layer_module( File "D:\dev\envs\py38_pt_common\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "D:\dev\envs\py38_pt_common\lib\site-packages\transformers\models\xlnet\modeling_xlnet.py", line 515, in forward outputs = self.rel_attn( File "D:\dev\envs\py38_pt_common\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "D:\dev\envs\py38_pt_common\lib\site-packages\transformers\models\xlnet\modeling_xlnet.py", line 446, in forward attn_vec = self.rel_attn_core( File "D:\dev\envs\py38_pt_common\lib\site-packages\transformers\models\xlnet\modeling_xlnet.py", line 287, in rel_attn_core bd = torch.einsum("ibnd,jbnd->bnij", q_head + self.r_r_bias, k_head_r) File "D:\dev\envs\py38_pt_common\lib\site-packages\torch\functional.py", line 327, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] RuntimeError: CUDA out of memory. Tried to allocate 108.00 MiB (GPU 0; 10.00 GiB total capacity; 8.17 GiB already allocated; 0 bytes free; 8.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 解决方案: https://stackoverflow.com/questions/68967257/permissionerror-errno-13-permission-denied-python 降低batch_size进行测试.
使用DataLoaders
DataLoader中的workers数量
Batch size
梯度累计
保留的计算图
移动到单个
16-bit 混合精度训练
移动到多个GPUs中(模型复制)
移动到多个GPU-nodes中 (8+GPUs)
思考模型加速的技巧
google版本的源码比较难懂,使用了tensorflow1.3版本,用CPU+batchSize16可以完整复现,debug深入理解原理,参考如下链接:
环境:坑点,清华源中pip无法安装tf1.13,需要使用conda安装.
创建虚拟环境
conda create --prefix=D:\root\env\pyenv\py36_tf1
激活虚拟环境
conda activate D:\root\env\pyenv\py36_tf1
使用conda安装
conda install --prefix=D:\root\env\pyenv\py36_tf1 tensorflow=1.13
使用pip安装
pip install tensorflow=1.13
基于源码 [github bert](https://github.com/google-research/bert),Pre-trained models模型的下载 GLUE\BERT_BASE_DIR\uncased_L-12_H-768_A-12, 认识这个模型: 2018/10/19 06:21 313 bert_config.json 配置文件,包含各种参数 2018/10/19 06:34 440,425,712 bert_model.ckpt.data-00000-of-00001 训练保存点 2018/10/19 06:34 8,528 bert_model.ckpt.index 序号 2018/10/19 06:34 904,243 bert_model.ckpt.meta 元数据 2018/10/19 06:21 231,508 vocab.txt 词 1.下载数据集 Before running this example you must download the GLUE data by running this script and unpack it to some directory $GLUE_DIR. Next, download the BERT-Base checkpoint and unzip it to some directory $BERT_BASE_DIR. 2.数据集认识 clone代码到本地的IDE,拷贝GLUE数据到项目的目录下,准备好写脚本 选择MRPC数据集,GLUE/glue_data/MRPC/train.tsv 任务是:判断两句话的含义是否一致,样本数30k+,适合自己玩,调试下。 3.脚本 参考链接 export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12 export GLUE_DIR=/path/to/glue python run_classifier.py \ --task_name=MRPC \ --do_train=true \ --do_eval=true \ --data_dir=$GLUE_DIR/MRPC \ --vocab_file=$BERT_BASE_DIR/vocab.txt \ --bert_config_file=$BERT_BASE_DIR/bert_config.json \ --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ --max_seq_length=128 \ --train_batch_size=32 \ --learning_rate=2e-5 \ --num_train_epochs=3.0 \ --output_dir=/tmp/mrpc_output/
--task_name=MRPC --do_train=true --do_eval=true --data_dir=D:\root\code\python\BERT-BiLSTM-CRF-NER-master\bert_base\bert\glue_data\MRPC --vocab_file=D:\bert_uncased_L-12_H-768_A-12\vocab.txt --bert_config_file=D:\bert_uncased_L-12_H-768_A-12\config.json --init_checkpoint=D:\bert_uncased_L-12_H-768_A-12\bert_model.ckpt --max_seq_length=128 --train_batch_size=8 --learning_rate=2e-5 --num_train_epochs=3.0 --output_dir=D:\root\code\python\BERT-BiLSTM-CRF-NER-master\bert_base\bert\output
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。