当前位置:   article > 正文

Megatron-LM训练GPT2模型_megatron llama

megatron llama

基于Megatron-LM从0到1完成GPT2模型预训练、模型评估及推理 - 知乎 (zhihu.com)

1、配置环境(太遭罪了)

先讲结论,踩坑太漫长了:

GPU:tesla P100

cuda11.8(可换其他,低点好)

pytorch2.1.0(可换其他2.1还是有点小坑)

 Megatron-LM(tag2.5),最新的transformer_engine用不了,对GPU框架有要求

pytorch镜像选好版本(别用太高,gpu不行,多踩了很多坑),有apex的就行

首先进入到Megatron-LM目录,安装一下依赖,pip install -r requirements.txt

不需要tensorflow

pytorch和cuda要匹配

 安装apex遇到的各种问题:

1、cuda和torch版本不匹配

原来时cuda11.4,torch版本1.12+cu113(torch没有114就离谱)

修改setup.py文件,删除验证匹配的地方即可

或者重下cuda和torch

我都做了但我卡住的地方不是这个原因

2、编译不了c++文件!为什么!(放弃了,没解决)

 from /root/yjy/Megatron-LM/apex/csrc/flatten_unflatten.cpp:1:
  /usr/include/c++/9/cwchar:44:10: fatal error: wchar.h: No such file or directory

gpt让我下载

  1. sudo apt-get update
  2. sudo apt-get install libc6-dev

但是又报错,这个linux-headers-5.4.0-165我不敢乱删 

  放弃了,转而使用镜像!

使用镜像配置环境:(又是曲折的换版本)

 PyTorch Release 21.05 - NVIDIA Docs

下载镜像 ,选好版本(别用太高,不适配,多踩了很多坑),有apex的就行

  1. docker run -dt --name pytorch_yjy --restart=always --gpus all \
  2. --network=host \
  3. --shm-size 8G \
  4. -v /mnt/VMSTORE/yjy/Megatron-LM-GPT:/Megatron-LM-GPT \
  5. -w /Megatron-LM-GPT \
  6. nvcr.io/nvidia/pytorch:23.04-py3 \
  7. /bin/bash
docker exec -it pytorch_yjy bash
缺少amp_C

在这里安装apex成功了,但是模型训练使用的时候又报错了!!!缺少amp_C!!!

解决办法一:(别用,后面还会报错)

用这个版本的apex成功了

NVIDIA/apex at 3303b3e7174383312a3468ef390060c26e640cb1 (github.com)

Megatron-LLaMA/megatron/model/fused_layer_norm.py

NVIDIA/apex at 3303b3e7174383312a3468ef390060c26e640cb1 (github.com)

但是会报没有_six等错误 ,没有inf的错误,这些修改一下就好

解决办法二:

 用 python setup.py install,别用pip

 其他报错:

  1. RuntimeError: ColumnParallelLinear was called with gradient_accumulation_fusion set to True but the custom CUDA extension fused_weight_gradient_mlp_cuda module is not found. To use gradient_accumulation_fusion you must install APEX with --cpp_ext and --cuda_ext. For example: pip install --global-option="--cpp_ext" --global-option="--cuda_ext ." Note that the extension requires CUDA>=11. Otherwise, you must turn off gradient accumulation fusion.RuntimeError
  2. : ColumnParallelLinear was called with gradient_accumulation_fusion set to True but the custom CUDA extension fused_weight_gradient_mlp_cuda module is not found. To use gradient_accumulation_fusion you must install APEX with --cpp_ext and --cuda_ext. For example: pip install --global-option="--cpp_ext" --global-option="--cuda_ext ." Note that the extension requires CUDA>=11. Otherwise, you must turn off gradient accumulation fusion.

 【精选】安装apex报错_install the apex with cuda support (https://github-CSDN博客ModuleNotFoundError: No module named ‘fused_layer_norm_cuda‘_我用k-bert的时候报错no module named 'layer_norm,但我是有这个组件的-CSDN博客

 增加cuda环境变量:

  1. export CUDA_HOME=/usr/local/cuda-11.3
  2. export LD_LIBRARY_PATH=/usr/local/cuda-11.3/lib64:$LD_LIBRARY_PATH

 报错:21.05没有transformer_engine

 然后完全不知道为什么会报段错误

 又换了22.10,没有段错误但是,缺少te.pytorch.DotProductAttention!!!!安装transformer_engine又是各种报错,是我没看文档,transformer_engine0.6以上才有这个API

23.04环境运行报错:原因就是cuda和gpu和torch版本不匹配

  1. #cuda是否可用;
  2. torch.cuda.is_available()
  3. # 返回gpu数量;
  4. torch.cuda.device_count()
  5. # 返回gpu名字,设备索引默认从0开始;
  6. torch.cuda.get_device_name(0)
  7. # 返回当前设备索引;
  8. torch.cuda.current_device()

尝试1:卸载torch然后重新下载,但是transformer_engine损坏了,链接不到了

尝试2:不要 transformer_engine

vim megatron/core/transformer/custom_layers/transformer_engine.py

不行还是需要用到的,删掉容器重来吧,回到原点的错误

WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication

查看算力: 查看NVIDIA显卡计算能力-CSDN博客

 

就是GPU的算力和cuda不匹配

 怎么在cuda的镜像上堆pytorch镜像?

我最后还是在23.04环境里重新安装了cuda11.8

因为只有cuda环境的镜像太干净了,要自己重装好多东西!!

然后重装torch

但是重装不了transformer_engine,然后我终于在找为什么的时候发现了tesla根本用不了!!!!

 

最后选择MegatronLM的2.5版本就不会用到它!

2、准备数据集 

数据集下载地址:

MEGA

 安装依赖库

pip install ftfy langdetect numpy torch pandas nltk sentencepiece boto3 tqdm regex bs4 newspaper3k htmlmin tldextract -i https://pypi.tuna.tsinghua.edu.cn/simple  --trusted-host pypi.tuna.tsinghua.edu.cn

 进入openwebtext文件夹下,将下载的数据集放在urls文件夹下,然后去重,执行

python3 blacklist_urls.py urls clean_urls.txt

从url下载数据,使用工具:

yet-another-account/openwebtext at dependabot/pip/certifi-2022.12.7 (github.com)

python3 download.py clean_urls.txt --n_procs=15 --timeout=15 ----output_dir

然后报错 

TypeError: 'ExtractResult' object is not iterable

  1. Traceback (most recent call last):
  2. File "download.py", line 307, in <module>
  3. cdata = list(pool.imap(download, chunk, chunksize=1))
  4. File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 868, in next
  5. raise value
  6. TypeError: 'ExtractResult' object is not iterable

解决1: tldextract版本不对,换个版本就可以跑了

这个问题出现在尝试使用 tldextract 库从 URL 中提取域名时。错误消息表明 tldextract.extract(url) 的结果类型不可迭代,这可能是因为库的版本更新或者在使用上有一些变化

 

 下载完成后,之后就会生成一个scraped文件夹,每个url下载的文本就保存在data子文件夹下

使用(merge_jsons.py)来把文件夹中的所有txt合并成一个json文件

python3 tools/openwebtext/merge_jsons.py --data_path=scraped/data --output_file=data/merged_output.json

数据清洗:

python3 cleanup_dataset.py tools/openwebtext/merged_output.json data/merged_cleand.json

报错:找不到tokenizer

之前一直以为是识别不到megatron里的tokenizer,或者是pip tokenizer的包,弄了好久,结果是根本就是缺少了一个tokenizer.py 

 这是tokenizer.py

  1. # coding=utf-8
  2. # Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
  3. #
  4. # Licensed under the Apache License, Version 2.0 (the "License");
  5. # you may not use this file except in compliance with the License.
  6. # You may obtain a copy of the License at
  7. #
  8. # http://www.apache.org/licenses/LICENSE-2.0
  9. #
  10. # Unless required by applicable law or agreed to in writing, software
  11. # distributed under the License is distributed on an "AS IS" BASIS,
  12. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  13. # See the License for the specific language governing permissions and
  14. # limitations under the License.
  15. import sys
  16. sys.path.append('..')
  17. from megatron.tokenizer.gpt2_tokenization import GPT2Tokenizer
  18. class Tokenizer:
  19. def __init__(self, cache_dir=None):
  20. self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2',
  21. cache_dir=cache_dir)
  22. self.tokenizer.max_len = int(1e12)
  23. self.eod_token = self.tokenizer.encoder['<|endoftext|>']
  24. assert self.eod_token < 65535, 'vocab size will not fit in uint16'
  25. print('> GPT2 tokenizer with {} vocab size and eod token {} ...'.format(
  26. len(self.tokenizer.encoder), self.eod_token))
  27. def tokenize_document(self, document):
  28. tokens = self.tokenizer.encode(document)
  29. tokens.append(self.eod_token)
  30. return tokens

 fix issue #33 missing modules by hyoo · Pull Request #89 · NVIDIA/Megatron-LM (github.com)

 cannot import name 'cached_path' from 'transformers'

cannot import name 'cached_path' from 'transformers' · Issue #1475 · ThilinaRajapakse/simpletransformers (github.com)

 shuffle清洗后的数据集。

shuf data/merged_cleand.json -o data/train_data.json

数据预处理: 

  1. python tools/preprocess_data.py \
  2. --input data/train_data_half.json \
  3. --output-prefix data/my-gpt2_half \
  4. --vocab-file model/gpt2-vocab.json\
  5. --tokenizer-type GPT2BPETokenizer \
  6. --merge-file model/gpt2-merges.txt \
  7. --append-eod \
  8. --workers 20

 输出文件名为 my-gpt2_text_document.bin 和 my-gpt2_text_document.idx。在 GPT2 训练时,使用不带扩展名的名称作为 --data-path

 至此数据处理结束!

  1. #!/bin/bash
  2. # Runs the "345M" parameter model
  3. export CUDA_DEVICE_MAX_CONNECTIONS=1
  4. CHECKPOINT_PATH=model/model_optim_rng.pt
  5. VOCAB_FILE=model/gpt2-vocab.json
  6. MERGE_FILE=model/gpt2-merges.txt
  7. DATA_PATH=data/my-gpt2_text_document
  8. MODEL_PATH=model/output
  9. # 模型超参数
  10. GPT_ARGS="
  11. --num-layers 24 \
  12. --hidden-size 1024 \
  13. --num-attention-heads 16 \
  14. --seq-length 1024 \
  15. --max-position-embeddings 1024 \
  16. --micro-batch-size 1 \
  17. --global-batch-size 2 \
  18. --lr 0.00015 \
  19. --train-iters 5000 \
  20. --lr-decay-iters 320000 \
  21. --lr-decay-style cosine \
  22. --min-lr 1.0e-5 \
  23. --weight-decay 1e-2 \
  24. --lr-warmup-fraction .01 \
  25. --clip-grad 1.0 \
  26. --fp16
  27. "
  28. # 数据集和词表路径参数
  29. DATA_ARGS="
  30. --data-path $DATA_PATH \
  31. --vocab-file $VOCAB_FILE \
  32. --merge-file $MERGE_FILE \
  33. --data-impl mmap \
  34. --split 700,200,100
  35. "
  36. # 模型权重输出、评估、日志相关的参数
  37. OUTPUT_ARGS="
  38. --log-interval 100 \
  39. --save-interval 10000 \
  40. --eval-interval 1000 \
  41. --eval-iters 10
  42. "
  43. # 启动训练任务
  44. torchrun pretrain_gpt.py \
  45. $GPT_ARGS \
  46. $DATA_ARGS \
  47. $OUTPUT_ARGS \
  48. --save $MODEL_PATH \
  49. --load $CHECKPOINT_PATH

3、模型训练:

查看显卡占用情况:

watch -n 5 nvidia-smi
  1. python -m gpt2 train --train_corpus data/wikitext-103-raw/wiki.train.raw \
  2. --eval_corpus data/wikitext-103-raw/wiki.test.raw \
  3. --vocab_path build/vocab.txt \
  4. --save_checkpoint_path ckpt-gpt2.pth \
  5. --save_model_path gpt2-pretrained.pth \
  6. --batch_train 128 \
  7. --batch_eval 128 \
  8. --seq_len 64 \
  9. --total_steps 1000000 \
  10. --eval_steps 500 \
  11. --save_steps 5000 \
  12. --use_amp \
  13. --use_grad_ckpt \
  14. --gpus GPUS 4

 单机单卡:

  1. CHECKPOINT_PATH=model/model_optim_rng.pt
  2. VOCAB_FILE=model/gpt2-vocab.json
  3. MERGE_FILE=model/gpt2-merges.txt
  4. DATA_PATH=data/my-gpt2_text_document
  5. MODEL_PATH=model/output
  6. GPT_ARGS="--num-layers 24 \
  7. --hidden-size 1024 \
  8. --num-attention-heads 16 \
  9. --seq-length 1024 \
  10. --max-position-embeddings 1024 \
  11. --micro-batch-size 4 \
  12. --global-batch-size 8 \
  13. --lr 0.00015 \
  14. --train-iters 500000 \
  15. --lr-decay-iters 320000 \
  16. --lr-decay-style cosine \
  17. --vocab-file $VOCAB_FILE \
  18. --merge-file $MERGE_FILE \
  19. --lr-warmup-fraction .01 \
  20. --fp16"
  21. OUTPUT_ARGS="--log-interval 10 \
  22. --save-interval 500 \
  23. --eval-interval 100 \
  24. --eval-iters 10 \
  25. --checkpoint-activations"
  26. python pretrain_gpt.py \
  27. $GPT_ARGS \
  28. $OUTPUT_ARGS \
  29. --save $MODEL_PATH \
  30. --load $CHECKPOINT_PATH \
  31. --data-path $DATA_PATH \

 

 Distributed training:

DP:
  1. WORLD_SIZE=4
  2. TENSOR_MP_SIZE=1
  3. PIPELINE_MP_SIZE=1
  4. DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
  5. --nnodes 1 \
  6. --node_rank 0 \
  7. --master_addr localhost \
  8. --master_port 6000"
  9. CHECKPOINT_PATH=model/model_optim_rng.pt
  10. VOCAB_FILE=model/gpt2-vocab.json
  11. MERGE_FILE=model/gpt2-merges.txt
  12. DATA_PATH=data/my-gpt2_text_document
  13. MODEL_PATH=model/output/mp
  14. GPT_ARGS="--num-layers 24 \
  15. --hidden-size 1024 \
  16. --num-attention-heads 16 \
  17. --seq-length 1024 \
  18. --max-position-embeddings 1024 \
  19. --micro-batch-size 2 \
  20. --global-batch-size 8 \
  21. --lr 0.00015 \
  22. --train-iters 500000 \
  23. --lr-decay-iters 320000 \
  24. --lr-decay-style cosine \
  25. --vocab-file $VOCAB_FILE \
  26. --merge-file $MERGE_FILE \
  27. --lr-warmup-fraction .01 \
  28. --fp16"
  29. OUTPUT_ARGS="--log-interval 10 \
  30. --save-interval 500 \
  31. --eval-interval 100 \
  32. --eval-iters 10 \
  33. --checkpoint-activations"
  34. python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_gpt.py \
  35. $GPT_ARGS \
  36. $OUTPUT_ARGS \
  37. --save $MODEL_PATH \
  38. --load $CHECKPOINT_PATH \
  39. --data-path $DATA_PATH \
  40. --tensor-model-parallel-size $TENSOR_MP_SIZE \
  41. --pipeline-model-parallel-size $PIPELINE_MP_SIZE \
  42. --DDP-impl torch

 

 PP:
  1. WORLD_SIZE=4
  2. TENSOR_MP_SIZE=1
  3. PIPELINE_MP_SIZE=4
  4. DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
  5. --nnodes 1 \
  6. --node_rank 0 \
  7. --master_addr localhost \
  8. --master_port 6000"
  9. CHECKPOINT_PATH=model/model_optim_rng.pt
  10. VOCAB_FILE=model/gpt2-vocab.json
  11. MERGE_FILE=model/gpt2-merges.txt
  12. DATA_PATH=data/my-gpt2_text_document
  13. MODEL_PATH=model/output/mp
  14. GPT_ARGS="--num-layers 24 \
  15. --hidden-size 1024 \
  16. --num-attention-heads 16 \
  17. --seq-length 1024 \
  18. --max-position-embeddings 1024 \
  19. --micro-batch-size 2 \
  20. --global-batch-size 8 \
  21. --lr 0.00015 \
  22. --train-iters 500000 \
  23. --lr-decay-iters 320000 \
  24. --lr-decay-style cosine \
  25. --vocab-file $VOCAB_FILE \
  26. --merge-file $MERGE_FILE \
  27. --lr-warmup-fraction .01 \
  28. --fp16"
  29. OUTPUT_ARGS="--log-interval 10 \
  30. --save-interval 500 \
  31. --eval-interval 100 \
  32. --eval-iters 10 \
  33. --checkpoint-activations"
  34. python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_gpt.py \
  35. $GPT_ARGS \
  36. $OUTPUT_ARGS \
  37. --save $MODEL_PATH \
  38. --load $CHECKPOINT_PATH \
  39. --data-path $DATA_PATH \
  40. --tensor-model-parallel-size $TENSOR_MP_SIZE \
  41. --pipeline-model-parallel-size $PIPELINE_MP_SIZE \
  42. --DDP-impl local

 

TP:
  1. WORLD_SIZE=4
  2. TENSOR_MP_SIZE=4
  3. PIPELINE_MP_SIZE=1
  4. DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
  5. --nnodes 1 \
  6. --node_rank 0 \
  7. --master_addr localhost \
  8. --master_port 6000"
  9. CHECKPOINT_PATH=model/model_optim_rng.pt
  10. VOCAB_FILE=model/gpt2-vocab.json
  11. MERGE_FILE=model/gpt2-merges.txt
  12. DATA_PATH=data/my-gpt2_text_document
  13. MODEL_PATH=model/output/tp
  14. GPT_ARGS="--num-layers 24 \
  15. --hidden-size 1024 \
  16. --num-attention-heads 16 \
  17. --seq-length 1024 \
  18. --max-position-embeddings 1024 \
  19. --micro-batch-size 2 \
  20. --global-batch-size 8 \
  21. --lr 0.00015 \
  22. --train-iters 500000 \
  23. --lr-decay-iters 320000 \
  24. --lr-decay-style cosine \
  25. --vocab-file $VOCAB_FILE \
  26. --merge-file $MERGE_FILE \
  27. --lr-warmup-fraction .01 \
  28. --fp16"
  29. OUTPUT_ARGS="--log-interval 10 \
  30. --save-interval 500 \
  31. --eval-interval 100 \
  32. --eval-iters 10 \
  33. --checkpoint-activations"
  34. python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_gpt.py \
  35. $GPT_ARGS \
  36. $OUTPUT_ARGS \
  37. --save $MODEL_PATH \
  38. --load $CHECKPOINT_PATH \
  39. --data-path $DATA_PATH \
  40. --tensor-model-parallel-size $TENSOR_MP_SIZE \
  41. --pipeline-model-parallel-size $PIPELINE_MP_SIZE \
  42. --DDP-impl local

 

 MP:380M
  1. WORLD_SIZE=4
  2. TENSOR_MP_SIZE=2
  3. PIPELINE_MP_SIZE=2
  4. DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
  5. --nnodes 1 \
  6. --node_rank 0 \
  7. --master_addr localhost \
  8. --master_port 6000"
  9. CHECKPOINT_PATH=model/model_optim_rng.pt
  10. VOCAB_FILE=model/gpt2-vocab.json
  11. MERGE_FILE=model/gpt2-merges.txt
  12. DATA_PATH=data/my-gpt2_text_document
  13. MODEL_PATH=model/output/mp
  14. GPT_ARGS="--num-layers 24 \
  15. --hidden-size 1024 \
  16. --num-attention-heads 16 \
  17. --seq-length 1024 \
  18. --max-position-embeddings 1024 \
  19. --micro-batch-size 2 \
  20. --global-batch-size 8 \
  21. --lr 0.00015 \
  22. --train-iters 500000 \
  23. --lr-decay-iters 320000 \
  24. --lr-decay-style cosine \
  25. --vocab-file $VOCAB_FILE \
  26. --merge-file $MERGE_FILE \
  27. --lr-warmup-fraction .01 \
  28. --fp16"
  29. OUTPUT_ARGS="--log-interval 10 \
  30. --save-interval 500 \
  31. --eval-interval 100 \
  32. --eval-iters 10 \
  33. --checkpoint-activations"
  34. python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_gpt.py \
  35. $GPT_ARGS \
  36. $OUTPUT_ARGS \
  37. --save $MODEL_PATH \
  38. --load $CHECKPOINT_PATH \
  39. --data-path $DATA_PATH \
  40. --tensor-model-parallel-size $TENSOR_MP_SIZE \
  41. --pipeline-model-parallel-size $PIPELINE_MP_SIZE \
  42. --DDP-impl local

 

 

  MP:1.6B(参数不太对)
  1. WORLD_SIZE=4
  2. TENSOR_MP_SIZE=2
  3. PIPELINE_MP_SIZE=2
  4. DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
  5. --nnodes 1 \
  6. --node_rank 0 \
  7. --master_addr localhost \
  8. --master_port 6000"
  9. CHECKPOINT_PATH=model/model_optim_rng.pt
  10. VOCAB_FILE=model/gpt2-vocab.json
  11. MERGE_FILE=model/gpt2-merges.txt
  12. DATA_PATH=data/my-gpt2_text_document
  13. MODEL_PATH=model/output/mp
  14. GPT_ARGS="--num-layers 48 \
  15. --hidden-size 1024 \
  16. --num-attention-heads 16 \
  17. --seq-length 1024 \
  18. --max-position-embeddings 1600 \
  19. --micro-batch-size 16 \
  20. --global-batch-size 64 \
  21. --lr 0.00015 \
  22. --train-iters 5000 \
  23. --lr-decay-iters 320000 \
  24. --lr-decay-style cosine \
  25. --vocab-file $VOCAB_FILE \
  26. --merge-file $MERGE_FILE \
  27. --lr-warmup-fraction .01 \
  28. --fp16"
  29. OUTPUT_ARGS="--log-interval 10 \
  30. --save-interval 500 \
  31. --eval-interval 100 \
  32. --eval-iters 10 \
  33. --checkpoint-activations"
  34. python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_gpt.py \
  35. $GPT_ARGS \
  36. $OUTPUT_ARGS \
  37. --save $MODEL_PATH \
  38. --load $CHECKPOINT_PATH \
  39. --data-path $DATA_PATH \
  40. --tensor-model-parallel-size $TENSOR_MP_SIZE \
  41. --pipeline-model-parallel-size $PIPELINE_MP_SIZE \
  42. --DDP-impl local

训练llama7B:

wiki数据集的xml转成json格式了,怎么从url获取text呢??

Wikidata 数据包下载+格式转换+入库MySQL_wiki数据库-CSDN博客

快速使用wikiextractor提取维基百科语料的简单用法-CSDN博客

 wiki中文文本语料下载并处理 ubuntu + python2.7_wikismallen.txt-CSDN博客

 还是用了openwebtext数据集

  1. python Megatron-LLaMA/tools/preprocess_data.py \
  2. --input data/train_halfdata.json \
  3. --output-prefix data/openwebtexthalf \
  4. --dataset-impl mmap \
  5. --tokenizer-type PretrainedFromHF \
  6. --tokenizer-name-or-path llama7B_hf \
  7. --append-eod \
  8. --workers 20 \
  9. --chunk-size 25

 将huggingface转换为megatron可用,但是这个卡跑不下7B的,砍了参数,就不适配这个转换好的 了

  1. python Megatron-LLaMA/tools/checkpoint_conversion/llama_checkpoint_conversion.py \
  2. --load_path "llama7B_hf" \
  3. --save_path "llama7B_hf_ab" \
  4. --target_tensor_model_parallel_size 2 \
  5. --target_pipeline_model_parallel_size 4 \
  6. --target_data_parallel_size 1 \
  7. --target_params_dtype "fp16" \
  8. --make_vocab_size_divisible_by 1 \
  9. --print-checkpoint-structure \
  10. --megatron-path "Megatron-LLaMA"

 这个是将llama官网下的tokenizer.model转换为huggingface的

  1. python convert_llama_weights_to_hf.py \
  2. --input_dir llama7B \
  3. --model_size 7B \
  4. --output_dir llama7B_hf

 转换成功后

  1. #!/bin/bash
  2. DATASET="data/openwebtexthalf"
  3. TP_SIZE=2
  4. PP_SIZE=4
  5. WORLD_SIZE=8
  6. MICRO_BATCH_SIZE=1
  7. # The int is the number of micro steps of gradient accumulation
  8. GLOBAL_BATCH_SIZE=$((($WORLD_SIZE * $MICRO_BATCH_SIZE) / ($TP_SIZE * $PP_SIZE) * 8))
  9. # GLOBAL_BATCH_SIZE=128
  10. JOB_NAME="LLaMA_tp${TP_SIZE}_pp${PP_SIZE}_mbs${MICRO_BATCH_SIZE}_gpus${WORLD_SIZE}"
  11. LOAD_CHECKPOINT_PATH="llama7B_hf_ab/"
  12. SAVE_CHECKPOINT_PATH="model/llama-7/"
  13. TOKENIZER_PATH="llama7B_hf_ab/"
  14. TENSORBOARD_DIR="model/tensorboard/"
  15. TRAIN_ITERS=1000
  16. EVAL_ITERS=10
  17. EVAL_INTERVAL=1000
  18. SAVE_INTERVAL=100
  19. LOG_INTERVAL=1
  20. # Setting --tensorboard-queue-size to 1 significantly slows down the training
  21. options=" \
  22. --finetune \
  23. --sequence-parallel \
  24. --tensor-model-parallel-size ${TP_SIZE} \
  25. --pipeline-model-parallel-size ${PP_SIZE} \
  26. --num-layers 32 \
  27. --hidden-size 4096 \
  28. --num-attention-heads 32 \
  29. --seq-length 4096 \
  30. --max-position-embeddings 4096 \
  31. --no-position-embedding \
  32. --use-rotary-position-embeddings \
  33. --swiglu \
  34. --ffn-hidden-size 11008\
  35. --disable-bias-linear \
  36. --RMSNorm \
  37. --layernorm-epsilon 1e-6 \
  38. --causal-lm \
  39. --tokenizer-type PretrainedFromHF \
  40. --tokenizer-name-or-path $TOKENIZER_PATH \
  41. --make-vocab-size-divisible-by 1 \
  42. --init-method-std 0.01 \
  43. --micro-batch-size ${MICRO_BATCH_SIZE} \
  44. --global-batch-size ${GLOBAL_BATCH_SIZE} \
  45. --train-iters ${TRAIN_ITERS} \
  46. --lr 6.0e-5 \
  47. --lr-decay-iters 10 \
  48. --lr-warmup-iters 5 \
  49. --min-lr 6.0e-6 \
  50. --override-opt_param-scheduler \
  51. --lr-decay-style cosine \
  52. --adam-beta1 0.9 \
  53. --adam-beta2 0.95 \
  54. --clip-grad 1.0 \
  55. --weight-decay 0.1 \
  56. --overlapped-distributed-optimizer \
  57. --reduce-bucket-size=2e8 \
  58. --no-gradient-accumulation-fusion \
  59. --dataloader-type cyclic \
  60. --data-impl mmap \
  61. --data-path ${DATASET} \
  62. --split 98,2,0 \
  63. --eval-interval ${EVAL_INTERVAL} \
  64. --eval-iters ${EVAL_ITERS} \
  65. --save-interval ${SAVE_INTERVAL} \
  66. --save ${SAVE_CHECKPOINT_PATH} \
  67. --load ${LOAD_CHECKPOINT_PATH} \
  68. --no-load-optim \
  69. --log-interval ${LOG_INTERVAL} \
  70. --tensorboard-dir ${TENSORBOARD_DIR} \
  71. --tensorboard-queue-size 1000 \
  72. --log-timers-to-tensorboard \
  73. --log-batch-size-to-tensorboard \
  74. --log-validation-ppl-to-tensorboard \
  75. --job-name ${JOB_NAME} \
  76. --bf16 \
  77. --recompute-activations \
  78. --recompute-granularity selective \
  79. "
  80. torchrun --nproc_per_node=8 --master_port=6000 Megatron-LLaMA/pretrain_llama.py ${options}

 

终于跑起来了,就是因为 megatron-llama的fused_kernels里用了

导致编译通不过但不知道为什么 

 词向量维度*((4*词向量维度 + 3*FFN隐藏层维度) *层数+词表大小+窗口长度)

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Gausst松鼠会/article/detail/645812
推荐阅读
相关标签
  

闽ICP备14008679号