AllinToyou

这个屌丝很懒，什么也没留下！

热门标签

大模型也内卷，Vicuna训练及推理指南，效果碾压斯坦福羊驼_vicuna13b微调

作者：AllinToyou | 2024-05-03 03:58:16

踩

vicuna13b微调

link

2023开年以来，大模型进入疯狂内卷状态，大模型的发布都要以“天”为单位进行迭代。

之前，尝试了从0到1复现斯坦福羊驼（Stanford Alpaca 7B） ，下面我们来尝试从0到1复现Vicuna训练及推理。

Vicuna简介

继斯坦福羊驼（Stanford Alpaca）之后，UC伯克利、CMU、斯坦福等机构的学者，联手发布了最新开源大模型骆马（Vicuna），包含7B和13B参数。其中，13B参数模型，训练成本仅需300美元，达到了ChatGPT的90%以上的能力，初步评估总结如图所示：

Vicuna工作流程

Vicuna具体的工作流程如下图所示，首先，研究人员从 http://ShareGPT.com（一个供用户分享 ChatGPT 对话内容的网站）收集了约 7 万个对话，并增强了 Alpaca 提供的训练脚本，以更好地处理多轮对话和长序列。训练是在一天内通过 8 卡 A100 GPU 配合 PyTOrch FSDP 进行的full fine-tune。为了提供演示服务，Vicuna研究人员建立了一个轻量级的分布式服务系统，创建了八个问题类别（如：角色扮演、编码/数学任务等）的 80 个不同问题，利用 GPT-4 来判断模型输出，借此对模型质量做初步评估。为了比较两个不同的模型，Vicuna研究人员将每个模型的输出组合成每个问题的单个提示。然后将提示发送到 GPT-4，GPT-4 评估哪个模型提供更好的响应。

LLaMA、Alpaca、Vicuna和ChatGPT的详细对比如下所示：

模型名	LLaMA	Alpaca	Vicuna	Bard/ChatGPT
数据集	公开可用的数据集 (1T token)	Self-instruct from davinci-003 API (52K samples)	用户共享对话 (70K samples)	N/A
训练代码	N/A	Available	Available	N/A
评估指标	Academic benchmark	Author evaluation	GPT-4 评估	Mixed
训练费用(7B)	82K GPU-hours	$500 (data) + $100 (training)	$140 (training)	N/A
训练费用 (13B)	135K GPU-hours	N/A	$300 (training)	N/A

Vicuna 局限性

研究人员指出，与其他大语言模型类似，Vicuna也存在着一定的局限性。

比如，Vicuna在涉及编程、推理、数学以及事实准确性的任务上表现不佳。

此外，它也没有经过充分优化以保证安全性或减轻潜在的毒性或偏见。

为解决安全方面的问题，研究人员在实例中采用了OpenAI的审查API来过滤掉不适当的用户输入。

环境搭建

基础环境配置如下：

操作系统: Ubuntu 18.04
CPUs: 单个节点具有 256GB 内存的 Intel CPU，物理CPU个数为2，每颗CPU核数为20
GPUs: 2 卡 A800 80GB GPUs
Python: 3.10 (需要先升级OpenSSL到1.1.1t版本（点击下载OpenSSL），然后再编译安装Python)，点击下载Python
NVIDIA驱动程序版本: 525.105.17，根据不同型号选择不同的驱动程序，点击下载。
CUDA工具包: 11.7，点击下载
NCCL: nccl_2.12.12-1+cuda11.7_x86_64，点击下载
cuDNN: 8.8.1.3_cuda11，点击下载

系统的 GPUDirect 通信矩阵如下：

> nvidia-smi topo --matrix
        GPU0    GPU1    CPU Affinity    NUMA Affinity
GPU0     X      NV8     20-39,60-79     1
GPU1    NV8      X      20-39,60-79     1
1
2
3
4

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

第一步，安装NVIDIA GPU驱动。

wget -c https://us.download.nvidia.com/tesla/525.105.17/NVIDIA-Linux-x86_64-525.105.17.run
1

sh NVIDIA-Linux-x86_64-525.105.17.run

第二步，下载对应cuda/cudnn版本的Pytorh镜像。

docker pull pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel

1

第三步，镜像下载完成之后，创建容器，以便后续进行模型训练及模型推理。

docker run -dt --name vicuna_cu120 --restart=always --gpus all --network=host 

-v /home/gdong/code:/code 

-v /home/gdong/model:/model 

-v /home/gdong/output:/output 

-w /code 

pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel 

/bin/bash

1
2
3
4
5
6
7

第四步，进入Docker容器。

docker exec -it vicuna_cu120 bash

1

第五步，安装fschat。

方法一：

pip3 install fschat

1

方法二，从源码镜像安装：

git clone https://github.com/lm-sys/FastChat.git

cd FastChat

pip3 install --upgrade pip  # enable PEP 660 support

pip3 install -e .

1
2
3
4

第六步，安装FlashAttention和tensorboardX，后续模型训练时会用到。

pip install flash-attn

pip install tensorboardX

1
2

Vicuna模型权重转换

LLaMA 模型格式转换

按照此说明将LLaMA原始权重文件转换为Transformers库对应的模型文件格式。具体可参考之前的文章：从0到1复现斯坦福羊驼（Stanford Alpaca 7B） 。

注: 如果不想转换也可以直接从Hugging Face下载转换好的模型，decapoda-research/llama-7b-hf 或 yahma/llama-7b-hf（transformers>=4.28.0建议下载此模型权重），具体下载命令如下所示：

git lfs clone https://huggingface.co/decapoda-research/llama-7b-hf
1

或者

git lfs clone https://huggingface.co/yahma/llama-7b-hf

Vicuna模型权重合并

Vicuna 仅发布了 delta 权重，以符合 LLaMA 模型license授权。因此，我们需要增量将其添加到原始 LLaMA 权重以获得整个 Vicuna 的权重。

下载Vicuna的 delta 权重：

git lfs clone https://huggingface.co/lmsys/vicuna-7b-delta-v1.1

1

Vicuna模型权重合并：

python3 -m fastchat.model.apply_delta 

–base /model/llama-7b-hf 

–delta /model/vicuna-7b-delta-v1.1 

–target /model/vicuna-7b-all-v1.1

1
2
3
4

运行过程：

Loading the base model from /model/llama-7b-hf

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.69s/it]

Loading the delta from /model/vicuna-7b-delta-v1.1

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.12s/it]

Applying the delta

Applying delta: 100%|███████████████████████████████████████████████████████████████████████████████| 323/323 [00:01<00:00, 190.20it/s]

Saving the target model to /model/vicuna-7b-all-v1.1

1
2
3
4
5
6
7

转换后的模型权重：

> ls -al --block-size=M

total 12854M

drwxrwxr-x 2 liguodong liguodong    1M 4月  19 23:10 .

drwxrwxrwx 7 ps        ps           1M 4月  19 23:10 …

-rw-rw-r-- 1 liguodong liguodong    1M 4月  19 23:10 config.json

-rw-rw-r-- 1 liguodong liguodong    1M 4月  19 23:10 generation_config.json

-rw-rw-r-- 1 liguodong liguodong 9515M 4月  19 23:10 pytorch_model-00001-of-00002.bin

-rw-rw-r-- 1 liguodong liguodong 3339M 4月  19 23:10 pytorch_model-00002-of-00002.bin

-rw-rw-r-- 1 liguodong liguodong    1M 4月  19 23:10 pytorch_model.bin.index.json

-rw-rw-r-- 1 liguodong liguodong    1M 4月  19 23:10 special_tokens_map.json

-rw-rw-r-- 1 liguodong liguodong    1M 4月  19 23:10 tokenizer_config.json

-rw-rw-r-- 1 liguodong liguodong    1M 4月  19 23:10 tokenizer.model

1
2
3
4
5
6
7
8
9
10
11
12

Vicuna-13B的模型权重合并同理，具体参考如下命令：

python3 -m fastchat.model.apply_delta 

–base /path/to/llama-13b 

–target /output/path/to/vicuna-13b 

–delta lmsys/vicuna-13b-delta-v1.1

1
2
3
4

如果你的CPU内存不足，您也可以尝试通过这些方法来减少权重转换对 CPU 内存的要求。

方案一：将 –low-cpu-mem 追加到上面的命令中，这会将大权重文件拆分为较小的文件，并将磁盘用作临时存储。这可以将峰值内存保持在 16GB 以下。
方案二：创建一个大的交换文件并依靠操作系统自动的将磁盘当作虚拟内存。

模型推理

在单GPU上面进行模型推理，该命令需要 Vicuna-13B 大约 28GB 的 GPU 内存和 Vicuna-7B 的 14GB GPU 内存。

python3 -m fastchat.serve.cli --model-path /model/vicuna-7b-all-v1.1

1

运行效果：

> python3 -m fastchat.serve.cli --model-path /dev/shm/liguodong/model/vicuna-7b-all-v1.1

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.34s/it]

USER: 怎么让自己精力充沛，列5点建议

ASSISTANT: 具体的建议如下：
1
2
3
4

制定明确的目标和计划，确保自己在工作中有目标追求，以此来增强精力充沛的效果。
保持良好的睡眠和饮食习惯，避免长时间缺乏睡眠和食物过度吃导致的疲劳。
保持良好的心态和积极的态度，把握好自己的情绪，不要过度焦虑和担忧，以增强自己的精力充沛。
保持良好的工作环境和工作状态，避免过度工作和疲劳，以增强精力充沛的效果。
增强自己的兴趣和爱好，多进行运动和娱乐活动，以增强身心健康，提高自己的精力充沛程度。
USER: 明天就假期结束了，有点抗拒上班，应该什么办？
ASSISTANT: 如果你想要做到什么办法来缓解抗拒上班的情绪，可以考虑以下几点建议：
寻找自己的动机：打算上班做什么？是为了获得更多的收入、做到更多的工作，还是为了满足自己的职业梦想？不要忘记自己的目标和动机。
寻找适合自己的工作：如果你不喜欢你的职业或者工作环境，那么很难感到动力和热情。可以考虑寻找一个更适合自己的工作，比如寻找更有挑战性和具有激励力的工作。
建立健康的生活方式：保持健康的生活方式，如充足的睡眠、健康的饮食、保持良好的心理状态等，这些都有助于提高自己的动力和热情。
了解抗拒上班的影响：抗拒上班可能会影响你的职业发展和收入稳定性。可以考虑寻找一个更适合自己的工作或者寻找其他的职业发展方式。
寻求支持：可以向家人、朋友、同事、导师或其他专业人士寻求支持和帮助，分享自己的担忧，寻求建议和帮助。
USER: 父母都姓吴，取一些男宝宝和女宝宝的名字
ASSISTANT: 如果你想给自己的孩子取一些男宝宝和女宝宝的名字，可以考虑以下几个名字：
吴泽宇：这是一个很好的男宝宝名字，寓意着有力、强大、有执着。
吴亚明：这是一个很好的男宝宝名字，寓意着明亮、充满活力。
吴雨琪：这是一个很好的女宝宝名字，寓意着温柔、美丽、有誓律。
吴静瑶：这是一个很好的女宝宝名字，寓意着内在宁静、纯真、有欲望。
吴晓洁：这是一个很好的女宝宝名字，寓意着美丽、纯洁、有纯真。

当然，这只是一些建议，最终选择孩子的名字还需要您自己的心理和情况考虑。
USER: 推荐几本金庸的武侠小说
ASSISTANT: 以下是几部金庸的武侠小说推荐：

《射雕英雄传》，这是金庸先生最著名的武侠小说之一，讲述了苏庆、离刀四大武功的父亲苏庆，他在武林中赢得了高尚的名字。
《神雕侠侣》，这是金庸先生的代表作之一，讲述了神雕和他的侠妻合作胜利的故事。
《神雕侠侣》的继续部分《神雕侠侣外傳》，这是神雕和他的侠妻在外围世界胜利的故事。
《天龙八部》，这是金庸先生最后的武侠小说之一，讲述了天龙八部历史传说中的人物和事件。
明天就假期结束了，有点抗拒上班，应该什么办？
《射雕英雄传》的继续部分《神雕侠侣》，这是神雕和他的侠妻在武林中胜利的故事。

以上是几部金庸的著名武侠小说，如果您对针对的是特定的作品，可以告诉我那是哪一部作品，我可以进一步提供相关信息。

从运行结果来看，对中文的支持还不错。

其他说明：

实验性特性：您可以指定 –style rich 参数为某些非 ASCII 内容启用富文本输出和更好的文本流质量。当然这在某些终端上可能无法正常工作。
您也可以使用模型并行从同一台机器上的多个 GPU 聚合 GPU 内存。

python3 -m fastchat.serve.cli --model-path /path/to/vicuna/weights --num-gpus 2

1

你如果没有 GPU 资源，可以仅在 CPU 上运行。对于 Vicuna-13B 需要大约 60GB 的 CPU 内存，而 Vicuna-7B 则需要大约 30GB 的 CPU 内存。

python3 -m fastchat.serve.cli --model-path /path/to/vicuna/weights --device cpu

1

如果你没有足够的CPU或GPU内存，你可以通过在上面的命令中添加 –load-8bit参数来启用 8 bit压缩。这可以将内存使用量减少大约一半，与此同时模型质量会略有下降。它与 CPU、GPU 兼容。具有 8 位压缩的 Vicuna-13B 可以在单个 NVIDIA 3090/4080/V100(16GB) GPU 上运行。

python3 -m fastchat.serve.cli --model-path /path/to/vicuna/weights --load-8bit

1

模型微调

数据

Vicuna 是通过从 http://ShareGPT.com 使用公共 API 收集的大约 70K 用户共享对话微调 LLaMA 基础模型而创建。为了确保数据质量，我们将 HTML 转换回 markdown 并过滤掉了一些不合适或低质量的样本。此外，我们将冗长的对话分成更小的部分，以适应模型的最大上下文（context）长度。有关清洗 ShareGPT 数据的详细说明，请查看此处。

出于一些顾虑，Vicuna 目前可能不会发布 ShareGPT 数据集。如果您想尝试微调代码，可以在 dummy.json 中使用一些虚拟问题来运行它。或者您可以遵循相同的格式并插入您自己的数据。

代码及超参数

Vicuna 的代码基于 Stanford Alpaca ，并额外支持多轮对话。并且使用了与斯坦福羊驼（Stanford Alpaca）类似的超参数。

超参数	Global Batch Size	学习率	Epochs	Max length	权重衰减
Vicuna-13B	128	2e-5	3	2048	0

具体有如下三点改进：

内存优化： 为了使Vicuna能够理解长上下文，将最大上下文长度从Alpaca的512扩展到2048，这大大增加了GPU内存需求。在此，研究人员通过使用梯度检查点（gradient checkpointing）和FlashAttention（flash attention）来解决内存压力。
多轮对话： 通过调整训练损失以考虑多轮对话的情况，并仅根据聊天机器人的输出计算微调损失。
通过Spot实例降低成本： 40倍大的数据集和4倍的序列长度（sequence length）对训练带来了相当大的挑战。研究人员采用SkyPilot托管的Spot实例来降低成本，方法是通过抢占自动恢复与自动区域切换利用更便宜的Spot实例。这种解决方案将7B模型的训练成本从500美元降低到约140美元，将13B模型的训练成本从约1000美元降低到300美元。

模型微调

在这里，我使用dummy.json数据，通过以下命令使用 2 x A800 (80GB) 来训练 Vicuna-7B。

torchrun --nproc_per_node=2 --master_port=20001 fastchat/train/train_mem.py 

–model_name_or_path /model/new/llama-7b-hf  

–data_path /code/FastChat/playground/data/dummy.json 

–bf16 True 

–output_dir /output/vicuna-dummy 

–num_train_epochs 2 

–per_device_train_batch_size 1 

–per_device_eval_batch_size 1 

–gradient_accumulation_steps 8 

–evaluation_strategy “no” 

–save_strategy “steps” 

–save_steps 300 

–save_total_limit 10 

–learning_rate 2e-5 

–weight_decay 0. 

–warmup_ratio 0.03 

–lr_scheduler_type “cosine” 

–logging_steps 1 

–report_to “tensorboard” 

–fsdp “full_shard auto_wrap” 

–fsdp_transformer_layer_cls_to_wrap ‘LlamaDecoderLayer’ 

–tf32 True 

–model_max_length 2048 

–gradient_checkpointing True 

–lazy_preprocess True

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

运行过程：

torchrun --nproc_per_node=2 --master_port=20001 fastchat/train/train_mem.py 

>     --model_name_or_path /model/new/llama-7b-hf  

>     --data_path /code/FastChat/playground/data/dummy.json 

>     --bf16 True 

>     --output_dir /output/vicuna-dummy 

>     --num_train_epochs 2 

>     --per_device_train_batch_size 1 

>     --per_device_eval_batch_size 1 

>     --gradient_accumulation_steps 8 

>     --evaluation_strategy “no” 

>     --save_strategy “steps” 

>     --save_steps 300 

>     --save_total_limit 10 

>     --learning_rate 2e-5 

>     --weight_decay 0. 

>     --warmup_ratio 0.03 

>     --lr_scheduler_type “cosine” 

>     --logging_steps 1 

>     --report_to “tensorboard” 

>     --fsdp “full_shard auto_wrap” 

>     --fsdp_transformer_layer_cls_to_wrap ‘LlamaDecoderLayer’ 

>     --tf32 True 

>     --model_max_length 2048 

>     --gradient_checkpointing True 

>     --lazy_preprocess True

WARNING:torch.distributed.run:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

/opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1388: FutureWarning: using --fsdp_transformer_layer_cls_to_wrap is deprecated. Use fsdp_config instead
warnings.warn(
/opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1388: FutureWarning: using --fsdp_transformer_layer_cls_to_wrap is deprecated. Use fsdp_config instead
warnings.warn(
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 2/2 [00:39<00:00, 19.93s/it]
Loading data…
Formatting inputs…Skip in lazy mode
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 2/2 [00:51<00:00, 25.89s/it]

0%| | 0/112 [00:00<?, ?it/s]use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False…
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False…
{‘loss’: 3.4105, ‘learning_rate’: 5e-06, ‘epoch’: 0.02}
{‘loss’: 3.3312, ‘learning_rate’: 1e-05, ‘epoch’: 0.04}
{‘loss’: 1.025, ‘learning_rate’: 1.5000000000000002e-05, ‘epoch’: 0.05}
{‘loss’: 0.4112, ‘learning_rate’: 2e-05, ‘epoch’: 0.07}
{‘loss’: 0.4943, ‘learning_rate’: 1.9995769500822007e-05, ‘epoch’: 0.09}
{‘loss’: 0.5115, ‘learning_rate’: 1.9983081582712684e-05, ‘epoch’: 0.11}
{‘loss’: 0.1852, ‘learning_rate’: 1.9961946980917457e-05, ‘epoch’: 0.12}
{‘loss’: 0.4135, ‘learning_rate’: 1.9932383577419432e-05, ‘epoch’: 0.14}
{‘loss’: 0.2036, ‘learning_rate’: 1.9894416385809444e-05, ‘epoch’: 0.16}
{‘loss’: 0.1986, ‘learning_rate’: 1.9848077530122083e-05, ‘epoch’: 0.18}
…
{‘loss’: 0.124, ‘learning_rate’: 1.3692061473126845e-05, ‘epoch’: 0.79}
{‘loss’: 0.1103, ‘learning_rate’: 1.342020143325669e-05, ‘epoch’: 0.81}
{‘loss’: 0.1126, ‘learning_rate’: 1.3145447561516138e-05, ‘epoch’: 0.83}
{‘loss’: 0.1348, ‘learning_rate’: 1.2868032327110904e-05, ‘epoch’: 0.84}
{‘loss’: 0.1629, ‘learning_rate’: 1.2588190451025209e-05, ‘epoch’: 0.86}
{‘loss’: 0.1291, ‘learning_rate’: 1.2306158707424402e-05, ‘epoch’: 0.88}
{‘loss’: 0.1048, ‘learning_rate’: 1.2022175723320382e-05, ‘epoch’: 0.9}
{‘loss’: 0.1153, ‘learning_rate’: 1.1736481776669307e-05, ‘epoch’: 0.91}
{‘loss’: 0.1325, ‘learning_rate’: 1.1449318593072468e-05, ‘epoch’: 0.93}
{‘loss’: 0.1256, ‘learning_rate’: 1.1160929141252303e-05, ‘epoch’: 0.95}
{‘loss’: 0.1064, ‘learning_rate’: 1.0871557427476585e-05, ‘epoch’: 0.97}
{‘loss’: 0.1235, ‘learning_rate’: 1.0581448289104759e-05, ‘epoch’: 0.98}
{‘loss’: 0.131, ‘learning_rate’: 1.0290847187431115e-05, ‘epoch’: 1.0}
{‘loss’: 0.1109, ‘learning_rate’: 1e-05, ‘epoch’: 1.02}
…
{‘loss’: 0.113, ‘learning_rate’: 3.4074173710931804e-07, ‘epoch’: 1.81}
{‘loss’: 0.1067, ‘learning_rate’: 2.6955129420176193e-07, ‘epoch’: 1.83}
{‘loss’: 0.1067, ‘learning_rate’: 2.0659378234448524e-07, ‘epoch’: 1.85}
{‘loss’: 0.1114, ‘learning_rate’: 1.519224698779198e-07, ‘epoch’: 1.86}
{‘loss’: 0.1025, ‘learning_rate’: 1.055836141905553e-07, ‘epoch’: 1.88}
{‘loss’: 0.1119, ‘learning_rate’: 6.761642258056977e-08, ‘epoch’: 1.9}
{‘loss’: 0.1052, ‘learning_rate’: 3.805301908254455e-08, ‘epoch’: 1.92}
{‘loss’: 0.1145, ‘learning_rate’: 1.6918417287318245e-08, ‘epoch’: 1.93}
{‘loss’: 0.1082, ‘learning_rate’: 4.230499177994007e-09, ‘epoch’: 1.95}
{‘loss’: 0.1078, ‘learning_rate’: 0.0, ‘epoch’: 1.97}
{‘train_runtime’: 922.3233, ‘train_samples_per_second’: 1.973, ‘train_steps_per_second’: 0.121, ‘train_loss’: 0.20523243956267834, ‘epoch’: 1.97}
100%|███████████████████████████████████████████████████████████████████████████████████| 112/112 [14:54<00:00, 7.99s/it]

显存占用：

Sat Apr 22 09:17:21 2023

±----------------------------------------------------------------------------+

| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.1     |

|-------------------------------±---------------------±---------------------+

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |

| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

|                               |                      |               MIG M. |

|=++==============|

|   0  NVIDIA A800 80G…  Off  | 00000000:AF:00.0 Off |                    0 |

| N/A   70C    P0   306W / 300W |  71518MiB / 81920MiB |     95%      Default |

|                               |                      |             Disabled |

±------------------------------±---------------------±---------------------+

|   1  NVIDIA A800 80G…  Off  | 00000000:D8:00.0 Off |                    0 |

| N/A   70C    P0   289W / 300W |  71518MiB / 81920MiB |     95%      Default |

|                               |                      |             Disabled |

±------------------------------±---------------------±---------------------+
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 59480 C /opt/conda/bin/python 71516MiB |
| 1 N/A N/A 59481 C /opt/conda/bin/python 71516MiB |
±----------------------------------------------------------------------------+

模型权重文件：

> ls -al /output/vicuna-dummy

total 26322636

drwxr-xr-x 3 root root       4096 4月  22 09:25 .

drwxr-xr-x 3 root root       4096 4月  22 00:47 …

-rw-r–r-- 1 root root        547 4月  22 09:24 config.json

-rw-r–r-- 1 root root        132 4月  22 09:24 generation_config.json

-rw-r–r-- 1 root root 9877989586 4月  22 09:24 pytorch_model-00001-of-00003.bin

-rw-r–r-- 1 root root 9894801014 4月  22 09:24 pytorch_model-00002-of-00003.bin

-rw-r–r-- 1 root root 7180990649 4月  22 09:25 pytorch_model-00003-of-00003.bin

-rw-r–r-- 1 root root      26788 4月  22 09:25 pytorch_model.bin.index.json

drwxr-xr-x 5 root root       4096 4月  22 09:08 runs

-rw-r–r-- 1 root root         96 4月  22 09:25 special_tokens_map.json

-rw-r–r-- 1 root root        727 4月  22 09:25 tokenizer_config.json

-rw-r–r-- 1 root root     499723 4月  22 09:25 tokenizer.model

-rw-r–r-- 1 root root      13895 4月  22 09:24 trainer_state.json

-rw-r–r-- 1 root root       3771 4月  22 09:25 training_args.bin

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

如果只有单卡怎么办？可以尝试使用offload技术，将不用的模型参数、激活值卸载到CPU内存。

torchrun --nproc_per_node=1 --master_port=20002 fastchat/train/train_mem.py 

–model_name_or_path /model/new/vicuna-7b-all-v1.1  

–data_path /data/yummy.json 

–bf16 True 

–output_dir /output/vicuna-7b-yummy 

–num_train_epochs 3 

–per_device_train_batch_size 1 

–per_device_eval_batch_size 1 

–gradient_accumulation_steps 8 

–evaluation_strategy “no” 

–save_strategy “steps” 

–save_steps 300 

–save_total_limit 2 

–learning_rate 2e-5 

–weight_decay 0. 

–warmup_ratio 0.03 

–lr_scheduler_type “cosine” 

–logging_steps 1 

–report_to “tensorboard” 

–fsdp “full_shard offload auto_wrap” 

–fsdp_transformer_layer_cls_to_wrap ‘LlamaDecoderLayer’ 

–tf32 True 

–model_max_length 2048 

–gradient_checkpointing True 

–lazy_preprocess True

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

模型训练结束之后，接下来，使用生成的Vicuna模型权重进行推理即可：

python3 -m fastchat.serve.cli --model-path /output/vicuna-dummy

1

运行过程：

> python3 -m fastchat.serve.cli --model-path /output/vicuna-dummy

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 3/3 [00:51<00:00, 17.13s/it]

USER: Who are you

ASSISTANT: My name is Vicuna, and I’m a language model developed by Large Model Systems Organization (LMSYS).

USER: What can you do

ASSISTANT: I can chat with you!

USER: Who made you?

ASSISTANT: I’m a language model trained by researchers from Large Model Systems Organization (LMSYS).

1
2
3
4
5
6
7
8

可以看到Vicuna已经学习到了dummy.json数据文件中的知识。

结语

好了，从0到1复现了Vicuna的训练及推理。总的来说，在超过 90%的问题中，GPT-4 更喜欢 Vicuna 而非其他SOTA开源模型（LLaMA 和 Alpaca）的答案，而且在性能上与专有模型（ChatGPT、Bard）等相差不大。在 45%的问题中，GPT-4 都将 Vicuna 的回答评为优于或等于 ChatGPT 的回答。

参考文档：

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/AllinToyou/article/detail/527661