AllinToyou

这个屌丝很懒，什么也没留下！

热门标签

基于XTuner微调书生·浦语大模型

作者：AllinToyou | 2024-03-01 06:41:24

踩

基于XTuner微调书生·浦语大模型

1 概述

XTuner 是一个傻瓜式、轻量级的大语言模型微调工具箱，由MMRazor和MMDeploy联合开发。其以配置文件的形式封装了大部分微调场景，0基础的非专业人员也能一键开始微调；对于 7B 参数量的LLM，微调所需的最小显存仅为 8GB。

常见的两种微调策略：增量预训练、指令跟随

增量预训练：使用场景，让基座模型学到一些新知识，如某个垂直类领域的常识，训练数据：文章、书籍、代码等
指令跟随微调：使用场景，让模型学会对话模板，根据人类指令进行对话，训练数据：高质量的对话、问答数据

在基座模型进行增量预训练的模型对话效果往往不行，需要再进行指令跟随微调

增量预训练

指令跟随微调

XTuner 支持的开源大模型：InternLM、Llama/Llama2、ChatGLM2/ChatGLM3、Qwen、Baichuan/Baichuan2、Zephyr

XTuner微调有全参、LoRA、QLoRA三种方式，QLoRA是LoRA微调的改进。

在这里插入图片描述

2 快速上手

2.1 平台

Ubuntu + Anaconda + CUDA/CUDNN + 8GB nvidia显卡

2.2 安装

# 如果你是在 InternStudio 平台，则从本地 clone 一个已有 pytorch 2.0.1 的环境：
/root/share/install_conda_env_internlm_base.sh xtuner0.1.9
# 如果你是在其他平台：
conda create --name xtuner0.1.9 python=3.10 -y

# 激活环境
conda activate xtuner0.1.9
# 进入家目录 （~的意思是 “当前用户的home路径”）
cd ~
# 创建版本文件夹并进入，以跟随本教程
mkdir xtuner019 && cd xtuner019


# 拉取 0.1.9 的版本源码
git clone -b v0.1.9  https://github.com/InternLM/xtuner
# 无法访问github的用户请从 gitee 拉取:
# git clone -b v0.1.9 https://gitee.com/Internlm/xtuner

# 进入源码目录
cd xtuner

# 从源码安装 XTuner
pip install -e '.[all]'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

安装完后，就开始搞搞准备工作了。（准备在 oasst1 数据集上微调 internlm-7b-chat）

# 创建一个微调 oasst1 数据集的工作路径，进入
mkdir ~/ft-oasst1 && cd ~/ft-oasst1
1
2

2.3 微调

2.3.1 准备配置文件

XTuner 提供多个开箱即用的配置文件，用户可以通过下列命令查看：

# 列出所有内置配置
xtuner list-cfg
1
2

假如显示bash: xtuner: command not found的话可以考虑在终端输入 export PATH=$PATH:‘/root/.local/bin’

在这里插入图片描述

拷贝一个配置文件到当前目录：
# xtuner copy-cfg ${CONFIG_NAME} ${SAVE_PATH}

在本案例中即：（注意最后有个英文句号，代表复制到当前路径）

cd ~/ft-oasst1
xtuner copy-cfg internlm_chat_7b_qlora_oasst1_e3 .
1
2

配置文件名的解释：

xtuner copy-cfg internlm_chat_7b_qlora_oasst1_e3 .

模型名	internlm_chat_7b
使用算法	qlora
数据集	oasst1
把数据集跑几次	跑3次：e3 (epoch 3 )

*无 chat比如 internlm-7b 代表是基座(base)模型

2.3.2 模型下载

由于下载模型很慢，用教学平台的同学可以直接复制模型。

ln -s /share/temp/model_repos/internlm-chat-7b ~/ft-oasst1/
1

以上是通过软链的方式，将模型文件挂载到家目录下，优势是：

节省拷贝时间，无需等待
节省用户开发机存储空间

当然，也可以用 cp -r /share/temp/model_repos/internlm-chat-7b ~/ft-oasst1/ 进行数据拷贝。

以下是自己下载模型的步骤。

不用 xtuner 默认的从 huggingface 拉取模型，而是提前从 ~~OpenXLab~~ ModelScope 下载模型到本地

# 创建一个目录，放模型文件，防止散落一地
mkdir ~/ft-oasst1/internlm-chat-7b

# 装一下拉取模型文件要用的库
pip install modelscope

# 从 modelscope 下载下载模型文件
cd ~/ft-oasst1
apt install git git-lfs -y
git lfs install
git lfs clone https://modelscope.cn/Shanghai_AI_Laboratory/internlm-chat-7b.git -b v1.0.3
1
2
3
4
5
6
7
8
9
10
11

2.3.3 数据集下载

https://huggingface.co/datasets/timdettmers/openassistant-guanaco/tree/main

由于 huggingface 网络问题，咱们已经给大家提前下载好了，复制到正确位置即可：

cd ~/ft-oasst1
# ...-guanaco 后面有个空格和英文句号啊
cp -r /root/share/temp/datasets/openassistant-guanaco .
1
2
3

此时，当前路径的文件应该长这样：

|-- internlm-chat-7b
|   |-- README.md
|   |-- config.json
|   |-- configuration.json
|   |-- configuration_internlm.py
|   |-- generation_config.json
|   |-- modeling_internlm.py
|   |-- pytorch_model-00001-of-00008.bin
|   |-- pytorch_model-00002-of-00008.bin
|   |-- pytorch_model-00003-of-00008.bin
|   |-- pytorch_model-00004-of-00008.bin
|   |-- pytorch_model-00005-of-00008.bin
|   |-- pytorch_model-00006-of-00008.bin
|   |-- pytorch_model-00007-of-00008.bin
|   |-- pytorch_model-00008-of-00008.bin
|   |-- pytorch_model.bin.index.json
|   |-- special_tokens_map.json
|   |-- tokenization_internlm.py
|   |-- tokenizer.model
|   `-- tokenizer_config.json
|-- internlm_chat_7b_qlora_oasst1_e3_copy.py
`-- openassistant-guanaco
    |-- openassistant_best_replies_eval.jsonl
    `-- openassistant_best_replies_train.jsonl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

2.3.4 修改配置文件

修改其中的模型和数据集为本地路径

cd ~/ft-oasst1
vim internlm_chat_7b_qlora_oasst1_e3_copy.py
1
2

在vim界面完成修改后，请输入:wq退出。假如认为改错了可以用:q!退出且不保存。当然我们也可以考虑打开python文件直接修改，但注意修改完后需要按下Ctrl+S进行保存。

减号代表要删除的行，加号代表要增加的行。

# 修改模型为本地路径
- pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
+ pretrained_model_name_or_path = './internlm-chat-7b'

# 修改训练数据集为本地路径
- data_path = 'timdettmers/openassistant-guanaco'
+ data_path = './openassistant-guanaco'
1
2
3
4
5
6
7

常用超参

参数名	解释
data_path	数据路径或 HuggingFace 仓库名
max_length	单条数据最大 Token 数，超过则截断
pack_to_max_length	是否将多条短数据拼接到 max_length，提高 GPU 利用率
accumulative_counts	梯度累积，每多少次 backward 更新一次参数
evaluation_inputs	训练过程中，会根据给定的问题进行推理，便于观测训练状态
evaluation_freq	Evaluation 的评测间隔 iter 数
…	…

如果想把显卡的现存吃满，充分利用显卡资源，可以将 max_length 和 batch_size 这两个参数调大。

2.3.5 开始微调

训练：

xtuner train ${CONFIG_NAME_OR_PATH}

也可以增加 deepspeed 进行训练加速：

xtuner train ${CONFIG_NAME_OR_PATH} --deepspeed deepspeed_zero2

例如，我们可以利用 QLoRA 算法在 oasst1 数据集上微调 InternLM-7B：

# 单卡
## 用刚才改好的config文件训练
xtuner train ./internlm_chat_7b_qlora_oasst1_e3_copy.py

# 多卡
NPROC_PER_NODE=${GPU_NUM} xtuner train ./internlm_chat_7b_qlora_oasst1_e3_copy.py

# 若要开启 deepspeed 加速，增加 --deepspeed deepspeed_zero2 即可
1
2
3
4
5
6
7
8

微调得到的 PTH 模型文件和其他杂七杂八的文件都默认在当前的 ./work_dirs 中。

跑完训练后，当前路径应该长这样：

|-- internlm-chat-7b
|-- internlm_chat_7b_qlora_oasst1_e3_copy.py
|-- openassistant-guanaco
|   |-- openassistant_best_replies_eval.jsonl
|   `-- openassistant_best_replies_train.jsonl
`-- work_dirs
    `-- internlm_chat_7b_qlora_oasst1_e3_copy
        |-- 20231101_152923
        |   |-- 20231101_152923.log
        |   `-- vis_data
        |       |-- 20231101_152923.json
        |       |-- config.py
        |       `-- scalars.json
        |-- epoch_1.pth
        |-- epoch_2.pth
        |-- epoch_3.pth
        |-- internlm_chat_7b_qlora_oasst1_e3_copy.py
        `-- last_checkpoint
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

2.3.6 将得到的 PTH 模型转换为 HuggingFace 模型，即：生成 Adapter 文件夹

xtuner convert pth_to_hf ${CONFIG_NAME_OR_PATH} ${PTH_file_dir} ${SAVE_PATH}

在本示例中，为：

mkdir hf
export MKL_SERVICE_FORCE_INTEL=1
export MKL_THREADING_LAYER=GNU
xtuner convert pth_to_hf ./internlm_chat_7b_qlora_oasst1_e3_copy.py ./work_dirs/internlm_chat_7b_qlora_oasst1_e3_copy/epoch_1.pth ./hf
1
2
3
4

此时，路径中应该长这样：

|-- internlm-chat-7b
|-- internlm_chat_7b_qlora_oasst1_e3_copy.py
|-- openassistant-guanaco
|   |-- openassistant_best_replies_eval.jsonl
|   `-- openassistant_best_replies_train.jsonl
|-- hf
|   |-- README.md
|   |-- adapter_config.json
|   |-- adapter_model.bin
|   `-- xtuner_config.py
`-- work_dirs
    `-- internlm_chat_7b_qlora_oasst1_e3_copy
        |-- 20231101_152923
        |   |-- 20231101_152923.log
        |   `-- vis_data
        |       |-- 20231101_152923.json
        |       |-- config.py
        |       `-- scalars.json
        |-- epoch_1.pth
        |-- epoch_2.pth
        |-- epoch_3.pth
        |-- internlm_chat_7b_qlora_oasst1_e3_copy.py
        `-- last_checkpoint
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

此时，hf 文件夹即为我们平时所理解的所谓 “LoRA 模型文件”

可以简单理解：LoRA 模型文件 = Adapter

2.4 部署与测试

2.4.1 将 HuggingFace adapter 合并到大语言模型：

xtuner convert merge ./internlm-chat-7b ./hf ./merged --max-shard-size 2GB
# xtuner convert merge \
#     ${NAME_OR_PATH_TO_LLM} \
#     ${NAME_OR_PATH_TO_ADAPTER} \
#     ${SAVE_PATH} \
#     --max-shard-size 2GB
1
2
3
4
5
6

2.4.2 与合并后的模型对话：

# 加载 Adapter 模型对话（Float 16）
xtuner chat ./merged --prompt-template internlm_chat

# 4 bit 量化加载
# xtuner chat ./merged --bits 4 --prompt-template internlm_chat
1
2
3
4
5

2.4.3 Demo

修改 cli_demo.py 中的模型路径

- model_name_or_path = "/root/model/Shanghai_AI_Laboratory/internlm-chat-7b"
+ model_name_or_path = "merged"
1
2

运行 cli_demo.py 以目测微调效果

python ./cli_demo.py
1

效果：

微调前	微调后

xtuner chat 的启动参数

启动参数	干哈滴
–prompt-template	指定对话模板
–system	指定SYSTEM文本
–system-template	指定SYSTEM模板
--bits	LLM位数
–bot-name	bot名称
–with-plugins	指定要使用的插件
–no-streamer	是否启用流式传输
–lagent	是否使用lagent
–command-stop-word	命令停止词
–answer-stop-word	回答停止词
–offload-folder	存放模型权重的文件夹（或者已经卸载模型权重的文件夹）
–max-new-tokens	生成文本中允许的最大 `token` 数量
–temperature	温度值
–top-k	保留用于顶k筛选的最高概率词汇标记数
–top-p	如果设置为小于1的浮点数，仅保留概率相加高于 `top_p` 的最小一组最有可能的标记
–seed	用于可重现文本生成的随机种子

3 自定义微调

以 Medication QA 数据集为例

3.1 概述

3.1.1 场景需求

基于 InternLM-chat-7B 模型，用 MedQA 数据集进行微调，将其往医学问答领域对齐。

3.1.2 真实数据预览

问题	答案
What are ketorolac eye drops?（什么是酮咯酸滴眼液？）	Ophthalmic ketorolac is used to treat itchy eyes caused by allergies. It also is used to treat swelling and redness (inflammation) that can occur after cataract surgery. Ketorolac is in a class of medications called nonsteroidal anti-inflammatory drugs (NSAIDs). It works by stopping the release of substances that cause allergy symptoms and inflammation.
What medicines raise blood sugar? （什么药物会升高血糖？）	Some medicines for conditions other than diabetes can raise your blood sugar level. This is a concern when you have diabetes. Make sure every doctor you see knows about all of the medicines, vitamins, or herbal supplements you take. This means anything you take with or without a prescription. Examples include: Barbiturates. Thiazide diuretics. Corticosteroids. Birth control pills (oral contraceptives) and progesterone. Catecholamines. Decongestants that contain beta-adrenergic agents, such as pseudoephedrine. The B vitamin niacin. The risk of high blood sugar from niacin lowers after you have taken it for a few months. The antipsychotic medicine olanzapine (Zyprexa).

问题

答案

What are ketorolac eye drops?（什么是酮咯酸滴眼液？）

Ophthalmic ketorolac is used to treat itchy eyes caused by allergies. It also is used to treat swelling and redness (inflammation) that can occur after cataract surgery. Ketorolac is in a class of medications called nonsteroidal anti-inflammatory drugs (NSAIDs). It works by stopping the release of substances that cause allergy symptoms and inflammation.

What medicines raise blood sugar? （什么药物会升高血糖？）

Some medicines for conditions other than diabetes can raise your blood sugar level. This is a concern when you have diabetes. Make sure every doctor you see knows about all of the medicines, vitamins, or herbal supplements you take. This means anything you take with or without a prescription. Examples include: Barbiturates. Thiazide diuretics. Corticosteroids. Birth control pills (oral contraceptives) and progesterone. Catecholamines. Decongestants that contain beta-adrenergic agents, such as pseudoephedrine. The B vitamin niacin. The risk of high blood sugar from niacin lowers after you have taken it for a few months. The antipsychotic medicine olanzapine (Zyprexa).

3.2 数据准备

以 Medication QA 数据集为例

原格式：(.xlsx)

在这里插入图片描述

问题	药物类型	问题类型	回答	主题	URL
aaa	bbb	ccc	ddd	eee	fff

3.2.1 将数据转为 XTuner 的数据格式

目标格式：(.jsonL)

[{
    "conversation":[
        {
            "system": "xxx",
            "input": "xxx",
            "output": "xxx"
        }
    ]
},
{
    "conversation":[
        {
            "system": "xxx",
            "input": "xxx",
            "output": "xxx"
        }
    ]
}]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/AllinToyou/article/detail/171720