当前位置:   article > 正文

使用XTuner微调书生·浦语2大模型实战

使用XTuner微调书生·浦语2大模型实战

一、XTuner安装

1、代码准备

  1. mkdir project
  2. cd project
  3. git clone https://github.com/InternLM/xtuner.git

2、环境准备

  1. cd xtuner
  2. pip install -r requirements.txt
  3. #从源码安装
  4. pip install -e '.[all]'

3、查看配置文件列表

XTuner 提供多个开箱即用的配置文件,用户可以通过下列命令查看:

  1. #列出所有内置配置文件
  2. xtuner list-cfg
  3. #列出internlm2大模型的相关配置文件
  4. xtuner list-cfg | grep internlm2

二、大模型微调步骤

1、大模型下载

  1. cd /root/share/model_repos/
  2. git clone https://www.modelscope.cn/Shanghai_AI_Laboratory/internlm2-chat-7b.git

2、微调数据准备

2.1、获取原始数据

Medication_QA数据为例:

原始数据格式

取其中的“Question”和“Answer”来做数据集

2.2、将数据转为 XTuner 的数据格式

XTuner的数据格式(.jsonl)

  1. [{
  2. "conversation":[
  3. {
  4. "system": "xxx",
  5. "input": "xxx",
  6. "output": "xxx"
  7. }
  8. ]
  9. },
  10. {
  11. "conversation":[
  12. {
  13. "system": "xxx",
  14. "input": "xxx",
  15. "output": "xxx"
  16. }
  17. ]
  18. }]

每一条原始数据对应一条“conversation”,对应关系为:

input 对应 Question

output 对应 Answer

system 全部设置为“You are a professional, highly experienced doctor professor. You always provide accurate, comprehensive, and detailed answers based on the patients' questions.”

格式化后的数据:

数据转换代码如下(xlsx2jsonl.py):

  1. import openpyxl
  2. import json
  3. def process_excel_to_json(input_file, output_file):
  4. # Load the workbook
  5. wb = openpyxl.load_workbook(input_file)
  6. # Select the "DrugQA" sheet
  7. sheet = wb["DrugQA"]
  8. # Initialize the output data structure
  9. output_data = []
  10. # Iterate through each row in column A and D
  11. for row in sheet.iter_rows(min_row=2, max_col=4, values_only=True):
  12. system_value = "You are a professional, highly experienced doctor professor. You always provide accurate, comprehensive, and detailed answers based on the patients' questions."
  13. # Create the conversation dictionary
  14. conversation = {
  15. "system": system_value,
  16. "input": row[0],
  17. "output": row[3]
  18. }
  19. # Append the conversation to the output data
  20. output_data.append({"conversation": [conversation]})
  21. # Write the output data to a JSON file
  22. with open(output_file, 'w', encoding='utf-8') as json_file:
  23. json.dump(output_data, json_file, indent=4)
  24. print(f"Conversion complete. Output written to {output_file}")
  25. # Replace 'MedQA2019.xlsx' and 'output.jsonl' with your actual input and output file names
  26. process_excel_to_json('MedInfo2019-QA-Medications.xlsx', 'output.jsonl')
2.3、切分出训练集和测试集(7:3)
  1. import json
  2. import random
  3. def split_conversations(input_file, train_output_file, test_output_file):
  4. # Read the input JSONL file
  5. with open(input_file, 'r', encoding='utf-8') as jsonl_file:
  6. data = json.load(jsonl_file)
  7. # Count the number of conversation elements
  8. num_conversations = len(data)
  9. # Shuffle the data randomly
  10. random.shuffle(data)
  11. random.shuffle(data)
  12. random.shuffle(data)
  13. # Calculate the split points for train and test
  14. split_point = int(num_conversations * 0.7)
  15. # Split the data into train and test
  16. train_data = data[:split_point]
  17. test_data = data[split_point:]
  18. # Write the train data to a new JSONL file
  19. with open(train_output_file, 'w', encoding='utf-8') as train_jsonl_file:
  20. json.dump(train_data, train_jsonl_file, indent=4)
  21. # Write the test data to a new JSONL file
  22. with open(test_output_file, 'w', encoding='utf-8') as test_jsonl_file:
  23. json.dump(test_data, test_jsonl_file, indent=4)
  24. print(f"Split complete. Train data written to {train_output_file}, Test data written to {test_output_file}")
  25. # Replace 'input.jsonl', 'train.jsonl', and 'test.jsonl' with your actual file names
  26. split_conversations('output.jsonl', 'MedQA2019-structured-train.jsonl', 'MedQA2019-structured-test.jsonl')

3、配置文件准备

根据所选用的大模型下载对应的配置文件

xtuner copy-cfg internlm2_chat_7b_qlora_oasst1_e3_copy.py .
  1. # 修改import部分
  2. - from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
  3. + from xtuner.dataset.map_fns import template_map_fn_factory
  4. # 修改模型为本地路径
  5. - pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
  6. + pretrained_model_name_or_path = '/root/share/model_repos/internlm2-chat-7b'
  7. # 修改训练数据为 MedQA2019-structured-train.jsonl 路径
  8. - data_path = 'timdettmers/openassistant-guanaco'
  9. + data_path = 'MedQA2019-structured-train.jsonl'
  10. # 修改 train_dataset 对象
  11. train_dataset = dict(
  12. type=process_hf_dataset,
  13. - dataset=dict(type=load_dataset, path=data_path),
  14. + dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
  15. tokenizer=tokenizer,
  16. max_length=max_length,
  17. - dataset_map_fn=alpaca_map_fn,
  18. + dataset_map_fn=None,
  19. template_map_fn=dict(
  20. type=template_map_fn_factory, template=prompt_template),
  21. remove_unused_columns=True,
  22. shuffle_before_pack=True,
  23. pack_to_max_length=pack_to_max_length)

4、启动微调

xtuner train internlm2_chat_7b_qlora_oasst1_e3_copy.py --deepspeed deepspeed_zero2

5、将得到的 PTH 模型转换为 HuggingFace 模型,即:生成 Adapter 文件夹

  1. mkdir hf
  2. export MKL_SERVICE_FORCE_INTEL=1
  3. export MKL_THREADING_LAYER=GNU
  4. xtuner convert pth_to_hf internlm2_chat_7b_qlora_oasst1_e3_copy.py ./work_dirs/internlm2_chat_7b_qlora_oasst1_e3_copy/iter_96.pth ./hf

hf 文件夹即为我们平时所理解的所谓 “LoRA 模型文件”

6、将 HuggingFace adapter 合并到大语言模型:

  1. xtuner convert merge /root/share/model_repos/internlm2-chat-7b ./hf ./merged --max-shard-size 2GB
  2. # xtuner convert merge \
  3. # ${NAME_OR_PATH_TO_LLM} \
  4. # ${NAME_OR_PATH_TO_ADAPTER} \
  5. # ${SAVE_PATH} \
  6. # --max-shard-size 2GB

./merged 文件夹中既微调后的大模型,使用方法和原模型一样(internlm2-chat-7b)

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/凡人多烦事01/article/detail/181603
推荐阅读
相关标签
  

闽ICP备14008679号