赞
踩
本教程适用于希望为自己的项目定制 Llama 2 和 Mistral 等强大语言模型的任何人。使用
QLoRA,我们将指导您如何微调这些大型语言模型 (LLM),即使您没有超级计算机。
好模型需要好数据。我们将介绍在现有数据上进行训练以及如何创建自己的数据集。您将学习如何格式化数据用于训练,特别是 ChatML 格式。代码保持简单,避免使用额外的黑盒或训练工具,只使用基本的 PyTorch 和 Hugging Face 软件包。
在开始之前,你需要 Hugging Face 提供的最新工具。在终端运行以下命令安装或更新这些软件包:
pip install -U accelerate bitsandbytes datasets peft transformers tokenizers
这些是编写本教程时使用的特定版本,以供参考:
accelerate 0.24.1
bitsandbytes 0.41.1
datasets 2.14.6
peft 0.6.0
transformers 4.35.0
tokenizers 0.14.1
torch 2.1.0
本节主要介绍加载或制作数据集的关键过程,以及随后根据 ChatML 结构对其进行格式化的过程。接下来,我们将深入探讨标记化和批处理领域。
请记住,数据集的质量至关重要–它将极大地影响模型的性能。您的数据集必须与您的任务相匹配。
数据集可以从不同来源混合。以 Mistral 的 Open Hermes 2 微调为例,它是在来自多个数据集的约 90 万个样本上训练出来的。
这些数据集通常由问答对组成,格式为孤立的问答对(单个样本相当于单个问题和答案)或按对话序列连接(格式为 Q/A、Q/A、Q/A)。
本节旨在指导您将这些数据集转换成与训练方案兼容的统一格式。为了准备训练,我们必须选择一种格式。我之所以选择 OpenAI 的 ChatML,是因为它在最近发布的模型中被频繁采用,并有可能成为新的标准。
下面是一个 ChatML 格式的对话示例(来自 Open Orca 数据集):
<|im_start|>system
You are an AI assistant. User will you give you a task. Your goal is to
complete the task as faithfully as you can. While performing the task
think step-by-step and justify your steps.<|im_end|>
<|im_start|>user
Premise: A man is inline skating in front of a wooden bench. Hypothesis:
A man is having fun skating in front of a bench. .Choose the correct
answer: Given the premise, can we conclude the hypothesis?
Select from: a). yes b). it is not possible to tell c). no<|im_end|>
<|im_start|>assistant
b). it is not possible to tell Justification: Although the man is inline
skating in front of the wooden bench, we cannot conclude whether he is
having fun or not, as his emotions are not explicitly mentioned.<|im_end|>
上述示例可以进行标记化、批处理并输入到训练算法中。不过,在继续之前,我们先来看看几个著名的数据集,以及如何准备和格式化它们。
从 "开放助手 "数据集“开始。
from datasets import load_dataset
dataset = load_dataset("OpenAssistant/oasst_top1_2023-08-25")
加载后,数据集被预先划分为训练(1.3 万个条目)和测试(700 个条目)两个部分。
>>> dataset
DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 12947
})
test: Dataset({
features: ['text'],
num_rows: 690
})
})
Let’s look at the first entry:
>>> print(dataset["train"][0]["text"])
<|im_start|>user
Consigliami 5 nomi per il mio cucciolo di dobberman<|im_end|>
<|im_start|>assistant
Ecco 5 nomi per il tuo cucciolo di dobermann:
- Zeus
- Apollo
- Thor
- Athena
- Odin<|im_end|>
这是 ChatML 不需要做任何事情。除了告诉标记符和模型,字符串 <|im_start|> 和 <|im_end|> 是标记符,不应该被拆分,而 <|im_end|> 是一个特殊的标记符(eos,“end-of-sequence”),标志着模型回答的结束,否则模型将永远生成,永不停止。如何将这些标记与 llama2 和 mistral 等基础模型结合起来,将在第 3 节中详细说明。
接下来是 Open Orca,这个数据集包含 420 万个条目,需要在加载后进行训练/测试拆分,可以使用 train_test_split 来实现。
from datasets import load_dataset
dataset = load_dataset("Open-Orca/OpenOrca")
dataset = dataset["train"].train_test_split(test_size=0.1)
让我们来看看数据集的结构。这是第一个条目:
{
'id': 'flan.2020759',
'system_prompt': 'You are an AI assistant. You will be given a task.
You must generate a detailed and long answer.',
'question': 'Ülke, bildirgeyi uygulamaya başlayan son ülkeler
arasında olmasına rağmen 46 ülke arasında 24. sırayı
aldı.
Could you please translate this to English?',
'response': 'Despite being one of the last countries to
implement the declaration, it ranked 24th out of 46 countries.'
}
这是一对 "问题+答案 "和系统信息,描述了问题需要回答的上下文。
与 Open Assistant 数据集相反,我们必须自己将 Open Orca 数据格式化为 ChatML。
def format_conversation(row): template="<|im_start|>system\n{sys}<|im_end|>\n<|im_start|>user\n{q}<|im_end|>\n<|im_start|>assistant\n{a}<|im_end|>" conversation=template.format( sys=row["system_prompt"], q=row["question"], a=row["response"], ) return {"text": conversation} import os dataset = dataset.map( format_conversation, remove_columns=dataset["train"].column_names # remove all columns; only "text" will be left num_proc=os.cpu_count() # multithreaded )
现在,数据集已准备就绪,可以进行标记化并输入训练。
当Lex Fridman 播客的文字记录训练 llama1。这项任务包括将以深入讨论著称的播客转化为训练集,让人工智能模仿 Lex 的谈话方式。有关如何创建数据集的详细信息,请参阅《从文本到人工智能聊天》(From Transcripts to AI Chat:Lex Fridman 播客实验》。
from datasets import load_dataset
dataset = load_dataset("g-ronimo/lfpodcast")
dataset = dataset["train"].train_test_split(test_size=0.1)
检查训练集中的第一个条目,你会看到这样一个 JSON 对象:
>>> print(json.dumps(dataset["train"][0],indent=2)) { "title": "Lex_Fridman_Podcast_-_114__Russ_Tedrake_Underactuated_Robotics_Control_Dynamics_and_Touch", "episode": 114, "speaker_ratio_lex-vs-guest": 0.44402311303719755, "conversation": [ { "from": "Guest", "text": "I think the most beautiful motion of a robot has to be the passive dynamic walkers. I think there's just something fundamentally beautiful. (..) but what Steve and Andy did was they took it to this beautiful conclusion. where they built something that had knees, arms, a torso, the arms swung naturally, give it a little push, and that looked like a stroll through the park." }, { "from": "Lex", "text": "How do you design something like that? Is that art or science?" }, (...)
这种结构捕捉到了每集播客的精髓,但为了准备模型训练,需要将对话转换成 ChatML 格式。我们需要依次查看每条信息,应用 ChatML 格式,然后将信息连接起来,将整集的文字记录存储在一个文本字段中。我们将把 Guest 和 Lex 的角色分别重新分配给用户和助手,以便让语言模型采用 Lex 的好奇和博学角色。
def format_conversation(row): # Template for conversation turns in ChatML format template="<|im_start|>user\n{q}<|im_end|>\n<|im_start|>assistant\n{a}<|im_end|>" turns=row["conversation"] # If Lex is the first speaker, skip his turn to start with Guest's question if turns[0]["from"]=="Lex": turns=turns[1:] conversation=[] for i in range(0, len(turns), 2): # Assuming the conversation always alternates between Guest and Lex question=turns[i] # Guest answer=turns[i+1] # Lex conversation.append( template.format( q=question["text"], a=answer["text"], )) return {"text": "\n".join(conversation)} import os dataset = dataset.map( format_conversation, remove_columns=dataset["train"].column_names, num_proc=os.cpu_count() )
通过应用这些更改,生成的数据集将被标记化并输入训练管道,从而让语言模型以类似 Lex Fridman 的播客讨论方式进行对话。如果你好奇,可以试试 llama-fridman。
为了更深入地了解数据集创建的细微差别,让我们来考虑这样一个案例:我们希望训练人工智能来反映一位知名人士的声音和个性。我选择将美国著名厨师安东尼-布尔丹(Anthony Bourdain)的自传转化为数据集。他写的《厨房机密》生动地描述了厨师在厨房和头脑中的所有疯狂行为。
这一过程包括将布尔丹书中的叙述转化为引人入胜的对话,就像一次前后呼应的采访,捕捉他的精神。
主要步骤:
mv anthony-bourdain-kitchen-confidential.pdf kc.pdf
pdftotext -nopgbrk kc.pdf
# fix line breaks within sentence
sed -r ':a /[a-zA-Z,\ ]$/N;s/(.)\n/\1 /;ta' kc.txt > kc_reformat.txt
现在使用每个段落 n 和段落 n-1 来参与任何智能开源 LLM 或 GPT-3.5/4。我使用 Open Hermes 2 为每个段落创建了一个问题。
# Gather paragraphs to target with open("kc_reformat.txt") as f: file_content = f.read() chapters=file_content.split("\n\n") # Define minimum and maximum lengths to ensure a good interview flow passage_minlen=300 # if paragraph <300 chars -> merge with next passage_maxlen=2000 # if paragraph >2k chars -> split # Process the chapters into suitable interview passages passages=[] for chap in chapters: passage="" for par in chap.split("\n"): if(len(passage)<passage_minlen) or not passage[-1]=="." and len(passage)<passage_maxlen: passage+="\n" + par else: passages.append(passage.strip().replace("\n", " ")) passage=par # Ask Open Hermes prompt_template="""<|im_start|>system You are an expert interviewer who interviews an autobiography of a famous chef. You formulate questions based on quotes from the autobiography. Below is one such quote. Formulate a question that the quote would be the perfect answer to. The question should be short and directed at the author of the autobiography like in an interview. The question is short. Remember, make the question as short as possible. Do not give away the answer in your question. Also: If possible, ask for motvations, feelings, and perceptions rather than events or facts. Here is some context that might help you formulate the question regarding the quote: {ctx} <|im_end|> <|im_start|>user Quote: {par}<|im_end|> <|im_start|>assistant Question:""" prompts=[] for i,p in enumerate(passages): prompt=prompt_template.format(par=passages[i], ctx=passages[i-1]) prompts.append(prompt) # Prompt smart LLM, parse results, store Q/A in .json ...
生成的,json文件如下所示
{ "question": "Why you choose to share your experiences and insights from your career in the restaurant industry despite the angry or wanting to horrify the dining public?", "answer": "I'm not spilling my guts about everything I've seen, learned and done in my long and checkered career as dishwasher, prep drone, fry cook, grillardin, saucier, sous-chef and chef because I'm angry at the business, or because I want to horrify the dining public. I'd still like to be a chef, too, when this thing comes out, as this life is the only life I really know. If I need a favor at four o'clock in the morning, whether it's a quick loan, a shoulder to cry on, a sleeping pill, bail money, or just someone to pick me up in a car in a bad neighborhood in the driving rain, I'm definitely not calling up a fellow writer. I'm calling my sous-chef, or a former sous-chef, or my saucier, someone I work with or have worked with over the last twenty-plus years." }, { "question": "Why do you feel more comfortable sharing the \"dark recesses\" of the restaurant underbelly instead of writing about your personal experiences outside of the culinary world?", "answer": "No, I want to tell you about the dark recesses of the restaurant underbelly-a subculture whose centuries-old militaristic hierarchy and ethos of 'rum, buggery and the lash' make for a mix of unwavering order and nerve-shattering chaos-because I find it all quite comfortable, like a nice warm bath. I can move around easily in this life. I speak the language. In the small, incestuous community of chefs and cooks in New York City, I know the people, and in my kitchen, I know how to behave (as opposed to in real life, where I'm on shakier ground). I want the professionals who read this to enjoy it for what it is: a straight look at a life many of us have lived and breathed for most of our days and nights to the exclusion of 'normal' social interaction. Never having had a Friday or Saturday night off, always working holidays, being busiest when the rest of the world is just getting out of work, makes for a sometimes peculiar world-view, which I hope my fellow chefs and cooks will recognize. The restaurant lifers who read this may or may not like what I'm doing. But they'll know I'm not lying." }
之后再将其转化为ChatML的格式
interview_fn="kc_reformat_interview.json" dataset = load_dataset('json', data_files=interview_fn, field='interview') dataset=dataset["train"].train_test_split(test_size=0.1) # chatML template, from https://huggingface.co/docs/transformers/main/chat_templating tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}" def format_interview(conv): messages = [ {"role": "user", "content": conv["question"]}, {"role": "assistant", "content": conv["answer"]} ] chat=tokenizer.apply_chat_template(messages, tokenize=False).strip() return {"text": chat} dataset = dataset.map( format_conversation, remove_columns=dataset["train"].column_names )
通过转换布尔丹的自传,我们旨在生成一种人工智能,与他的叙事风格和对烹饪行业的看法相呼应,并体现他的人生哲学。所提供的方法是非常基本的,需要进一步完善,例如删除低内容的答案,剥离非必要的文本元素,如脚注、页码等。这将提高模型的质量。
如果你好奇,可以和米斯特拉尔-布尔丹(Mistral Bourdain)聊聊。虽然目前的输出只是对布尔丹声音的初级模仿,但它可以作为概念验证;加强数据集的整理无疑会产生更令人信服的模拟效果。
我想你现在应该明白了。下面是 GPT-4 提出的其他一些创造性数据集的想法:
在开始处理刚刚准备好的数据之前,我们需要加载model 和tokenizer,确保它们能正确处理 ChatML 标记 <|im_start|> 和 <|im_end|>,并将 <|im_end|> 作为(新的)eos 标记。
import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, BitsAndBytesConfig from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model modelpath="models/Mistral-7B-v0.1" # Load 4-bit quantized model model = AutoModelForCausalLM.from_pretrained( modelpath, device_map="auto", quantization_config=BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4", ), torch_dtype=torch.bfloat16, ) # Load (slow) Tokenizer, fast tokenizer sometimes ignores added tokens tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast=False) # Add tokens <|im_start|> and <|im_end|>, latter is special eos token tokenizer.pad_token = "</s>" tokenizer.add_tokens(["<|im_start|>"]) tokenizer.add_special_tokens(dict(eos_token="<|im_end|>")) model.resize_token_embeddings(len(tokenizer)) model.config.eos_token_id = tokenizer.eos_token_id
由于我们不是训练所有参数,而只是训练一个子集,因此必须使用 huggingface peft 将 LoRA 适配器添加到模型中。请确保使用 peft >= 0.6,否则:1)get_peft_model 会非常慢;2)Mistral 的训练会失败。
# Add LoRA adapters to model
model = prepare_model_for_kbit_training(model)
config = LoraConfig(
r=64,
lora_alpha=16,
target_modules = ['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'],
lora_dropout=0.1,
bias="none",
modules_to_save = ["lm_head", "embed_tokens"], # needed because we added new tokens to tokenizer/model
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
model.config.use_cache = False
我们发现,最关键的 LoRA 超参数是总共使用了多少个 LoRA 适配器,而且所有线性transformer block层上的 LoRA 都需要与完全微调性能相匹配
适当的标记化和批处理对确保正确处理数据至关重要。
对数据集中的文本字段进行标记化,无需添加特殊标记或填充,因为我们将手动完成这项工作。
def tokenize(element):
return tokenizer(
element["text"],
truncation=True,
max_length=2048,
add_special_tokens=False,
)
dataset_tokenized = dataset.map(
tokenize,
batched=True,
num_proc=os.cpu_count(), # multithreaded
remove_columns=["text"] # don't need the strings anymore, we have tokens from here on
)
max_length:指定样本的最大长度(以字节数为单位)。所有长度超过 2048 字节的样本都将被截断,不会进行训练。如果您的数据集中只有单个样本中的短问/答对(如 Open Orca),那么这个长度就绰绰有余了;如果您的样本较长(如播客文本),您最好要么增加 max_length(消耗 VRAM),要么将样本拆分成几个较小的样本。llama2 的最大值为 4096。Mistral 是 “以 8k 上下文长度和固定缓存大小进行训练的,理论上的注意力跨度为 128k token”,但我从未超过 4096。
Hugging Face训练器希望使用整理器函数将样本列表转换为字典,字典中包含一批带填充的
# collate function - to transform list of dictionaries [ {input_ids: [123, ..]}, {.. ] to single batch dictionary { input_ids: [..], labels: [..], attention_mask: [..] } def collate(elements): tokenlist=[e["input_ids"] for e in elements] tokens_maxlen=max([len(t) for t in tokenlist]) # length of longest input input_ids,labels,attention_masks = [],[],[] for tokens in tokenlist: # how many pad tokens to add for this sample pad_len=tokens_maxlen-len(tokens) # pad input_ids with pad_token, labels with ignore_index (-100) and set attention_mask 1 where content, otherwise 0 input_ids.append( tokens + [tokenizer.pad_token_id]*pad_len ) labels.append( tokens + [-100]*pad_len ) attention_masks.append( [1]*len(tokens) + [0]*pad_len ) batch={ "input_ids": torch.tensor(input_ids), "labels": torch.tensor(labels), "attention_mask": torch.tensor(attention_masks) } return batch
超参数的选择会对模型性能产生重大影响。以下是我们为训练选择的超参数:
bs=8 # batch size ga_steps=1 # gradient acc. steps epochs=5 steps_per_epoch=len(dataset_tokenized["train"])//(bs*ga_steps) args = TrainingArguments( output_dir="out", per_device_train_batch_size=bs, per_device_eval_batch_size=bs, evaluation_strategy="steps", logging_steps=1, eval_steps=steps_per_epoch, # eval and save once per epoch save_steps=steps_per_epoch, gradient_accumulation_steps=ga_steps, num_train_epochs=epochs, lr_scheduler_type="constant", optim="paged_adamw_32bit", learning_rate=0.0002, group_by_length=True, fp16=True, ddp_find_unused_parameters=False, # needed for training with accelerate )
开始训练
trainer = Trainer(
model=model,
tokenizer=tokenizer,
data_collator=collate,
train_dataset=dataset_tokenized["train"],
eval_dataset=dataset_tokenized["test"],
args=args,
)
trainer.train()
learning rate
、epoch
、rankr
、lora_alpha
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。