微调 Code Llama 完整指南



今天这篇文章将向大家详细介绍如何对 Code Llama 进行微调,让它变成适合 SQL 开发的有利工具。对于编程开发任务,经过适当微调后的 Code Llama 的性能通常都会比普通的 Llama 强很多,特别是当我们针对具体任务进行优化时:

  • 使用b-mc2/sql-create-context这个文本查询及其对应的SQL查询集合进行训练

  • 使用Lora方法,将基础模型的权重量化为int8,冻结权重,仅对适配器进行训练

  • 本文大多参考了alpaca-lora项目,同时也进行了一定的改进与优化

通过上述几点方法,相信我们能使Code Llama专注于SQL开发领域,获得更好的效果。如果按照本指南步骤进行指导,相信您也能掌握微调的奥妙。

二、微调 Code Llama


我使用了一台配置了 Python 3.10 和 Cuda 11.8 的 A100 GPU 服务器来运行本文中的代码。大约运行了一个小时。(为了验证可移植性,我还试验在Colab上运行代码,效果都很好。)

  1. !pip install git+https://github.com/huggingface/transformers.git@main bitsandbytes accelerate==0.20.3 # we need latest transformers for this
  2. !pip install git+https://github.com/huggingface/peft.git@e536616888d51b453ed354a6f1e243fecb02ea08
  3. !pip install datasets==2.10.1
  4. import locale # colab workaround
  5. locale.getpreferredencoding = lambda: "UTF-8" # colab workaround
  6. !pip install wandb


  1. from datetime import datetime
  2. import os
  3. import sys
  4. import torch
  5. from peft import (
  6. LoraConfig,
  7. get_peft_model,
  8. get_peft_model_state_dict,
  9. prepare_model_for_int8_training,
  10. set_peft_model_state_dict,
  11. )
  12. from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq

(如果出现导入错误,请尝试重新启动 Jupyter 内核)


这将从 Huggingface Hub 中提取数据集,并将其中的 10% 分成评估集,以检查模型在训练中的表现如何:

  1. from datasets import load_dataset
  2. dataset = load_dataset("b-mc2/sql-create-context", split="train")
  3. train_dataset = dataset.train_test_split(test_size=0.1)["train"]
  4. eval_dataset = dataset.train_test_split(test_size=0.1)["test"]


  1. train_dataset = load_dataset('json', data_files='train_set.jsonl', split='train')
  2. eval_dataset = load_dataset('json', data_files='validation_set.jsonl', split='train')




我从 Huggingface 加载代码 llama int8(Lora 的标准):

  1. base_model = "codellama/CodeLlama-7b-hf"
  2. model = AutoModelForCausalLM.from_pretrained(
  3. base_model,
  4. load_in_8bit=True,
  5. torch_dtype=torch.float16,
  6. device_map="auto",
  7. )
  8. tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

torch_dtype=torch.float16 表示使用 float16 表示形式执行计算,即使值本身是 8 位整数。

如果出现错误“ValueError:Tokenizer 类 CodeLlamaTokenizer 不存在或当前未导入。”确保你的 Transformer 版本是 4.33.0.dev0 并且accelerate是 >=0.20.3。



  1. eval_prompt = """You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.
  2. You must output the SQL query that answers the question.
  3. ### Input:
  4. Which Class has a Frequency MHz larger than 91.5, and a City of license of hyannis, nebraska?
  5. ### Context:
  6. CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)
  7. ### Response:
  8. """
  9. model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
  10. model.eval()
  11. with torch.no_grad():
  12. print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))


SELECT * FROM table_name_12 WHERE class > 91.5 AND city_of_license = 'hyannis, nebraska'




  1. tokenizer.add_eos_token = True
  2. tokenizer.pad_token_id = 0
  3. tokenizer.padding_side = "left"

设置 tokenize 函数以使 labels 和 input_ids 相同。这基本上就是自我监督微调:

  1. def tokenize(prompt):
  2. result = tokenizer(
  3. prompt,
  4. truncation=True,
  5. max_length=512,
  6. padding=False,
  7. return_tensors=None,
  8. )
  9. # "self-supervised learning" means the labels are also the inputs:
  10. result["labels"] = result["input_ids"].copy()
  11. return result

并运行将每个 data_point 转换为我在网上找到的效果很好的提示:

  1. def generate_and_tokenize_prompt(data_point):
  2. full_prompt =f"""You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.
  3. You must output the SQL query that answers the question.
  4. ### Input:
  5. {data_point["question"]}
  6. ### Context:
  7. {data_point["context"]}
  8. ### Response:
  9. {data_point["answer"]}
  10. """
  11. return tokenize(full_prompt)


  1. tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
  2. tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

2.7、设置 LoRA

置标准 Lora 配置并将其附加到基本模型:

  1. model.train() # put model back into training mode
  2. model = prepare_model_for_int8_training(model)
  3. config = LoraConfig(
  4. r=16,
  5. lora_alpha=16,
  6. target_modules=[
  7. "q_proj",
  8. "k_proj",
  9. "v_proj",
  10. "o_proj",
  11. ],
  12. lora_dropout=0.05,
  13. bias="none",
  14. task_type="CAUSAL_LM",
  15. )
  16. model = get_peft_model(model, config)

要从检查点恢复,请将resumefromcheckpoint 设置为要从中恢复的adapter_model.bin 的路径:

  1. resume_from_checkpoint = "" # set this to the adapter_model.bin file you want to resume from
  2. if resume_from_checkpoint:
  3. if os.path.exists(resume_from_checkpoint):
  4. print(f"Restarting from {resume_from_checkpoint}")
  5. adapters_weights = torch.load(resume_from_checkpoint)
  6. set_peft_model_state_dict(model, adapters_weights)
  7. else:
  8. print(f"Checkpoint {resume_from_checkpoint} not found")


  1. wandb_project = "sql-try2-coder"
  2. if len(wandb_project) > 0:
  3. os.environ["WANDB_PROJECT"] = wandb_project
  1. if torch.cuda.device_count() > 1:
  2. # keeps Trainer from trying its own DataParallelism when more than 1 gpu is available
  3. model.is_parallelizable = True
  4. model.model_parallel = True


如果 GPU 内存不足,请更改 perdevicetrainbatchsize。 gradientaccumulationsteps 变量应确保这不会影响训练运行期间的批量动态。所有其他变量都是标准的东西,不用设置:

  1. batch_size = 128
  2. per_device_train_batch_size = 32
  3. gradient_accumulation_steps = batch_size // per_device_train_batch_size
  4. output_dir = "sql-code-llama"
  5. training_args = TrainingArguments(
  6. per_device_train_batch_size=per_device_train_batch_size,
  7. gradient_accumulation_steps=gradient_accumulation_steps,
  8. warmup_steps=100,
  9. max_steps=400,
  10. learning_rate=3e-4,
  11. fp16=True,
  12. logging_steps=10,
  13. optim="adamw_torch",
  14. evaluation_strategy="steps", # if val_set_size > 0 else "no",
  15. save_strategy="steps",
  16. eval_steps=20,
  17. save_steps=20,
  18. output_dir=output_dir,
  19. load_best_model_at_end=False,
  20. group_by_length=True, # group sequences of roughly the same length together to speed up training
  21. report_to="wandb", # if use_wandb else "none",
  22. run_name=f"codellama-{datetime.now().strftime('%Y-%m-%d-%H-%M')}", # if use_wandb else None,
  23. )
  24. trainer = Trainer(
  25. model=model,
  26. train_dataset=tokenized_train_dataset,
  27. eval_dataset=tokenized_val_dataset,
  28. args=training_args,
  29. data_collator=DataCollatorForSeq2Seq(
  30. tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
  31. ),
  32. )

然后我们进行一些与 pytorch 相关的优化,这只是使训练更快,但不影响准确性:

  1. model.config.use_cache = False
  2. old_state_dict = model.state_dict
  3. model.state_dict = (lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())).__get__(
  4. model, type(model)
  5. )
  6. if torch.__version__ >= "2" and sys.platform != "win32":
  7. print("compiling the model")
  8. model = torch.compile(model)

此 ^ 将在 A100 上运行大约 1 小时。


  1. import torch
  2. from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
  3. base_model = "codellama/CodeLlama-7b-hf"
  4. model = AutoModelForCausalLM.from_pretrained(
  5. base_model,
  6. load_in_8bit=True,
  7. torch_dtype=torch.float16,
  8. device_map="auto",
  9. )
  10. tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

要加载经过微调的 Lora/Qlora 适配器,请使用 PeftModel.frompretrained。 output_dir 应该是包含adapterconfig.json和adapter_model.bin的东西:

  1. from peft import PeftModel
  2. model = PeftModel.from_pretrained(model, output_dir)


  1. eval_prompt = """You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.
  2. You must output the SQL query that answers the question.
  3. ### Input:
  4. Which Class has a Frequency MHz larger than 91.5, and a City of license of hyannis, nebraska?
  5. ### Context:
  6. CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)
  7. ### Response:
  8. """
  9. model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
  10. model.eval()
  11. with torch.no_grad():
  12. print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))


SELECT class FROM table_name_12 WHERE frequency_mhz > 91.5 AND city_of_license = "hyannis, nebraska"

从运行结果可以看到微调是有效果的!也可以将此适配器转换为 Llama.cpp 模型以在本地运行。

