llama2相比于前一代,令牌数量增加了40%,达到2T,上下文长度增加了一倍,并应用分组查询注意(GQA)技术来加速在较重的70B模型上的推理。在标准的transformer 体系结构上,使用RMSNorm归一化、SwiGLU激活和旋转位置嵌入,上下文长度达到了4096个,并应用了具有余弦学习率调度、权重衰减0.1和梯度裁剪的Adam优化器。
最主要的一点是,LLaMA 2-CHAT已经和OpenAI ChatGPT一样好了,所以我们可以使用它作为我们本地的一个替代了
# Load dataset from the hub
dataset = load_dataset(dataset_name, split=dataset_split)
# Show dataset size
print(f"dataset size: {len(dataset)}")
# Show an example
def format_instruction(sample):
return f"""### Instruction:
Use the Task below and the Input given to write the Response, which is a programming code that can solve the following Task:
### Task:
### Input:
### Response:
### Instruction:
Use the Task below and the Input given to write the Response, which is a programming code that can solve the following Task:
### Task:
Develop a Python program that prints "Hello, World!" whenever it is run.
### Input:
### Response:
#Python program to print "Hello World!"
print("Hello, World!")
为了方便演示,我们使用Google Colab环境,对于第一次测试运行,T4实例就足够了,但是当涉及到运行整个数据集训练,则需要使用A100。
除此以外,还可以登录Huggingface hub ,这样可以上传和共享模型,当然这个是可选项。
from huggingface_hub import login
from dotenv import load_dotenv
import os
# Load the enviroment variables
# Login to the Hugging Face Hub
LoRA[1]的作者提出权值变化矩阵∆W的变化可以分解为两个低秩矩阵A和b。LoRA不直接训练∆W中的参数,而是直接训练A和b中的参数,因此可训练参数的数量要少得多。假设A的维数为100 * 1,B的维数为1 * 100,则∆W中的参数个数为100 * 100 = 10000。在A和B中训练的人数只有100 + 100 = 200,而在∆W中训练的个数是10000
# Get the type compute_dtype = getattr(torch, bnb_4bit_compute_dtype) # BitsAndBytesConfig int-4 config bnb_config = BitsAndBytesConfig( load_in_4bit=use_4bit, bnb_4bit_use_double_quant=use_double_nested_quant, bnb_4bit_quant_type=bnb_4bit_quant_type, bnb_4bit_compute_dtype=compute_dtype ) # Load model and tokenizer model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, use_cache = False, device_map=device_map) model.config.pretraining_tp = 1 # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right"
# Activate 4-bit precision base model loading
use_4bit = True
# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"
# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"
# Activate nested quantization for 4-bit base models (double quantization)
use_double_nested_quant = False
# LoRA attention dimension
lora_r = 64
# Alpha parameter for LoRA scaling
lora_alpha = 16
# Dropout probability for LoRA layers
lora_dropout = 0.1
接下来的步骤对于所有的Hugging Face用户来说应该都很熟悉了,设置训练参数,创建Trainer。在执行指令微调时,我们调用封装PEFT模型定义和其他步骤的SFTTrainer方法。
# Define the training arguments args = TrainingArguments( output_dir=output_dir, num_train_epochs=num_train_epochs, per_device_train_batch_size=per_device_train_batch_size, # 6 if use_flash_attention else 4, gradient_accumulation_steps=gradient_accumulation_steps, gradient_checkpointing=gradient_checkpointing, optim=optim, logging_steps=logging_steps, save_strategy="epoch", learning_rate=learning_rate, weight_decay=weight_decay, fp16=fp16, bf16=bf16, max_grad_norm=max_grad_norm, warmup_ratio=warmup_ratio, group_by_length=group_by_length, lr_scheduler_type=lr_scheduler_type, disable_tqdm=disable_tqdm, report_to="tensorboard", seed=42 ) # Create the trainer trainer = SFTTrainer( model=model, train_dataset=dataset, peft_config=peft_config, max_seq_length=max_seq_length, tokenizer=tokenizer, packing=packing, formatting_func=format_instruction, args=args, ) # train the model trainer.train() # there will not be a progress bar since tqdm is disabled # save model in local trainer.save_model()
# Number of training epochs num_train_epochs = 1 # Enable fp16/bf16 training (set bf16 to True with an A100) fp16 = False bf16 = True # Batch size per GPU for training per_device_train_batch_size = 4 # Number of update steps to accumulate the gradients for gradient_accumulation_steps = 1 # Enable gradient checkpointing gradient_checkpointing = True # Maximum gradient normal (gradient clipping) max_grad_norm = 0.3 # Initial learning rate (AdamW optimizer) learning_rate = 2e-4 # Weight decay to apply to all layers except bias/LayerNorm weights weight_decay = 0.001 # Optimizer to use optim = "paged_adamw_32bit" # Learning rate schedule lr_scheduler_type = "cosine" #"constant" # Ratio of steps for a linear warmup (from 0 to learning rate) warmup_ratio = 0.03 # Group sequences into batches with same length # Saves memory and speeds up training considerably group_by_length = False # Save checkpoint every X updates steps save_steps = 0 # Log every X updates steps logging_steps = 25 # Disable tqdm disable_tqdm= True
from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( args.output_dir, low_cpu_mem_usage=True, return_dict=True, torch_dtype=torch.float16, device_map=device_map, ) # Merge LoRA and base model merged_model = model.merge_and_unload() # Save the merged model merged_model.save_pretrained("merged_model",safe_serialization=True) tokenizer.save_pretrained("merged_model") # push merged model to the hub merged_model.push_to_hub(hf_model_repo) tokenizer.push_to_hub(hf_model_repo)
import torch from transformers import AutoModelForCausalLM, AutoTokenizer # Get the tokenizer tokenizer = AutoTokenizer.from_pretrained(hf_model_repo) # Load the model model = AutoModelForCausalLM.from_pretrained(hf_model_repo, load_in_4bit=True, torch_dtype=torch.float16, device_map=device_map) # Create an instruction instruction="Optimize a code snippet written in Python. The code snippet should create a list of numbers from 0 to 10 that are divisible by 2." input="" prompt = f"""### Instruction: Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task. ### Task: {instruction} ### Input: {input} ### Response: """ # Tokenize the input input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda() # Run the model to infere an output outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.5) # Print the result print(f"Prompt:\n{prompt}\n") print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")
Prompt: ### Instruction: Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task. ### Task: Optimize a code snippet written in Python. The code snippet should create a list of numbers from 0 to 10 that are divisible by 2. ### Input: arr = [] for i in range(10): if i % 2 == 0: arr.append(i) ### Response: Generated instruction: arr = [i for i in range(10) if i % 2 == 0] Ground truth: arr = [i for i in range(11) if i % 2 == 0]
[1] Llama-2 paper https://arxiv.org/pdf/2307.09288.pdf
[2] python code dataset http://sahil2801/code_instructions_120k
[3] 本文使用的数据集 https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca
[4] LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685
[5]. QLoRa: Efficient Finetuning of QuantizedLLMs arXiv:2305.14314
作者:Eduardo Muñoz
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。