赞
踩
一.量化模型调用方式
下面是一个调用FlagAlpha/Llama2-Chinese-13b-Chat[1]的4bit压缩版本FlagAlpha/Llama2-Chinese-13b-Chat-4bit[2]
的例子:
from transformers import AutoTokenizer from auto_gptq import AutoGPTQForCausalLM model = AutoGPTQForCausalLM.from_quantized('FlagAlpha/Llama2-Chinese-13b-Chat-4bit', device="cuda:0") tokenizer = AutoTokenizer.from_pretrained('FlagAlpha/Llama2-Chinese-13b-Chat-4bit',use_fast=False) input_ids = tokenizer(['<s>Human: 怎么登上火星\n</s><s>Assistant: '], return_tensors="pt",add_special_tokens=False).input_ids.to('cuda') generate_input = { "input_ids":input_ids, "max_new_tokens":512, "do_sample":True, "top_k":50, "top_p":0.95, "temperature":0.3, "repetition_penalty":1.3, "eos_token_id":tokenizer.eos_token_id, "bos_token_id":tokenizer.bos_token_id, "pad_token_id":tokenizer.pad_token_id } generate_ids = model.generate(**generate_input) text = tokenizer.decode(generate_ids[0]) print(text)
这里面有个问题就是由Llama2-Chinese-13b-Chat
如何得到Llama2-Chinese-13b-Chat-4bit
?这涉及另外一个AutoGPTQ库(一个基于GPTQ算法,简单易用且拥有用户友好型接口的大语言模型量化工具包)[3]。先梳理下思路,由于meta-llama/Llama-2-13b-chat-hf
对中文支持较差,所以采用中文指令集在此基础上进行LoRA微调得到了FlagAlpha/Llama2-Chinese-13b-Chat-LoRA
,而FlagAlpha/Llama2-Chinese-13b-Chat=FlagAlpha/Llama2-Chinese-13b-Chat-LoRA+meta-llama/Llama-2-13b-chat-hf
,即将两者参数合并后的版本。FlagAlpha/Llama2-Chinese-13b-Chat-4bit
就是对FlagAlpha/Llama2-Chinese-13b-Chat
进行4bit量化后的版本。总结起来就是如何合并,如何量化这2个问题。官方提供的一些合并参数后的模型[4],如下所示:
二.如何合并LoRA Model和Base Model
网上合并LoRA参数和原始模型的脚本很多,参考文献[6]亲测可用。合并后的模型格式包括pth
和huggingface
两种。如下所示:
1.LoRA Model文件列表
对于LLama2-7B-hf进行LoRA微调生成文件如下所示:
adapter_config.json
adapter_model.bin
optimizer.pt
README.md
rng_state.pth
scheduler.pt
special_tokens_map.json
tokenizer.json
tokenizer.model
tokenizer_config.json
trainer_state.json
training_args.bin
2.Base Model文件列表
LLama2-7B-hf文件列表,如下所示:
config.json generation_config.json gitattributes.txt LICENSE.txt model-00001-of-00002.safetensors model-00002-of-00002.safetensors model.safetensors.index.json pytorch_model-00001-of-00002.bin pytorch_model-00002-of-00002.bin pytorch_model.bin.index.json README.md Responsible-Use-Guide.pdf special_tokens_map.json tokenizer.json tokenizer.model tokenizer_config.json USE_POLICY.md
3.合并后huggingface
文件列表
合并LoRA Model和Base Model后,生成huggingface格式文件列表,如下所示:
config.json
generation_config.json
pytorch_model-00001-of-00007.bin
pytorch_model-00002-of-00007.bin
pytorch_model-00003-of-00007.bin
pytorch_model-00004-of-00007.bin
pytorch_model-00005-of-00007.bin
pytorch_model-00006-of-00007.bin
pytorch_model-00007-of-00007.bin
pytorch_model.bin.index.json
special_tokens_map.json
tokenizer.model
tokenizer_config.json
4.合并后pth
文件列表
合并LoRA Model和Base Model后,生成pth格式文件列表,如下所示:
consolidated.00.pth
params.json
special_tokens_map.json
tokenizer.model
tokenizer_config.json
5.合并脚本[6]思路
以合并后生成huggingface模型格式为例,介绍合并脚本的思路,如下所示:
# 步骤1:加载base model base_model = LlamaForCausalLM.from_pretrained( base_model_path, # 基础模型路径 load_in_8bit=False, # 加载8位 torch_dtype=torch.float16, # float16 device_map={"": "cpu"}, # cpu ) # 步骤2:遍历LoRA模型 for lora_index, lora_model_path in enumerate(lora_model_paths): # 步骤3:根据base model和lora model来初始化PEFT模型 lora_model = PeftModel.from_pretrained( base_model, # 基础模型 lora_model_path, # LoRA模型路径 device_map={"": "cpu"}, # cpu torch_dtype=torch.float16, # float16 ) # 步骤4:将lora model和base model合并为一个独立的model base_model = lora_model.merge_and_unload() ...... # 步骤5:保存tokenizer tokenizer.save_pretrained(output_dir) # 步骤6:保存合并后的独立model LlamaForCausalLM.save_pretrained(base_model, output_dir, save_function=torch.save, max_shard_size="2GB")
合并LoRA Model和Base Model过程中输出日志可参考huggingface[7]和pth[8]。
三.如何量化4bit模型
如果得到了一个训练好的模型,比如LLama2-7B,如何得到LLama2-7B-4bit呢?因为模型参数越来越多,多参数模型的量化还是会比少参数模型的非量化效果要好。量化的方案非常的多[9][12],比如AutoGPTQ、GPTQ-for-LLaMa、exllama、llama.cpp等。下面重点介绍下AutoGPTQ的基础实践过程[10],AutoGPTQ进阶教程参考文献[11]。
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig # 量化配置 from transformers import AutoTokenizer # 第1部分:量化一个预训练模型 pretrained_model_name = r"L:/20230713_HuggingFaceModel/20230903_Llama2/Llama-2-7b-hf" # 预训练模型路径 quantize_config = BaseQuantizeConfig(bits=4, group_size=128) # 量化配置,bits表示量化后的位数,group_size表示分组大小 model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_name, quantize_config) # 加载预训练模型 tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name) # 加载tokenizer examples = [ # 量化样本 tokenizer( "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm." ) ] # 翻译:准备examples(一个只有两个键'input_ids'和'attention_mask'的字典列表)来指导量化。这里只使用一个文本来简化代码,但是应该注意,使用的examples越多,量化后的模型就越好(很可能)。 model.quantize(examples) # 执行量化操作,examples提供量化过程所需的示例数据 quantized_model_dir = "./llama2_quantize_AutoGPTQ" # 保存量化后的模型 model.save_quantized(quantized_model_dir) # 保存量化后的模型 # 第2部分:加载量化模型和推理 from transformers import TextGenerationPipeline # 生成文本 device = "cuda:0" model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device=device) # 加载量化模型 pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device) # 得到pipeline管道 print(pipeline("auto-gptq is")[0]["generated_text"]) # 生成文本
参考文献:
[1]https://huggingface.co/FlagAlpha/Llama2-Chinese-13b-Chat
[2]https://huggingface.co/FlagAlpha/Llama2-Chinese-13b-Chat-4bit
[3]https://github.com/PanQiWei/AutoGPTQ/blob/main/README_zh.md
[4]https://github.com/FlagAlpha/Llama2-Chinese#基于Llama2的中文微调模型
[5]CPU中合并权重(合并思路仅供参考):https://github.com/yangjianxin1/Firefly/blob/master/script/merge_lora.py
[6]https://github.com/ai408/nlp-engineering/blob/main/20230916_Llama2-Chinese/tools/merge_llama_with_lora.py
[7]https://github.com/ai408/nlp-engineering/blob/main/20230916_Llama2-Chinese/tools/merge_llama_with_lora_log/merge_llama_with_lora_hf_log
[8]https://github.com/ai408/nlp-engineering/blob/main/20230916_Llama2-Chinese/tools/merge_llama_with_lora_log/merge_llama_with_lora_pt_log
[9]LLaMa量化部署:https://zhuanlan.zhihu.com/p/641641929
[10]AutoGPTQ基础教程:https://github.com/PanQiWei/AutoGPTQ/blob/main/docs/tutorial/01-Quick-Start.md
[11]AutoGPTQ进阶教程:https://github.com/PanQiWei/AutoGPTQ/blob/main/docs/tutorial/02-Advanced-Model-Loading-and-Best-Practice.md
[12]Inference Experiments with LLaMA v2 7b:https://github.com/djliden/inference-experiments/blob/main/llama2/README.md
[13]llama2_quantize_AutoGPTQ:https://github.com/ai408/nlp-engineering/blob/main/20230916_Llama2-Chinese/tools/llama2_quantize_AutoGPTQ.py
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。