赞
踩
加载7B的LLM在GPU上进行推理,我这24G的显存居然一次推理都执行不了,Out of Memory。
这里采用Quanto库进行对模型进行量化
quanto==0.1.0版本的库,需要torch版本>2.2.0, 建议先将torch进行升级
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118
然后安装
- pip install quanto
- pip install accelerate
transformers 版本 == 4.40.0
quanto==0.1.0
-----------------------------------------------------------------------------
quanto量化过程中需要gcc版本大于9.0.0(这个可以自己升级下)
----------------------------------------------------------------------------
模型量化后,再执行生成文本
代码如下, 只用了不到13G的显存就能够完成推理。
- from transformers import AutoTokenizer,AutoModelForCausalLM, QuantoConfig
- import torch
- import os
- os.environ["TOKENIZERS_PARALLELISM"] = "true"
-
-
- def generate_text(model,input_text):
- #inputs = tokenizer(input_text, return_tensors='pt', max_length=64, padding='max_length', truncation=True)
- inputs = tokenizer(input_text, return_tensors='pt')
- #print(inputs)
- model = model#.to(device)
- inputs = inputs.to(device)
- outputs = model.generate(**inputs,max_new_tokens=50)
- decoded_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
- #decoded_output = [tokenizer.decode(
- # output,
- # skip_special_tokens=True,
- # clean_up_tokenization_spaces=True,
- # )
- # for output in outputs
- # ]
- ##decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
- return decoded_output
-
-
- tokenizer = AutoTokenizer.from_pretrained("cognitivecomputations/dolphin-2.9-llama3-8b",padding_size="left")
- #model = AutoModelForCausalLM.from_pretrained("cognitivecomputations/dolphin-2.9-llama3-8b")
- quantization_config = QuantoConfig(weights="int8")
- quantized_model = AutoModelForCausalLM.from_pretrained("cognitivecomputations/dolphin-2.9-llama3-8b", device_map="cuda:1", quantization_config=quantization_config)
-
- device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
-
- result = generate_text(quantized_model, "How many steps can put elephants into a refrigerator?")
- print(result)

------------------------------------------------------------------------------------------
另一种方法,huggingface再加载模型的时候可以直接用bfloat16,然后自动分配到各个GPU上,进行分布式运行,比直接量化方便很多
model = AutoModelForCausalLM.from_pretrained("cognitivecomputations/dolphin-2.9-llama3-8b",torch_dtype=torch.bfloat16,device_map="auto")
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。