当前位置:   article > 正文

HuggingFace中模型量化_huggingface 量化int8

huggingface 量化int8

安装库

加载7B的LLM在GPU上进行推理,我这24G的显存居然一次推理都执行不了,Out of Memory。

这里采用Quanto库进行对模型进行量化

quanto==0.1.0版本的库,需要torch版本>2.2.0, 建议先将torch进行升级

pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118

然后安装

  1. pip install quanto
  2. pip install accelerate

transformers 版本 == 4.40.0

quanto==0.1.0

-----------------------------------------------------------------------------

quanto量化过程中需要gcc版本大于9.0.0(这个可以自己升级下)

----------------------------------------------------------------------------

量化

模型量化后,再执行生成文本

代码如下, 只用了不到13G的显存就能够完成推理。

  1. from transformers import AutoTokenizer,AutoModelForCausalLM, QuantoConfig
  2. import torch
  3. import os
  4. os.environ["TOKENIZERS_PARALLELISM"] = "true"
  5. def generate_text(model,input_text):
  6. #inputs = tokenizer(input_text, return_tensors='pt', max_length=64, padding='max_length', truncation=True)
  7. inputs = tokenizer(input_text, return_tensors='pt')
  8. #print(inputs)
  9. model = model#.to(device)
  10. inputs = inputs.to(device)
  11. outputs = model.generate(**inputs,max_new_tokens=50)
  12. decoded_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
  13. #decoded_output = [tokenizer.decode(
  14. # output,
  15. # skip_special_tokens=True,
  16. # clean_up_tokenization_spaces=True,
  17. # )
  18. # for output in outputs
  19. # ]
  20. ##decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
  21. return decoded_output
  22. tokenizer = AutoTokenizer.from_pretrained("cognitivecomputations/dolphin-2.9-llama3-8b",padding_size="left")
  23. #model = AutoModelForCausalLM.from_pretrained("cognitivecomputations/dolphin-2.9-llama3-8b")
  24. quantization_config = QuantoConfig(weights="int8")
  25. quantized_model = AutoModelForCausalLM.from_pretrained("cognitivecomputations/dolphin-2.9-llama3-8b", device_map="cuda:1", quantization_config=quantization_config)
  26. device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
  27. result = generate_text(quantized_model, "How many steps can put elephants into a refrigerator?")
  28. print(result)

------------------------------------------------------------------------------------------

huggingface 分布式

另一种方法,huggingface再加载模型的时候可以直接用bfloat16,然后自动分配到各个GPU上,进行分布式运行,比直接量化方便很多

model = AutoModelForCausalLM.from_pretrained("cognitivecomputations/dolphin-2.9-llama3-8b",torch_dtype=torch.bfloat16,device_map="auto")

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/很楠不爱3/article/detail/707319
推荐阅读
相关标签
  

闽ICP备14008679号