赞
踩
(设备为50%A100)
使用lmdeploy测试性能
python benchmark_qwen.py
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 4/4 [02:37<00:00, 39.39s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 53.064 words/s
lmdeploy lite auto_awq /root/models/Qwen2-7B-Instruct --work-dir /root/models/Qwen2-7B-Instruct-4bit
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00, 1.45s/it]
Traceback (most recent call last):
File "/root/.conda/envs/lmdeploy/bin/lmdeploy", line 8, in <module>
sys.exit(run())
File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 26, in run
args.run(args)
File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/lite.py", line 131, in auto_awq
auto_awq(**kwargs)
File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/apis/auto_awq.py", line 54, in auto_awq
model, tokenizer, work_dir = calibrate(model, calib_dataset, calib_samples,
File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/apis/calibrate.py", line 155, in calibrate
raise RuntimeError(
RuntimeError: Currently, quantification and calibration of Qwen2ForCausalLM are not supported. The supported model types are InternLMForCausalLM, InternLM2ForCausalLM, QWenLMHeadModel, BaiChuanForCausalLM, BaichuanForCausalLM, LlamaForCausalLM.
原因是lmdeploy版本比较低,需要升级到0.4以上的版本
pip install lmdeploy[all]==0.4.2
4bit之后速度提升了6倍
python benchmark_qwen_4bit.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING] gemm_config.in is not found; using default GEMM algo
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 312.444 words/s
显存占用几乎一样,分配了更多kv cache
Lmdeploy测试代码
import datetime from lmdeploy import pipeline pipe = pipeline('/root/models/Qwen2-7B-Instruct-4bit') # warmup inp = "hello" for i in range(5): print("Warm up...[{}/5]".format(i+1)) response = pipe([inp]) # test speed inp = "请介绍一下你自己。" times = 10 total_words = 0 start_time = datetime.datetime.now() for i in range(times): response = pipe([inp]) total_words += len(response[0].text) end_time = datetime.datetime.now() delta_time = end_time - start_time delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0 speed = total_words / delta_time print("Speed: {:.3f} words/s".format(speed))
报错
python benchmark_transformer_qwen2.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 8.68it/s]
Warm up...[1/5]
Traceback (most recent call last):
File "/root/benchmark_transformer_qwen2.py", line 15, in <module>
response, history = model.chat(tokenizer, inp, history=[])
File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat'
原因是需要增加chat方法,代码如下
import torch import datetime from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("/root/models/Qwen2-7B-Instruct", trust_remote_code=True) # Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error. model = AutoModelForCausalLM.from_pretrained("/root/models/Qwen2-7B-Instruct", torch_dtype=torch.float16, trust_remote_code=True).cuda() model = model.eval() def chat(tokenizer, ques, history=[], **kw): iids = tokenizer.apply_chat_template( history + [{'role': 'user', 'content': ques}], add_generation_prompt=1, ) oids = model.generate( inputs=torch.tensor([iids]).to(model.device), max_new_tokens=512, ) oids = oids[0][len(iids):].tolist() if oids[-1] == tokenizer.eos_token_id: oids = oids[:-1] ans = tokenizer.decode(oids) return ans model.chat = chat # warmup inp = "hello" for i in range(5): print("Warm up...[{}/5]".format(i+1)) response = model.chat(tokenizer, inp) # test speed inp = "请介绍一下你自己。" times = 10 total_words = 0 start_time = datetime.datetime.now() for i in range(times): response = model.chat(tokenizer, inp) total_words += len(response) end_time = datetime.datetime.now() delta_time = end_time - start_time delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0 speed = total_words / delta_time print("Speed: {:.3f} words/s".format(speed))
量化前
python benchmark_transformer_qwen2.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.01it/s]
cuda:0
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 75.586 words/s
量化后速度反而慢了很多
python benchmark_transformer_qwen2.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 8.40it/s]
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 4.495 words/s
格式从safetensors
变成了bin
下载Qwen2-7B-Instruct-GPTQ-Int4
huggingface-cli download --resume-download Qwen/Qwen2-7B-Instruct-GPTQ-Int4 --local-dir /root/models/Qwen2-7B-Instruct-GPTQ-Int4/
稍快一点,仍然慢于Qwen2-7B-Instruct
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 11.385 words/s
Qwen2效率评估参考
https://qwen.readthedocs.io/zh-cn/latest/benchmark/speed_benchmark.html
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。