使用Lmdeploy将Qwen2-7B量化和加速推理_special tokens have been added in the vocabulary,

special tokens have been added in the vocabulary, make sure the associated w

量化前 53words/s


python benchmark_qwen.py 
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 4/4 [02:37<00:00, 39.39s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 53.064 words/s
lmdeploy lite auto_awq /root/models/Qwen2-7B-Instruct --work-dir /root/models/Qwen2-7B-Instruct-4bit
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.45s/it]
Traceback (most recent call last):
  File "/root/.conda/envs/lmdeploy/bin/lmdeploy", line 8, in <module>
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 26, in run
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/lite.py", line 131, in auto_awq
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/apis/auto_awq.py", line 54, in auto_awq
    model, tokenizer, work_dir = calibrate(model, calib_dataset, calib_samples,
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/apis/calibrate.py", line 155, in calibrate
    raise RuntimeError(
RuntimeError: Currently, quantification and calibration of Qwen2ForCausalLM are not supported. The supported model types are InternLMForCausalLM, InternLM2ForCausalLM, QWenLMHeadModel, BaiChuanForCausalLM, BaichuanForCausalLM, LlamaForCausalLM.
pip install lmdeploy[all]==0.4.2
  • 1

量化后 312words/s


python benchmark_qwen_4bit.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING] gemm_config.in is not found; using default GEMM algo                                                                                       
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 312.444 words/s
显存占用几乎一样,分配了更多kv cache



import datetime
from lmdeploy import pipeline

pipe = pipeline('/root/models/Qwen2-7B-Instruct-4bit')

# warmup
inp = "hello"
for i in range(5):
    print("Warm up...[{}/5]".format(i+1))
    response = pipe([inp])

# test speed
inp = "请介绍一下你自己。"
times = 10
total_words = 0
start_time = datetime.datetime.now()
for i in range(times):
    response = pipe([inp])
    total_words += len(response[0].text)
end_time = datetime.datetime.now()

delta_time = end_time - start_time
delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0
speed = total_words / delta_time
print("Speed: {:.3f} words/s".format(speed))
python benchmark_transformer_qwen2.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards:   0%|                                                                                               | 0/3 [00:00<?, ?it/s]/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  8.68it/s]
Warm up...[1/5]
Traceback (most recent call last):
  File "/root/benchmark_transformer_qwen2.py", line 15, in <module>
    response, history = model.chat(tokenizer, inp, history=[])
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat'
import torch
import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("/root/models/Qwen2-7B-Instruct", trust_remote_code=True)

# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("/root/models/Qwen2-7B-Instruct", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()

def chat(tokenizer, ques, history=[], **kw):
    iids = tokenizer.apply_chat_template(
        history + [{'role': 'user', 'content': ques}], 
    oids = model.generate(
    oids = oids[0][len(iids):].tolist()
    if oids[-1] == tokenizer.eos_token_id:
        oids = oids[:-1]
    ans = tokenizer.decode(oids)
    return ans

model.chat = chat

# warmup
inp = "hello"
for i in range(5):
    print("Warm up...[{}/5]".format(i+1))
    response = model.chat(tokenizer, inp)

# test speed
inp = "请介绍一下你自己。"
times = 10
total_words = 0
start_time = datetime.datetime.now()
for i in range(times):
    response = model.chat(tokenizer, inp)
    total_words += len(response)
end_time = datetime.datetime.now()

delta_time = end_time - start_time
delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0
speed = total_words / delta_time
print("Speed: {:.3f} words/s".format(speed))
python benchmark_transformer_qwen2.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.01it/s]
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 75.586 words/s
python benchmark_transformer_qwen2.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  8.40it/s]
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 4.495 words/s
huggingface-cli download --resume-download Qwen/Qwen2-7B-Instruct-GPTQ-Int4 --local-dir /root/models/Qwen2-7B-Instruct-GPTQ-Int4/
  • 1


Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 11.385 words/s
