赞
踩
qwen1.5 模型的问答生成方式发生了变化,不再支持 mode.chat(),但整体来看, 1.5版本的问答效果确实有了很大提升。
本文仍在编辑中
qwen-7B 大语言模型的加载方式如下
qwen1.5B 大语言模型的加载方式如下:
- import pandas as pd
- import torch
- from transformers import AutoModelForCausalLM, AutoTokenizer # transformer>=4.37.2
-
- """================Qwen-7B-15GB-推理运行大小-17GB--理论上微调需要GPU显存至少要大于17*4,一般至少要5倍于推理运存--================="""
-
- device = "cuda"
- model_id = "../model/Qwen1.5-7B-Chat"
- # 这里设置torch_dtype=torch.bfloat16 ,否则模型会按照全精度加载,GPU推理运存会从17G翻倍到34G
- model= AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=False)
- tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)
-
-
- prompt = """帮我把空调打开。
- 空调温度调节到24℃。
- 空调打开制冷模式,风速设为低档。
- 根据上述信息,分别提取出空调的开关状态、温度设置、风速设置、空调模式 """
-
- print("=== * ==="*50)
-
- def qwen_chat(prompt):
- messages = [
- {"role": "system", "content": "You are a helpful assistant. "},
- {"role": "user", "content": prompt}
- ]
-
- text = tokenizer.apply_chat_template(
- messages,
- tokenize=False,
- add_generation_prompt=True
- )
-
- print("=== tokenizer is finished ===")
- model_inputs = tokenizer([text], return_tensors="pt").to(device)
-
- # 注意这里需要设置 pad_token_id=tokenizer.eos_token_id,否则会出现warnning错误, hf上的脚本示例没有写
- generated_ids = model.generate(
- model_inputs.input_ids,
- max_new_tokens=512,
- pad_token_id=tokenizer.eos_token_id
- )
-
- generated_ids = [
- output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
- ]
- response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
-
- print(f'response: {response}')
- return response
-
- if __name__ == '__main__':
- prompt = prompt
- output = qwen_chat(prompt=prompt)
vllm的推理部署的环境要求,相对hf,对环境要求比较严格, vllm需要 gpu的 cuda 11.8以上,最好是12以上,torch的版本要2.1以上,具体版本要求,可以查vllm github上不同版本对环境的依赖要求 requirements.txt ;
代码示例如下 :
- from vllm import LLM
- from transformers import TextIteratorStreamer
- from vllm.sampling_params import SamplingParams
-
-
- class QwenVllm(object):
- def __init__(self, gpu_num=2, max_tokens=512):
- self.gpu_num = gpu_num
- self.max_tokens = max_tokens
- self.model_path= "Qwen1.5-7B-Chat"
- self.model, self.tokenizer = self.model_load_with_vllm()
-
-
- def model_load_with_vllm(self):
- """ vllm 形式预加载 模型 """
-
- model = LLM(
- tokenizer=self.model_path,
- model=self.model_path,
- dtype="bfloat16",
- tokenizer_mode= 'slow',
- trust_remote_code=False,
- tensor_parallel_size=self.gpu_num,
- gpu_memory_utilization=0.8, # gpu 初始化显存占比,这里单卡48g显存
- max_context_len_to_capture=8192,
- max_model_len = 8192,
- )
-
- tokenizer = model.get_tokenizer()
-
- return model, tokenizer
-
-
- def qwen_chat_vllm(self, prompt):
- """ vllm batch推理注意 batch size 与 gpu 关系"""
-
- message= [
- {"role": "system", "content": "you are a great assistant."},
- {"role": "user", "content": prompt}
- ]
-
- text = self.tokenizer.apply_chat_template(
- history,
- tokenize=False,
- add_generation_prompt=True
- )
-
- # max_token to control the maximum output length
- sampling_params = SamplingParams(
- temperature=0.7,
- top_p=0.8,
- repetition_penalty=1.05,
- max_tokens=self.max_tokens)
-
- outputs = self.model.generate([text], sampling_params)
-
- response = []
- for output in outputs:
- # prompt = output.prompt
- generated_text = output.outputs[0].text
- response.append(generated_text)
-
- return response
-
- if __name__ == '__main__':
- run = QwenVllm(gpu_num=2, max_tokens=1024)
- # 大模型单论对话生成
- prompt = """你介绍一下上海市吧"""
- response = run.qwen_chat(prompt=prompt)
- print(response)
vllm加载模型的方式有很多种,不一定局限于上面的模型加载推理部署方式,具体可以到vllm官网文档去细看。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。