赞
踩
目录
vLLM是伯克利大学LMSYS组织开源的大语言模型高速推理框架。它利用了全新的注意力算法「PagedAttention」,提供易用、快速、便宜的LLM服务。
vLLM 是一个Python库,同时也包含预编译的C++和CUDA(12.1版本)二进制文件。
1. 安装条件:
2.使用 pip 安装:
# 使用conda创建python虚拟环境(可选)
conda create -n vllm python=3.11 -y
conda activate vllm# Install vLLM with CUDA 12.1.
pip install vllm
vLLM 也支持在 x86 CPU 平台上进行基本的模型推理和服务,支持的数据类型包括 FP32 和 BF16。
1.安装要求:
2.安装编译依赖:
yum install -y gcc gcc-c++
3.下载源码:
git clone https://github.com/vllm-project/vllm.git
4.安装python依赖:
pip install wheel packaging ninja setuptools>=49.4.0 numpy psutil
# 需要进入源码目录
pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
5.执行安装:
VLLM_TARGET_DEVICE=cpu python setup.py install
1. vLLM默认从HuggingFace下载模型,如果想从ModelScope下载模型,需要配置环境变量:
export VLLM_USE_MODELSCOPE=True
- from transformers import AutoTokenizer
- from vllm import LLM, SamplingParams
-
- # Initialize the tokenizer
- tokenizer = AutoTokenizer.from_pretrained("/data/weisx/model/Qwen1.5-4B-Chat")
-
- # Pass the default decoding hyperparameters of Qwen1.5-4B-Chat
- # max_tokens is for the maximum length for generation.
- sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)
-
- # Input the model name or path. Can be GPTQ or AWQ models.
- llm = LLM(model="Qwen/l/Qwen1.5-4B-Chat", trust_remote_code=True)
-
- # Prepare your prompts
- prompt = "Tell me something about large language models."
- messages = [
- {"role": "system", "content": "You are a helpful assistant."},
- {"role": "user", "content": prompt}
- ]
- text = tokenizer.apply_chat_template(
- messages,
- tokenize=False,
- add_generation_prompt=True
- )
-
- # generate outputs
- outputs = llm.generate([text], sampling_params)
-
- # Print the outputs.
- for output in outputs:
- prompt = output.prompt
- generated_text = output.outputs[0].text
- print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
借助vLLM,构建一个与OpenAI API兼容的API服务十分简便,该服务可以作为实现OpenAI API协议的服务器进行部署。默认情况下,它将在 http://localhost:8000 启动服务器。您可以通过 --host 和 --port 参数来自定义地址。请按照以下所示运行命令:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen1.5-4B-Chat
使用curl与Qwen对接:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen1.5-4B-Chat",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."}
]
}'
使用python客户端与Qwen对接:
- from openai import OpenAI
- # Set OpenAI's API key and API base to use vLLM's API server.
- openai_api_key = "EMPTY"
- openai_api_base = "http://localhost:8000/v1"
-
- client = OpenAI(
- api_key=openai_api_key,
- base_url=openai_api_base,
- )
-
- chat_response = client.chat.completions.create(
- model="Qwen/Qwen1.5-4B-Chat",
- messages=[
- {"role": "system", "content": "You are a helpful assistant."},
- {"role": "user", "content": "Tell me something about large language models."},
- ]
- )
- print("Chat response:", chat_response)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。