当前位置:   article > 正文




一、vLLM 介绍

二、安装 vLLM

2.1 使用 GPU 进行安装

2.2 使用CPU进行安装

2.3 相关配置

三、使用 vLLM

3.1 离线推理

3.2 适配OpenAI-API的API服务

一、vLLM 介绍


二、安装 vLLM

2.1 使用 GPU 进行安装

        vLLM 是一个Python库,同时也包含预编译的C++和CUDA(12.1版本)二进制文件。

       1. 安装条件:

  • OS: Linux
  • Python: 3.8 – 3.11
  • GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)

        2.使用 pip 安装:

# 使用conda创建python虚拟环境(可选)
conda create -n vllm python=3.11 -y
conda activate vllm

# Install vLLM with CUDA 12.1.
pip install vllm

2.2 使用CPU进行安装

        vLLM 也支持在 x86 CPU 平台上进行基本的模型推理和服务,支持的数据类型包括 FP32 和 BF16。


  • OS: Linux
  • Compiler: gcc/g++>=12.3.0 (recommended)
  • Instruction set architecture (ISA) requirement: AVX512 is required.


yum install -y gcc  gcc-c++


git clone https://github.com/vllm-project/vllm.git


pip install wheel packaging ninja setuptools>=49.4.0 numpy psutil

# 需要进入源码目录
pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu


VLLM_TARGET_DEVICE=cpu python setup.py install

2.3 相关配置

       1. vLLM默认从HuggingFace下载模型,如果想从ModelScope下载模型,需要配置环境变量:


三、使用 vLLM

3.1 离线推理

  1. from transformers import AutoTokenizer
  2. from vllm import LLM, SamplingParams
  3. # Initialize the tokenizer
  4. tokenizer = AutoTokenizer.from_pretrained("/data/weisx/model/Qwen1.5-4B-Chat")
  5. # Pass the default decoding hyperparameters of Qwen1.5-4B-Chat
  6. # max_tokens is for the maximum length for generation.
  7. sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)
  8. # Input the model name or path. Can be GPTQ or AWQ models.
  9. llm = LLM(model="Qwen/l/Qwen1.5-4B-Chat", trust_remote_code=True)
  10. # Prepare your prompts
  11. prompt = "Tell me something about large language models."
  12. messages = [
  13. {"role": "system", "content": "You are a helpful assistant."},
  14. {"role": "user", "content": prompt}
  15. ]
  16. text = tokenizer.apply_chat_template(
  17. messages,
  18. tokenize=False,
  19. add_generation_prompt=True
  20. )
  21. # generate outputs
  22. outputs = llm.generate([text], sampling_params)
  23. # Print the outputs.
  24. for output in outputs:
  25. prompt = output.prompt
  26. generated_text = output.outputs[0].text
  27. print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

3.2 适配OpenAI-API的API服务

        借助vLLM,构建一个与OpenAI API兼容的API服务十分简便,该服务可以作为实现OpenAI API协议的服务器进行部署。默认情况下,它将在 http://localhost:8000 启动服务器。您可以通过 --host 和 --port 参数来自定义地址。请按照以下所示运行命令:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen1.5-4B-Chat


curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "Qwen/Qwen1.5-4B-Chat",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me something about large language models."}


  1. from openai import OpenAI
  2. # Set OpenAI's API key and API base to use vLLM's API server.
  3. openai_api_key = "EMPTY"
  4. openai_api_base = "http://localhost:8000/v1"
  5. client = OpenAI(
  6. api_key=openai_api_key,
  7. base_url=openai_api_base,
  8. )
  9. chat_response = client.chat.completions.create(
  10. model="Qwen/Qwen1.5-4B-Chat",
  11. messages=[
  12. {"role": "system", "content": "You are a helpful assistant."},
  13. {"role": "user", "content": "Tell me something about large language models."},
  14. ]
  15. )
  16. print("Chat response:", chat_response)

