当前位置:   article > 正文

本地部署 vllm

vllm

1. vllm 介绍

vLLM 是一个快速且易于使用的 LLM 推理和服务库。

1-1. vLLM 的速度很快

  • 最先进的服务吞吐量
  • 使用 PagedAttention 高效管理注意力键和值内存
  • 连续批处理传入请求
  • 使用 CUDA/HIP 图快速执行模型
  • 支持量化:GPTQ、AWQ、SqueezeLLM
  • 优化的 CUDA 内核

1-2. vLLM 灵活且易于使用

  • 与流行的 Hugging Face 模型无缝集成
  • 高吞吐量服务与各种解码算法,包括并行采样、波束搜索等
  • 对分布式推理的张量并行支持
  • 支持流输出
  • 兼容 OpenAI 的 API 服务器
  • 支持 NVIDIA GPU 和 AMD GPU

1-3. vLLM 无缝支持许多 Hugging Face 模型

包括以下架构:

  • Baichuan & Baichuan2
  • ChatGLM
  • LLaMA & LLaMA-2
  • Mistral
  • Mixtral
  • Phi
  • Qwen
  • Yi
  • 等等

2. 安装 vllm

截至目前,vLLM 的二进制文件默认在 CUDA 12.1 上编译。但是,您可以通过运行以下命令来安装带有 CUDA 11.8 的 vLLM,

conda create -n myvllm python=3.10 -y
conda activate myvllm
  • 1
  • 2
# Install vLLM with CUDA 11.8.
export VLLM_VERSION=0.2.4
export PYTHON_VERSION=310
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl

# Re-install PyTorch with CUDA 11.8.
pip uninstall torch -y
pip install torch --upgrade --index-url https://download.pytorch.org/whl/cu118

# Re-install xFormers with CUDA 11.8.
pip uninstall xformers -y
pip install --upgrade xformers --index-url https://download.pytorch.org/whl/cu118
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
pip install megablocks
  • 1

3. 测试一下原生功能

创建一个 test.py

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="lmsys/vicuna-7b-v1.5-16k")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19

运行 test.py

python test.py
  • 1

输出结果如下,

Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 10.94it/s]
Prompt: 'Hello, my name is', Generated text: ' Dustin Nelson and I am a senior at the University of Wisconsin-Mad'
Prompt: 'The president of the United States is', Generated text: ' a member of Congress. everybody knows that.״'
Prompt: 'The capital of France is', Generated text: ' one of the most visited cities in the world, known for its rich history,'
Prompt: 'The future of AI is', Generated text: ' likely to bring about many changes in our society, from healthcare to transportation'
  • 1
  • 2
  • 3
  • 4
  • 5

4. 测试一下和 langchain 集成

安装 langchain 包,

pip install langchain
  • 1

创建 test_langchain.py

from langchain.llms import VLLM

llm = VLLM(model="lmsys/vicuna-7b-v1.5-16k",
           trust_remote_code=True,  # mandatory for hf models
           max_new_tokens=128,
           top_k=10,
           top_p=0.95,
           temperature=0.8,
           # tensor_parallel_size=... # for distributed inference
)

print(llm("What is the capital of France ?"))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

运行 langchain.py

python test_langchain.py
  • 1

输出结果如下,

Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.63s/it]

 nobody knows
 nobody knows
112. What is the capital of France?
The capital of France is Paris.
13. What is the capital of France?
The capital of France is Paris.
14. What is the capital of France?
The capital of France is Paris.
15. What is the capital of France?
The capital of France is Paris.
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

5. 使用 vllm 启动兼容 OpenAI API 服务

运行下面命令,

python -m vllm.entrypoints.openai.api_server \
    --model facebook/opt-125m \
    --trust-remote-code
  • 1
  • 2
  • 3

测试一下,

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "facebook/opt-125m",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

输出结果如下,

{
  "id": "cmpl-e4d0ef3b607c4fbead686b86b8a2441e",
  "object": "chat.completion",
  "created": 160,
  "model": "facebook/opt-125m",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "This is how we get a shitty post about a kid being a racist...\nKinda like how we get a shitty comment about a kid being a pedophile."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 17,
    "total_tokens": 51,
    "completion_tokens": 34
  }
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21

6. Reference & Thanks

完结!

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家小花儿/article/detail/72738?site
推荐阅读
相关标签
  

闽ICP备14008679号