赞
踩
vLLM 是一个快速且易于使用的 LLM 推理和服务库。
包括以下架构:
截至目前,vLLM 的二进制文件默认在 CUDA 12.1 上编译。但是,您可以通过运行以下命令来安装带有 CUDA 11.8 的 vLLM,
conda create -n myvllm python=3.10 -y
conda activate myvllm
# Install vLLM with CUDA 11.8.
export VLLM_VERSION=0.2.4
export PYTHON_VERSION=310
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl
# Re-install PyTorch with CUDA 11.8.
pip uninstall torch -y
pip install torch --upgrade --index-url https://download.pytorch.org/whl/cu118
# Re-install xFormers with CUDA 11.8.
pip uninstall xformers -y
pip install --upgrade xformers --index-url https://download.pytorch.org/whl/cu118
pip install megablocks
创建一个 test.py
,
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="lmsys/vicuna-7b-v1.5-16k")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
运行 test.py
,
python test.py
输出结果如下,
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 10.94it/s]
Prompt: 'Hello, my name is', Generated text: ' Dustin Nelson and I am a senior at the University of Wisconsin-Mad'
Prompt: 'The president of the United States is', Generated text: ' a member of Congress. everybody knows that.״'
Prompt: 'The capital of France is', Generated text: ' one of the most visited cities in the world, known for its rich history,'
Prompt: 'The future of AI is', Generated text: ' likely to bring about many changes in our society, from healthcare to transportation'
安装 langchain
包,
pip install langchain
创建 test_langchain.py
,
from langchain.llms import VLLM
llm = VLLM(model="lmsys/vicuna-7b-v1.5-16k",
trust_remote_code=True, # mandatory for hf models
max_new_tokens=128,
top_k=10,
top_p=0.95,
temperature=0.8,
# tensor_parallel_size=... # for distributed inference
)
print(llm("What is the capital of France ?"))
运行 langchain.py
,
python test_langchain.py
输出结果如下,
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.63s/it]
nobody knows
nobody knows
112. What is the capital of France?
The capital of France is Paris.
13. What is the capital of France?
The capital of France is Paris.
14. What is the capital of France?
The capital of France is Paris.
15. What is the capital of France?
The capital of France is Paris.
运行下面命令,
python -m vllm.entrypoints.openai.api_server \
--model facebook/opt-125m \
--trust-remote-code
测试一下,
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]
}'
输出结果如下,
{
"id": "cmpl-e4d0ef3b607c4fbead686b86b8a2441e",
"object": "chat.completion",
"created": 160,
"model": "facebook/opt-125m",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "This is how we get a shitty post about a kid being a racist...\nKinda like how we get a shitty comment about a kid being a pedophile."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 17,
"total_tokens": 51,
"completion_tokens": 34
}
}
完结!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。