使用vllm部署ChatGLM2并提供兼容 OpenAI 的 API Server实现异步访问_vllm.entrypoints.openai.api_server

作者：笔触狂放9 | 2024-04-06 16:50:00

踩

vllm.entrypoints.openai.api_server

配置环境

到vllm的GitHub仓库GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs按照说明安装就可以了，不赘述。

运行接口

OpenAI 的 API Server

python -m vllm.entrypoints.openai.api_server --model your_model_path --trust-remote-code

运行下面的命令，默认host为0.0.0.0，默认端口为8000，也可以通过--host --port指定。使用chatglm等模型时，请指定 --trust-remote-code参数。


curl http://10.102.33.181:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": your_model_path,
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

调用时可以用下面测试，注意model参数一定要传


sampling_params = SamplingParams(
            n=request.n,
            presence_penalty=request.presence_penalty,
            frequency_penalty=request.frequency_penalty,
            temperature=request.temperature,
            top_p=request.top_p,
            stop=request.stop,
            stop_token_ids=request.stop_token_ids,
            max_tokens=request.max_tokens,
            best_of=request.best_of,
            top_k=request.top_k,
            ignore_eos=request.ignore_eos,
            use_beam_search=request.use_beam_search,
            skip_special_tokens=request.skip_special_tokens,
            spaces_between_special_tokens=spaces_between_special_tokens,
        )

具体参数的含义请参照https://github.com/vllm-project/vllm/blob/9b294976a2373f6fda22c1b2e704c395c8bd0787/vllm/sampling_params.pySamplingParams 类中的说明。

使用程序调用


import requests
import json
 
def get_response(text):
    raw_json_data = {
        "model": your_model_path,
        "prompt": prompt,
        "logprobs":1,
        "max_tokens": 100,
        "temperature": 0
    }
    json_data = json.dumps(raw_json_data)
    headers = {
        "Content-Type": "application/json",
        "User-Agent": "PostmanRuntime/7.29.2",
        "Accept": "*/*",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive"
    }
    response = requests.post(f'http://localhost:8000/v1/completions',
                             data=json_data,
                             headers=headers)
    if response.status_code == 200:
        response = json.loads(response.text)
        response_data = response["choices"][0]['text']
    else:
        print(data)
    return response_data

使用grequests实现异步批请求


import json
import time
import grequests
 
 
headers = {'Content-Type': 'application/json'}
data = {
    "model": your_model_path,
    "prompt": prompt,
    "logprobs":1,
    "max_tokens": 100,
    "temperature": 0
}
start = time.time()
req_list = [   # 请求列表
    grequests.post('http://localhost:8000/v1/completions', data=json.dumps(data), headers=headers) for i in range(10)
]
res_list = grequests.map(req_list)
print(round(time.time()-start, 1))
print(json.loads(res_list[0].text)["choices"][0]['text'])

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/笔触狂放9/article/detail/373171