当前位置:   article > 正文

使用vllm部署ChatGLM2并提供兼容 OpenAI 的 API Server实现异步访问_vllm.entrypoints.openai.api_server

vllm.entrypoints.openai.api_server

配置环境

到vllm的GitHub仓库GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs按照说明安装就可以了,不赘述。

运行接口

OpenAI 的 API Server

python -m vllm.entrypoints.openai.api_server --model your_model_path --trust-remote-code

运行下面的命令,默认host为0.0.0.0,默认端口为8000,也可以通过--host --port指定。使用chatglm等模型时,请指定 --trust-remote-code参数。

  1. curl http://10.102.33.181:8000/v1/completions \
  2. -H "Content-Type: application/json" \
  3. -d '{
  4. "model": your_model_path,
  5. "prompt": "San Francisco is a",
  6. "max_tokens": 7,
  7. "temperature": 0
  8. }'

调用时可以用下面测试,注意model参数一定要传

其他更多的参数请参照https://github.com/vllm-project/vllm/blob/9b294976a2373f6fda22c1b2e704c395c8bd0787/vllm/entrypoints/openai/api_server.py#L252中的sampling_params

  1. sampling_params = SamplingParams(
  2. n=request.n,
  3. presence_penalty=request.presence_penalty,
  4. frequency_penalty=request.frequency_penalty,
  5. temperature=request.temperature,
  6. top_p=request.top_p,
  7. stop=request.stop,
  8. stop_token_ids=request.stop_token_ids,
  9. max_tokens=request.max_tokens,
  10. best_of=request.best_of,
  11. top_k=request.top_k,
  12. ignore_eos=request.ignore_eos,
  13. use_beam_search=request.use_beam_search,
  14. skip_special_tokens=request.skip_special_tokens,
  15. spaces_between_special_tokens=spaces_between_special_tokens,
  16. )

具体参数的含义请参照https://github.com/vllm-project/vllm/blob/9b294976a2373f6fda22c1b2e704c395c8bd0787/vllm/sampling_params.pySamplingParams 类中的说明。

使用程序调用

  1. import requests
  2. import json
  3. def get_response(text):
  4. raw_json_data = {
  5. "model": your_model_path,
  6. "prompt": prompt,
  7. "logprobs":1,
  8. "max_tokens": 100,
  9. "temperature": 0
  10. }
  11. json_data = json.dumps(raw_json_data)
  12. headers = {
  13. "Content-Type": "application/json",
  14. "User-Agent": "PostmanRuntime/7.29.2",
  15. "Accept": "*/*",
  16. "Accept-Encoding": "gzip, deflate, br",
  17. "Connection": "keep-alive"
  18. }
  19. response = requests.post(f'http://localhost:8000/v1/completions',
  20. data=json_data,
  21. headers=headers)
  22. if response.status_code == 200:
  23. response = json.loads(response.text)
  24. response_data = response["choices"][0]['text']
  25. else:
  26. print(data)
  27. return response_data

使用grequests实现异步批请求

  1. import json
  2. import time
  3. import grequests
  4. headers = {'Content-Type': 'application/json'}
  5. data = {
  6. "model": your_model_path,
  7. "prompt": prompt,
  8. "logprobs":1,
  9. "max_tokens": 100,
  10. "temperature": 0
  11. }
  12. start = time.time()
  13. req_list = [ # 请求列表
  14. grequests.post('http://localhost:8000/v1/completions', data=json.dumps(data), headers=headers) for i in range(10)
  15. ]
  16. res_list = grequests.map(req_list)
  17. print(round(time.time()-start, 1))
  18. print(json.loads(res_list[0].text)["choices"][0]['text'])
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/笔触狂放9/article/detail/373171
推荐阅读
相关标签
  

闽ICP备14008679号