赞
踩
https://www.bilibili.com/video/BV1ei4y1v7LY/
【chatglm】(9):使用fastchat和vllm部署chatlgm3-6b模型,并简单的进行速度测试对比。vllm确实速度更快些。
选择cuda 至少要要11.8 以上的版本:
然后直接执行:
pip3 install vllm
下载模型:
apt update && apt install git-lfs
git clone https://www.modelscope.cn/ZhipuAI/chatglm3-6b.git
python -m vllm.entrypoints.api_server --trust-remote-code --model /root/autodl-tmp/chatglm3-6b
INFO 12-11 08:09:34 llm_engine.py:73] Initializing an LLM engine with config: model='/root/autodl-tmp/chatglm3-6b', tokenizer='/root/autodl-tmp/chatglm3-6b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
WARNING 12-11 08:09:34 tokenizer.py:79] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 12-11 08:09:46 llm_engine.py:222] # GPU blocks: 20255, # CPU blocks: 9362
INFO: Started server process [2175]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
测试接口:
test_throughput.py
https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/test_throughput.py
# coding=utf-8
"""
代码测试工具:
python3 test_throughput.py --api-address http://localhost:8000 --model-name chatglm3-6b --n-thread 10
"""
import argparse
import json
import requests
import threading
import time
def main():
headers = {"User-Agent": "openai client", "Content-Type": "application/json"}
ploads = {
"model": args.model_name,
"messages": [{"role": "user", "content": "生成一个50字的故事"}],
"temperature": 0.7,
}
thread_api_addr = args.api_address
def send_request(results, i):
print(f"thread {i} goes to {thread_api_addr}")
response = requests.post(
thread_api_addr + "/v1/chat/completions",
headers=headers,
json=ploads,
stream=False,
)
print(response.text)
response_new_words = json.loads(response.text)["usage"]["completion_tokens"]
#error_code = json.loads(response.text)["error_code"]
print(f"=== Thread {i} ===, words: {response_new_words} ")
results[i] = response_new_words
# use N threads to prompt the backend
tik = time.time()
threads = []
results = [None] * args.n_thread
for i in range(args.n_thread):
t = threading.Thread(target=send_request, args=(results, i))
t.start()
# time.sleep(0.5)
threads.append(t)
for t in threads:
t.join()
print(f"Time (POST): {time.time() - tik} s")
n_words = sum(results)
time_seconds = time.time() - tik
print(
f"Time (Completion): {time_seconds}, n threads: {args.n_thread}, "
f"throughput: {n_words / time_seconds} words/s."
)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--api-address", type=str, default="http://127.0.0.1:8000")
parser.add_argument("--model-name", type=str, default="chatglm3-6b")
parser.add_argument("--n-thread", type=int, default=10)
args = parser.parse_args()
main()
使用fastchat 和 vllm 简单的对比了下。
没有做量化,也没有其他配置。
fastchat 是 20 t/s 左右,vllm 是 200+ t/s 速度上确实还是非常不错的。
但是发现 vllm 在返回的内容上不如 fastchat 好。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。