【ChatGLM3】（5）：使用 fastchat 部署ChatGLM3服务，启动8bit的worker，可以运行openai_api服务和web界面方便进行测试。还支持embeddings 接口！_fastchat openai_api server

作者：知新_RL | 2024-02-10 08:12:31

踩

fastchat openai_api server

0，视频演示

https://www.bilibili.com/video/BV1qH4y1q7gC/?vd_source=4b290247452adda4e56d84b659b0c8a2

终于弄明白FastChat服务了，本地部署ChatGLM3，BEG模型，可部署聊天接口，web展示和Embedding服务！

1，关于fastchat项目

FastChat是一个用于训练、部署和评估基于大型语言模型的聊天机器人的开放平台。其核心功能包括：

•最先进模型的权重、训练代码和评估代码（例如Vicuna、FastChat-T5）。•基于分布式多模型的服务系统，具有Web界面和与OpenAI兼容的RESTful API。

项目地址：
https://github.com/lm-sys/FastChat

2，使用docker启动项目

需要python3的环境，因为启动模型，需要nvidia的镜像。
整个docker 配置：

准备ChatGLM3的博客参考：

# 解决增加参数：https://github.com/lm-sys/FastChat/issues/657
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 108.00 MiB (GPU 0; 10.91 GiB total capacity; 10.67 GiB already allocated; 31.75 MiB free; 10.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

# 通过增加参数  --load-8bit  --load-4bit 解决
# 类似 python 代码中的：.quantize(8).cuda()
1
2
3
4
5

查看各个服务的命令：
python3 -m fastchat.serve.controller --help
python3 -m fastchat.serve.model_worker --help
python3 -m fastchat.serve.openai_api_server --help
python3 -m fastchat.serve.gradio_web_server --help

# 把 chatglm3 的模型放到 /data/models 目录下：
docker run -itd --name fastchat -v /data/models:/data/models --gpus=all \
-p 8000:8000 pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel

docker exec -it fastchat bash 

pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple 
pip3 config set install.trusted-host mirrors.aliyun.com

# 安装依赖环境
pip3 install "fschat[model_worker,webui]"
pip3 install transformers accelerate sentencepiece

#首先启动 controller ：
python3 -m fastchat.serve.controller --host 172.17.0.2 --port 21001 
# 然后启动模型： 说明，必须是本地ip 
python3 -m fastchat.serve.model_worker --load-8bit --model-names chatglm3-6b --model-path /data/models/chatglm3-6b-models --controller-address http://172.17.0.2:21001 --worker-address http://172.17.0.2:8080 --host 0.0.0.0 --port 8080 
# 最后启动 openapi的 兼容服务 地址 8000
python3 -m fastchat.serve.openai_api_server --controller-address http://172.17.0.2:21001 --host 0.0.0.0 --port 8000

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

支持的其他模型：

https://github.com/lm-sys/FastChat/blob/main/docs/model_support.md
在这里插入图片描述
虽然没有写支持 chatglm3模型，但是确实可以运行。

3，执行 POST 参数

curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
     "model": "chatglm3-6b",
     "messages": [{"role": "user", "content": "北京景点"}],
     "temperature": 0.7
   }'

1
2
3
4
5
6

4，启动web 界面


 pip3 install fschat[model_worker,webui] pydantic==1.10.13

python3 -m fastchat.serve.gradio_web_server --controller-url http://172.17.0.2:21001 --host 0.0.0.0 --port 8000
1
2
3
4

在这里插入图片描述
可以选择大模型

5，可以成功注册 bge-large-zh 模型，可以使用接口进行embedding 了

可以支持的embedding 模型：
https://www.modelscope.cn/models/AI-ModelScope/bge-large-zh/summary

2023-11-18 14:37:48 | ERROR | stderr | ImportError:
2023-11-18 14:37:48 | ERROR | stderr | XLMRobertaConverter requires the protobuf library but it was not found in your environment. Checkout the instructions on the
2023-11-18 14:37:48 | ERROR | stderr | installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones

# 安装解决 
apt update && apt install libprotobuf-dev protobuf-compiler
pip3 install google protocol
1
2
3
4
5
6
7

使用命令行 debug ：

python3 -m fastchat.serve.cli --debug  --model /data/models/bge-large-zh
可以解析 embedding 模型：
BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(21128, 1024, padding_idx=0)
    (position_embeddings): Embedding(512, 1024)
    (token_type_embeddings): Embedding(2, 1024)
    (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
1
2
3
4
5
6
7
8
9
10


# 模型比较小不需要 8bit 量化了：
python3 -m fastchat.serve.model_worker --model-names bge-zh --model-path /data/models/bge-large-zh --controller-address http://172.17.0.2:21001 --worker-address http://172.17.0.2:8080 --host 0.0.0.0 --port 8080 
1
2
3

测试接口：

curl http://localhost:8000/v1/embeddings \
 -H "Content-Type: application/json" \
 -d '{
  "input": "Your text string goes here",
  "model": "bge-zh"
}'
1
2
3
4
5
6

可以支持 embeddings 接口，返回向量数组了：

在这里插入图片描述
返回向量维度 1024 ，效果比 m3e 要好些：

5，FastChat的框架非常不错，可以支持模型，进行chat，和embedding

经过实验，可以发现使用fastchat可以成功部署ChatGLM3 进行对话。
可以成功部署 bge-zh 模型进行 embedding 的向量化，还是非常方便的。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/知新_RL/article/detail/73737