当前位置:   article > 正文

网易有道 QAnything 安装部署实验_qanything部署

qanything部署

一、什么是QAnything?

1.1 QAnything

QAnything (Question and Answer based on Anything) 是致力于支持任意格式文件或数据库的本地知识库问答系统,可断网安装使用。

您的任何格式的本地文件都可以往里扔,即可获得准确、快速、靠谱的问答体验。

目前已支持格式: PDF(pdf)Word(docx)PPT(pptx)XLS(xlsx)Markdown(md)电子邮件(eml)TXT(txt)图片(jpg,jpeg,png)CSV(csv)网页链接(html),更多格式,敬请期待...

1.2 特点

  • 数据安全,支持全程拔网线安装使用。
  • 支持跨语种问答,中英文问答随意切换,无所谓文件是什么语种。
  • 支持海量数据问答,两阶段向量排序,解决了大规模数据检索退化的问题,数据越多,效果越好。
  • 高性能生产级系统,可直接部署企业应用。
  • 易用性,无需繁琐的配置,一键安装部署,拿来就用。
  • 支持选择多知识库问答。

1.3 架构

qanything_system
1.3.1 为什么是两阶段检索?

知识库数据量大的场景下两阶段优势非常明显,如果只用一阶段embedding检索,随着数据量增大会出现检索退化的问题,如下图中绿线所示,二阶段rerank重排后能实现准确率稳定增长,即数据越多,效果越好

two stage retrievaal

QAnything使用的检索组件BCEmbedding有非常强悍的双语和跨语种能力,能消除语义检索里面的中英语言之间的差异,从而实现:

1.3.2 一阶段检索(embedding)
模型名称RetrievalSTSPairClassificationClassificationRerankingClustering平均
bge-base-en-v1.537.1455.0675.4559.7343.0537.7447.20
bge-base-zh-v1.547.6063.7277.4063.3854.8532.5653.60
bge-large-en-v1.537.1554.0975.0059.2442.6837.3246.82
bge-large-zh-v1.547.5464.7379.1464.1955.8833.2654.21
jina-embeddings-v2-base-en31.5854.2874.8458.4241.1634.6744.29
m3e-base46.2963.9371.8464.0852.3837.8453.54
m3e-large34.8559.7467.6960.0748.9931.6246.78
bce-embedding-base_v157.6065.7374.9669.0057.2938.9559.43
1.3.3 二阶段检索(rerank)
模型名称Reranking平均
bge-reranker-base57.7857.78
bge-reranker-large59.6959.69
bce-reranker-base_v160.0660.06

二.、开始

在线试用QAnything

2.1 必要条件

2.1.1 For Linux
SystemRequired itemMinimum RequirementNote
Linux amd64NVIDIA GPU Memory>= 4GB (use OpenAI API)Minimum: GTX 1050Ti(use OpenAI API)
Recommended: RTX 3090
NVIDIA Driver Version>= 525.105.17
Docker version>= 20.10.5Docker install
docker compose version>= 2.23.3docker compose install
2.1.2 For Windows with WSL Ubuntu子系统
SystemRequired itemMinimum RequirementNote
Windows with WSL Ubuntu子系统NVIDIA GPU Memory>= 4GB (use OpenAI API)最低: GTX 1050Ti(use OpenAI API)
推荐: RTX 3090
GEFORCE EXPERIENCE>= 546.33GEFORCE EXPERIENCE download
Docker Desktop>= 4.26.1(131620)Docker Desktop for Windows
git-lfsgit-lfs install

2.2 下载安装

2.2.1 下载项目

下载源代码:

git clone https://github.com/netease-youdao/QAnything.git

获取 Embedding 模型:

git clone https://www.modelscope.cn/netease-youdao/QAnything.git
  • 从有道的资源库中下载所需的 Embedding 模型。
  • 解压下载的模型文件,得到一个名为 "models" 的文件夹,其中包含了所需的 embedding 模型。
  • 将解压后的 "models" 文件夹放置于 QAnything 的根目录下。

下载大语言模型:

  • 推荐使用 "通义千问" 的大语言模型。
  • 下载所需的大语言模型,并将其存放在 QAnything 的 "assets/custom_models/" 文件夹中。

MiniChat-2-3B

git clone https://www.modelscope.cn/netease-youdao/MiniChat-2-3B.git

Qwen-7B

git clone https://www.modelscope.cn/netease-youdao/Qwen-7B-QAnything.git
2.2.2 QAnything 服务启动命令用法

用法:

  1. bash run.sh [-c <llm_api>] [-i <device_id>] [-b <runtime_backend>] [-m <model_name>] [-t <conv_template>] [-p <tensor_parallel>] [-r <gpu_memory_utilization>] [-h]
  2. -c <llm_api>:指定LLM API模式,选项为 {local, cloud},默认为 'local'。若设置为 '-c cloud',请首先手动设置环境变量 {OPENAI_API_KEY, OPENAI_API_BASE, OPENAI_API_MODEL_NAME, OPENAI_API_CONTEXT_LENGTH} 到 .env 文件中。
  3. -i <device_id>:指定GPU设备ID。
  4. -b <runtime_backend>:指定LLM推理运行时后端,选项为 {default, hf, vllm}。
  5. -m <model_name>:指定要加载的公共LLM模型名称,用于通过FastChat serve API使用,选项为 {Qwen-7B-Chat, deepseek-llm-7b-chat, ...}。
  6. -t <conv_template>:指定使用公共LLM模型时的会话模板,选项为 {qwen-7b-chat, deepseek-chat, ...}。
  7. -p <tensor_parallel>:使用选项 {1, 2} 设置vllm后端的张量并行参数,默认为1。
  8. -r <gpu_memory_utilization>:指定vllm后端的gpu_memory_utilization参数(0,1],默认为0.81。
  9. -h:显示帮助信息。
服务启动命令GPUsLLM Runtime BackendLLM model
bash ./run.sh -c cloud -i 0 -b default1OpenAI APIOpenAI API
bash ./run.sh -c local -i 0 -b default1FasterTransformerQwen-7B-QAnything
bash ./run.sh -c local -i 0 -b hf -m MiniChat-2-3B -t minichat1Huggingface TransformersPublic LLM (e.g., MiniChat-2-3B)
bash ./run.sh -c local -i 0 -b vllm -m MiniChat-2-3B -t minichat -p 1 -r 0.811vllmPublic LLM (e.g., MiniChat-2-3B)
bash ./run.sh -c local -i 0,1 -b default2FasterTransformerQwen-7B-QAnything
bash ./run.sh -c local -i 0,1 -b hf -m MiniChat-2-3B -t minichat2Huggingface TransformersPublic LLM (e.g., MiniChat-2-3B)
bash ./run.sh -c local -i 0,1 -b vllm -m MiniChat-2-3B -t minichat -p 1 -r 0.812vllmPublic LLM (e.g., MiniChat-2-3B)
bash ./run.sh -c local -i 0,1 -b vllm -m MiniChat-2-3B -t minichat -p 2 -r 0.812vllmPublic LLM (e.g., MiniChat-2-3B)
  1. Note: 你可以根据自己的设备条件选择最合适的服务启动命令。
  2. (1) 当设置 "-i 0,1" 时,本地嵌入/重新排序将在设备 gpu_id_1 上运行,否则默认使用 gpu_id_0
  3. (2) 当设置 "-c cloud" 时,将使用本地Embedding/Rerank和OpenAI LLM API,仅需要约4GB VRAM(适用于GPU设备VRAM <= 8GB)。
  4. (3) 使用OpenAI LLM API时,您将被要求立即输入{OPENAI_API_KEY, OPENAI_API_BASE, OPENAI_API_MODEL_NAME, OPENAI_API_CONTEXT_LENGTH}。
  5. (4) "-b hf" 是运行公共LLM推理的最推荐方式,但性能较差。
  6. (5) 在选择QAnything系统的公共Chat LLM时,应考虑更合适的PROMPT_TEMPLATE设置,以考虑不同的LLM模型。
  7. (6) 支持使用Huggingface Transformers/vllm后端的FastChat API的公共LLM列表位于 "/path/to/QAnything/third_party/FastChat/fastchat/conversation.py" 中。

支持使用 FastChat API 和 Huggingface Transformers/vllm 运行时后端的 Pulic LLM

model_nameconv_templateSupported Pulic LLM List
Qwen-7B-QAnythingqwen-7b-qanythingQwen-7B-QAnything
Qwen-1_8B-Chat/Qwen-7B-Chat/Qwen-14B-Chatqwen-7b-chatQwen
Baichuan2-7B-Chat/Baichuan2-13B-Chatbaichuan2-chatBaichuan2
MiniChat-2-3BminichatMiniChat
deepseek-llm-7b-chatdeepseek-chatDeepseek
Yi-6B-ChatYi-34b-chatYi
chatglm3-6bchatglm3ChatGLM3
... check or add conv_template for more LLMs in "/path/to/QAnything/third_party/FastChat/fastchat/conversation.py"
2.2.3 服务启动命令示例:
  1. 在单GPU上使用Huggingface transformers运行时的FastChat API运行QAnything:
  2. 推荐用于具有 VRAM <= 16GB 的 GPU 设备

  1. 1.1 运行Qwen-7B-QAnything
  2. ## Step 1. Download the public LLM model (e.g., Qwen-7B-QAnything) and save to "/path/to/QAnything/assets/custom_models"
  3. ## (Optional) Download Qwen-7B-QAnything from ModelScope: https://www.modelscope.cn/models/netease-youdao/Qwen-7B-QAnything
  4. ## (Optional) Download Qwen-7B-QAnything from Huggingface: https://huggingface.co/netease-youdao/Qwen-7B-QAnything
  5. cd /path/to/QAnything/assets/custom_models
  6. git clone https://huggingface.co/netease-youdao/Qwen-7B-QAnything
  7. # Step 2. Execute the service startup command. Here we use "-b hf" to specify the Huggingface transformers backend.
  8. ## Here we use "-b hf" to specify the transformers backend that will load model in 8 bits but do bf16 inference as default for saving VRAM.
  9. cd /path/to/QAnything
  10. bash ./run.sh -c local -i 0 -b hf -m Qwen-7B-QAnything -t qwen-7b-qanything
  11. 1.2 运行一个公共的 LLM 模型(例如 MiniChat-2-3B)
  12. ## Step 1. Download the public LLM model (e.g., MiniChat-2-3B) and save to "/path/to/QAnything/assets/custom_models"
  13. cd /path/to/QAnything/assets/custom_models
  14. git clone https://huggingface.co/GeneZC/MiniChat-2-3B
  15. ## Step 2. Execute the service startup command. Here we use "-b hf" to specify the Huggingface transformers backend.
  16. ## Here we use "-b hf" to specify the transformers backend that will load model in 8 bits but do bf16 inference as default for saving VRAM.
  17. cd /path/to/QAnything
  18. bash ./run.sh -c local -i 0 -b hf -m MiniChat-2-3B -t minichat
  1. 在单GPU上使用带有 vllm 运行时后端的 FastChat API 运行 QAnything:
  1. 2.1 运行 Qwen-7B-QAnything
  2. ## Step 1. Download the public LLM model (e.g., Qwen-7B-QAnything) and save to "/path/to/QAnything/assets/custom_models"
  3. ## (Optional) Download Qwen-7B-QAnything from ModelScope: https://www.modelscope.cn/models/netease-youdao/Qwen-7B-QAnything
  4. ## (Optional) Download Qwen-7B-QAnything from Huggingface: https://huggingface.co/netease-youdao/Qwen-7B-QAnything
  5. cd /path/to/QAnything/assets/custom_models
  6. git clone https://huggingface.co/netease-youdao/Qwen-7B-QAnything
  7. ## Step 2. Execute the service startup command. Here we use "-b vllm" to specify the vllm backend.
  8. ## Here we use "-b vllm" to specify the vllm backend that will do bf16 inference as default.
  9. ## Note you should adjust the gpu_memory_utilization yourself according to the model size to avoid out of memory (e.g., gpu_memory_utilization=0.81 is set default for 7B. Here, gpu_memory_utilization is set to 0.85 by "-r 0.85").
  10. cd /path/to/QAnything
  11. bash ./run.sh -c local -i 0 -b vllm -m Qwen-7B-QAnything -t qwen-7b-qanything -p 1 -r 0.85
  12. 2.2 运行一个公共的 LLM 模型(例如 MiniChat-2-3B)
  13. ## Step 1. Download the public LLM model (e.g., MiniChat-2-3B) and save to "/path/to/QAnything/assets/custom_models"
  14. cd /path/to/QAnything/assets/custom_models
  15. git clone https://huggingface.co/GeneZC/MiniChat-2-3B
  16. ## Step 2. Execute the service startup command.
  17. ## Here we use "-b vllm" to specify the vllm backend that will do bf16 inference as default.
  18. ## Note you should adjust the gpu_memory_utilization yourself according to the model size to avoid out of memory (e.g., gpu_memory_utilization=0.81 is set default for 7B. Here, gpu_memory_utilization is set to 0.5 by "-r 0.5").
  19. cd /path/to/QAnything
  20. bash ./run.sh -c local -i 0 -b vllm -m MiniChat-2-3B -t minichat -p 1 -r 0.5
  21. ## (Optional) Step 2. Execute the service startup command to specify the vllm backend by "-i 0,1 -p 2". It will do faster inference by setting a tensor parallel mode on 2 GPUs.
  22. ## bash ./run.sh -c local -i 0,1 -b vllm -m MiniChat-2-3B -t minichat -p 2 -r 0.5
  1. 在多GPU上使用FastChat API和vllm后端启动QAnything,并设置张量并行参数为2:
  1. cd /path/to/QAnything
  2. bash ./run.sh -c local -i 0,1 -b vllm -m Qwen-7B-QAnything -t qwen-7b-qanything -p 2 -r 0.85

2.3 开始使用

前端页面

运行成功后,即可在浏览器输入以下地址进行体验。

  • 前端地址: http://your_host:5052/qanything/
API

如果想要访问API接口,请参考下面的地址:

  • API address: http://your_host:8777/api/
  • For detailed API documentation, please refer to QAnything API 文档
  1. DEBUG
  2. 如果想要查看相关日志,请查看QAnything/logs/debug_logs目录下的日志文件。
  3. debug.log
  4. 用户请求处理日志
  5. sanic_api.log
  6. 后端服务运行日志
  7. llm_embed_rerank_tritonserver.log(单卡部署)
  8. LLM embedding和rerank tritonserver服务启动日志
  9. llm_tritonserver.log(多卡部署)
  10. LLM tritonserver服务启动日志
  11. embed_rerank_tritonserver.log(多卡部署或使用openai接口)
  12. embedding和rerank tritonserver服务启动日志
  13. rerank_server.log
  14. rerank服务运行日志
  15. ocr_server.log
  16. OCR服务运行日志
  17. npm_server.log
  18. 前端服务运行日志
  19. llm_server_entrypoint.log
  20. LLM中转服务运行日志
  21. fastchat_logs/*.log
  22. FastChat服务运行日志
关闭服务
bash close.sh

2.4 离线部署linux

  1. # 先在联网机器上下载docker镜像
  2. docker pull quay.io/coreos/etcd:v3.5.5
  3. docker pull minio/minio:RELEASE.2023-03-20T20-16-18Z
  4. docker pull milvusdb/milvus:v2.3.4
  5. docker pull mysql:latest
  6. docker pull freeren/qanything:v1.2.1
  7. # 打包镜像
  8. docker save quay.io/coreos/etcd:v3.5.5 minio/minio:RELEASE.2023-03-20T20-16-18Z milvusdb/milvus:v2.3.4 mysql:latest freeren/qanything:v1.2.1 -o qanything_offline.tar
  9. # 下载QAnything代码
  10. wget https://github.com/netease-youdao/QAnything/archive/refs/heads/master.zip
  11. # 把镜像qanything_offline.tar和代码QAnything-master.zip拷贝到断网机器上
  12. cp QAnything-master.zip qanything_offline.tar /path/to/your/offline/machine
  13. # 在断网机器上加载镜像
  14. docker load -i qanything_offline.tar
  15. # 解压代码,运行
  16. unzip QAnything-master.zip
  17. cd QAnything-master
  18. bash run.sh
参考链接

QAnything github: https://github.com/netease-youdao/QAnything
QAnything gitee: QAnything: QAnything (Question and Answer based on Anything) 是致力于支持任意格式文件或数据库的本地知识库问答系统,可断网安装使用。GitHub - QwenLM/Qwen: The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.QAnything

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/羊村懒王/article/detail/406461
推荐阅读
相关标签
  

闽ICP备14008679号