【Dify知识库】（10）：Dify0.4.9改造支持MySQL，成功接入通义千问-7B-Chat-Int4做对话，本地使用fastchat启动，占6G显存，可以成功配置LLM和 embedding_dify qwen

作者：人工智能uu | 2024-07-16 00:58:04

踩

dify qwen

0，视频地址

https://www.bilibili.com/video/BV1ia4y1y7VM/?vd_source=4b290247452adda4e56d84b659b0c8a2

【Dify知识库】（10）：Dify0.4.9版本，支持MySQL，成功接入通义千问-7B-Chat-Int4做对话，本地使用fastchat启动，占6G显存

项目地址：
https://gitee.com/fly-llm/dify-mysql-llm

1，关于qwen7b 大模型

项目地址：https://www.modelscope.cn/models/qwen/Qwen-7B-Chat/summary
介绍（Introduction）
**通义千问-7B（Qwen-7B）**是阿里云研发的通义千问大模型系列的70亿参数规模的模型。Qwen-7B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样，覆盖广泛，包括大量网络文本、专业书籍、代码等。同时，在Qwen-7B的基础上，我们使用对齐机制打造了基于大语言模型的AI助手Qwen-7B-Chat。相较于最初开源的Qwen-7B模型，我们现已将预训练模型和Chat模型更新到效果更优的版本。本仓库为Qwen-7B-Chat的仓库。

下载完成：
18G	Qwen-7B-Chat
然后会执行git解压缩：
29G     Qwen-7B-Chat
1
2
3
4

2，下载项目，但是发现11G的显存运行不了，选择int4版本的

2024-01-20 02:01:09 | ERROR | stderr | [–worker-address WORKER_ADDRESS]
2024-01-20 02:01:09 | ERROR | stderr | [–controller-address CONTROLLER_ADDRESS]
2024-01-20 02:01:09 | ERROR | stderr | [–model-path MODEL_PATH] [–revision REVISION]
2024-01-20 02:01:09 | ERROR | stderr | [–device {cpu,cuda,mps,xpu,npu}] [–gpus GPUS]
2024-01-20 02:01:09 | ERROR | stderr | [–num-gpus NUM_GPUS] [–max-gpu-memory MAX_GPU_MEMORY]
2024-01-20 02:01:09 | ERROR | stderr | [–dtype {float32,float16,bfloat16}] [–load-8bit]
2024-01-20 02:01:09 | ERROR | stderr | [–cpu-offloading] [–gptq-ckpt GPTQ_CKPT]
2024-01-20 02:01:09 | ERROR | stderr | [–gptq-wbits {2,3,4,8,16}]
2024-01-20 02:01:09 | ERROR | stderr | [–gptq-groupsize GPTQ_GROUPSIZE] [–gptq-act-order]
2024-01-20 02:01:09 | ERROR | stderr | [–awq-ckpt AWQ_CKPT] [–awq-wbits {4,16}]
2024-01-20 02:01:09 | ERROR | stderr | [–awq-groupsize AWQ_GROUPSIZE] [–enable-exllama]
2024-01-20 02:01:09 | ERROR | stderr | [–exllama-max-seq-len EXLLAMA_MAX_SEQ_LEN]
2024-01-20 02:01:09 | ERROR | stderr | [–exllama-gpu-split EXLLAMA_GPU_SPLIT]
2024-01-20 02:01:09 | ERROR | stderr | [–exllama-cache-8bit] [–enable-xft]
2024-01-20 02:01:09 | ERROR | stderr | [–xft-max-seq-len XFT_MAX_SEQ_LEN]
2024-01-20 02:01:09 | ERROR | stderr | [–xft-dtype {fp16,bf16,int8,bf16_fp16,bf16_int8}]
2024-01-20 02:01:09 | ERROR | stderr | [–model-names MODEL_NAMES]
2024-01-20 02:01:09 | ERROR | stderr | [–conv-template CONV_TEMPLATE] [–embed-in-truncate]
2024-01-20 02:01:09 | ERROR | stderr | [–limit-worker-concurrency LIMIT_WORKER_CONCURRENCY]
2024-01-20 02:01:09 | ERROR | stderr | [–stream-interval STREAM_INTERVAL] [–no-register]
2024-01-20 02:01:09 | ERROR | stderr | [–seed SEED] [–debug DEBUG] [–ssl]

同时需要安装 einops transformers_stream_generator auto-gptq optimum
这些库才可以启动成功

3，启动成功

https://github.com/QwenLM/Qwen/issues/385

"disable_exllama": true,
1

执行 shell 替换字符串：

sed -i 's/"gptq"/"gptq","disable_exllama": true/g' Qwen-7B-Chat-Int4/config.json
1

最后启动成功：

2024-01-20 03:15:08 | INFO | model_worker | args: Namespace(host='0.0.0.0', port=8001, worker_address='http://fastchat-worker-llm:8001', controller_address='http://fastchat-controller:21001', model_path='/data/models/Qwen-7B-Chat-Int4', revision='main', device='cuda', gpus=None, num_gpus=1, max_gpu_memory=None, dtype=None, load_8bit=False, cpu_offloading=False, gptq_ckpt=None, gptq_wbits=16, gptq_groupsize=-1, gptq_act_order=False, awq_ckpt=None, awq_wbits=16, awq_groupsize=-1, enable_exllama=False, exllama_max_seq_len=4096, exllama_gpu_split=None, exllama_cache_8bit=False, enable_xft=False, xft_max_seq_len=4096, xft_dtype=None, model_names=['Qwen-7B-Chat', 'gpt-3.5-turbo-0613', 'gpt-3.5-turbo', 'gpt-3.5-turbo-instruct', 'gpt-35-turbo'], conv_template=None, embed_in_truncate=False, limit_worker_concurrency=5, stream_interval=2, no_register=False, seed=None, debug=False, ssl=False)
2024-01-20 03:15:08 | INFO | model_worker | Loading the model ['Qwen-7B-Chat', 'gpt-3.5-turbo-0613', 'gpt-3.5-turbo', 'gpt-3.5-turbo-instruct', 'gpt-35-turbo'] on worker 7fc2703b ...
2024-01-20 03:15:08 | INFO | datasets | PyTorch version 2.1.0 available.
Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.
2024-01-20 03:15:09 | WARNING | transformers_modules.Qwen-7B-Chat-Int4.modeling_qwen | Your device does NOT support faster inference with fp16, please switch to fp32 which is likely to be faster
2024-01-20 03:15:09 | WARNING | transformers_modules.Qwen-7B-Chat-Int4.modeling_qwen | Try importing flash-attention for faster inference...
2024-01-20 03:15:09 | WARNING | transformers_modules.Qwen-7B-Chat-Int4.modeling_qwen | Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
2024-01-20 03:15:09 | WARNING | transformers_modules.Qwen-7B-Chat-Int4.modeling_qwen | Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
2024-01-20 03:15:09 | WARNING | transformers_modules.Qwen-7B-Chat-Int4.modeling_qwen | Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:  33%|███▎      | 1/3 [00:00<00:00,  9.35it/s]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:00<00:00,  8.40it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  8.89it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  8.84it/s]
2024-01-20 03:15:10 | ERROR | stderr | 
2024-01-20 03:15:29 | INFO | model_worker | Register to controller
2024-01-20 03:15:29 | ERROR | stderr | INFO:     Started server process [1]
2024-01-20 03:15:29 | ERROR | stderr | INFO:     Waiting for application startup.
2024-01-20 03:15:29 | ERROR | stderr | INFO:     Application startup complete.
2024-01-20 03:15:29 | ERROR | stderr | INFO:     Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

然后测试接口：

curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json" -d '{
     "model": "gpt-35-turbo",
     "messages": [{"role": "user", "content": "你是谁"}],
     "temperature": 0.7
}'
# 返回json ：
{"id":"chatcmpl-AFGZMvdyKCGLwRNmtgWWeu","object":"chat.completion","created":1705761576,"model":"gpt-35-turbo","choices":[{"index":0,"message":{"role":"assistant","content":"我是通义千问，由阿里云开发的AI预训练模型。我被设计成可以回答各种问题、提供信息和与用户进行对话的模型。我可以帮助您查找各种问题的答案，提供定义、解释和建议，将文本从一种语言翻译成另一种语言，总结文本，生成文本，写故事，分析情绪，提供建议，开发算法，编写代码等。如果您有任何问题，请随时告诉我，我会尽力帮助您。"},"finish_reason":"stop"}],"usage":{"prompt_tokens":20,"total_tokens":116,"completion_tokens":96}}

1
2
3
4
5
6
7
8