基于llama.cpp的GGUF量化与基于llama-cpp-python的部署_llama.cpp guff

作者：小蓝xlanll | 2024-05-04 04:16:37

踩

llama.cpp guff

前言：笔者在做GGUF量化和后续部署的过程中踩到了一些坑，这里记录一下。

1.量化

项目地址：llama.cpp

1.1 环境搭建

笔者之前构建了一个用于实施大模型相关任务的docker镜像，这次依然是在这个镜像的基础上完成的，这里给出Dockerfile：

FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04
# requirements
ADD source.list /etc/apt/sources.list
RUN apt-get update && apt-get install -y python3.10 python3-pip python3.10-dev vim git
# torch
COPY torch-2.2.0+cu121-cp310-cp310-linux_x86_64.whl torch-2.2.0+cu121-cp310-cp310-linux_x86_64.whl
RUN pip3 install torch-2.2.0+cu121-cp310-cp310-linux_x86_64.whl
# llama factory requirements
RUN pip3 install transformers==4.38.2 datasets==2.16.1 accelerate==0.27.2 peft==0.10.0 trl==0.7.11 gradio==3.50.2 \
    deepspeed==0.13.1 modelscope ipython scipy einops sentencepiece protobuf jieba rouge-chinese nltk sse-starlette  \
    matplotlib pandas numpy tqdm tensor_parallel scikit-learn \
    --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple
# FlashAttention
RUN pip install ninja -i https://pypi.tuna.tsinghua.edu.cn/simple
RUN pip install packaging -i https://pypi.tuna.tsinghua.edu.cn/simple
RUN pip install flash-attn --no-build-isolation -i https://pypi.tuna.tsinghua.edu.cn/simple
# gptq
RUN pip install auto-gptq --no-build-isolation
# awq
RUN pip install autoawq
# llama.cpp
RUN apt-get install -y cmake
RUN git clone https://github.com/ggerganov/llama.cpp
RUN pip install gguf -i https://pypi.tuna.tsinghua.edu.cn/simple
WORKDIR /llama.cpp
RUN mkdir build
WORKDIR /llama.cpp/build
RUN cmake .. -DLLAMA_CUDA=ON
RUN cmake --build . --config Release
# python build
RUN CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

这里直接进行了编译，实例化容器可以直接用。

# 构建镜像
sudo docker build -t llm:v1.0 .
1
2

这里提供一个脚本用于创建环境。

docker run \
  -it \
  --rm \
  --name quantization \
  --network=host \
  --shm-size 32G \
  --gpus "device=0" \
  -v /home/[yourname]/.cache/huggingface/hub/:/root/.cache/huggingface/hub/ \
  -v /home/[yourname]/.cache/huggingface/datasets/:/root/.cache/huggingface/datasets/ \
  -w /llama.cpp/ \
  llm:v1.4
1
2
3
4
5
6
7
8
9
10
11

运行脚本后可以直接进入环境。

1.2 量化

量化分为两步：

将原始的模型转换为gguf模型

python3 convert-hf-to-gguf.py [model_path] --outfile [gguf_file].gguf
# example Qwen1.5-7b-chat
# 注意这里使用的是挂载在的哦参考而中的transformers的默认cache地址
python3 convert-hf-to-gguf.py /root/.cache/huggingface/hub/models--Qwen--Qwen1.5-7B-Chat/snapshots/294483ad23713036574b30587b186713373f4271/ --outfile Qwen1.5-7B-Chat.gguf
1
2
3
4

注意：这里的转换支持AWQ量化模型的转换，需要注意的是在通过autoawq实施量化时：

...
# Quantize
# NOTE: We avoid packing weights, so you cannot use this model in AutoAWQ
# after quantizing. The saved model is FP16 but has the AWQ scales applied.
model.quantize(
    tokenizer,
    quant_config=quant_config,
    export_compatible=True
)
...
1
2
3
4
5
6
7
8
9
10

量化

./build/bin/quantize [gguf_file].gguf [quantized_gguf_file].gguf [quantize_method]
# example Qwen1.5-7b-chat.gguf q4_0
./build/bin/quantize Qwen1.5-7B-Chat.gguf Qwen1.5-7B-Chat-q4_0.gguf q4_0
1
2
3

2.部署

在llama.cpp介绍的HTTP server中笔者找到了一个在python中可以优雅调用gguf的项目。

项目地址：llama-cpp-python

实施过程可以运行以下脚本（依然可以在docker容器中运行，llama-cpp-python在Dockerfile中已经添加）

from llama_cpp import Llama

model = Llama(
    model_path='your_gguf_file.gguf',
    n_gpu_layers=32,  # Uncomment to use GPU acceleration
    n_ctx=2048,  # Uncomment to increase the context window
)

output = model('your_input', max_tokens=32, stop=["Q:", "\n"])
output = output['choices'][0]['text'].strip()
1
2
3
4
5
6
7
8
9
10

这里给出llama-cpp-python示例中的output的完整形式

{
  "id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "object": "text_completion",
  "created": 1679561337,
  "model": "./models/7B/llama-model.gguf",
  "choices": [
    {
      "text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
      "index": 0,
      "logprobs": None,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 28,
    "total_tokens": 42
  }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

3.量化结果比较

这里借助chain-of-thought-hub对几个量化模型进行比较。

模型：qwen1.5-7B-chat
量化：4bit
GPU：4060Ti-16G

model	gptq-no-desc-act	gptq-desc-act	awq	gguf	awq-gguf
MMLU	0.5580	0.5912	0.5601	0.5597	0.5466
time	3741.81	3745.25	5181.86	3124.77	3091.46

目前还没有搞定gptq的gguf导出，后面会再尝试一下。

感谢以下博客：
https://qwen.readthedocs.io/zh-cn/latest/index.html

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/小蓝xlanll/article/detail/532892