当前位置:   article > 正文

vllm 加速推理通义千问Qwen经验总结_vllm qwen

vllm qwen

1. 简介

1.1. 功能说明

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

    • State-of-the-art serving throughput
    • Efficient management of attention key and value memory with PagedAttention
    • Continuous batching of incoming requests
    • Optimized CUDA kernels

vLLM is flexible and easy to use with:

    • Seamless integration with popular Hugging Face models
    • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
    • Tensor parallelism support for distributed inference
    • Streaming outputs
    • OpenAI-compatible API server

1.2. GitHub项目

原始项目:https://github.com/vllm-project/vllm (支持awq量化,暂不支持gptq量化)

拓展项目:https://github.com/chu-tianxiang/vllm-gptq (支持gptq量化)

2. 架构

官方文档:vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog

中文文档:vLLM框架原理——PagedAttention - 知乎

2.1. 官方测试性能

2.2. 主要功能说明

2.2.1. PagedAttention

问题:由于碎片化和过度预留,现有的系统浪费了60%-80%的内存

解决方法:将序列中token按固定长度划分为多个块,并与系统内存进行映射,解决不连续填满问题。

亮点:高效内存利用和共享

高效内存利用示意图:

physical blocks 是横坐标,filled slots是纵坐标

高效内存共享示意图:

3. 实现方案

如果要用gptq量化需要用:https://github.com/chu-tianxiang/vllm-gptq

pip install -e .

3.1. 离线模式

  1. pip install vllm
  2. from vllm import LLM
  3. prompts = ["Hello, my name is", "The capital of France is"] # Sample prompts.
  4. llm = LLM(model="/data/jupyter/LLM/models/Qwen-14B-Chat-Int4-hf",
  5. trust_remote_code=True,
  6. quantization="gptq",
  7. gpu_memory_utilization=0.5,
  8. ) # Create an LLM.
  9. outputs = llm.generate(prompts) # Generate texts from the prompts.

3.2. 服务模型

  1. # 指定模型名称或模型路径
  2. CUDA_VISIBLE_DEVICES=3 python -m vllm.entrypoints.openai.api_server \
  3. --model /data/jupyter/LLM/models/Qwen-14B-Chat-Int4-hf/ \
  4. --trust-remote-code \
  5. --port 30003 \
  6. --gpu-memory-utilization 0.5 \
  7. --tensor-parallel-size 1 \
  8. --served-model-name Qwen/Qwen-14B-Chat-Int4-hf \
  9. --quantization gptq
  1. # curl请求
  2. curl http://localhost:8000/v1/completions \
  3. -H "Content-Type: application/json" \
  4. -d '{
  5. "model": "Qwen/Qwen-14B-Chat-Int4-hf",
  6. "prompt": "San Francisco is a",
  7. "max_tokens": 7,
  8. "temperature": 0
  9. }'

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/2023面试高手/article/detail/175521
推荐阅读
  

闽ICP备14008679号