当前位置:   article > 正文

通义千问Qwen模型运行异常解决记录:FlashAttention only supports Ampere GPUs or newer_runtimeerror: flashattention only supports ampere

runtimeerror: flashattention only supports ampere gpus or newer.

通过langchain调用Qwen/Qwen-1_8B-Chat模型时,对话过程中出现报错提示:

ERROR: object of type 'NoneType' has no len()
Traceback (most recent call last):
File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain/chains/base.py", line 385, in acall
    raise e
  File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain/chains/base.py", line 379, in acall
    await self._acall(inputs, run_manager=run_manager)
  File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain/chains/llm.py", line 275, in _acall
    response = await self.agenerate([inputs], run_manager=run_manager)
  File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain/chains/llm.py", line 142, in agenerate
    return await self.llm.agenerate_prompt(
  File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 506, in agenerate_prompt
    return await self.agenerate(
  File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 466, in agenerate
    raise exceptions[0]
  File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 569, in _agenerate_with_cache
    return await self._agenerate(
  File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain_community/chat_models/openai.py", line 519, in _agenerate
    return await agenerate_from_stream(stream_iter)
  File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 85, in agenerate_from_stream
    async for chunk in stream:
  File "/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain_community/chat_models/openai.py", line 490, in _astream
    if len(chunk["choices"]) == 0:
TypeError: object of type 'NoneType' has no len()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23

很疑惑,其他LLM模型都能正常运行,唯独Qwen不行。
查了很多资料,众说纷纭,未解决。
于是仔细看报错信息,最后一行报错说 File “/root/anaconda3/envs/chatchat/lib/python3.10/site-packages/langchain_community/chat_models/openai.py”, line 490有问题,那就打开490行附近,看看源码:

if not isinstance(chunk, dict):
   chunk = chunk.dict()
if len(chunk["choices"]) == 0:
   continue
choice = chunk["choices"][0]
  • 1
  • 2
  • 3
  • 4
  • 5

应该就是这个chunk里面没有choices导致的报错。
那我们把这个chunk打印一下,看看他里面有些什么,于是修改这个文件代码为:

if not isinstance(chunk, dict):
   chunk = chunk.dict()
print(f'chunk:{chunk}')
if len(chunk["choices"]) == 0:
   continue
choice = chunk["choices"][0]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

再次运行,看到chunk的输出为:

chunk:{'id': None, 'choices': None, 'created': None, 'model': None, 'object': None, 'system_fingerprint': None, 'text': '**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**\n\n(FlashAttention only supports Ampere GPUs or newer.)', 'error_code': 50001}
  • 1

终于看到真正的错误信息了:NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE:FlashAttention only supports Ampere GPUs or newer。
看样子真正出问题的点在flash-attention上。
翻看huggingface上通义千问的安装说明:

依赖项(Dependency)
运行Qwen-1.8B-Chat,请确保满足上述要求,再执行以下pip命令安装依赖库
pip install transformers==4.32.0 accelerate tiktoken einops scipy transformers_stream_generator==0.0.4 peft deepspeed

另外,推荐安装flash-attention库(当前已支持flash attention 2),以实现更高的效率和更低的显存占用。
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# 下方安装可选,安装可能比较缓慢。
# pip install csrc/layer_norm
# pip install csrc/rotary
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

按照文档,flash-attention是安装好了的,问题应该不是出在安装上面。
在qwenlm的issue上看到说要卸载flash-atten:https://github.com/QwenLM/Qwen/issues/438
然后在huggingface社区看到对这个问题的解释:https://huggingface.co/Qwen/Qwen-7B-Chat/discussions/37:

flash attention是一个用于加速模型训练推理的可选项,且仅适用于Turing、Ampere、Ada、Hopper架构的Nvidia GPU显卡(如H100、A100、RTX 3090、T4、RTX 2080),您可以在不安装flash attention的情况下正常使用模型进行推理。
  • 1

再一核对我自己的GPU,了然了,原来是我的GPU不适用于flash attention!
所以,解决方案就是:

pip uninstall flash-atten
  • 1
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/羊村懒王/article/detail/270274
推荐阅读
相关标签
  

闽ICP备14008679号