赞
踩
个人电脑配置实在难以以 float16 运行 Mixtral 8*7B 大语言模型,所以参数 4bit 或者 8bit 来启动。
实际测试结果,4bit 时推理速度明显变快了,8bit 时推理也非常慢。
使用的推理框架时 fastchat。
vi fastchat/model/model_adapter.py
修改前,
class MistralAdapter(BaseModelAdapter):
"""The model adapter for Mistral AI models"""
def match(self, model_path: str):
return "mistral" in model_path.lower() or "mixtral" in model_path.lower()
def load_model(self, model_path: str, from_pretrained_kwargs: dict):
model, tokenizer = super().load_model(model_path, from_pretrained_kwargs)
model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id
return model, tokenizer
修改后,
class MistralAdapter(BaseModelAdapter):
"""The model adapter for Mistral AI models"""
def match(self, model_path: str):
return "mistral" in model_path.lower() or "mixtral" in model_path.lower()
def load_model(self, model_path: str, from_pretrained_kwargs: dict):
# model, tokenizer = super().load_model(model_path, from_pretrained_kwargs)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
if "mixtral" in model_path.lower():
model = AutoModelForCausalLM.from_pretrained(
model_path,
low_cpu_mem_usage=True,
trust_remote_code=True,
# attn_implementation="flash_attention_2",
# load_in_8bit=True,
load_in_4bit=True,
**from_pretrained_kwargs,
)
else:
model = AutoModelForCausalLM.from_pretrained(
model_path,
low_cpu_mem_usage=True,
trust_remote_code=True,
**from_pretrained_kwargs,
)
model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id
return model, tokenizer
完结!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。