赞
踩
参数高效微调PEFT(一)快速入门BitFit、Prompt Tuning、Prefix Tuning
参数高效微调PEFT(二)快速入门P-Tuning、P-Tuning V2
注意:我们到目前都是以单精度FP32加载模型,模型本身占用的显存大小并没有改变。
避免了推理期间Prompt系列方法带来的额外计算量
。论文地址:LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS (2106)
LORA的核心思想就是通过低秩分解来模拟参数的改变量,从而以极小的参数量来实现大模型的间接训练。
如下图所示,可训练层维度和预训练模型层维度一致为d,先将维度d通过全连接层降维至r,再从r通过全连接层映射回d维度。其中,r<<d,r是矩阵的秩,这样矩阵计算就从 d × d d × d d×d变为 d × r + r × d d × r + r × d d×r+r×d,参数量减少很多。
训练完成后,可以将两个低秩矩阵与原始模型中的权重进行合并,合并后的模型与原始模型无异,避免了推理期间Prompt系列方法带来的额外计算量
。
我们知道Transformer的权重矩阵包括Attention模块里用于计算query,key,value的Wq、Wk、Wv、多头attention的Wo以及MLP层的权重矩阵。如下图,LoRA只应用于Attention模块中的4种权重矩阵,而且通过消融实验发现同时调整 Wq 和 Wv 会产生最佳结果。
我们在peft库默认配置下(LoraConfig(task_type=TaskType.CAUSAL_LM)
),进行源码分析。
from peft import LoraConfig, TaskType, get_peft_model
config = LoraConfig(task_type=TaskType.CAUSAL_LM)
model = get_peft_model(model, config)
LoRA的代码主要分为三个部分:初始化、推理和参数合并。
我们先看下LoRA初始化,默认设置下,LoRA为线性层class Linear(nn.Linear, LoRALayer)
。
LoRA仅支持nn.Linear、nn.Embedding和nn.Conv2d,微调线性层是常见的做法。
通过MAPPING找到PeftModelForCausalLM进行初始化
# peft\mapping.py
def get_peft_model(model: PreTrainedModel, peft_config: PeftConfig, adapter_name: str = "default") -> PeftModel:
......
# 1、判断任务类型 如果不是['SEQ_CLS', 'SEQ_2_SEQ_LM', 'CAUSAL_LM', 'TOKEN_CLS', 'QUESTION_ANS', 'FEATURE_EXTRACTION']里类型 而且 不是prompt_learning
if peft_config.task_type not in MODEL_TYPE_TO_PEFT_MODEL_MAPPING.keys() and not peft_config.is_prompt_learning:
return PeftModel(model, peft_config, adapter_name=adapter_name)
# 2、prompt_learning
if peft_config.is_prompt_learning:
peft_config = _prepare_prompt_learning_config(peft_config, model_config)
# 3、通过mapping找到peft.peft_model.PeftModelForCausalLM
return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
# peft\peft_model.py
class PeftModelForCausalLM(PeftModel):
def __init__(self, model, peft_config: PeftConfig, adapter_name="default"):
# 调用父类PeftModel的初始化方法
super().__init__(model, peft_config, adapter_name)
self.base_model_prepare_inputs_for_generation = self.base_model.prepare_inputs_for_generation
......
# peft\peft_model.py class PeftModel(PushToHubMixin, torch.nn.Module): def __init__(self, model: PreTrainedModel, peft_config: PeftConfig, adapter_name: str = "default"): super().__init__() self.base_model = model self.config = getattr(self.base_model, "config", {"model_type": "custom"}) self.modules_to_save = None self.peft_config = {} self.active_adapter = adapter_name self.peft_type = peft_config.peft_type if not peft_config.is_prompt_learning: # 不是prompt_learning,我们进入此分支 self.peft_config[adapter_name] = peft_config # 通过PEFT_TYPE_TO_MODEL_MAPPING我们获取peft.tuners.lora.LoraModel # 然后会进入LoraModel的初始化 self.base_model = PEFT_TYPE_TO_MODEL_MAPPING[peft_config.peft_type]( self.base_model, self.peft_config, adapter_name ) self.set_additional_trainable_modules(peft_config, adapter_name) else: self.add_adapter(adapter_name, peft_config) ...... # peft\tuners\lora.py class LoraModel(BaseTuner): def __init__(self, model, config, adapter_name) -> None: # 进行BaseTuner的初始化 super().__init__(model, config, adapter_name) .....
new_module = Linear(adapter_name, in_features, out_features, bias=bias, **kwargs)
,即peft\tuners\lora.py中Linear的初始化。# peft\tuners\tuners_utils.py class BaseTuner(nn.Module, ABC): def __init__(self, model, peft_config: Union[PeftConfig, dict[str, PeftConfig]], adapter_name: str) -> None: super().__init__() self.model = model ...... # 会掉用inject_adapter方法 self.inject_adapter(self.model, adapter_name) # Copy the peft_config in the injected model. self.model.peft_config = self.peft_config ...... # peft\tuners\lora.py def inject_adapter(self, model: nn.Module, adapter_name: str): peft_config = self.peft_config[adapter_name] is_target_modules_in_base_model = False key_list = [key for key, _ in model.named_modules()] model_config = getattr(model, "config", {"model_type": "custom"}) if hasattr(model_config, "to_dict"): model_config = model_config.to_dict() # 获取高效微调的默认设置项 peft_config = self._prepare_adapter_config(peft_config, model_config) # 遍历每一个key for key in key_list: is_target_modules_in_base_model = True # 例如,key= 'transformer.h.0.self_attention.query_key_value' # parent = BloomAttention( # (query_key_value): Linear(in_features=64, out_features=192, bias=True) # (dense): Linear(in_features=64, out_features=64, bias=True) # (attention_dropout): Dropout(p=0.0, inplace=False) # ) # target = Linear(in_features=64, out_features=192, bias=True) # target_name = 'query_key_value' parent, target, target_name = _get_submodules(model, key) optionnal_kwargs = { "loaded_in_8bit": getattr(model, "is_loaded_in_8bit", False), "loaded_in_4bit": getattr(model, "is_loaded_in_4bit", False), "current_key": key, } # self._create_and_replace(peft_config, adapter_name, target, target_name, parent, **optionnal_kwargs) ...... # peft\tuners\lora.py def _create_and_replace( self, lora_config, adapter_name, target, target_name, parent, **optionnal_kwargs, ): ...... # TODO: better deal with that if isinstance(target, LoraLayer) and isinstance(target, torch.nn.Conv2d): target.update_layer_conv2d( adapter_name, lora_config.r, lora_config.lora_alpha, lora_config.lora_dropout, lora_config.init_lora_weights, ) ...... else: # target默认为线性层,会进入此分支 new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs) # 此方法会将原始module进行替换,替换为加上lora后的 self._replace_module(parent, target_name, new_module, target)
# 高效微调的默认设置项 LoraConfig( peft_type=<PeftType.LORA: 'LORA'> # peft的类型 , auto_mapping=None , base_model_name_or_path='' , revision=None , task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'> # 任务类型 , inference_mode=False , r=8 # 秩大小,一般4,8,16,小数据集一般设置更小的r(1,2) , target_modules=['query_key_value'] # 指定应用lora的目标模块,Bloom模型默认为query_key_value,可以使用正则 , lora_alpha=8 # 尺度缩放参数,ΔW按α/r缩放,即scaling[adapter_name]=lora_alpha/r,默认为1 , lora_dropout=0.0 # lora层的dropout比率 , fan_in_fan_out=False , bias='none' # 偏差可以是“无”、“全部”或“lora_only”。如果是“all”或“lora_only”,则相应的偏差将在训练期间更新 , modules_to_save=None # 全量微调时,模型的layer名称 , init_lora_weights=True , layers_to_transform=None , layers_pattern=None )
Linear初始化中,重要的代码就是update_layer
update_layer中,就会初始化秩为r的可训练参数A和B
new_module = Linear(
in_features=64, out_features=192, bias=True
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=64, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=192, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
)
上面就是Linear初始化后的模块,即new_module。初始化后,就会将之前的module进行替换。
# 替换前 BloomAttention( (query_key_value): Linear(in_features=64, out_features=192, bias=True) (dense): Linear(in_features=64, out_features=64, bias=True) (attention_dropout): Dropout(p=0.0, inplace=False) ) # 通过下面代码进行替换 # peft\tuners\lora.py中的_create_and_replace方法 new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs) self._replace_module(parent, target_name, new_module, target) # 替换后 BloomAttention( (query_key_value): Linear( in_features=64, out_features=192, bias=True (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=64, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=192, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() ) (dense): Linear(in_features=64, out_features=64, bias=True) (attention_dropout): Dropout(p=0.0, inplace=False) )
# peft\tuners\lora.py class Linear(nn.Linear, LoraLayer): # Lora implemented in a dense layer def __init__( self, adapter_name: str, in_features: int, out_features: int, r: int = 0, lora_alpha: int = 1, lora_dropout: float = 0.0, fan_in_fan_out: bool = False, # Set this to True if the layer to replace stores weight like (fan_in, fan_out) is_target_conv_1d_layer: bool = False, **kwargs, ): init_lora_weights = kwargs.pop("init_lora_weights", True) nn.Linear.__init__(self, in_features, out_features, **kwargs) LoraLayer.__init__(self, in_features=in_features, out_features=out_features) # Freezing the pre-trained weight matrix self.weight.requires_grad = False self.fan_in_fan_out = fan_in_fan_out if fan_in_fan_out: self.weight.data = self.weight.data.T nn.Linear.reset_parameters(self) # 重要代码 self.update_layer(adapter_name, r, lora_alpha, lora_dropout, init_lora_weights) self.active_adapter = adapter_name self.is_target_conv_1d_layer = is_target_conv_1d_layer ......
def update_layer(self, adapter_name, r, lora_alpha, lora_dropout, init_lora_weights): self.r[adapter_name] = r self.lora_alpha[adapter_name] = lora_alpha if lora_dropout > 0.0: # dropout lora_dropout_layer = nn.Dropout(p=lora_dropout) else: # 默认设置 lora_dropout_layer = nn.Identity() self.lora_dropout.update(nn.ModuleDict({adapter_name: lora_dropout_layer})) # Actual trainable parameters if r > 0: # 如果秩大于0,就初始化秩为R的可训练参数A和B self.lora_A.update(nn.ModuleDict({adapter_name: nn.Linear(self.in_features, r, bias=False)})) self.lora_B.update(nn.ModuleDict({adapter_name: nn.Linear(r, self.out_features, bias=False)})) # 指定lora_alpha参数,用于平衡LORA 和 基础模型的贡献 # 这里scaling默认为1 self.scaling[adapter_name] = lora_alpha / r if init_lora_weights: self.reset_lora_parameters(adapter_name) self.to(self.weight.device)
LoraModel( (model): BloomForCausalLM( (transformer): BloomModel( (word_embeddings): Embedding(250880, 64) (word_embeddings_layernorm): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (h): ModuleList( (0-1): 2 x BloomBlock( (input_layernorm): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (self_attention): BloomAttention( # 在BloomAttention的query_key_value中,使用LoRA进行微调 (query_key_value): Linear( in_features=64, out_features=192, bias=True (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=64, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=192, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() ) (dense): Linear(in_features=64, out_features=64, bias=True) (attention_dropout): Dropout(p=0.0, inplace=False) ) (post_attention_layernorm): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (mlp): BloomMLP( (dense_h_to_4h): Linear(in_features=64, out_features=256, bias=True) (gelu_impl): BloomGelu() (dense_4h_to_h): Linear(in_features=256, out_features=64, bias=True) ) ) ) (ln_f): LayerNorm((64,), eps=1e-05, elementwise_affine=True) ) (lm_head): Linear(in_features=64, out_features=250880, bias=False) ) )
peft\tuners\lora.py
中Linear的前向传播(如下代码)。默认self.active_adapter=default
)# peft\tuners\lora.py def forward(self, x: torch.Tensor): previous_dtype = x.dtype if self.active_adapter not in self.lora_A.keys(): return F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) if self.disable_adapters: if self.r[self.active_adapter] > 0 and self.merged: self.unmerge() result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) # 不禁用adapters,且秩r > 0,且不合并 elif self.r[self.active_adapter] > 0 and not self.merged: # 1、result = torch.Size([1, 48, 192]) result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) x = x.to(self.lora_A[self.active_adapter].weight.dtype) # 2、核心代码 # x = torch.Size([1, 48, 64]) # self.lora_A['default'] = Linear(in_features=64, out_features=8, bias=False) # x经过self.lora_A,torch.Size([1, 48, 8]) # self.lora_B['default'] = Linear(in_features=8, out_features=192, bias=False) # x经过self.lora_B,torch.Size([1, 48, 192]) # 然后result相加合并,即h = Wx + BAx result += ( self.lora_B[self.active_adapter]( self.lora_A[self.active_adapter](self.lora_dropout[self.active_adapter](x)) ) * self.scaling[self.active_adapter] ) else: result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) result = result.to(previous_dtype) return result
同样,我们只需要在加载原模型后、配置训练器前加peft的代码即可。
from peft import LoraConfig, TaskType, get_peft_model
config = LoraConfig(
task_type=TaskType.CAUSAL_LM
# , target_modules=".*\.1.*query_key_value" 指定要添加LoRA的目标模块(支持正则)
# , modules_to_save=["word_embeddings"] 全量微调时,模型的layer名称
, modules_to_save=None
)
model = get_peft_model(model, config)
# 打印可训练参数
model.print_trainable_parameters()
trainable params: 786,432 || all params: 346,555,392 || trainable%: 0.22692822508443325
(base) root@autodl-container-adbc11ae52-f2ebff02:~# nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:41:00.0 Off | N/A | | 35% 59C P2 143W / 250W | 2810MiB / 11264MiB | 31% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+
LoRA训练完成后,可以将两个低秩矩阵与原始模型中的权重进行合并,合并后的模型与原始模型无异,避免了推理期间Prompt系列方法带来的额外计算量
。
from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel # 1、加载基础模型 model_path = r'/root/autodl-fs/models/langboat/bloom-389m-zh' model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True) tokenizer = AutoTokenizer.from_pretrained(model_path) # 2、加载PeftModel p_model = PeftModel.from_pretrained(model, model_id="./chatbot/checkpoint-500/") # 3、模型合并 merge_model = p_model.merge_and_unload() # 4、完整模型保存,此时大小就很大了,下次加载时候可以用AutoModelForCausalLM直接加载 merge_model.save_pretrained("./chatbot/merge_model")
论文地址:ADAPTIVE BUDGET ALLOCATION FOR PARAMETER EFFICIENT FINE-TUNING(23.03)
LoRA可以达到与完全微调几乎相当的性能,但是也存在一些问题。
如下图a所示,将可微调参数全部放在FFN的效果要好于放在attention矩阵中的效果;如下图b所示,同时微调高层参数的效果会比微调底层参数的效果更好。因此,作者提出了AdaLoRA,它根据权重矩阵的重要性得分,在权重矩阵之间自适应地分配参数预算。
AdaLORA主要包含两个模块:
(ii) 基于重要程度的参数分配(Importance-aware rank allocation): 裁剪一些冗余的奇异值。
具体原理可以参考:AdaLoRA(Adaptive LoRA)详解
这里我们没有分析AdaLoRA的源码,直接进行轻量微调,有兴趣的可以分析下源码。
同样,我们只需要在加载原模型后、配置训练器前加peft的代码即可。
from peft import get_peft_model, TaskType, AdaLoraConfig config = AdaLoraConfig(task_type=TaskType.CAUSAL_LM , r=8 , lora_alpha=8 , lora_dropout=0 , target_modules=["query_key_value"] ) model = get_peft_model(model, config) # 打印可训练参数 model.print_trainable_parameters() trainable params: 1,179,936 || all params: 346,948,920 || trainable%: 0.3400892557901607
(base) root@autodl-container-adbc11ae52-f2ebff02:~/autodl-tmp/transformers-code/03-PEFT# nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:41:00.0 Off | N/A | | 33% 54C P2 119W / 250W | 2816MiB / 11264MiB | 32% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。