赞
踩
LoRA, Low-Rank Adaptation of Large Language Models, 大语言模型的低秩适配器.
这里的秩就是通用的 最大线性无关组个数
这样的定义.
适用场景: 用 Lora 来替代大模型的常规 fine-tune.
优点:
对于一个预训练的权重矩阵
W
0
∈
R
d
×
k
W_0\in \mathbb R^{d\times k}
W0∈Rd×k, 将它的更新限制为低秩分解的形式,
W
(
新
)
=
W
0
+
Δ
W
=
W
0
+
B
A
(1)
W^{(新)} = W_0+\Delta W = W_0 + BA \tag 1
W(新)=W0+ΔW=W0+BA(1)
w
h
e
r
e
B
∈
R
d
×
r
,
A
∈
R
r
×
k
,
r
<
<
m
i
n
(
d
,
k
)
where\ B\in \mathbb R^{d\times r}, A \in \mathbb R^{r\times k}, r <<min(d,k)
where B∈Rd×r,A∈Rr×k,r<<min(d,k),
这样该模块的输出就是
h
=
W
0
x
+
Δ
W
x
=
W
0
x
+
B
A
x
(2)
h=W_0 x+\Delta W x=W_0 x+B A x \tag 2
h=W0x+ΔWx=W0x+BAx(2)
tips: 矩阵乘法满足结合律, 即 (AB)C=A(BC)
.
首先定义了 LoRALayer 类, 只是维护了几个self.xx字段:
然后以最常用的 Linear 讨论.
class LoRALayer(): def __init__( self, r: int, lora_alpha: int, lora_dropout: float, merge_weights: bool, ): self.xx=xx pass class Linear(nn.Linear, LoRALayer): def __init__(self, in_features: int, out_features: int, r:int=0): nn.Linear.__init__(self, in_features, out_features, **kwargs) LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout, merge_weights=merge_weights) if r > 0: self.lora_A = nn.Parameter(self.weight.new_zeros((r, in_features))) self.lora_B = nn.Parameter(self.weight.new_zeros((out_features, r))) self.scaling = self.lora_alpha / self.r def forward(self, x: torch.Tensor): result = F.linear(x, T(self.weight), bias=self.bias) if self.r > 0: # @ 运算符对应的是 __matmul__(self, x) 实现. result += (self.lora_dropout(x) @ self.lora_A.transpose(0, 1) @ self.lora_B.transpose(0, 1)) * self.scaling return result
把原模型中使用到 nn.Linear
的语句, 直接替换为 lora.Linear
, 就是这么简洁!
因为 lora.Linear 使用了多继承, 把原 linear 的 weight 也包含进去了, 名字也没改, 所以 无脑restore 也不会有不一致问题.
同时, 插件名字含有 ‘lora_A, lora_B’, 所以无论保存还是计算梯度时, 都可以靠字符串规律简单的单拎出来.
有两个例子, 都是用在了 transformer block 内 multi_head_attention 中的 qkv parameter 中.
以官方的 GPT2Model 改造为例.
class Attention(nn.Module): def __init__(self, nx, n_ctx, config, scale=False): super(Attention, self).__init__() n_state = nx # in Attention: n_state=768 (nx=n_embd) # [switch nx => n_state from Block to Attention to keep identical to TF implem] assert n_state % config.n_head == 0 self.register_buffer("bias", torch.tril(torch.ones(n_ctx, n_ctx)).view(1, 1, n_ctx, n_ctx)) self.n_head = config.n_head self.split_size = n_state self.scale = scale self.c_attn = lora.MergedLinear( nx, n_state * 3, r=config.lora_attn_dim, lora_alpha=config.lora_attn_alpha, lora_dropout=config.lora_dropout, enable_lora=[True, False, True], fan_in_fan_out=True, merge_weights=False ) self.c_proj = Conv1D(n_state, nx) self.config = config
这个更直观一点.
class RobertaSelfAttention(nn.Module):
def __init__(self, config):
if config.apply_lora:
self.query = lora.Linear(config.hidden_size, self.all_head_size, config.lora_r, lora_alpha=config.lora_alpha)
else:
self.query = nn.Linear(config.hidden_size, self.all_head_size)
通过例子看到, lora 只介入 linear 层的
X
W
T
XW^T
XWT 计算, 之后再原样加回 bias.
再之后是 soft_max 对权重归一化, 然后与 value 向量相乘. 然后是 layer_norm.
# coding: utf-8 import tensorflow as tf from tensorflow.contrib import keras from tensorflow.python.framework import tensor_shape class Dense(keras.layers.Dense): """ 这里也取名 Dense, 与 keras.layers.Dense 同名, 如此 weight 的变量名也不会受到影响. """ def __init__(self, r=32, lora_alpha=1, **kwargs): if 'input_dim' in kwargs: self.in_features = kwargs['input_dim'] if 'units' in kwargs: self.out_features = kwargs['units'] self.r = r self.lora_alpha = lora_alpha # 兼容 py2, 避免值为 0 self.scaling = lora_alpha * 1.0 / r self.lora_A = None self.lora_B = None keras.layers.Dense.__init__(self, **kwargs) def build(self, input_shape): keras.layers.Dense.build(self, input_shape) input_shape = tensor_shape.TensorShape(input_shape) if not hasattr(self, 'in_features'): self.in_features = input_shape[-1].value self.lora_A = self.add_weight(name='lora_A', shape=[self.r, self.in_features], initializer=keras.initializers.he_uniform()) self.lora_B = self.add_weight(name='lora_B', shape=[self.out_features, self.r], initializer=keras.initializers.Zeros) def call(self, inputs): result = keras.layers.Dense.call(self, inputs) tmp = tf.matmul(inputs, tf.transpose(self.lora_A, perm=[1, 0])) tmp = tf.matmul(tmp, tf.transpose(self.lora_B, perm=[1, 0])) result += tmp * self.scaling return result def __repr__(self): return "(group_tag={}, A={} ,B={})".format(self.group_tag, self.A, self.B)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。