赞
踩
本文利用chatglm2-6b huggingface上的模型源码介绍其结构,结合一些论文博客对chatglm2模型进行分解。
Chatglm2-6b模型参数包括28个GLM层(由MLP和自注意力组成),注意力的头数为32,采用Multi-Query Attention,隐藏层层数28。位置编码采用旋转位置编码,激活函数为SwiGLU,归一化方法为RMSNorm。
ChatGLMModel (假设输入X大小为 3x5)
SwiGLU(x,W,V,b,c,β)
=
Swish
β
(
x
W
+
b
)
⊗
(
x
V
+
c
)
\operatorname{SwiGLU(x, W, V, b, c, \beta)}=\operatorname{Swish}_{\beta}(x W+b) \otimes(xV+c)
SwiGLU(x,W,V,b,c,β)=Swishβ(xW+b)⊗(xV+c)
其中
Swish
β
(
x
)
=
x
σ
(
β
x
)
\operatorname{Swish}_\beta(x)=x \sigma(\beta x)
Swishβ(x)=xσ(βx),
β
\beta
β为指定常数,常为1。
对应于chatglm2-6b中的源码
def swiglu(x):
x = torch.chunk(x, 2, dim=-1)
return F.silu(x[0]) * x[1]
旋转位置编码:RoPE
旋转位置编码的目的是用上不同token的相对位置。
假定 query 向量
q
m
\boldsymbol{q}_m
qm 和 key 向量
k
n
\boldsymbol{k}_n
kn 之间 的内积操作可以被一个函数
g
g
g 表示,该函数
g
g
g 的输入是词嵌入向量
x
m
,
x
n
\boldsymbol{x}_m , \boldsymbol{x}_n
xm,xn 和它们之间的相对位置为
m
−
n
m-n
m−n :
⟨
f
q
(
x
m
,
m
)
,
f
k
(
x
n
,
n
)
⟩
=
g
(
x
m
,
x
n
,
m
−
n
)
\left\langle\boldsymbol{f}_q\left(\boldsymbol{x}_m, m\right), f_k\left(\boldsymbol{x}_n, n\right)\right\rangle=g\left(\boldsymbol{x}_m, \boldsymbol{x}_n, m-n\right)
⟨fq(xm,m),fk(xn,n)⟩=g(xm,xn,m−n)
这样就能够将原来的绝对位置编码转为相对位置编码,下面就是求解
g
g
g 就可以了。苏剑林等人的论文中提出了如下的公式解决该问题。具体推导过程也可以参考该作者的博客。
f
q
(
x
m
,
m
)
=
(
W
q
x
m
)
e
i
m
θ
f
k
(
x
n
,
n
)
=
(
W
k
x
n
)
e
i
n
θ
g
(
x
m
,
x
n
,
m
−
n
)
=
Re
[
(
W
q
x
m
)
(
W
k
x
n
)
∗
e
i
(
m
−
n
)
θ
]
class RotaryEmbedding(nn.Module): def __init__(self, dim, original_impl=False, device=None, dtype=None): super().__init__() inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2, device=device).to(dtype=dtype) / dim)) self.register_buffer("inv_freq", inv_freq) self.dim = dim self.original_impl = original_impl def forward_impl( self, seq_len: int, n_elem: int, dtype: torch.dtype, device: torch.device, base: int = 10000 ): """Enhanced Transformer with Rotary Position Embedding. Derived from: https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/ transformers/rope/__init__.py. MIT License: https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/license. """ # $\Theta = {\theta_i = 10000^{\frac{2(i-1)}{d}}, i \in [1, 2, ..., \frac{d}{2}]}$ theta = 1.0 / (base ** (torch.arange(0, n_elem, 2, dtype=dtype, device=device) / n_elem)) # Create position indexes `[0, 1, ..., seq_len - 1]` seq_idx = torch.arange(seq_len, dtype=dtype, device=device) # Calculate the product of position index and $\theta_i$ idx_theta = torch.outer(seq_idx, theta).float() cache = torch.stack([torch.cos(idx_theta), torch.sin(idx_theta)], dim=-1) return cache def forward(self, max_seq_len, offset=0): return self.forward_impl( max_seq_len, self.dim, dtype=self.inv_freq.dtype, device=self.inv_freq.device ) def apply_rotary_pos_emb(x: torch.Tensor, rope_cache: torch.Tensor) -> torch.Tensor: # x: [sq, b, np, hn] # np: number of partion; hn: hidden states number sq, b, np, hn = x.size(0), x.size(1), x.size(2), x.size(3) rot_dim = rope_cache.shape[-2] * 2 x, x_pass = x[..., :rot_dim], x[..., rot_dim:] # truncate to support variable sizes rope_cache = rope_cache[:sq] xshaped = x.reshape(sq, -1, np, rot_dim // 2, 2) rope_cache = rope_cache.view(sq, -1, 1, xshaped.size(3), 2) x_out2 = torch.stack( [ xshaped[..., 0] * rope_cache[..., 0] - xshaped[..., 1] * rope_cache[..., 1], xshaped[..., 1] * rope_cache[..., 0] + xshaped[..., 0] * rope_cache[..., 1], ], -1, ) x_out2 = x_out2.flatten(3) return torch.cat((x_out2, x_pass), dim=-1)
multi-query attention 是 multi-head的变种,采用多头共享query和key,主要作用在于节省内存和减少运算成本。
多头注意力机制公式:
Attention
(
Q
,
K
,
V
)
=
softmax
(
Q
K
T
d
k
)
V
\operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V
Attention(Q,K,V)=softmax(dk
QKT)V
MultiHead
(
Q
,
K
,
V
)
=
Concat
(
head
1
,
…
,
head
h
)
W
O
where head
=
Attention
(
Q
W
i
Q
,
K
W
i
K
,
V
W
i
V
)
# 以下来自论文:Fast Transformer Decoding: One Write-Head is All You Need def MultiheadAttentionBatched(X, M, mask, P_q, P_k, P_v, P_o): """Multi-head Attention. Args: X: a tensor with shape [b,n,d] M: a tensor with shape [b,m,d] mask: a tensor with shape [b,h,n,m] P_q: a tensor with shape [h,d,k] P_k: a tensor with shape [h,d,k] P_v: a tensor with shape [h,d,v] P_o: a tensor with shape [h,d,v] Returns: Y: a tensor with shape [b,n,d] """ # b: batch size, m,n: sequence length, h: heads # k,v: dimension of key or value # d: hidden states Q = tf.einsum("bnd,hdk−>bhnk ", X, P_q) K = tf.einsum("bmd,hdk−>bhmk", M, P_k) V = tf.einsum("bmd,hdv−>bhmv", M, P_v) logits = tf.einsum("bhnk,bhmk−>bhnm ", Q, K) weights = tf.softmax(logits + mask) O = tf.einsum("bhnm,bhmv−>bhnv ", weights, V) Y = tf.einsum("bhnv,hdv−>bnd", O, P_o) return Y def MultiqueryAttentionBatched(X, M, mask, P_q, P_k, P_v, P_o): """Multi-query Attention. Args: X: a tensor with shape [b,n,d] M: a tensor with shape [b,m,d] mask: a tensor with shape [b,h,n,m] P_q: a tensor with shape [h,d,k] P_k: a tensor with shape [d,k] P_v: a tensor with shape [d,v] P_o: a tensor with shape [h,d,v] Returns: Y: a tensor with shape [b,n,d] """ # b: batch size, m,n: sequence length, h: heads # k,v: dimension of key or value # d: hidden states Q = tf.einsum("bnd,hdk−>bhnk ", X, P_q) K = tf.einsum("bmd,dk−>bmk", M, P_k) V = tf.einsum("bmd,dv−>bmv", M, P_v) logits = tf.einsum("bhnk,bmk−>bhnm", Q, K) weights = tf.softmax(logits + mask) O = tf.einsum("bhnm,bmv−>bhnv ", weights, V) Y = tf.einsum("bhnv,hdv−>bnd ", O, P_o) return Y
chatglm2-6b仍然采用GLM-10B的注意力编码方式。
Part A tokens can attend to each other, but cannot attend to any
tokens in B. Part B tokens can attend to Part A and antecedents in B,
but cannot attend to any subsequent tokens in B. To enable
autoregressive generation, each span is padded with special tokens
[START] and [END], for input and output respectively. In this way, our
model automatically learns a bidirectional encoder (for Part A) and a
unidirectional decoder (for Part B) in a unified model. (GLM, 2022)
A部分的token可以相互关注,但是不能关注到B部分的token。B部分的tokens 可以关注 A 和 B 中的前项,但不能关注 B 中的任何后续 tokens。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。