赞
踩
T r a n s f o r m e r \mathrm{Transformer} Transformer模型是 G o o g l e \mathrm{Google} Google团队在 2017 2017 2017年 6 6 6月由 A s h i s h V a s w a n i \mathrm{Ashish\text{ }Vaswani} Ashish Vaswani等人在论文《 A t t e n t i o n I s A l l Y o u N e e d \mathrm{Attention\text{ }Is\text{ }All \text{ }You \text{ } Need} Attention Is All You Need》所提出,当前它已经成为 N L P \mathrm{NLP} NLP领域中的首选模型。 T r a n s f o r m e r \mathrm{Transformer} Transformer抛弃了 R N N \mathrm{RNN} RNN的顺序结构,采用了 S e l f \mathrm{Self} Self- A t t e n t i o n \mathrm{Attention} Attention机制,使得模型可以并行化训练,而且能够充分利用训练资料的全局信息,加入 T r a n s f o r m e r \mathrm{Transformer} Transformer的 S e q 2 s e q \mathrm{Seq2seq} Seq2seq模型在 N L P \mathrm{NLP} NLP的各个任务上都有了显著的提升。本文做了大量的图示目的是能够更加清晰地讲解 T r a n s f o r m e r \mathrm{Transformer} Transformer的运行原理,以及相关组件的操作细节,文末还有完整可运行的代码示例。
T r a n s f o r m e r \mathrm{Transformer} Transformer中的核心机制就是 S e l f \mathrm{Self} Self- A t t e n t i o n \mathrm{Attention} Attention。 S e l f \mathrm{Self} Self- A t t e n t i o n \mathrm{Attention} Attention机制的本质来自于人类视觉注意力机制。当人视觉在感知东西时候往往会更加关注某个场景中显著性的物体,为了合理利用有限的视觉信息处理资源,人需要选择视觉区域中的特定部分,然后集中关注它。注意力机制主要目的就是对输入进行注意力权重的分配,即决定需要关注输入的哪部分,并对其分配有限的信息处理资源给重要的部分。
S
e
l
f
\mathrm{Self}
Self-
A
t
t
e
n
t
i
o
n
\mathrm{Attention}
Attention工作原理如上图所示,给定输入
w
o
r
d
e
m
b
e
d
d
i
n
g
\mathrm{word\text{ }embedding}
word embedding向量
a
1
,
a
2
,
a
3
∈
R
d
l
×
1
a^1,a^2,a^3 \in \mathbb{R}^{d_l \times 1}
a1,a2,a3∈Rdl×1,然后对于输入向量
a
i
,
i
∈
{
1
,
2
,
3
}
a^i,i\in \{1,2,3\}
ai,i∈{1,2,3}通过矩阵
W
q
∈
R
d
k
×
d
l
,
W
k
∈
R
d
k
×
d
l
,
W
v
∈
R
d
l
×
d
l
W^q\in \mathbb{R}^{d_k \times d_l},W^k\in \mathbb{R}^{d_k \times d_l},W^v\in \mathbb{R}^{d_l\times d_l}
Wq∈Rdk×dl,Wk∈Rdk×dl,Wv∈Rdl×dl进行线性变换得到
Q
u
e
r
y
\mathrm{Query}
Query向量
q
i
∈
R
d
k
×
1
q^i\in\mathbb{R}^{d_k \times 1}
qi∈Rdk×1,
K
e
y
\mathrm{Key}
Key向量
k
i
∈
R
d
k
×
1
k^i\in \mathbb{R}^{d_k \times 1}
ki∈Rdk×1,以及
V
a
l
u
e
\mathrm{Value}
Value向量
v
i
∈
R
d
l
×
1
v^i\in \mathbb{R}^{d_l \times 1}
vi∈Rdl×1,即
{
q
i
=
W
q
⋅
a
i
k
i
=
W
k
⋅
a
i
,
i
∈
{
1
,
2
,
3
}
v
i
=
W
v
⋅
a
i
\left\{qi=Wq⋅aiki=Wk⋅ai,i∈{1,2,3}vi=Wv⋅ai
M
u
l
t
i
\mathrm{Multi}
Multi-
H
e
a
d
A
t
t
e
n
t
i
o
n
\mathrm{Head\text{ }Attention}
Head Attention的工作原理与
S
e
l
f
\mathrm{Self}
Self-
A
t
t
e
n
t
i
o
n
\mathrm{Attention}
Attention的工作原理非常类似。为了方便图解可视化将
M
u
l
t
i
\mathrm{Multi}
Multi-
H
e
a
d
\mathrm{Head}
Head设置为
2
2
2-
H
e
a
d
\mathrm{Head}
Head,如果
M
u
l
t
i
\mathrm{Multi}
Multi-
H
e
a
d
\mathrm{Head}
Head设置为
8
8
8-
H
e
a
d
\mathrm{Head}
Head,则上图的
q
i
,
k
i
,
v
i
,
i
∈
{
1
,
2
,
3
}
q^i,k^i,v^i,i\in\{1,2,3\}
qi,ki,vi,i∈{1,2,3}的下一步的分支数为
8
8
8。给定输入
w
o
r
d
e
m
b
e
d
d
i
n
g
\mathrm{word\text{ }embedding}
word embedding向量
a
1
,
a
2
,
a
3
∈
R
d
l
×
1
a^1,a^2,a^3 \in \mathbb{R}^{d_l \times 1}
a1,a2,a3∈Rdl×1,然后对于输入向量
a
i
,
i
∈
{
1
,
2
,
3
}
a^i,i\in \{1,2,3\}
ai,i∈{1,2,3}通过矩阵
W
q
∈
R
d
k
×
d
l
,
W
k
∈
R
d
k
×
d
l
,
W
v
∈
R
d
l
×
d
l
W^q\in \mathbb{R}^{d_k \times d_l},W^k\in \mathbb{R}^{d_k \times d_l},W^v\in \mathbb{R}^{d_l\times d_l}
Wq∈Rdk×dl,Wk∈Rdk×dl,Wv∈Rdl×dl进行第一次线性变换得到
Q
u
e
r
y
\mathrm{Query}
Query向量
q
i
∈
R
d
k
×
1
q^i\in\mathbb{R}^{d_k \times 1}
qi∈Rdk×1,
K
e
y
\mathrm{Key}
Key向量
k
i
∈
R
d
k
×
1
k^i \in\mathbb{R}^{d_k \times 1}
ki∈Rdk×1,以及
V
a
l
u
e
\mathrm{Value}
Value向量
v
i
∈
R
d
l
×
1
v^i \in\mathbb{R}^{d_l \times 1}
vi∈Rdl×1。然后再对
Q
u
e
r
y
\mathrm{Query}
Query向量
q
i
q^i
qi通过矩阵
W
q
1
∈
R
d
m
×
d
k
W^{q1}\in \mathbb{R}^{d_m \times d_k}
Wq1∈Rdm×dk和
W
q
2
∈
R
d
m
×
d
k
W^{q2}\in \mathbb{R}^{d_m\times d_k}
Wq2∈Rdm×dk进行第二次线性变换得到
q
i
1
∈
R
d
m
×
1
q^{i1}\in \mathbb{R}^{d_m \times 1}
qi1∈Rdm×1和
q
i
2
∈
R
d
m
×
1
q^{i2}\in \mathbb{R}^{d_m\times 1}
qi2∈Rdm×1,同理对
K
e
y
\mathrm{Key}
Key向量
k
i
k^i
ki通过矩阵
W
k
1
∈
R
d
m
×
d
k
W^{k1}\in \mathbb{R}^{d_m \times d_k}
Wk1∈Rdm×dk和
W
k
2
∈
R
d
m
×
d
k
W^{k2}\in \mathbb{R}^{d_m\times d_k}
Wk2∈Rdm×dk进行第二次线性变换得到
k
i
1
∈
R
d
m
×
1
k^{i1}\in \mathbb{R}^{d_m\times 1}
ki1∈Rdm×1和
k
i
2
∈
R
d
m
×
1
k^{i2}\in \mathbb{R}^{d_m\times 1}
ki2∈Rdm×1,对
V
a
l
u
e
\mathrm{Value}
Value向量
v
i
v^i
vi通过矩阵
W
v
1
∈
R
d
l
2
×
d
l
W^{v1}\in \mathbb{R}^{\frac{d_l}{2}\times d_l}
Wv1∈R2dl×dl和
W
v
2
∈
R
d
l
2
×
d
l
W^{v2}\in \mathbb{R}^{\frac{d_l}{2}\times d_l}
Wv2∈R2dl×dl进行第二次线性变换得到
v
i
1
∈
R
d
l
2
×
1
v^{i1}\in \mathbb{R}^{\frac{d_l}{2}\times 1}
vi1∈R2dl×1和
v
i
2
∈
R
d
l
2
×
1
v^{i2}\in \mathbb{R}^{\frac{d_l}{2}\times 1}
vi2∈R2dl×1,具体的计算公式如下所示:
{
q
i
h
=
W
q
h
⋅
W
q
⋅
a
i
k
i
h
=
W
k
h
⋅
W
k
⋅
a
i
,
i
=
{
1
,
2
,
3
}
,
h
=
{
1
,
2
}
v
i
h
=
W
v
h
⋅
W
v
⋅
a
i
\left\{qih=Wqh⋅Wq⋅aikih=Wkh⋅Wk⋅ai,i={1,2,3},h={1,2}vih=Wvh⋅Wv⋅ai
如下图左半部分所示, S e l f \mathrm{Self} Self- A t t e n t i o n \mathrm{Attention} Attention的输出向量 b i , i ∈ { 1 , 2 , 3 , 4 } b^i, i \in \{1,2,3,4\} bi,i∈{1,2,3,4}综合了输入向量 a i , i ∈ { 1 , 2 , 3 , 4 } a^i, i \in \{1,2,3,4\} ai,i∈{1,2,3,4}的全部信息,由此可见, S e l f \mathrm{Self} Self- A t t e n t i o n \mathrm{Attention} Attention在实际编程中支持并行运算。如下图右半部分所示, M a s k S e l f \mathrm{Mask \text{ } Self} Mask Self- A t t e n t i o n \mathrm{Attention} Attention的输出向量 b i b^i bi只利用了已知部分输入的向量 a i a^i ai的信息。例如, b 1 b1 b1只是与 a 1 a^1 a1有关; b 2 b^2 b2与 a 1 a^1 a1和 a 2 a^2 a2有关; b 3 b^3 b3与 a 1 a^1 a1, a 2 a^2 a2和 a 3 a^3 a3有关; b 4 b^4 b4与 a 1 a^1 a1, a 2 a^2 a2, a 3 a^3 a3和 a 4 a^4 a4有关。 M a s k S e l f \mathrm{Mask \text{ } Self} Mask Self- A t t e n t i o n \mathrm{Attention} Attention在 T r a n s f o r m e r \mathrm{Transformer} Transformer中被用到过两次。
以上对
T
r
a
n
s
f
o
r
m
e
r
\mathrm{Transformer}
Transformer中的核心内容即自注意力机制进行了详细解剖,接下来会对
T
r
a
n
s
f
o
r
m
e
r
\mathrm{Transformer}
Transformer模型架构进行介绍。
T
r
a
n
s
f
o
r
m
e
r
\mathrm{Transformer}
Transformer模型是由
E
n
c
o
d
e
r
\mathrm{Encoder}
Encoder和
D
e
c
o
d
e
r
\mathrm{Decoder}
Decoder两个模块组成,具体的示意图如下所示,为了能够对
T
r
a
n
s
f
o
r
m
e
r
\mathrm{Transformer}
Transformer内部的操作细节进行更清晰的展示,下图以矩阵运算的视角对
T
r
a
n
s
f
o
r
m
e
r
\mathrm{Transformer}
Transformer的原理进行讲解。
E
n
c
o
d
e
r
\mathrm{Encoder}
Encoder模块操作的具体流程如下所示:
D e c o d e r \mathrm{Decoder} Decoder模块操作的具体流程如下所示:
T r a n s f o r m e r \mathrm{Transformer} Transformer具体的代码示例如下所示为一个国外博主视频里的代码,并根据上文对代码的一些细节进行了探讨。根据上文中 M u l t i \mathrm{Multi} Multi- H e a d A t t e n t i o n \mathrm{Head\text{ }Attention} Head Attention原理示例图可知,严格来看 M u l t i \mathrm{Multi} Multi- H e a d A t t e n t i o n \mathrm{Head\text{ }Attention} Head Attention在求注意分布的时候中间其实是有两步线性变换。给定输入向量 x ∈ R 256 × 1 x\in \mathbb{R}^{256\times 1} x∈R256×1 第一步线性变换直接让向量 x x x赋值给 q q q, k k k, v v v,这一过程以下程序中有所体现,在这里并不会产生歧义。第二步线性变换产生多 H e a d \mathrm{Head} Head,假设 H e a d = 8 \mathrm{Head}=8 Head=8的时候,按理说 q q q要与 8 8 8个矩阵 W q 1 , ⋯ , W q 8 W^{q1},\cdots,W^{q8} Wq1,⋯,Wq8进行线性变换得到 8 8 8个 q 1 , ⋯ , q 8 q^{1},\cdots,q^{8} q1,⋯,q8,同理 k k k要与 8 8 8个矩阵 W k 1 , ⋯ , W k 8 W^{k1},\cdots,W^{k8} Wk1,⋯,Wk8进行线性变换得到 8 8 8个 k 1 , ⋯ , k 8 k^{1},\cdots,k^{8} k1,⋯,k8, v v v要与 8 8 8个矩阵 W v 1 , ⋯ , W v 8 W^{v1},\cdots,W^{v8} Wv1,⋯,Wv8进行线性变换得到 8 8 8个 v 1 , ⋯ , v 8 v^{1},\cdots,v^{8} v1,⋯,v8,如果按照这个方式在程序实现则需要定义24个权重矩阵,非常的麻烦。以下程序中有一个简单的权重定义方法,通过该方法也可以实现以上多 H e a d \mathrm{Head} Head的线性变换,以向量 q = ( q 1 , ⋯ , q 256 ) ⊤ ∈ R 256 × 1 q = (q_1,\cdots, q_{256})^{\top}\in \mathbb{R}^{256 \times 1} q=(q1,⋯,q256)⊤∈R256×1为例:
import torch
import torch.nn as nn
import os
class SelfAttention(nn.Module):
def __init__(self, embed_size, heads):
super(SelfAttention, self).__init__()
self.embed_size = embed_size
self.heads = heads
self.head_dim = embed_size // heads
assert (self.head_dim * heads == embed_size), "Embed size needs to be div by heads"
self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.fc_out = nn.Linear(heads * self.head_dim, embed_size)
def forward(self, values, keys, query, mask):
N =query.shape[0]
value_len , key_len , query_len = values.shape[1], keys.shape[1], query.shape[1]
# split embedding into self.heads pieces
values = values.reshape(N, value_len, self.heads, self.head_dim)
keys = keys.reshape(N, key_len, self.heads, self.head_dim)
queries = query.reshape(N, query_len, self.heads, self.head_dim)
values = self.values(values)
keys = self.keys(keys)
queries = self.queries(queries)
energy = torch.einsum("nqhd,nkhd->nhqk", queries, keys)
# queries shape: (N, query_len, heads, heads_dim)
# keys shape : (N, key_len, heads, heads_dim)
# energy shape: (N, heads, query_len, key_len)
if mask is not None:
energy = energy.masked_fill(mask == 0, float("-1e20"))
attention = torch.softmax(energy/ (self.embed_size ** (1/2)), dim=3)
out = torch.einsum("nhql, nlhd->nqhd", [attention, values]).reshape(N, query_len, self.heads*self.head_dim)
# attention shape: (N, heads, query_len, key_len)
# values shape: (N, value_len, heads, heads_dim)
# (N, query_len, heads, head_dim)
out = self.fc_out(out)
return out
class TransformerBlock(nn.Module):
def __init__(self, embed_size, heads, dropout, forward_expansion):
super(TransformerBlock, self).__init__()
self.attention = SelfAttention(embed_size, heads)
self.norm1 = nn.LayerNorm(embed_size)
self.norm2 = nn.LayerNorm(embed_size)
self.feed_forward = nn.Sequential(
nn.Linear(embed_size, forward_expansion*embed_size),
nn.ReLU(),
nn.Linear(forward_expansion*embed_size, embed_size)
)
self.dropout = nn.Dropout(dropout)
def forward(self, value, key, query, mask):
attention = self.attention(value, key, query, mask)
x = self.dropout(self.norm1(attention + query))
forward = self.feed_forward(x)
out = self.dropout(self.norm2(forward + x))
return out
class Encoder(nn.Module):
def __init__(
self,
src_vocab_size,
embed_size,
num_layers,
heads,
device,
forward_expansion,
dropout,
max_length,
):
super(Encoder, self).__init__()
self.embed_size = embed_size
self.device = device
self.word_embedding = nn.Embedding(src_vocab_size, embed_size)
self.position_embedding = nn.Embedding(max_length, embed_size)
self.layers = nn.ModuleList(
[
TransformerBlock(
embed_size,
heads,
dropout=dropout,
forward_expansion=forward_expansion,
)
for _ in range(num_layers)]
)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask):
N, seq_length = x.shape
positions = torch.arange(0, seq_length).expand(N, seq_length).to(self.device)
out = self.dropout(self.word_embedding(x) + self.position_embedding(positions))
for layer in self.layers:
out = layer(out, out, out, mask)
return out
class DecoderBlock(nn.Module):
def __init__(self, embed_size, heads, forward_expansion, dropout, device):
super(DecoderBlock, self).__init__()
self.attention = SelfAttention(embed_size, heads)
self.norm = nn.LayerNorm(embed_size)
self.transformer_block = TransformerBlock(
embed_size, heads, dropout, forward_expansion
)
self.dropout = nn.Dropout(dropout)
def forward(self, x, value, key, src_mask, trg_mask):
attention = self.attention(x, x, x, trg_mask)
query = self.dropout(self.norm(attention + x))
out = self.transformer_block(value, key, query, src_mask)
return out
class Decoder(nn.Module):
def __init__(
self,
trg_vocab_size,
embed_size,
num_layers,
heads,
forward_expansion,
dropout,
device,
max_length,
):
super(Decoder, self).__init__()
self.device = device
self.word_embedding = nn.Embedding(trg_vocab_size, embed_size)
self.position_embedding = nn.Embedding(max_length, embed_size)
self.layers = nn.ModuleList(
[DecoderBlock(embed_size, heads, forward_expansion, dropout, device)
for _ in range(num_layers)]
)
self.fc_out = nn.Linear(embed_size, trg_vocab_size)
self.dropout = nn.Dropout(dropout)
def forward(self, x ,enc_out , src_mask, trg_mask):
N, seq_length = x.shape
positions = torch.arange(0, seq_length).expand(N, seq_length).to(self.device)
x = self.dropout((self.word_embedding(x) + self.position_embedding(positions)))
for layer in self.layers:
x = layer(x, enc_out, enc_out, src_mask, trg_mask)
out =self.fc_out(x)
return out
class Transformer(nn.Module):
def __init__(
self,
src_vocab_size,
trg_vocab_size,
src_pad_idx,
trg_pad_idx,
embed_size = 256,
num_layers = 6,
forward_expansion = 4,
heads = 8,
dropout = 0,
device="cuda",
max_length=100
):
super(Transformer, self).__init__()
self.encoder = Encoder(
src_vocab_size,
embed_size,
num_layers,
heads,
device,
forward_expansion,
dropout,
max_length
)
self.decoder = Decoder(
trg_vocab_size,
embed_size,
num_layers,
heads,
forward_expansion,
dropout,
device,
max_length
)
self.src_pad_idx = src_pad_idx
self.trg_pad_idx = trg_pad_idx
self.device = device
def make_src_mask(self, src):
src_mask = (src != self.src_pad_idx).unsqueeze(1).unsqueeze(2)
# (N, 1, 1, src_len)
return src_mask.to(self.device)
def make_trg_mask(self, trg):
N, trg_len = trg.shape
trg_mask = torch.tril(torch.ones((trg_len, trg_len))).expand(
N, 1, trg_len, trg_len
)
return trg_mask.to(self.device)
def forward(self, src, trg):
src_mask = self.make_src_mask(src)
trg_mask = self.make_trg_mask(trg)
enc_src = self.encoder(src, src_mask)
out = self.decoder(trg, enc_src, src_mask, trg_mask)
return out
if __name__ == '__main__':
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
x = torch.tensor([[1,5,6,4,3,9,5,2,0],[1,8,7,3,4,5,6,7,2]]).to(device)
trg = torch.tensor([[1,7,4,3,5,9,2,0],[1,5,6,2,4,7,6,2]]).to(device)
src_pad_idx = 0
trg_pad_idx = 0
src_vocab_size = 10
trg_vocab_size = 10
model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx, device=device).to(device)
out = model(x, trg[:, : -1])
print(out.shape)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。