赞
踩
多头注意力机制(Multi-Head Attention)是Transformer架构中的一个核心组件。它在机器翻译、自然语言处理(NLP)等领域取得了显著的成功。多头注意力机制的引入是为了增强模型的能力,使其能够从不同的角度关注输入序列的不同部分,从而捕捉更多层次的信息。
在多头注意力机制中,the number of heads
参数指的是“头”的数量,即注意力机制的独立并行子层的数量。每个头独立地执行注意力机制(Self-Attention 或 Attention),然后将这些头的输出连接起来,再通过线性变换得到最终的输出。
以下是多头注意力机制的详细步骤和解释:
线性变换:
假设输入的维度是 ( d m o d e l d_{model} dmodel),头的数量是 ( h h h),每个头的维度是 ( d k = d m o d e l / h d_k = d_{model} / h dk=dmodel/h)。
对于输入 ( X \mathbf{X} X),我们有:
Q i = X W i Q , K i = X W i K , V i = X W i V \mathbf{Q}_i = \mathbf{X} \mathbf{W}_i^Q, \quad \mathbf{K}_i = \mathbf{X} \mathbf{W}_i^K, \quad \mathbf{V}_i = \mathbf{X} \mathbf{W}_i^V Qi=XWiQ,Ki=XWiK,Vi=XWiV
其中 ( i i i) 表示第 (i) 个头,( W i Q , W i K , W i V \mathbf{W}_i^Q, \mathbf{W}_i^K, \mathbf{W}_i^V WiQ,WiK,WiV) 是线性变换矩阵。
计算注意力:
缩放点积注意力的公式为:
Attention ( Q , K , V ) = softmax ( Q K T d k ) V \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}}\right) \mathbf{V} Attention(Q,K,V)=softmax(dk QKT)V
连接(Concatenation):
如果有 (h) 个头,每个头的输出维度是 (d_k),则连接后的维度为 (h \times d_k = d_{model})。
线性变换:
MultiHead ( Q , K , V ) = Concat ( head 1 , head 2 , … , head h ) W O \text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Concat}(\text{head}_1, \text{head}_2, \ldots, \text{head}_h) \mathbf{W}^O MultiHead(Q,K,V)=Concat(head1,head2,…,headh)WO
其中,( W O \mathbf{W}^O WO) 是输出的线性变换矩阵。
以下是一个简单的 PyTorch 示例,展示多头注意力机制的实现:
import torch
import torch.nn as nn
import torch.nn.functional as F
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
self.num_heads = num_heads
self.d_model = d_model
self.d_k = d_model // num_heads
assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
self.query_linear = nn.Linear(d_model, d_model)
self.key_linear = nn.Linear(d_model, d_model)
self.value_linear = nn.Linear(d_model, d_model)
self.out_linear = nn.Linear(d_model, d_model)
def forward(self, query, key, value):
batch_size = query.size(0)
# Linear projections
query = self.query_linear(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
key = self.key_linear(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
value = self.value_linear(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
# Scaled dot-product attention
scores = torch.matmul(query, key.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.d_k, dtype=torch.float32))
attention = F.softmax(scores, dim=-1)
output = torch.matmul(attention, value)
# Concat and linear projection
output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
output = self.out_linear(output)
return output
# Example usage
d_model = 512
num_heads = 8
batch_size = 64
sequence_length = 10
mha = MultiHeadAttention(d_model, num_heads)
query = torch.randn(batch_size, sequence_length, d_model)
key = torch.randn(batch_size, sequence_length, d_model)
value = torch.randn(batch_size, sequence_length, d_model)
output = mha(query, key, value)
print(output.shape) # Expected output: (64, 10, 512)
在这个示例中:
d_model
是输入和输出的特征维度。num_heads
是头的数量。d_k
是每个头的维度。query
、key
和 value
的形状为 (batch_size, sequence_length, d_model)
。(batch_size, sequence_length, d_model)
。多头注意力机制通过将注意力机制并行化,并应用多个独立的注意力头,从而增强了模型的表示能力和学习能力。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。