赞
踩
1、引言
一些 X-formers 从以下几个方面提升了vanilla Transformer 的性能:模型的效率、模型泛化、模型适应性
2、背景
2.1 vanilla Transformer
The vanilla Transformer 是一种sequence-to-sequence 的模型。由 一个encoder和 一个decoder组成, (encoder和decoder都是由L个相同的块堆叠而成)。
encoder块是由多头自注意力模块和 position-wise FFN组成。为了构建更深层次的网络,层归一化后使用残差网络。
decoder块在多头自注意力模块和 position-wise FFN中加了一个交叉注意力模块。
2.1.1注意力模块
the scaled dot-product attention used by Transformer is given by
(式1)
Query-Key-Value (QKV)
Q ∈; K ∈ ;V ∈
N,M :queries and keys (or values)长度; and : keys (or queries) and values的维度;
A:注意力矩阵
式中的是为了减轻softmax函数的梯度消失问题
multi-head attention
式2将queries, keys and values由维投射到、、维;式3又将其还原为维
分类:
依据q、k、v的来源分为三种:
2.1.2 Position-wise FFN.
实质是全连接前馈模块
H′ is the outputs of previous layer
2.1.3 Residual Connection and Normalization.
每个模块都加了一个残差网络
2.1.4 Position Encodings.
Transformer忽视了位置信息,需要额外的工作去弥补。
2.2 Model Usage
2.3 Model Analysis
D:hidden dimension
T:input sequence length
2.4 Comparing Transformer to Other Network Types(transformer同其他网络类型比较)
2.4.1 Analysis ofSelf-Attention.
2.4.2 In Terms of Inductive Bias(归纳偏置)
3 TAXONOMY OF TRANSFORMERS(分类)
4 ATTENTION
Self-attention的两大挑战:计算复杂性、结构先验性
对注意力机制以下六个方面的提升:稀疏注意、线性注意、原型和内存压缩、低秩自注意、有先验的注意、多头注意机制的改进
4.1 Sparse Attention(两类:基于位置的稀疏注意、基于内容的稀疏注意)
standard self-attention mechanism: every token needs to attend to all other tokens。实质上注意力矩阵A在大多数数据点是稀疏的
4.1.1 Position-based Sparse Attention.
4.1.1.1 Atomic Sparse Attention.(原子稀疏注意)
4.1.1.2 Compound Sparse Attention(复合)
4.1.1.3 Extended Sparse Attention
4.1.2 Content-based Sparse Attention.
在同一组中心向量之下聚类q和k
(没有太明白公式的意思)
b:桶的数量
R:尺寸为【,】的矩阵
LSH(locality-sensitive hashing)函数的计算:
4.2 Linearized Attention
计算复杂度
Z为:
:按行的特征图
从向量角度推导公式深入了解线性注意力
4.2.1 Feature Maps
Performer [18] :
不能保证非负注意力分数,导致不稳定和反常行为
Performer [19]:
保证无偏估计和非负输出,比Performer【18】更稳定
4.2.2 Aggregation Rule:
4.3 Query Prototyping and Memory Compression Apart
4.3.1 Attention with Prototype Queries
decreasing the number of queries with query prototyping
4.3.2 Attention with CompressedKey-Value Memory
reducing the number ofthe key-value pairs before applying the attention mechanism
and values using a strided convolution.
4.4 Low-rank Self-Attention
4.4.1 Low-rankParameterization
限制的维度
4.4.2 Low-rank Approximation
kernel approximation with random feature maps
Nyström method.
4.5 Attention with Prior
4.5.1 Prior that Models locality
higher indicates a higher prior probability
Yang et al. [156]
Gaussian Transformer
4.5.2 Prior from Lower Modules.
attention distributions are similar in adjacent layers. previous layer as a prior
第l层的注意分数;
w1,w2:相邻层分数的权重;
g:将前层的分数转化为先验
4.5.3 Prior as Multi-task Adapters
直和:
Aj:训练参数
,:Feature Wise Linear Modulation functions
4.5.4 Attention with Only Prior
4.6 Improved Multi-Head Mechanism
4.6.1 Head Behavior Modeling.
4.6.2 Multi-head with Restricted Spans
• Locality.
• Efficiency.
4.6.3 Multi-head with RefinedAggregation.
best balance the translation performance and computational efficency
4.6.4 Other Modifications.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。