Monodyee

这个屌丝很懒，什么也没留下！

热门标签

a survey of transformer 学习笔记_position-wise ffn

作者：Monodyee | 2024-04-02 12:24:35

踩

position-wise ffn

1、引言

一些 X-formers 从以下几个方面提升了vanilla Transformer 的性能：模型的效率、模型泛化、模型适应性

2、背景

2.1 vanilla Transformer

The vanilla Transformer 是一种sequence-to-sequence 的模型。由一个encoder和一个decoder组成, （encoder和decoder都是由L个相同的块堆叠而成）。

encoder块是由多头自注意力模块和 position-wise FFN组成。为了构建更深层次的网络，层归一化后使用残差网络。

decoder块在多头自注意力模块和 position-wise FFN中加了一个交叉注意力模块。

2.1.1注意力模块

the scaled dot-product attention used by Transformer is given by

（式1）

Query-Key-Value (QKV)

Q ∈ $R^{N\times Dk}$ ; K ∈ $R^{M\times Dk}$ ;V ∈ $R^{M\times Dv}$

N,M ：queries and keys （or values）长度; $Dk$ and $Dv$ : keys (or queries) and values的维度；

A：注意力矩阵

式中的 $\sqrt{Dk}$ 是为了减轻softmax函数的梯度消失问题

multi-head attention

式2将queries, keys and values由 $Dm$ 维投射到 $Dk$ 、 $Dk$ 、 $Dv$ 维；式3又将其还原为 $Dm$ 维

分类：

依据q、k、v的来源分为三种：

Self-attention ：式1 Q = K = V = X
Masked Self-attention：parallel training：
Cross-attention：The queries are projected from the outputs of the previous (decoder) layer, whereas the keys and values are projected using the outputs of the encoder

2.1.2 Position-wise FFN.

实质是全连接前馈模块

H′ is the outputs of previous layer

2.1.3 Residual Connection and Normalization.

每个模块都加了一个残差网络

2.1.4 Position Encodings.

Transformer忽视了位置信息，需要额外的工作去弥补。

2.2 Model Usage

Encoder-Decoder:sequence to sequence
Encoder only:用于分类
Decoder only：用于序列生成

2.3 Model Analysis

D：hidden dimension

T：input sequence length

2.4 Comparing Transformer to Other Network Types（transformer同其他网络类型比较）

2.4.1 Analysis ofSelf-Attention.

与全连接层相比，自注意力机制参数效率更高，更灵活处理不同的序列长度
与卷积层相比，突破需深层网络需要深层网络才能捕获全局信息的限制，自注意力机制通过一定数量的层能捕获全局信息
比起递归层，并行化更好

2.4.2 In Terms of Inductive Bias（归纳偏置）

卷积网络：平移不变性的归纳偏置，共享核函数；
递归网络：时间不变性的归纳偏置，马尔科夫结构；
Transformer：较少的结构信息，使其更具灵活性，缺陷是小数据集容易过拟合
GNN：Transformer可以被视为图神经网络（Transformer can be viewed as a GNN defined over a complete directed graph (with self-loop) where each input is a node in the graph），两者主要区别：Transformer无先验的结构信息

3 TAXONOMY OF TRANSFORMERS（分类）

4 ATTENTION

Self-attention的两大挑战：计算复杂性、结构先验性

对注意力机制以下六个方面的提升：稀疏注意、线性注意、原型和内存压缩、低秩自注意、有先验的注意、多头注意机制的改进

4.1 Sparse Attention（两类：基于位置的稀疏注意、基于内容的稀疏注意）

standard self-attention mechanism： every token needs to attend to all other tokens。实质上注意力矩阵A在大多数数据点是稀疏的

4.1.1 Position-based Sparse Attention.

4.1.1.1 Atomic Sparse Attention.（原子稀疏注意）

4.1.1.2 Compound Sparse Attention（复合）

4.1.1.3 Extended Sparse Attention

4.1.2 Content-based Sparse Attention.

Routing Transformer ：

在同一组中心向量之下聚类q和k

（没有太明白公式的意思）

Reformer ：

b：桶的数量

R：尺寸为【 $Dk$ ， $\frac{b}{2}$ 】的矩阵

LSH（locality-sensitive hashing）函数的计算：

Sparse Adaptive Connection (SAC)
Sparse Sinkhorn Attention

4.2 Linearized Attention

计算复杂度

标准自注意力

线性自注意力

Z为：

$\phi$ ：按行的特征图

从向量角度推导公式深入了解线性注意力

4.2.1 Feature Maps

Linear Transformer 的特征图

Performer [18, 19] uses random feature maps that approximate the scoring function of Trans- former

Performer [18] ：

不能保证非负注意力分数，导致不稳定和反常行为

Performer [19]：

保证无偏估计和非负输出，比Performer【18】更稳定

Schlag et al.

4.2.2 Aggregation Rule：

RFA ：introduces a gating mechanism to the summation
Schlag et al ：enlarge the capacity in a write-and-remove fashion

4.3 Query Prototyping and Memory Compression Apart

4.3.1 Attention with Prototype Queries

decreasing the number of queries with query prototyping

4.3.2 Attention with CompressedKey-Value Memory

reducing the number ofthe key-value pairs before applying the attention mechanism

Liu et al. ：propose Memory Compressed Attention (MCA) that reduces the number of keys

and values using a strided convolution.

Set Transformer [70] and Luna [90]： a number of external trainable global nodes to summarize information from inputs
Linformer [142]：utilizes linear projections to project keys and values from length n to a smaller length nk
Poolingformer [165] ：adopts two-level attention that combines a sliding window attention and a compressed memory attention.

4.4 Low-rank Self-Attention

4.4.1 Low-rankParameterization

限制 $Dk$ 的维度

4.4.2 Low-rank Approximation

kernel approximation with random feature maps

Nyström method.

4.5 Attention with Prior

4.5.1 Prior that Models locality

higher $Gij$ indicates a higher prior probability

Yang et al. [156]

Gaussian Transformer

4.5.2 Prior from Lower Modules.
attention distributions are similar in adjacent layers. previous layer as a prior

第l层的注意分数；

w1,w2：相邻层分数的权重；

g：将前层的分数转化为先验

Predictive Attention Transformer ：w1= $\alpha$ w2= 1- $\alpha$ ;g卷积层
Realformer ：w1=w1=1;identity map; g=恒等映射
Lazyformer ： layers. This is equivalent to setting g(·) to identity and switch the settings of w1 = 0, w2 = 1 and w1 = 1, w2 = 0 alternatingly.

4.5.3 Prior as Multi-task Adapters

直和：

Aj:训练参数

$\beta j$ , $\gamma j$ :Feature Wise Linear Modulation functions

4.5.4 Attention with Only Prior

Zhang et al.:只用离散均匀分布作为注意分布的来源
You et al. ： utilize a Gaussian distribution as the hardcoded attention distribution for attention calculation.
Synthesizer: replace generated attention scores with: (1) a learnable, randomly initialized attention scores, and (2) attention scores output by a feed-forward network that is only conditioned on the querying input itself

4.6 Improved Multi-Head Mechanism

4.6.1 Head Behavior Modeling.

Li et al：引入正则项到损失函数激励注意力头的多样性
Deshpande and Narasimhan：辅助损失——Frobenius norm（佛罗伯尼范数） between attention distribution maps and predefined attention patterns
Talking-head Attention ：a talking head mechanism
Collaborative Multi-head Attention：

4.6.2 Multi-head with Restricted Spans

• Locality.
• Efficiency.

4.6.3 Multi-head with RefinedAggregation.

best balance the translation performance and computational efficency

4.6.4 Other Modifications.

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/Monodyee/article/detail/352514