当前位置:   article > 正文

a survey of transformer 学习笔记_position-wise ffn

position-wise ffn

1、引言

一些 X-formers 从以下几个方面提升了vanilla Transformer 的性能:模型的效率、模型泛化、模型适应性

2、背景

2.1 vanilla Transformer

The vanilla Transformer 是一种sequence-to-sequence 的模型。由 一个encoder和 一个decoder组成, (encoder和decoder都是由L个相同的块堆叠而成)。

encoder块是由多头自注意力模块和 position-wise FFN组成。为了构建更深层次的网络,层归一化后使用残差网络。

decoder块在多头自注意力模块和 position-wise FFN中加了一个交叉注意力模块。

2.1.1注意力模块


the scaled dot-product attention used by Transformer is given by

(式1)

 Query-Key-Value (QKV)

 Q ∈R^{N\times Dk}; K ∈ R^{M\times Dk};V ∈ R^{M\times Dv}

 N,M :queries and keys (or values)长度; Dkand Dv:  keys (or queries) and values的维度;

 A:注意力矩阵

式中的\sqrt{Dk}是为了减轻softmax函数的梯度消失问题 


multi-head attention

式2将queries, keys and values由Dm维投射到DkDkDv维;式3又将其还原为Dm


 分类:

   依据q、k、v的来源分为三种:

  • Self-attention :式1 Q = K = V = X 
  • Masked Self-attention:parallel training:
  • Cross-attention:The queries are projected from the outputs of the previous (decoder) layer, whereas the keys and values are projected using the outputs of the encoder


2.1.2 Position-wise FFN.

实质是全连接前馈模块

 H′ is the outputs of previous layer


2.1.3 Residual Connection and Normalization.

每个模块都加了一个残差网络

2.1.4 Position Encodings.

Transformer忽视了位置信息,需要额外的工作去弥补。

2.2 Model Usage 

  •  Encoder-Decoder:sequence to sequence
  •  Encoder only:用于分类
  •  Decoder only:用于序列生成

2.3 Model Analysis

 D:hidden dimension 

 T:input sequence length

2.4 Comparing Transformer to Other Network Types(transformer同其他网络类型比较)

2.4.1 Analysis ofSelf-Attention.

  •  与全连接层相比,自注意力机制参数效率更高,更灵活处理不同的序列长度
  •  与卷积层相比,突破需深层网络需要深层网络才能捕获全局信息的限制,自注意力机制通过一定数量的层能捕获全局信息
  •  比起递归层,并行化更好

2.4.2 In Terms of Inductive Bias(归纳偏置)

  • 卷积网络:平移不变性的归纳偏置,共享核函数;
  • 递归网络:时间不变性的归纳偏置,马尔科夫结构;
  • Transformer:较少的结构信息,使其更具灵活性,缺陷是小数据集容易过拟合
  • GNN:Transformer可以被视为图神经网络(Transformer can be viewed as a GNN defined over a complete directed graph (with self-loop) where each input is a node in the graph),两者主要区别:Transformer无先验的结构信息

3 TAXONOMY OF TRANSFORMERS(分类)

 4 ATTENTION

Self-attention的两大挑战:计算复杂性、结构先验性

对注意力机制以下六个方面的提升:稀疏注意、线性注意、原型和内存压缩、低秩自注意、有先验的注意、多头注意机制的改进


4.1 Sparse Attention(两类:基于位置的稀疏注意、基于内容的稀疏注意)

standard self-attention mechanism: every token needs to attend to all other tokens。实质上注意力矩阵A在大多数数据点是稀疏的

 4.1.1 Position-based Sparse Attention.

4.1.1.1 Atomic Sparse Attention.(原子稀疏注意)

 4.1.1.2 Compound Sparse Attention(复合)

 4.1.1.3 Extended Sparse Attention

4.1.2 Content-based Sparse Attention. 

  •  Routing Transformer :

在同一组中心向量之下聚类q和k

(没有太明白公式的意思)

 

  •  Reformer :

b:桶的数量

R:尺寸为【Dk\frac{b}{2}】的矩阵

LSH(locality-sensitive hashing)函数的计算:

  •  Sparse Adaptive Connection (SAC)
  •  Sparse Sinkhorn Attention 

4.2 Linearized Attention

计算复杂度

  • 标准自注意力

       

        

       

  • 线性自注意力

       Z为:

           

 \phi:按行的特征图


从向量角度推导公式深入了解线性注意力

 

 4.2.1 Feature Maps

  • Linear Transformer 的特征图

          

  •  Performer [18, 19] uses random feature maps that approximate the scoring function of Trans- former

     Performer [18] :

         不能保证非负注意力分数,导致不稳定和反常行为 

    Performer [19]:

        保证无偏估计和非负输出,比Performer【18】更稳定 

  • Schlag et al.

 4.2.2 Aggregation Rule:

  • RFA :introduces a gating mechanism to the summation
  • Schlag et al :enlarge the capacity in a write-and-remove fashion

4.3 Query Prototyping and Memory Compression Apart

4.3.1 Attention with Prototype Queries

 decreasing the number of queries with query prototyping

4.3.2 Attention with CompressedKey-Value Memory

reducing the number ofthe key-value pairs before applying the attention mechanism

  • Liu et al.propose Memory Compressed Attention (MCA) that reduces the number of keys

       and values using a strided convolution. 

  • Set Transformer [70] and Luna [90]: a number of external trainable global nodes to summarize information from inputs 
  • Linformer [142]:utilizes linear projections to project keys and values from length  n to a smaller  length nk
  • Poolingformer [165] :adopts two-level attention that combines a sliding window attention and a compressed memory attention.

4.4 Low-rank Self-Attention

4.4.1 Low-rankParameterization

 限制Dk的维度

4.4.2 Low-rank Approximation

kernel approximation with random feature maps

Nyström method. 

4.5 Attention with Prior

 4.5.1 Prior that Models locality

higher  Gij indicates a higher prior probability 

Yang et al. [156] 

Gaussian Transformer

4.5.2 Prior from Lower Modules.  
attention distributions are similar in adjacent layers.  previous layer as a prior

第l层的注意分数;

w1,w2:相邻层分数的权重;

 g:将前层的分数转化为先验

  • Predictive Attention Transformer :w1= \alpha    w2= 1- \alpha ;g卷积层
  • Realformer :w1=w1=1;identity map;  g=恒等映射
  • Lazyformer : layers. This is equivalent to setting g(·) to identity and switch the settings of w1 = 0, w2 = 1 and w1 = 1, w2 = 0 alternatingly. 

4.5.3 Prior as Multi-task Adapters

 

直和:

Aj:训练参数

\beta j,\gamma j:Feature Wise Linear Modulation functions

4.5.4 Attention with Only Prior

  •  Zhang et al.:只用离散均匀分布作为注意分布的来源
  •  You et al. : utilize a Gaussian distribution as the hardcoded attention distribution for attention calculation.
  • Synthesizer: replace generated attention scores with: (1) a learnable, randomly initialized attention scores, and (2) attention scores output by a feed-forward network that is only conditioned on the querying input itself

4.6 Improved Multi-Head Mechanism

4.6.1 Head Behavior Modeling.

  • Li et al:引入正则项到损失函数激励注意力头的多样性
  • Deshpande and Narasimhan:辅助损失——Frobenius norm(佛罗伯尼范数) between attention distribution maps and predefined attention patterns
  • Talking-head Attention :a talking head mechanism
  • Collaborative Multi-head Attention:

 4.6.2 Multi-head with Restricted Spans

• Locality.
• Efficiency. 

 4.6.3 Multi-head with RefinedAggregation.

  best balance the translation performance and computational efficency

4.6.4 Other Modifications.

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Monodyee/article/detail/352514
推荐阅读
相关标签
  

闽ICP备14008679号