赞
踩
In the previous blog, I discussed the training data of LLMs and their data scheduling methods. This blog will focus on another important aspect of LLMs: model architecture. Due to the complexity and diversity of model architectures, I will be writing a bilingual blog. This version is in English, and a Chinese version has been released. The blog is based on datawhale files and a nice survey.
The Transformer architecture has become the dominant framework for creating a wide range of LLMs, enabling the scaling of language models to hundreds or thousands of billions of parameters. Broadly speaking, the prevalent architectures of current LLMs can be roughly classified into three main types: encoder-decoder, causal decoder, and prefix decoder. And a summary table can be seen in:
The stability of training is a significant challenge for pre-training LLMs. Normalization is a widely used strategy to address this issue and stabilize the training of neural networks. In the original Transformer, LayerNorm is utilized. However, several advanced normalization techniques have been proposed as alternatives to LayerNorm, such as RMSNorm and DeepNorm.
In addition to the normalization method, the normalization position also plays a crucial role in LLMs. There are generally three choices for the normalization position: post-LN, pre-LN, and sandwich-LN.
Properly setting activation functions in feed-forward networks is crucial for achieving good performance. GeLU activations are widely used in existing LLMs. Additionally, variants of GeLU activation have been utilized in the latest LLMs, especially the SwiGLU and GeGLU variants, which often achieve better performance in practice. However, compared to GeLU, they require additional parameters (about 50%) in the feed-forward networks.
As the self-attention modules in Transformers are permutation equivariant, position embeddings (PE) are employed to inject absolute or relative position information for modeling sequences.
Absolute Position Embedding: In the original Transformer, absolute position embeddings are used. At the bottoms of the encoder and the decoder, the absolute positional embeddings are added to the input embeddings. There are two variants of absolute position embeddings proposed in the original Transformer, namely sinusoidal and learned position embeddings, with the latter being commonly used in existing pre-trained language models.
Relative Position Embedding: Unlike absolute position embeddings, relative positional embeddings are generated based on the offsets between keys and queries. A popular variant of relative PE was introduced in Transformer-XL. The calculation of attention scores between keys and queries has been modified to introduce learnable embeddings corresponding to relative positions.
Rotary Position Embedding (RoPE): It involves setting specific rotatory matrices based on the absolute position of each key or query. The scores between keys and queries can be computed with relative position information. RoPE combines each consecutive pair of elements in query and key vectors as a dimension, resulting in d 2 \frac{d}{2} 2d dimensions for an original d-length embedding. For each dimension i ∈ { 1 , . . . , d 2 } i \in \{1,...,\frac{d}{2}\} i∈{1,...,2d}, the pair of involved elements will rotate based on the rotation angle t ⋅ θ i t·\theta_i t⋅θi, where t t t denotes the position index and θ i \theta_i θi is the basis in the dimension. Following sinusoidal position embeddings, RoPE defines the basis θ i \theta_i θi as an exponentiation of the base b (set to 10000 by default):
Θ = { θ i = b − 2 ( i − 1 ) / d ∣ i ∈ { 1 , 2 , … , d / 2 } } \Theta = \{\theta_i = b^{-2(i-1)/d}|i\in \{1,2,\dots,d/2\}\} Θ={θi=b−2(i−1)/d∣i∈{1,2,…,d/2}}
Furthermore, a recent study defines the distance required to rotate one cycle (2π) for each dimension as wavelength:
λ i = 2 π b 2 ( i − 1 ) / d = 2 π / θ i \lambda_i = 2 \pi b^{2(i-1)/d}= 2\pi/\theta_i λi=2πb2(i−1)/d=2π/θi
Because of its outstanding performance and long-term decay property, RoPE has been widely embraced in the latest LLMs. Building upon RoPE, xPos enhances the translation invariance and length extrapolation of the Transformer. At each dimension of the rotation angle vector, xPos introduces a special exponential decay that diminishes as the basis grows, thereby mitigating the instability during training as the distance increases.
Due to its importance and hard-understanding, a nice blog illustrates it well (Warning: NEED GOOD MATH): https://zhuanlan.zhihu.com/p/647109286.
ALiBi: It is designed to enhance the extrapolation capability of the Transformer. Similar to relative position embedding, it biases attention scores using a penalty based on the distances between keys and queries. Unlike relative positional embedding methods, the penalty scores in ALiBi are predefined without any trainable parameters. Empirical results have demonstrated that ALiBi outperforms several popular position embedding methods, particularly on longer sequences. Furthermore, it has been shown that ALiBi can also enhance training stability in BLOOM.
The attention mechanism is a crucial element of the Transformer, enabling tokens across the sequence to interact and compute representations of the input and output sequences.
In summary, existing literature suggests the following detailed configurations for stronger generalization and training stability: choose pre-RMSNorm for layer normalization, and SwiGLU or GeGLU as the activation function. Additionally, it is recommended not to use LN immediately after embedding layers, as this may lead to performance degradation. Regarding position embeddings, RoPE or ALiBi is a better choice, especially for better performance on long sequences.
END
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。