赞
踩
Bert 的模型由多层双向的Transformer编码器组成,由12层组成,768隐藏单元,12个head,总参数量110M,约1.15亿参数量。
词汇量的大小vocab_size=30522
隐藏层hidden_size=768 (即词向量维度d_model=768)
文本输入最长大小max_position_embeddings=512
词向量参数token embedding=30522*768
位置向量参数position embedding=512*768
句子类型参数Segment embedding=2*768(2个句子,0和1区分上下句子)
故,embedding总参数 = (30522+512+2)*768 = 23,835,648 = 22.7MB
假设输入X(x, 768),则Q=XW(Q)=(x, 64),K=XW(K)=(x, 64),V=XW(v)=(x, 64)
QK(T)=(x, x),V=(x, 64)
Z=softmax(QK(T)/8)*V=(x, 64),12头concat又进行了线性变换(下图为8头Transformer示例)
Z(concat)=(x, 64*12)
故W(O) = (768, 768),最终Z=(x, 768),与输入保持一致
权重矩阵W(Q)/W(K)/W(V)维度为:(768, 768/12=64)
权重矩阵W(O)维度为:(768, 768)
故,12头multi-heads的参数为:768*64*3*12 + 768*768 = 2,359,296
故,12层multi-heads的参数为:2,359,296 * 12 = 28,311,552 = 27MB
3、全连接层(FeedForward)参数
前馈网络feed forword的参数主要由2个全连接层组成,论文中全连接层的公式为:
FFN(x) = max(0, xW1 + b1)W2 + b2
其中用到了两个参数W1和W2,Bert沿用了惯用的全连接层大小设置,即4 * dmodle = 3072,因此,W1,W2分别为(768, 3072),(3072, 768)
故,12层的全连接层参数为:12*( 2 * 768 * 3072)= 56,623,104 = 54MB(未统计Bias)
LN层有gamma和beta等2个参数。在三个地方用到了layernorm层:embedding层后、multi-head attention后、feed forward后。
故,12层LN层参数为:768*2 + (768*2)*12 + (768*2)*12 = 38,400 = 37.5KB
5、结论
Base Bert的encoder用了12层,因此,最后的参数大小为:
1)词向量参数(包括layernorm) = (30522+512+2)* 768 + 768 * 2
2)12 * (Multi-Heads参数 + 全连接层参数 + layernorm参数) =
12 *((768*64*3*12 + 768*768) + (768 * 3072 * 2) + (768*2*2))
Total:108808704.0 ≈ 104MB
注:本文介绍的参数仅仅是encoder的参数,基于encoder的两个任务next sentence prediction 和 MLM涉及的参数(分别是768 * 2,768 * 768,总共约0.5M)并未加入,此外涉及的bias由于参数很少,本文也并未加入。
Transformer Encoder-Decoder Architecture The BERT model contains only the encoder block of the transformer architecture. Let's look at individual elements of an encoder block for BERT to visualize the number weight matrices as well as the bias vectors. The given configuration L = 12 means there will be 12 layers of self attention, H = 768 means that the embedding dimension of individual tokens will be of 768 dimensions, A = 12 means there will be 12 attention heads in one layer of self attention. The encoder block performs the following sequence of operations:
The input will be the sequence of tokens as a matrix of S * d dimension. Where s is the sequence length and d is the embedding dimension. The resultant input sequence will be the sum of token embeddings, token type embeddings as well as position embedding as a d-dimensional vector for each token. In the BERT model, the first set of parameters is the vocabulary embeddings. BERT uses WordPiece[2] embeddings that has 30522 tokens. Each token is of 768 dimensions.
Embedding layer normalization. One weight matrix and one bias vector.
Multi-head self attention. There will be h number of heads, and for each head there will be three matrices which will correspond to query matrix, key matrix and the value matrix. The first dimension of these matrices will be the embedding dimension and the second dimension will be the embedding dimension divided by the number of attention heads. Apart from this, there will be one more matrix to transform the concatenated values generated by attention heads to the final token representation.
Residual connection and layer normalization. One weight matrix and one bias vector.
Position-wise feedforward network will have one hidden layer, that will correspond to two weight matrices and two bias vectors. In the paper, it is mentioned that the number of units in the hidden layer will be four times the embedding dimension.
Residual connection and layer normalization. One weight matrix and one bias vector.
Let's calculate the actual number of parameters by associating the right dimensions to the weight matrices and bias vectors for the BERT base model.
Embedding Matrices:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。