赞
踩
- import torch
- fy=torch.load("pytorch_bert_model.bin" ,map_location=torch.device('cpu'))
- for i in fy.keys():
- print(i+' '+str(list(fy[i].size())))
输出如下:
- bert.embeddings.word_embeddings.weight [28996, 768]
- bert.embeddings.position_embeddings.weight [512, 768]
- bert.embeddings.token_type_embeddings.weight [2, 768]
- bert.embeddings.LayerNorm.weight [768]
- bert.embeddings.LayerNorm.bias [768]
- bert.encoder.layer.0.attention.self.query.weight [768, 768]
- bert.encoder.layer.0.attention.self.query.bias [768]
- bert.encoder.layer.0.attention.self.key.weight [768, 768]
- bert.encoder.layer.0.attention.self.key.bias [768]
- bert.encoder.layer.0.attention.self.value.weight [768, 768]
- bert.encoder.layer.0.attention.self.value.bias [768]
- bert.encoder.layer.0.attention.output.dense.weight [768, 768]
- bert.encoder.layer.0.attention.output.dense.bias [768]
- bert.encoder.layer.0.attention.output.LayerNorm.weight [768]
- bert.encoder.layer.0.attention.output.LayerNorm.bias [768]
- bert.encoder.layer.0.intermediate.dense.weight [3072, 768]
- bert.encoder.layer.0.intermediate.dense.bias [3072]
- bert.encoder.layer.0.output.dense.weight [768, 3072]
- bert.encoder.layer.0.output.dense.bias [768]
- bert.encoder.layer.0.output.LayerNorm.weight [768]
- bert.encoder.layer.0.output.LayerNorm.bias [768]
- bert.encoder.layer.1.attention.self.query.weight [768, 768]
- bert.encoder.layer.1.attention.self.query.bias [768]
- bert.encoder.layer.1.attention.self.key.weight [768, 768]
- bert.encoder.layer.1.attention.self.key.bias [768]
- bert.encoder.layer.1.attention.self.value.weight [768, 768]
- bert.encoder.layer.1.attention.self.value.bias [768]
- bert.encoder.layer.1.attention.output.dense.weight [768, 768]
- bert.encoder.layer.1.attention.output.dense.bias [768]
- bert.encoder.layer.1.attention.output.LayerNorm.weight [768]
- bert.encoder.layer.1.attention.output.LayerNorm.bias [768]
- bert.encoder.layer.1.intermediate.dense.weight [3072, 768]
- bert.encoder.layer.1.intermediate.dense.bias [3072]
- bert.encoder.layer.1.output.dense.weight [768, 3072]
- bert.encoder.layer.1.output.dense.bias [768]
- bert.encoder.layer.1.output.LayerNorm.weight [768]
- bert.encoder.layer.1.output.LayerNorm.bias [768]
- bert.encoder.layer.2.attention.self.query.weight [768, 768]
- bert.encoder.layer.2.attention.self.query.bias [768]
- bert.encoder.layer.2.attention.self.key.weight [768, 768]
- bert.encoder.layer.2.attention.self.key.bias [768]
- bert.encoder.layer.2.attention.self.value.weight [768, 768]
- bert.encoder.layer.2.attention.self.value.bias [768]
- bert.encoder.layer.2.attention.output.dense.weight [768, 768]
- bert.encoder.layer.2.attention.output.dense.bias [768]
- bert.encoder.layer.2.attention.output.LayerNorm.weight [768]
- bert.encoder.layer.2.attention.output.LayerNorm.bias [768]
- bert.encoder.layer.2.intermediate.dense.weight [3072, 768]
- bert.encoder.layer.2.intermediate.dense.bias [3072]
- bert.encoder.layer.2.output.dense.weight [768, 3072]
- bert.encoder.layer.2.output.dense.bias [768]
- bert.encoder.layer.2.output.LayerNorm.weight [768]
- bert.encoder.layer.2.output.LayerNorm.bias [768]
- bert.encoder.layer.3.attention.self.query.weight [768, 768]
- bert.encoder.layer.3.attention.self.query.bias [768]
- bert.encoder.layer.3.attention.self.key.weight [768, 768]
- bert.encoder.layer.3.attention.self.key.bias [768]
- bert.encoder.layer.3.attention.self.value.weight [768, 768]
- bert.encoder.layer.3.attention.self.value.bias [768]
- bert.encoder.layer.3.attention.output.dense.weight [768, 768]
- bert.encoder.layer.3.attention.output.dense.bias [768]
- bert.encoder.layer.3.attention.output.LayerNorm.weight [768]
- bert.encoder.layer.3.attention.output.LayerNorm.bias [768]
- bert.encoder.layer.3.intermediate.dense.weight [3072, 768]
- bert.encoder.layer.3.intermediate.dense.bias [3072]
- bert.encoder.layer.3.output.dense.weight [768, 3072]
- bert.encoder.layer.3.output.dense.bias [768]
- bert.encoder.layer.3.output.LayerNorm.weight [768]
- bert.encoder.layer.3.output.LayerNorm.bias [768]
- bert.encoder.layer.4.attention.self.query.weight [768, 768]
- bert.encoder.layer.4.attention.self.query.bias [768]
- bert.encoder.layer.4.attention.self.key.weight [768, 768]
- bert.encoder.layer.4.attention.self.key.bias [768]
- bert.encoder.layer.4.attention.self.value.weight [768, 768]
- bert.encoder.layer.4.attention.self.value.bias [768]
- bert.encoder.layer.4.attention.output.dense.weight [768, 768]
- bert.encoder.layer.4.attention.output.dense.bias [768]
- bert.encoder.layer.4.attention.output.LayerNorm.weight [768]
- bert.encoder.layer.4.attention.output.LayerNorm.bias [768]
- bert.encoder.layer.4.intermediate.dense.weight [3072, 768]
- bert.encoder.layer.4.intermediate.dense.bias [3072]
- bert.encoder.layer.4.output.dense.weight [768, 3072]
- bert.encoder.layer.4.output.dense.bias [768]
- bert.encoder.layer.4.output.LayerNorm.weight [768]
- bert.encoder.layer.4.output.LayerNorm.bias [768]
- bert.encoder.layer.5.attention.self.query.weight [768, 768]
- bert.encoder.layer.5.attention.self.query.bias [768]
- bert.encoder.layer.5.attention.self.key.weight [768, 768]
- bert.encoder.layer.5.attention.self.key.bias [768]
- bert.encoder.layer.5.attention.self.value.weight [768, 768]
- bert.encoder.layer.5.attention.self.value.bias [768]
- bert.encoder.layer.5.attention.output.dense.weight [768, 768]
- bert.encoder.layer.5.attention.output.dense.bias [768]
- bert.encoder.layer.5.attention.output.LayerNorm.weight [768]
- bert.encoder.layer.5.attention.output.LayerNorm.bias [768]
- bert.encoder.layer.5.intermediate.dense.weight [3072, 768]
- bert.encoder.layer.5.intermediate.dense.bias [3072]
- bert.encoder.layer.5.output.dense.weight [768, 3072]
- bert.encoder.layer.5.output.dense.bias [768]
- bert.encoder.layer.5.output.LayerNorm.weight [768]
- bert.encoder.layer.5.output.LayerNorm.bias [768]
- bert.encoder.layer.6.attention.self.query.weight [768, 768]
- bert.encoder.layer.6.attention.self.query.bias [768]
- bert.encoder.layer.6.attention.self.key.weight [768, 768]
- bert.encoder.layer.6.attention.self.key.bias [768]
- bert.encoder.layer.6.attention.self.value.weight [768, 768]
- bert.encoder.layer.6.attention.self.value.bias [768]
- bert.encoder.layer.6.attention.output.dense.weight [768, 768]
- bert.encoder.layer.6.attention.output.dense.bias [768]
- bert.encoder.layer.6.attention.output.LayerNorm.weight [768]
- bert.encoder.layer.6.attention.output.LayerNorm.bias [768]
- bert.encoder.layer.6.intermediate.dense.weight [3072, 768]
- bert.encoder.layer.6.intermediate.dense.bias [3072]
- bert.encoder.layer.6.output.dense.weight [768, 3072]
- bert.encoder.layer.6.output.dense.bias [768]
- bert.encoder.layer.6.output.LayerNorm.weight [768]
- bert.encoder.layer.6.output.LayerNorm.bias [768]
- bert.encoder.layer.7.attention.self.query.weight [768, 768]
- bert.encoder.layer.7.attention.self.query.bias [768]
- bert.encoder.layer.7.attention.self.key.weight [768, 768]
- bert.encoder.layer.7.attention.self.key.bias [768]
- bert.encoder.layer.7.attention.self.value.weight [768, 768]
- bert.encoder.layer.7.attention.self.value.bias [768]
- bert.encoder.layer.7.attention.output.dense.weight [768, 768]
- bert.encoder.layer.7.attention.output.dense.bias [768]
- bert.encoder.layer.7.attention.output.LayerNorm.weight [768]
- bert.encoder.layer.7.attention.output.LayerNorm.bias [768]
- bert.encoder.layer.7.intermediate.dense.weight [3072, 768]
- bert.encoder.layer.7.intermediate.dense.bias [3072]
- bert.encoder.layer.7.output.dense.weight [768, 3072]
- bert.encoder.layer.7.output.dense.bias [768]
- bert.encoder.layer.7.output.LayerNorm.weight [768]
- bert.encoder.layer.7.output.LayerNorm.bias [768]
- bert.encoder.layer.8.attention.self.query.weight [768, 768]
- ........
- ........
- bert.encoder.layer.11.attention.self.query.weight [768, 768]
- bert.encoder.layer.11.attention.self.query.bias [768]
- bert.encoder.layer.11.attention.self.key.weight [768, 768]
- bert.encoder.layer.11.attention.self.key.bias [768]
- bert.encoder.layer.11.attention.self.value.weight [768, 768]
- bert.encoder.layer.11.attention.self.value.bias [768]
- bert.encoder.layer.11.attention.output.dense.weight [768, 768]
- bert.encoder.layer.11.attention.output.dense.bias [768]
- bert.encoder.layer.11.attention.output.LayerNorm.weight [768]
- bert.encoder.layer.11.attention.output.LayerNorm.bias [768]
- bert.encoder.layer.11.intermediate.dense.weight [3072, 768]
- bert.encoder.layer.11.intermediate.dense.bias [3072]
- bert.encoder.layer.11.output.dense.weight [768, 3072]
- bert.encoder.layer.11.output.dense.bias [768]
- bert.encoder.layer.11.output.LayerNorm.weight [768]
- bert.encoder.layer.11.output.LayerNorm.bias [768]
- bert.pooler.dense.weight [768, 768]
- bert.pooler.dense.bias [768]
- classifier.weight [1, 768]
- classifier.bias [1]
BERTBASE: L=12, H=768, A=12, 总参数=110M
其中层数(即 Transformer 块个数)表示为 L,将隐藏尺寸(即每层的神经元个数)表示为 H、自注意力头数表示为 A。在所有实验中,将前馈/滤波器尺寸设置为 4H,即 H=768 时为 3072(和intermediate层有关),H=1024 时为 4096。
q、k、v分别代表query,key,value矩阵。把<k, v>看做一个键值对,Attention机制是对输入中元素的Value值进行加权求和,而q和Key用来计算对应Value的权重系数。
bert_base基础版本就有12层,进阶版本有24层。同时它也有很大的前馈神经网络( 768和1024个隐藏层神经元),还有很多attention heads(12-16个)。
bert_base有12个attention head,因此每层encoder的输出维度都是12*64=768
源码中Attention后实际的流程是如何的?
dense层可以简单理解成维度变换
Layner Norm是对一个层的向量做归一化处理,这跟使用ResNet的SkipConnection。前者是序列模型正则化防止covariance shift的手段,后者是避免优化过程中梯度消失的。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。