当前位置:   article > 正文

bert 模型参数统计_bart-base有多少参数

bart-base有多少参数

bert 模型参数量分析

使用huggingface transformers中的bert模型,分析统计模型的参数量

huggingface 模型导入

	import torch
    from transformers import BertTokenizer, BertModel
	bertModel = BertModel.from_pretrained("bert-base-chinese", output_hidden_states=True, output_attentions=True)
	total = sum(p.numel() for p in bertModel.parameters())
	print("total param:",total)

	输出如下:
	total param: 102267648
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

上述代码统计了模型的总参数量,输出为102267684

下面从三个方面统计分析bert 模型参数量

1、embedding 层

    bert中的embedding有三种,分别为word embedding、position embedding、sentence embedding。
在bert-base-chinese这个模型中,词汇数量为21128,embedding维度为768,每条数据长度L为512。
  • 1
  • 2

word embedding参数量:21128*768
position embedding参数量:512*768
sentence embedding参数量:2*768
在embedding层最后有Layer Norm 层,改层的参数量为768+768,LN公式中的 α \alpha α β \beta β

embedding层中的参数为
21128*768+512*768+2*768+768+768 =16622592

2、self-attention层

    self-attention 一共有12层,每层中有两部分组成,分别为multihead-Attention 和Layer Norm层
  • 1

multihead-Attention 中有Q、K、V三个转化矩阵和一个拼接矩阵,Q、K、V的shape为:768*12*64 +768 第一个768为embedding维度,12为head数量,64为子head的维度,最后加的768为模型中的bias。经过Q、K、V变化后的数据需要concat起来,额外需要一个768*768+768的拼接矩阵。
Layer Norm参数量:768+768
self-attention一层中的参数为:
(768*12*64 +768)*3+768*768+768 +768+768=2363904
一共12层,2363904 *12 = 28366848

3、feedforward层

    feedforward 一共有12层,每层中有两部分组成,分别为feedforward和Layer Norm层
  • 1

feedforward 网络结构为 W 2 ( W 1 X + b 1 ) + b 2 W_2(W_1X+b_1)+b_2 W2(W1X+b1)+b2,有两个线性变换层 W 1 W_1 W1是从768–>7684, W 2 W_2 W2是从7684–>768, W 1 W_1 W1参数量为7687684+7684, W 2 W_2 W2参数量为7684*768+768,
Layer Norm参数量:768+768
feedforward一层中的参数为:
(768*768*4 +768*4)+(768*4*768+768) + 768+768 =4723968
一共12层,4723968*12 = 56687616

参数总计

embedding:16622592
self-attention:28366848
feedforward:56687616
在feedforward层后还有一个pooler层,维度为768*768,参数量为(768*768+768 weights+bias),为获取训练数据中第一个特殊字符[CLS]的词向量,进一步计算bert中的NSP任务中的loss

total = 16622592 +28366848+56687616 + 768*768+768= 102267648
与pytorch统计结果相同。

上述有不明白的地方,可以看看bert模型中每层的参数

以下为模型中每一层的参数量:

for name,param in bertModel.named_parameters():
   print(name)
   print(param.shape)

# 输出如下:

embeddings.word_embeddings.weight
torch.Size([21128, 768])
embeddings.position_embeddings.weight
torch.Size([512, 768])
embeddings.token_type_embeddings.weight
torch.Size([2, 768])
embeddings.LayerNorm.weight
torch.Size([768])
embeddings.LayerNorm.bias
torch.Size([768])
encoder.layer.0.attention.self.query.weight
torch.Size([768, 768])
encoder.layer.0.attention.self.query.bias
torch.Size([768])
encoder.layer.0.attention.self.key.weight
torch.Size([768, 768])
encoder.layer.0.attention.self.key.bias
torch.Size([768])
encoder.layer.0.attention.self.value.weight
torch.Size([768, 768])
encoder.layer.0.attention.self.value.bias
torch.Size([768])
encoder.layer.0.attention.output.dense.weight
torch.Size([768, 768])
encoder.layer.0.attention.output.dense.bias
torch.Size([768])
encoder.layer.0.attention.output.LayerNorm.weight
torch.Size([768])
encoder.layer.0.attention.output.LayerNorm.bias
torch.Size([768])
encoder.layer.0.intermediate.dense.weight
torch.Size([3072, 768])
encoder.layer.0.intermediate.dense.bias
torch.Size([3072])
encoder.layer.0.output.dense.weight
torch.Size([768, 3072])
encoder.layer.0.output.dense.bias
torch.Size([768])
encoder.layer.0.output.LayerNorm.weight
torch.Size([768])
encoder.layer.0.output.LayerNorm.bias
torch.Size([768])
encoder.layer.1.attention.self.query.weight
torch.Size([768, 768])
encoder.layer.1.attention.self.query.bias
torch.Size([768])
encoder.layer.1.attention.self.key.weight
torch.Size([768, 768])
encoder.layer.1.attention.self.key.bias
torch.Size([768])
encoder.layer.1.attention.self.value.weight
torch.Size([768, 768])
encoder.layer.1.attention.self.value.bias
torch.Size([768])
encoder.layer.1.attention.output.dense.weight
torch.Size([768, 768])
encoder.layer.1.attention.output.dense.bias
torch.Size([768])
encoder.layer.1.attention.output.LayerNorm.weight
torch.Size([768])
encoder.layer.1.attention.output.LayerNorm.bias
torch.Size([768])
encoder.layer.1.intermediate.dense.weight
torch.Size([3072, 768])
encoder.layer.1.intermediate.dense.bias
torch.Size([3072])
encoder.layer.1.output.dense.weight
torch.Size([768, 3072])
encoder.layer.1.output.dense.bias
torch.Size([768])
encoder.layer.1.output.LayerNorm.weight
torch.Size([768])
encoder.layer.1.output.LayerNorm.bias
torch.Size([768])
encoder.layer.2.attention.self.query.weight
torch.Size([768, 768])
encoder.layer.2.attention.self.query.bias
torch.Size([768])
encoder.layer.2.attention.self.key.weight
torch.Size([768, 768])
encoder.layer.2.attention.self.key.bias
torch.Size([768])
encoder.layer.2.attention.self.value.weight
torch.Size([768, 768])
encoder.layer.2.attention.self.value.bias
torch.Size([768])
encoder.layer.2.attention.output.dense.weight
torch.Size([768, 768])
encoder.layer.2.attention.output.dense.bias
torch.Size([768])
encoder.layer.2.attention.output.LayerNorm.weight
torch.Size([768])
encoder.layer.2.attention.output.LayerNorm.bias
torch.Size([768])
encoder.layer.2.intermediate.dense.weight
torch.Size([3072, 768])
encoder.layer.2.intermediate.dense.bias
torch.Size([3072])
encoder.layer.2.output.dense.weight
torch.Size([768, 3072])
encoder.layer.2.output.dense.bias
torch.Size([768])
encoder.layer.2.output.LayerNorm.weight
torch.Size([768])
encoder.layer.2.output.LayerNorm.bias
torch.Size([768])
encoder.layer.3.attention.self.query.weight
torch.Size([768, 768])
encoder.layer.3.attention.self.query.bias
torch.Size([768])
encoder.layer.3.attention.self.key.weight
torch.Size([768, 768])
encoder.layer.3.attention.self.key.bias
torch.Size([768])
encoder.layer.3.attention.self.value.weight
torch.Size([768, 768])
encoder.layer.3.attention.self.value.bias
torch.Size([768])
encoder.layer.3.attention.output.dense.weight
torch.Size([768, 768])
encoder.layer.3.attention.output.dense.bias
torch.Size([768])
encoder.layer.3.attention.output.LayerNorm.weight
torch.Size([768])
encoder.layer.3.attention.output.LayerNorm.bias
torch.Size([768])
encoder.layer.3.intermediate.dense.weight
torch.Size([3072, 768])
encoder.layer.3.intermediate.dense.bias
torch.Size([3072])
encoder.layer.3.output.dense.weight
torch.Size([768, 3072])
encoder.layer.3.output.dense.bias
torch.Size([768])
encoder.layer.3.output.LayerNorm.weight
torch.Size([768])
encoder.layer.3.output.LayerNorm.bias
torch.Size([768])
encoder.layer.4.attention.self.query.weight
torch.Size([768, 768])
encoder.layer.4.attention.self.query.bias
torch.Size([768])
encoder.layer.4.attention.self.key.weight
torch.Size([768, 768])
encoder.layer.4.attention.self.key.bias
torch.Size([768])
encoder.layer.4.attention.self.value.weight
torch.Size([768, 768])
encoder.layer.4.attention.self.value.bias
torch.Size([768])
encoder.layer.4.attention.output.dense.weight
torch.Size([768, 768])
encoder.layer.4.attention.output.dense.bias
torch.Size([768])
encoder.layer.4.attention.output.LayerNorm.weight
torch.Size([768])
encoder.layer.4.attention.output.LayerNorm.bias
torch.Size([768])
encoder.layer.4.intermediate.dense.weight
torch.Size([3072, 768])
encoder.layer.4.intermediate.dense.bias
torch.Size([3072])
encoder.layer.4.output.dense.weight
torch.Size([768, 3072])
encoder.layer.4.output.dense.bias
torch.Size([768])
encoder.layer.4.output.LayerNorm.weight
torch.Size([768])
encoder.layer.4.output.LayerNorm.bias
torch.Size([768])
encoder.layer.5.attention.self.query.weight
torch.Size([768, 768])
encoder.layer.5.attention.self.query.bias
torch.Size([768])
encoder.layer.5.attention.self.key.weight
torch.Size([768, 768])
encoder.layer.5.attention.self.key.bias
torch.Size([768])
encoder.layer.5.attention.self.value.weight
torch.Size([768, 768])
encoder.layer.5.attention.self.value.bias
torch.Size([768])
encoder.layer.5.attention.output.dense.weight
torch.Size([768, 768])
encoder.layer.5.attention.output.dense.bias
torch.Size([768])
encoder.layer.5.attention.output.LayerNorm.weight
torch.Size([768])
encoder.layer.5.attention.output.LayerNorm.bias
torch.Size([768])
encoder.layer.5.intermediate.dense.weight
torch.Size([3072, 768])
encoder.layer.5.intermediate.dense.bias
torch.Size([3072])
encoder.layer.5.output.dense.weight
torch.Size([768, 3072])
encoder.layer.5.output.dense.bias
torch.Size([768])
encoder.layer.5.output.LayerNorm.weight
torch.Size([768])
encoder.layer.5.output.LayerNorm.bias
torch.Size([768])
encoder.layer.6.attention.self.query.weight
torch.Size([768, 768])
encoder.layer.6.attention.self.query.bias
torch.Size([768])
encoder.layer.6.attention.self.key.weight
torch.Size([768, 768])
encoder.layer.6.attention.self.key.bias
torch.Size([768])
encoder.layer.6.attention.self.value.weight
torch.Size([768, 768])
encoder.layer.6.attention.self.value.bias
torch.Size([768])
encoder.layer.6.attention.output.dense.weight
torch.Size([768, 768])
encoder.layer.6.attention.output.dense.bias
torch.Size([768])
encoder.layer.6.attention.output.LayerNorm.weight
torch.Size([768])
encoder.layer.6.attention.output.LayerNorm.bias
torch.Size([768])
encoder.layer.6.intermediate.dense.weight
torch.Size([3072, 768])
encoder.layer.6.intermediate.dense.bias
torch.Size([3072])
encoder.layer.6.output.dense.weight
torch.Size([768, 3072])
encoder.layer.6.output.dense.bias
torch.Size([768])
encoder.layer.6.output.LayerNorm.weight
torch.Size([768])
encoder.layer.6.output.LayerNorm.bias
torch.Size([768])
encoder.layer.7.attention.self.query.weight
torch.Size([768, 768])
encoder.layer.7.attention.self.query.bias
torch.Size([768])
encoder.layer.7.attention.self.key.weight
torch.Size([768, 768])
encoder.layer.7.attention.self.key.bias
torch.Size([768])
encoder.layer.7.attention.self.value.weight
torch.Size([768, 768])
encoder.layer.7.attention.self.value.bias
torch.Size([768])
encoder.layer.7.attention.output.dense.weight
torch.Size([768, 768])
encoder.layer.7.attention.output.dense.bias
torch.Size([768])
encoder.layer.7.attention.output.LayerNorm.weight
torch.Size([768])
encoder.layer.7.attention.output.LayerNorm.bias
torch.Size([768])
encoder.layer.7.intermediate.dense.weight
torch.Size([3072, 768])
encoder.layer.7.intermediate.dense.bias
torch.Size([3072])
encoder.layer.7.output.dense.weight
torch.Size([768, 3072])
encoder.layer.7.output.dense.bias
torch.Size([768])
encoder.layer.7.output.LayerNorm.weight
torch.Size([768])
encoder.layer.7.output.LayerNorm.bias
torch.Size([768])
encoder.layer.8.attention.self.query.weight
torch.Size([768, 768])
encoder.layer.8.attention.self.query.bias
torch.Size([768])
encoder.layer.8.attention.self.key.weight
torch.Size([768, 768])
encoder.layer.8.attention.self.key.bias
torch.Size([768])
encoder.layer.8.attention.self.value.weight
torch.Size([768, 768])
encoder.layer.8.attention.self.value.bias
torch.Size([768])
encoder.layer.8.attention.output.dense.weight
torch.Size([768, 768])
encoder.layer.8.attention.output.dense.bias
torch.Size([768])
encoder.layer.8.attention.output.LayerNorm.weight
torch.Size([768])
encoder.layer.8.attention.output.LayerNorm.bias
torch.Size([768])
encoder.layer.8.intermediate.dense.weight
torch.Size([3072, 768])
encoder.layer.8.intermediate.dense.bias
torch.Size([3072])
encoder.layer.8.output.dense.weight
torch.Size([768, 3072])
encoder.layer.8.output.dense.bias
torch.Size([768])
encoder.layer.8.output.LayerNorm.weight
torch.Size([768])
encoder.layer.8.output.LayerNorm.bias
torch.Size([768])
encoder.layer.9.attention.self.query.weight
torch.Size([768, 768])
encoder.layer.9.attention.self.query.bias
torch.Size([768])
encoder.layer.9.attention.self.key.weight
torch.Size([768, 768])
encoder.layer.9.attention.self.key.bias
torch.Size([768])
encoder.layer.9.attention.self.value.weight
torch.Size([768, 768])
encoder.layer.9.attention.self.value.bias
torch.Size([768])
encoder.layer.9.attention.output.dense.weight
torch.Size([768, 768])
encoder.layer.9.attention.output.dense.bias
torch.Size([768])
encoder.layer.9.attention.output.LayerNorm.weight
torch.Size([768])
encoder.layer.9.attention.output.LayerNorm.bias
torch.Size([768])
encoder.layer.9.intermediate.dense.weight
torch.Size([3072, 768])
encoder.layer.9.intermediate.dense.bias
torch.Size([3072])
encoder.layer.9.output.dense.weight
torch.Size([768, 3072])
encoder.layer.9.output.dense.bias
torch.Size([768])
encoder.layer.9.output.LayerNorm.weight
torch.Size([768])
encoder.layer.9.output.LayerNorm.bias
torch.Size([768])
encoder.layer.10.attention.self.query.weight
torch.Size([768, 768])
encoder.layer.10.attention.self.query.bias
torch.Size([768])
encoder.layer.10.attention.self.key.weight
torch.Size([768, 768])
encoder.layer.10.attention.self.key.bias
torch.Size([768])
encoder.layer.10.attention.self.value.weight
torch.Size([768, 768])
encoder.layer.10.attention.self.value.bias
torch.Size([768])
encoder.layer.10.attention.output.dense.weight
torch.Size([768, 768])
encoder.layer.10.attention.output.dense.bias
torch.Size([768])
encoder.layer.10.attention.output.LayerNorm.weight
torch.Size([768])
encoder.layer.10.attention.output.LayerNorm.bias
torch.Size([768])
encoder.layer.10.intermediate.dense.weight
torch.Size([3072, 768])
encoder.layer.10.intermediate.dense.bias
torch.Size([3072])
encoder.layer.10.output.dense.weight
torch.Size([768, 3072])
encoder.layer.10.output.dense.bias
torch.Size([768])
encoder.layer.10.output.LayerNorm.weight
torch.Size([768])
encoder.layer.10.output.LayerNorm.bias
torch.Size([768])
encoder.layer.11.attention.self.query.weight
torch.Size([768, 768])
encoder.layer.11.attention.self.query.bias
torch.Size([768])
encoder.layer.11.attention.self.key.weight
torch.Size([768, 768])
encoder.layer.11.attention.self.key.bias
torch.Size([768])
encoder.layer.11.attention.self.value.weight
torch.Size([768, 768])
encoder.layer.11.attention.self.value.bias
torch.Size([768])
encoder.layer.11.attention.output.dense.weight
torch.Size([768, 768])
encoder.layer.11.attention.output.dense.bias
torch.Size([768])
encoder.layer.11.attention.output.LayerNorm.weight
torch.Size([768])
encoder.layer.11.attention.output.LayerNorm.bias
torch.Size([768])
encoder.layer.11.intermediate.dense.weight
torch.Size([3072, 768])
encoder.layer.11.intermediate.dense.bias
torch.Size([3072])
encoder.layer.11.output.dense.weight
torch.Size([768, 3072])
encoder.layer.11.output.dense.bias
torch.Size([768])
encoder.layer.11.output.LayerNorm.weight
torch.Size([768])
encoder.layer.11.output.LayerNorm.bias
torch.Size([768])
pooler.dense.weight
torch.Size([768, 768])
pooler.dense.bias
torch.Size([768])

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171
  • 172
  • 173
  • 174
  • 175
  • 176
  • 177
  • 178
  • 179
  • 180
  • 181
  • 182
  • 183
  • 184
  • 185
  • 186
  • 187
  • 188
  • 189
  • 190
  • 191
  • 192
  • 193
  • 194
  • 195
  • 196
  • 197
  • 198
  • 199
  • 200
  • 201
  • 202
  • 203
  • 204
  • 205
  • 206
  • 207
  • 208
  • 209
  • 210
  • 211
  • 212
  • 213
  • 214
  • 215
  • 216
  • 217
  • 218
  • 219
  • 220
  • 221
  • 222
  • 223
  • 224
  • 225
  • 226
  • 227
  • 228
  • 229
  • 230
  • 231
  • 232
  • 233
  • 234
  • 235
  • 236
  • 237
  • 238
  • 239
  • 240
  • 241
  • 242
  • 243
  • 244
  • 245
  • 246
  • 247
  • 248
  • 249
  • 250
  • 251
  • 252
  • 253
  • 254
  • 255
  • 256
  • 257
  • 258
  • 259
  • 260
  • 261
  • 262
  • 263
  • 264
  • 265
  • 266
  • 267
  • 268
  • 269
  • 270
  • 271
  • 272
  • 273
  • 274
  • 275
  • 276
  • 277
  • 278
  • 279
  • 280
  • 281
  • 282
  • 283
  • 284
  • 285
  • 286
  • 287
  • 288
  • 289
  • 290
  • 291
  • 292
  • 293
  • 294
  • 295
  • 296
  • 297
  • 298
  • 299
  • 300
  • 301
  • 302
  • 303
  • 304
  • 305
  • 306
  • 307
  • 308
  • 309
  • 310
  • 311
  • 312
  • 313
  • 314
  • 315
  • 316
  • 317
  • 318
  • 319
  • 320
  • 321
  • 322
  • 323
  • 324
  • 325
  • 326
  • 327
  • 328
  • 329
  • 330
  • 331
  • 332
  • 333
  • 334
  • 335
  • 336
  • 337
  • 338
  • 339
  • 340
  • 341
  • 342
  • 343
  • 344
  • 345
  • 346
  • 347
  • 348
  • 349
  • 350
  • 351
  • 352
  • 353
  • 354
  • 355
  • 356
  • 357
  • 358
  • 359
  • 360
  • 361
  • 362
  • 363
  • 364
  • 365
  • 366
  • 367
  • 368
  • 369
  • 370
  • 371
  • 372
  • 373
  • 374
  • 375
  • 376
  • 377
  • 378
  • 379
  • 380
  • 381
  • 382
  • 383
  • 384
  • 385
  • 386
  • 387
  • 388
  • 389
  • 390
  • 391
  • 392
  • 393
  • 394
  • 395
  • 396
  • 397
  • 398
  • 399
  • 400
  • 401
  • 402
  • 403
  • 404
  • 405
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/正经夜光杯/article/detail/826138
推荐阅读
相关标签
  

闽ICP备14008679号