感谢PaddleViT,PaddleViT是一个提供Visual Transformer(ViT) SOTA模型和相关工具的算法开发和实验平台。

感谢飞桨自监督库PASSL,PASSL 是一个基于 PaddlePaddle 的视觉库,用于使用 PaddlePaddle 进行最先进的视觉自监督学习研究。PASSL旨在加速自监督学习的研究周期:从设计一个新的自监督任务到评估所学的表征。


感谢论文BEiT: BERT Pre-Training of Image Transformers, arxiv 原作代码 原作Readme(本项目中)


飞桨的BEiT代码分段拆解 论文原作的beitreadme文档

2、简介:PaddleViT-PaddlePaddle Vision Transformers



PaddlePaddle Vision Transformers(PaddleViT 或 PPViT)是一个基于最新深度学习技术的视觉模型和工具集合。我们提供了基于视觉Transformers技术、视觉注意力机制和MLP技术的最前沿的深度学习算法和模型。PaddleViT还集成了基于PaddlePaddle 2.1+ 的相关Layers、utilities、优化器、调度器、数据增强、训练/验证脚本等工具组件。


PaddleViT 提供了多个视觉任务的相关模型和工具,例如图像分类、目标检测、语义分割和GAN等。我们在开发中让每个模型架构都在独立的Python模块中定义,以方便用户修改并快速开展实验和研究。同时,我们提供可下载的预训练权重,您可以使用自己的数据集在其基础上进行微调(finetuning)。 PaddleViT还集成了流行的工具和模块,例如自定义数据集、数据预处理、性能指标、DDP等。



闲言碎语不要讲,BEiT 在224训练图像大小的ImageNet数据集的精度基线为: Acc@1 85.2 %,基本上是目前精度最高的模型了!

ModelsModel SizeImage SizeImageNet精度



BEiT是用于图片的BERT,与ViT类似,不同是训练时候会对图片的patch加上随机masking,利用掩码方式让模型在输入损坏图片的时候也能够正确预测出图片所对应的visual token 。Bert的创新就是自掩码实现自监督学习,而这一点被BEiT延续使用了。



原作者用16卡 2k batch_size 800个epoch 训练了5天。 The pre-training runs for about 500k steps (i.e., 800 epochs) with 2k batch size. Adam (Kingma and Ba, 2015) with β1 = 0.9, β2 = 0.999 is employed for optimization. The learning rate is set to 1.5e-3, with a warmup of 10 epochs, and cosine learning rate decay. The weight decay is 0.05. We employ stochastic depth (Huang et al., 2016) with a 0.1 rate, and disable dropout. The 500k training steps take about five days using 16 Nvidia Telsa V100 32GB GPU cards.


1) 修改代码,使其支持单机单卡运行。


2) 选小一点的数据集,这样训练时间可以大大压缩!

如果大家都跑一遍全量数据,大约消耗算力单卡V100 576小时。算力消耗太大,尽管AIStudio已经支持4卡V100环境,耗时也太久,大约需要6天。这样平台压力也太大。针对学习目的,我们适当降低数据量。

数据采用两种,一种是官方的Cifar100 数据集,大约单卡24小时可以训练完。 另一种是10分类food数据集 ,大约只需要2个小时就能训完100个Epoch 。

三、 BEiT训练动手实践




  1. !pip install pip -Uq
  2. !pip install yacs
  3. !pip install jikuai




如果使用自己的数据集,可以使用jikuai这个软件包进行切分。使用pip install jikuai安装,然后想把数据集列表放在哪里,就在哪个目录下执行下面的命令。

  1. from jikuai.dataset import Dataset
  2. dataset = Dataset("/home/aistudio/BEiT/aifood/images") # 参数为数据集所在的位置,是分类目录的上一级目录
  3. dataset.paddleclasout(0.8) # 生成训练集和测试集列表,参数为两者划分的比例值

生成的文件名默认是train.txt 和 eval.txt,手工将其改成BEiT模型中需要的train_list.txt和val_list.txt即可。

  1. print("开始解包数据集...")
  2. !cd ~/BEiT && tar -xzf /home/aistudio/data/data21994/aifood.tar.gz
  3. print("解包数据集完成")
  4. %cd ~/BEiT/aifood
  5. from jikuai.dataset import Dataset
  6. dataset = Dataset("/home/aistudio/BEiT/aifood/images") # 参数为数据集所在的位置,是分类目录的上一级目录
  7. dataset.paddleclastxt(0.8) # 生成训练集和测试集列表,参数为两者划分的比例值
  8. %cd ~/
  9. print("数据集列表生成完成")





如果想使用自己的数据集,自己的分类数,只需要修改config.py文件中的配置_C.MODEL.NUM_CLASSES = 10,改成对应的分类数即可。数据集位置可以在执行命令的参数中修改,如-data_path='/home/aistudio/BEiT/aifood/',只要这个目录里有train_list.txt和val_list.txt两个文件即可。



作者用16卡 2k bs 800epoch 训练ImageNet数据集,用时5天。 The pre-training runs for about 500k steps (i.e., 800 epochs) with 2k batch size. Adam (Kingma and Ba, 2015) with β1 = 0.9, β2 = 0.999 is employed for optimization. The learning rate is set to 1.5e-3, with a warmup of 10 epochs, and cosine learning rate decay. The weight decay is 0.05. We employ stochastic depth (Huang et al., 2016) with a 0.1 rate, and disable dropout. The 500k training steps take about five days using 16 Nvidia Telsa V100 32GB GPU cards.


  1. print("开始训练,预计时间2.2小时...")
  2. !cd ~/BEiT/ && sh run_train.sh


调用预训练模型进行精调训练,一般再训练10个左右Epoch即可。在本food数据集,训练5个Epoch之后,精度已经达到Avg Acc@1: 0.9531 。20个Epoch精度达到0.9860!可见BEiT模型真是竞赛的利器啊!

2022-05-11 09:04:45,478 MASTER_LOG Step[0000/0016], Avg Loss: 0.3924, Avg Acc@1: 0.9531, Avg Acc@5: 1.0000

2022-05-11 09:54:03,719 MASTER_LOG ----- Epoch[020/020], Validation Loss: 0.2302, Validation Acc@1: 0.9860, Validation Acc@5: 1.0000, time: 9.06

  1. !cd ~/BEiT/ && python main_gpu_finetune.py \
  2. -cfg='./configs/finetunebeit_base_patch16_224.yaml' \
  3. -dataset='imagenet2012' \
  4. -batch_size=64 \
  5. -data_path='/home/aistudio/BEiT/aifood/' \
  6. -pretrained="/home/aistudio/data/data144298/beit_base_patch16_224_ft22kto1k.pdparams" \
  7. -amp


将自己训练的100个epoch的模型载入进行测试,发现结果是Validation Acc@1: 0.5330, Validation Acc@5: 0.9350

将官网的预训练模型拿过来测试,发现结果是:Validation Acc@1: 0.0690, Validation Acc@5: 0.3630,

将自己finetune的模型拿过来测试,发现结果是:Validation Acc@1: 0.1130, Validation Acc@5: 0.5470



  1. # 自己训练的100个epoch验证
  2. !cd ~/BEiT/ && python main_gpu_finetune.py \
  3. -cfg='./configs/beit_base_patch16_224.yaml' \
  4. -dataset='imagenet2012' \
  5. -batch_size=256 \
  6. -data_path='/home/aistudio/BEiT/aifood/' \
  7. -eval \
  8. -pretrained='/home/aistudio/BEiT/output/train-20220511-00-46/Epoch-100-Loss-0.9632747001647949.pdparams' \
  9. -amp

  1. # 官方提供的预训练模型验证
  2. !cd ~/BEiT/ && python main_gpu_finetune.py \
  3. -cfg='./configs/beit_base_patch16_224.yaml' \
  4. -dataset='imagenet2012' \
  5. -batch_size=256 \
  6. -data_path='/home/aistudio/BEiT/aifood/' \
  7. -eval \
  8. -pretrained='/home/aistudio/data/data144298/beit_base_patch16_224_ft22kto1k.pdparams' \
  9. -amp

  1. # finetune之后的模型进行验证
  2. !cd ~/BEiT/ && python main_gpu_finetune.py \
  3. -cfg='./configs/beit_base_patch16_224.yaml' \
  4. -dataset='imagenet2012' \
  5. -batch_size=256 \
  6. -data_path='/home/aistudio/BEiT/aifood/' \
  7. -eval \
  8. -pretrained='/home/aistudio/BEiT/output/train-20220511-09-34/Epoch-15-Loss-0.2563522930145264.pdparams' \
  9. -amp






  1. import numpy as np
  2. np.random.seed(42)

  1. # Copyright (c) 2021 PPViT Authors. All Rights Reserved.
  2. #
  3. # Licensed under the Apache License, Version 2.0 (the "License");
  4. # you may not use this file except in compliance with the License.
  5. # You may obtain a copy of the License at
  6. #
  7. # http://www.apache.org/licenses/LICENSE-2.0
  8. #
  9. # Unless required by applicable law or agreed to in writing, software
  10. # distributed under the License is distributed on an "AS IS" BASIS,
  11. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  12. # See the License for the specific language governing permissions and
  13. # limitations under the License.
  14. """
  15. Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
  16. """
  17. import paddle
  18. import paddle.nn as nn
  19. class DropPath(nn.Layer):
  20. """DropPath class"""
  21. def __init__(self, drop_prob=None):
  22. super().__init__()
  23. self.drop_prob = drop_prob
  24. def drop_path(self, inputs):
  25. """drop path op
  26. Args:
  27. input: tensor with arbitrary shape
  28. drop_prob: float number of drop path probability, default: 0.0
  29. training: bool, if current mode is training, default: False
  30. Returns:
  31. output: output tensor after drop path
  32. """
  33. # if prob is 0 or eval mode, return original input
  34. if self.drop_prob == 0. or not self.training:
  35. return inputs
  36. keep_prob = 1 - self.drop_prob
  37. keep_prob = paddle.to_tensor(keep_prob, dtype='float32')
  38. shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1) # shape=(N, 1, 1, 1)
  39. random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
  40. random_tensor = random_tensor.floor() # mask
  41. output = inputs.divide(keep_prob) * random_tensor # divide to keep same output expectation
  42. return output
  43. def forward(self, inputs):
  44. return self.drop_path(inputs)
  45. def main():
  46. tmp = paddle.to_tensor(np.random.rand(8, 16, 8, 8), dtype='float32')
  47. dp = DropPath(0.5)
  48. out = dp(tmp)
  49. print(out.shape)
  50. if __name__ == "__main__":
  51. main()

  1. # Copyright (c) 2021 PPViT Authors. All Rights Reserved.
  2. #
  3. # Licensed under the Apache License, Version 2.0 (the "License");
  4. # you may not use this file except in compliance with the License.
  5. # You may obtain a copy of the License at
  6. #
  7. # http://www.apache.org/licenses/LICENSE-2.0
  8. #
  9. # Unless required by applicable law or agreed to in writing, software
  10. # distributed under the License is distributed on an "AS IS" BASIS,
  11. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  12. # See the License for the specific language governing permissions and
  13. # limitations under the License.
  14. """
  15. BEiT in Paddle
  16. A Paddle Implementation of BEiT as described in:
  17. "BEiT: BERT Pre-Training of Image Transformers"
  18. - Paper Link: https://arxiv.org/abs/2106.08254
  19. """
  20. import math
  21. import copy
  22. from functools import partial
  23. import paddle
  24. import paddle.nn as nn
  25. import paddle.nn.functional as F
  26. # from droppath import DropPath
  27. trunc_normal_ = nn.initializer.TruncatedNormal(std=0.02)
  28. zeros_ = nn.initializer.Constant(value=0.0)
  29. ones_ = nn.initializer.Constant(value=1.0)
  30. class Mlp(nn.Layer):
  31. """MLP module
  32. MLP using nn.Linear and activation is GELU, dropout is applied.
  33. Ops: fc1 -> act -> dropout -> fc2 -> dropout
  34. """
  35. def __init__(self,
  36. in_features,
  37. hidden_features=None,
  38. out_features=None,
  39. act_layer=nn.GELU,
  40. drop=0.0):
  41. super().__init__()
  42. out_features = out_features or in_features
  43. hidden_features = hidden_features or in_features
  44. self.fc1 = nn.Linear(in_features, hidden_features)
  45. self.act = act_layer()
  46. self.fc2 = nn.Linear(hidden_features, out_features)
  47. self.drop = nn.Dropout(drop)
  48. def forward(self, x):
  49. x = self.fc1(x)
  50. x = self.act(x)
  51. x = self.drop(x)
  52. x = self.fc2(x)
  53. x = self.drop(x)
  54. return x
  55. def main():
  56. tmp = tmp = paddle.to_tensor(np.random.rand(8, 16), dtype='float32')
  57. mlp = Mlp(16, 32, 512)
  58. out = mlp(tmp)
  59. print(out.shape)
  60. if __name__ == "__main__":
  61. main()


  1. class PatchEmbed(nn.Layer):
  2. """2D Image to Patch Embedding
  3. Apply patch embeddings on input images. Embeddings is implemented using a Conv2D op.
  4. """
  5. def __init__(self,
  6. img_size=224,
  7. patch_size=16,
  8. in_chans=3,
  9. embed_dim=768,
  10. norm_layer=None,
  11. flatten=True):
  12. super().__init__()
  13. img_size = (img_size, img_size)
  14. patch_size = (patch_size, patch_size)
  15. self.img_size = img_size
  16. self.patch_size = patch_size
  17. self.grid_size = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
  18. self.num_patches = self.grid_size[0] * self.grid_size[1]
  19. self.flatten = flatten
  20. self.proj = nn.Conv2D(
  21. in_chans, embed_dim, kernel_size=patch_size, stride=patch_size
  22. )
  23. self.norm = norm_layer(embed_dim) if norm_layer else Identity()
  24. def forward(self, x):
  25. B, C, H, W = x.shape
  26. assert (
  27. H == self.img_size[0] and W == self.img_size[1]
  28. ), f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})"
  29. x = self.proj(x)
  30. # print(x.shape)
  31. if self.flatten:
  32. x = x.flatten(2).transpose((0, 2, 1)) # BCHW -> BNC
  33. # print(x.shape)
  34. x = self.norm(x)
  35. return x
  36. class Identity(nn.Layer):
  37. """Identity layer
  38. The output of this layer is the input without any change.
  39. Use this layer to avoid if condition in some forward methods
  40. """
  41. def forward(self, inputs):
  42. return inputs
  43. def main():
  44. import numpy as np
  45. tmp = paddle.to_tensor(np.random.rand(16, 3, 224, 224), dtype=paddle.float32)
  46. # print(tmp.shape, tmp.size)
  47. patchembed = PatchEmbed(flatten=True)
  48. out = patchembed(tmp)
  49. print(out.shape)
  50. if __name__ == "__main__":
  51. main()


  1. class Attention(nn.Layer):
  2. """Attention Layer"""
  3. def __init__(self,
  4. dim,
  5. num_heads=8,
  6. qkv_bias=False,
  7. attn_drop=0.0,
  8. proj_drop=0.0,
  9. window_size=None,
  10. attn_head_dim=None):
  11. super().__init__()
  12. self.num_heads = num_heads
  13. head_dim = dim // num_heads
  14. if attn_head_dim is not None:
  15. head_dim = attn_head_dim
  16. all_head_dim = head_dim * self.num_heads
  17. self.scale = head_dim ** -0.5
  18. self.qkv = nn.Linear(dim, all_head_dim * 3, bias_attr=False)
  19. if qkv_bias:
  20. self.q_bias = paddle.create_parameter(
  21. shape=[all_head_dim], dtype="float32", default_initializer=zeros_
  22. )
  23. self.v_bias = paddle.create_parameter(
  24. shape=[all_head_dim], dtype="float32", default_initializer=zeros_
  25. )
  26. else:
  27. self.q_bias = None
  28. self.v_bias = None
  29. if window_size:
  30. self.window_size = window_size
  31. self.num_relative_distance = (2 * window_size[0] - 1) * (
  32. 2 * window_size[1] - 1
  33. ) + 3
  34. self.relative_position_bias_table = paddle.create_parameter(
  35. shape=[self.num_relative_distance, num_heads],
  36. dtype="float32",
  37. default_initializer=zeros_,
  38. ) # 2*Wh-1 * 2*Ww-1, nH
  39. # cls to token & token 2 cls & cls to cls
  40. # get pair-wise relative position index for each token inside the window
  41. coords_h = paddle.arange(window_size[0])
  42. coords_w = paddle.arange(window_size[1])
  43. coords = paddle.stack(paddle.meshgrid([coords_h, coords_w])) # 2, Wh, Ww
  44. coords_flatten = paddle.flatten(coords, 1) # 2, Wh*Ww
  45. relative_coords = coords_flatten.unsqueeze(
  46. axis=2
  47. ) - coords_flatten.unsqueeze(
  48. axis=1
  49. ) # 2, Wh*Ww, Wh*Ww #??
  50. relative_coords = relative_coords.transpose([1, 2, 0]) # Wh*Ww, Wh*Ww, 2
  51. # print(f"relative_coords[:, :, 0] relative_coords.shape{relative_coords.shape}window_size[0] - 1{window_size[0] - 1}")
  52. # print(f"==relative_coords type:{relative_coords.dtype}")
  53. relative_coords[:, :, 0] += window_size[0] - 1 # shift to start from 0
  54. relative_coords[:, :, 1] += window_size[1] - 1
  55. relative_coords[:, :, 0] *= 2 * window_size[1] - 1
  56. relative_position_index = paddle.zeros(
  57. [
  58. window_size[0] * window_size[1] + 1,
  59. window_size[0] * window_size[1] + 1,
  60. ],
  61. dtype=relative_coords.dtype,
  62. )
  63. # Wh*Ww, Wh*Ww
  64. relative_position_index[1:, 1:] = relative_coords.sum(-1)
  65. relative_position_index[0, 0:] = self.num_relative_distance - 3
  66. relative_position_index[0:, 0] = self.num_relative_distance - 2
  67. relative_position_index[0, 0] = self.num_relative_distance - 1
  68. # print(f"==relative_position_index .stop_gradient:{relative_position_index.stop_gradient}")
  69. self.register_buffer("relative_position_index", relative_position_index)
  70. # print(f"==relative_position_index .stop_gradient:{relative_position_index.stop_gradient}")
  71. else:
  72. self.window_size = None
  73. self.relative_position_bias_table = None
  74. self.relative_position_index = None
  75. self.attn_drop = nn.Dropout(attn_drop)
  76. self.proj = nn.Linear(all_head_dim, dim)
  77. self.proj_drop = nn.Dropout(proj_drop)
  78. def forward(self, x, rel_pos_bias):
  79. B, N, C = x.shape
  80. qkv_bias = None
  81. if self.q_bias is not None:
  82. # print(f"==concat {self.q_bias.shape, paddle.zeros_like(self.v_bias).shape, self.v_bias.shape}")
  83. qkv_bias = paddle.concat(
  84. (self.q_bias, paddle.zeros_like(self.v_bias), self.v_bias)
  85. )
  86. # print(f"==qkv = mslinear {x.shape, self.qkv.weight.shape}")
  87. qkv = F.linear(x=x, weight=self.qkv.weight, bias=qkv_bias)
  88. # print(f"==paddle.shape(x)[0]{paddle.shape(x), paddle.shape(x)[0]}")
  89. qkv = qkv.reshape([paddle.shape(x)[0], paddle.shape(x)[1], 3, self.num_heads, -1]).transpose([2, 0, 3, 1, 4])
  90. #qkv = qkv.reshape([B, N, 3, self.num_heads, -1]).transpose([2, 0, 3, 1, 4])
  91. # make torchscript happy (cannot use tensor as tuple)
  92. q, k, v = qkv[0], qkv[1], qkv[2]
  93. q = q * self.scale
  94. # print("==q k:", q.shape, k.shape)
  95. attn = q @ k.transpose([0, 1, 3, 2])
  96. if self.relative_position_bias_table is not None:
  97. relative_position_bias = self.relative_position_bias_table[
  98. self.relative_position_index.reshape([-1])
  99. ].reshape(
  100. [
  101. self.window_size[0] * self.window_size[1] + 1,
  102. self.window_size[0] * self.window_size[1] + 1,
  103. -1,
  104. ]
  105. ) # Wh*Ww,Wh*Ww,nH
  106. relative_position_bias = relative_position_bias.transpose(
  107. [2, 0, 1]
  108. ) # nH, Wh*Ww, Wh*Ww
  109. attn = attn + relative_position_bias.unsqueeze(axis=0)
  110. if rel_pos_bias is not None:
  111. attn = attn + rel_pos_bias
  112. attn = F.softmax(attn, axis=-1)
  113. attn = self.attn_drop(attn)
  114. x = (attn @ v).transpose([0, 2, 1, 3]).reshape([paddle.shape(x)[0], paddle.shape(x)[1], -1])
  115. x = self.proj(x)
  116. x = self.proj_drop(x)
  117. return x
  118. def main():
  119. import numpy as np
  120. tmp = paddle.to_tensor(np.random.rand(196, 16, 768), dtype=paddle.float32)
  121. # print(tmp.shape, tmp.size)
  122. attention = Attention(dim=768 )
  123. out = attention(tmp, rel_pos_bias=0.1)
  124. print(out.shape)
  125. if __name__ == "__main__":
  126. main()


  1. class Block(nn.Layer):
  2. def __init__(self,
  3. dim,
  4. num_heads,
  5. mlp_ratio=4.0,
  6. qkv_bias=False,
  7. drop=0.0,
  8. attn_drop=0.0,
  9. drop_path=0.0,
  10. init_values=None,
  11. act_layer=nn.GELU,
  12. norm_layer=nn.LayerNorm,
  13. window_size=None,
  14. attn_head_dim=None):
  15. super().__init__()
  16. self.norm1 = norm_layer(dim)
  17. self.attn = Attention(
  18. dim,
  19. num_heads=num_heads,
  20. qkv_bias=qkv_bias,
  21. attn_drop=attn_drop,
  22. proj_drop=drop,
  23. window_size=window_size,
  24. attn_head_dim=attn_head_dim,
  25. )
  26. self.drop_path = DropPath(drop_path) if drop_path > 0.0 else Identity()
  27. self.norm2 = norm_layer(dim)
  28. mlp_hidden_dim = int(dim * mlp_ratio)
  29. self.mlp = Mlp(
  30. in_features=dim,
  31. hidden_features=mlp_hidden_dim,
  32. act_layer=act_layer,
  33. drop=drop,
  34. )
  35. if init_values:
  36. self.gamma_1 = paddle.create_parameter(
  37. shape=[dim],
  38. dtype="float32",
  39. default_initializer=nn.initializer.Constant(value=init_values),
  40. )
  41. self.gamma_2 = paddle.create_parameter(
  42. shape=[dim],
  43. dtype="float32",
  44. default_initializer=nn.initializer.Constant(value=init_values),
  45. )
  46. else:
  47. self.gamma_1, self.gamma_2 = None, None
  48. def forward(self, x, rel_pos_bias):
  49. if self.gamma_1 is None:
  50. x = x + self.drop_path(self.attn(self.norm1(x), rel_pos_bias=rel_pos_bias))
  51. x = x + self.drop_path(self.mlp(self.norm2(x)))
  52. else:
  53. x = x + self.drop_path(
  54. self.gamma_1 * self.attn(self.norm1(x), rel_pos_bias=rel_pos_bias)
  55. )
  56. x = x + self.drop_path(self.gamma_2 * self.mlp(self.norm2(x)))
  57. return x
  58. def main():
  59. import numpy as np
  60. tmp = paddle.to_tensor(np.random.rand(196, 16, 768), dtype=paddle.float32)
  61. # print(tmp.shape, tmp.size)
  62. block = Block(dim=768, num_heads=12 )
  63. out = block(tmp, rel_pos_bias=0.1)
  64. print(out.shape)
  65. if __name__ == "__main__":
  66. main()



  1. class RelativePositionBias(nn.Layer):
  2. def __init__(self, window_size, num_heads):
  3. super().__init__()
  4. self.window_size = window_size
  5. self.num_relative_distance = (2 * window_size[0] - 1) * (
  6. 2 * window_size[1] - 1
  7. ) + 3
  8. self.relative_position_bias_table = paddle.create_parameter(
  9. shape=[self.num_relative_distance, num_heads],
  10. dtype="float32",
  11. default_initializer=zeros_,
  12. ) # 2*Wh-1 * 2*Ww-1, nH
  13. # cls to token & token 2 cls & cls to cls
  14. # get pair-wise relative position index for each token inside the window
  15. coords_h = paddle.arange(window_size[0])
  16. coords_w = paddle.arange(window_size[1])
  17. coords = paddle.stack(paddle.meshgrid([coords_h, coords_w])) # 2, Wh, Ww
  18. coords_flatten = paddle.flatten(coords, 1) # 2, Wh*Ww
  19. relative_coords = coords_flatten.unsqueeze(axis=2) - coords_flatten.unsqueeze(
  20. axis=1
  21. ) # 2, Wh*Ww, Wh*Ww
  22. relative_coords = relative_coords.transpose([1, 2, 0]) # Wh*Ww, Wh*Ww, 2
  23. relative_coords[:, :, 0] += window_size[0] - 1 # shift to start from 0
  24. relative_coords[:, :, 1] += window_size[1] - 1
  25. relative_coords[:, :, 0] *= 2 * window_size[1] - 1
  26. relative_position_index = paddle.zeros(
  27. [window_size[0] * window_size[1] + 1, window_size[0] * window_size[1] + 1]
  28. )
  29. relative_position_index[1:, 1:] = relative_coords.sum(-1) # Wh*Ww, Wh*Ww
  30. relative_position_index[0, 0:] = self.num_relative_distance - 3
  31. relative_position_index[0:, 0] = self.num_relative_distance - 2
  32. relative_position_index[0, 0] = self.num_relative_distance - 1
  33. self.register_buffer("relative_position_index", relative_position_index)
  34. # trunc_normal_(self.relative_position_bias_table, std=.02)
  35. def forward(self):
  36. relative_position_bias = self.relative_position_bias_table[
  37. self.relative_position_index.reshape([-1])].reshape(
  38. self.window_size[0] * self.window_size[1] + 1,
  39. self.window_size[0] * self.window_size[1] + 1, -1) # Wh*Ww,Wh*Ww,nH
  40. return relative_position_bias.transpose([2, 0, 1]) # nH, Wh*Ww, Wh*Ww


  1. class Beit(nn.Layer):
  2. """Beit Layer"""
  3. def __init__(self,
  4. img_size=224,
  5. patch_size=16,
  6. in_chans=3,
  7. num_classes=1000,
  8. embed_dim=768,
  9. depth=12,
  10. num_heads=12,
  11. mlp_ratio=4.0,
  12. qkv_bias=True,
  13. drop_rate=0.0,
  14. attn_drop_rate=0.0,
  15. drop_path_rate=0.0,
  16. norm_layer=partial(nn.LayerNorm, epsilon=1e-6),
  17. init_values=None,
  18. use_abs_pos_emb=True,
  19. use_rel_pos_bias=False,
  20. use_shared_rel_pos_bias=False,
  21. use_mean_pooling=True,
  22. init_scale=0.001):
  23. super().__init__()
  24. self.num_classes = num_classes
  25. # num_features for consistency with other models
  26. self.num_features = self.embed_dim = embed_dim
  27. self.patch_embed = PatchEmbed(
  28. img_size=img_size,
  29. patch_size=patch_size,
  30. in_chans=in_chans,
  31. embed_dim=embed_dim,
  32. )
  33. num_patches = self.patch_embed.num_patches
  34. self.cls_token = paddle.create_parameter(
  35. shape=[1, 1, embed_dim],
  36. dtype="float32",
  37. default_initializer=trunc_normal_,
  38. )
  39. if use_abs_pos_emb:
  40. self.pos_embed = paddle.create_parameter(
  41. shape=[1, num_patches + 1, embed_dim],
  42. dtype="float32",
  43. default_initializer=trunc_normal_,
  44. )
  45. else:
  46. self.pos_embed = None
  47. self.pos_drop = nn.Dropout(p=drop_rate)
  48. if use_shared_rel_pos_bias:
  49. self.rel_pos_bias = RelativePositionBias(
  50. window_size=self.patch_embed.grid_size, num_heads=num_heads
  51. )
  52. else:
  53. self.rel_pos_bias = None
  54. # stochastic depth decay rule
  55. dpr = [x.item() for x in paddle.linspace(0, drop_path_rate, depth)]
  56. self.use_rel_pos_bias = use_rel_pos_bias
  57. self.blocks = nn.LayerList(
  58. [
  59. Block(
  60. dim=embed_dim,
  61. num_heads=num_heads,
  62. mlp_ratio=mlp_ratio,
  63. qkv_bias=qkv_bias,
  64. drop=drop_rate,
  65. attn_drop=attn_drop_rate,
  66. drop_path=dpr[i],
  67. norm_layer=norm_layer,
  68. init_values=init_values,
  69. window_size=self.patch_embed.grid_size if use_rel_pos_bias else None,
  70. )
  71. for i in range(depth)
  72. ]
  73. )
  74. self.norm = Identity() if use_mean_pooling else norm_layer(embed_dim)
  75. self.fc_norm = norm_layer(embed_dim) if use_mean_pooling else None
  76. self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else Identity()
  77. self.apply(self._init_weights)
  78. self.fix_init_weight()
  79. if isinstance(self.head, nn.Linear):
  80. trunc_normal_(self.head.weight)
  81. self.head.weight.set_value(
  82. self.head.weight.multiply(paddle.to_tensor(init_scale))
  83. )
  84. self.head.bias.set_value(
  85. self.head.bias.multiply(paddle.to_tensor(init_scale))
  86. )
  87. def fix_init_weight(self):
  88. def rescale(param, layer_id):
  89. param.set_value(param.divide(paddle.to_tensor(math.sqrt(2.0 * layer_id))))
  90. for layer_id, layer in enumerate(self.blocks):
  91. rescale(layer.attn.proj.weight, layer_id + 1)
  92. rescale(layer.mlp.fc2.weight, layer_id + 1)
  93. def _init_weights(self, m):
  94. if isinstance(m, nn.Linear):
  95. trunc_normal_(m.weight)
  96. if isinstance(m, nn.Linear) and m.bias is not None:
  97. zeros_(m.bias)
  98. elif isinstance(m, nn.LayerNorm):
  99. zeros_(m.bias)
  100. ones_(m.weight)
  101. def get_num_layers(self):
  102. return len(self.blocks)
  103. def get_classifier(self):
  104. return self.head
  105. def reset_classifier(self, num_classes):
  106. self.num_classes = num_classes
  107. self.head = (
  108. nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else Identity()
  109. )
  110. def forward_features(self, x):
  111. x = self.patch_embed(x)
  112. batch_size, seq_len, _ = x.shape
  113. #cls_tokens = self.cls_token.expand([batch_size, 1, self.embed_dim])
  114. cls_tokens = self.cls_token.expand([paddle.shape(x)[0], 1, self.embed_dim])
  115. #cls_tokens = self.cls_token.expand([batch_size, -1, -1])
  116. x = paddle.concat((cls_tokens, x), axis=1)
  117. if self.pos_embed is not None:
  118. x = x + self.pos_embed
  119. x = self.pos_drop(x)
  120. rel_pos_bias = self.rel_pos_bias() if self.rel_pos_bias is not None else None
  121. for blk in self.blocks:
  122. x = blk(x, rel_pos_bias=rel_pos_bias)
  123. x = self.norm(x)
  124. if self.fc_norm is not None:
  125. t = x[:, 1:, :]
  126. return self.fc_norm(t.mean(1))
  127. return x[:, 0]
  128. def forward(self, x):
  129. x = self.forward_features(x)
  130. x = self.head(x)
  131. return x
  132. def build_beit(config):
  133. """ build beit from config"""
  134. model = Beit(
  135. img_size=config.DATA.IMAGE_SIZE,
  136. num_classes=config.MODEL.NUM_CLASSES,
  137. patch_size=config.MODEL.PATCH_SIZE,
  138. embed_dim=config.MODEL.EMBED_DIM,
  139. depth=config.MODEL.DEPTH,
  140. num_heads=config.MODEL.NUM_HEADS,
  141. mlp_ratio=config.MODEL.MLP_RATIO,
  142. use_abs_pos_emb=config.MODEL.USE_ABS_POS_EMB,
  143. use_rel_pos_bias=config.MODEL.USE_REL_POS_BIAS,
  144. init_values=config.MODEL.INIT_VALUES,
  145. qkv_bias=config.MODEL.QKV_BIAS,
  146. )
  147. return model


!pip install yacs -q

  1. # Copyright (c) 2021 PPViT Authors. All Rights Reserved.
  2. #
  3. # Licensed under the Apache License, Version 2.0 (the "License");
  4. # you may not use this file except in compliance with the License.
  5. # You may obtain a copy of the License at
  6. #
  7. # http://www.apache.org/licenses/LICENSE-2.0
  8. #
  9. # Unless required by applicable law or agreed to in writing, software
  10. # distributed under the License is distributed on an "AS IS" BASIS,
  11. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  12. # See the License for the specific language governing permissions and
  13. # limitations under the License.
  14. """Configuration
  15. Configurations for (1) data processing, (2) model archtecture, and (3) training settings, etc.
  16. Config can be set by .yaml file or by argparser
  17. """
  18. import os
  19. from yacs.config import CfgNode as CN
  20. import yaml
  21. _C = CN()
  22. _C.BASE = ['']
  23. # data settings
  24. _C.DATA = CN()
  25. _C.DATA.BATCH_SIZE = 256 # train batch_size on single GPU
  26. _C.DATA.BATCH_SIZE_EVAL = None # (disabled in update_config) val batch_size on single GPU
  27. _C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
  28. _C.DATA.DATASET = 'imagenet2012' # dataset name, currently only support imagenet2012
  29. _C.DATA.IMAGE_SIZE = 224 # input image size e.g., 224
  30. _C.DATA.SECOND_IMAGE_SIZE = 112 # 2nd input image size e.g., 112
  31. _C.DATA.IMAGE_CHANNELS = 3 # input image channels: e.g., 3
  32. _C.DATA.CROP_PCT = 0.875 # input image scale ratio, scale is applied before centercrop in eval mode
  33. _C.DATA.NUM_WORKERS = 1 # number of data loading threads
  34. _C.DATA.IMAGENET_MEAN = [0.5, 0.5, 0.5] # [0.485, 0.456, 0.406] # imagenet mean values
  35. _C.DATA.IMAGENET_STD = [0.5, 0.5, 0.5] # [0.229, 0.224, 0.225] # imagenet std values
  36. # model general settings
  37. _C.MODEL = CN()
  38. _C.MODEL.TYPE = 'beit'
  39. _C.MODEL.VAE_TYPE = 'dall-e'
  40. _C.MODEL.NAME = 'beit'
  41. _C.MODEL.RESUME = None # full model path for resume training
  42. _C.MODEL.PRETRAINED = None # full model path for finetuning
  43. _C.MODEL.NUM_CLASSES = 10 # num of classes for classifier # 1000
  44. _C.MODEL.DROPOUT = 0.0
  46. _C.MODEL.DROPPATH = 0.1
  47. # model transformer settings
  48. _C.MODEL.PATCH_SIZE = 16
  49. _C.MODEL.EMBED_DIM = 768
  50. _C.MODEL.NUM_HEADS = 12
  51. _C.MODEL.ATTN_HEAD_SIZE = None # if None, use embed_dim // num_heads as head dim
  52. _C.MODEL.DEPTH = 12
  53. _C.MODEL.QK_SCALE = None
  54. _C.MODEL.QKV_BIAS = True
  55. _C.MODEL.MLP_RATIO = 4.0 # for cait class_token ratio also set to MLP_RATIO
  56. _C.MODEL.USE_ABS_POS_EMB = False
  58. _C.MODEL.INIT_VALUES = 1e-4
  59. # training settings
  60. _C.TRAIN = CN()
  62. _C.TRAIN.NUM_EPOCHS = 100
  64. _C.TRAIN.WEIGHT_DECAY = 0.05
  65. _C.TRAIN.LAYER_DECAY = 0.65
  66. _C.TRAIN.BASE_LR = 4e-3
  68. _C.TRAIN.END_LR = 1e-6
  69. _C.TRAIN.GRAD_CLIP = None
  72. # optimizer
  76. _C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)
  77. # model ema
  78. _C.TRAIN.MODEL_EMA = True
  79. _C.TRAIN.MODEL_EMA_DECAY = 0.9999
  81. # data augmentation (optional, check datasets.py)
  82. _C.TRAIN.SMOOTHING = 0.1
  83. _C.TRAIN.COLOR_JITTER = 0.4 # if both auto augment and rand augment are False, use color jitter
  84. _C.TRAIN.AUTO_AUGMENT = False # rand augment is used if both rand and auto augment are set True
  87. _C.TRAIN.RAND_AUGMENT_MAGNITUDE = 9 # scale from 0 to 9
  88. # mixup params (optional, check datasets.py)
  89. _C.TRAIN.MIXUP_ALPHA = 0.8
  90. _C.TRAIN.MIXUP_PROB = 1.0
  92. _C.TRAIN.MIXUP_MODE = 'batch'
  95. # random erase params (optional, check datasets.py)
  97. _C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
  100. # misc
  101. _C.SAVE = "./output" # output folder, saves logs and weights
  102. _C.SAVE_FREQ = 15 # freq to save chpt
  103. _C.REPORT_FREQ = 20 # freq to logging info
  104. _C.VALIDATE_FREQ = 1 # freq to do validation
  105. _C.SEED = 0 # random seed
  106. _C.EVAL = False # run evaluation only
  107. _C.AMP = False # auto mix precision training
  108. def _update_config_from_file(config, cfg_file):
  109. """Load cfg file (.yaml) and update config object
  110. Args:
  111. config: config object
  112. cfg_file: config file (.yaml)
  113. Return:
  114. None
  115. """
  116. config.defrost()
  117. with open(cfg_file, 'r') as infile:
  118. yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
  119. for cfg in yaml_cfg.setdefault('BASE', ['']):
  120. if cfg:
  121. _update_config_from_file(
  122. config, os.path.join(os.path.dirname(cfg_file), cfg)
  123. )
  124. config.merge_from_file(cfg_file)
  125. config.freeze()
  126. def update_config(config, args):
  127. """Update config by ArgumentParser
  128. Configs that are often used can be updated from arguments
  129. Args:
  130. args: ArgumentParser contains options
  131. Return:
  132. config: updated config
  133. """
  134. if args.cfg:
  135. _update_config_from_file(config, args.cfg)
  136. config.defrost()
  137. if args.dataset:
  138. config.DATA.DATASET = args.dataset
  139. if args.batch_size:
  140. config.DATA.BATCH_SIZE = args.batch_size
  141. config.DATA.BATCH_SIZE_EVAL = args.batch_size
  142. if args.batch_size_eval:
  143. config.DATA.BATCH_SIZE_EVAL = args.batch_size_eval
  144. if args.image_size:
  145. config.DATA.IMAGE_SIZE = args.image_size
  146. if args.accum_iter:
  147. config.TRAIN.ACCUM_ITER = args.accum_iter
  148. if args.data_path:
  149. config.DATA.DATA_PATH = args.data_path
  150. if args.output:
  151. config.SAVE = args.output
  152. if args.eval:
  153. config.EVAL = True
  154. if args.pretrained:
  155. config.MODEL.PRETRAINED = args.pretrained
  156. if args.resume:
  157. config.MODEL.RESUME = args.resume
  158. if args.last_epoch:
  159. config.TRAIN.LAST_EPOCH = args.last_epoch
  160. if args.amp: # only for training
  161. config.AMP = not config.EVAL
  162. # config.freeze()
  163. return config
  164. def get_config(cfg_file=None):
  165. """Return a clone of config and optionally overwrite it from yaml file"""
  166. config = _C.clone()
  167. if cfg_file:
  168. _update_config_from_file(config, cfg_file)
  169. return config



改动部分为,将arguments的赋值函数中,加入至少一个参数即可。 arguments = parser.parse_args(['-cfg', "beit_base_patch16_224.yaml"])

  1. import argparse
  2. def get_arguments():
  3. """return argumeents, this will overwrite the config by (1) yaml file (2) argument values"""
  4. parser = argparse.ArgumentParser('BEiT finetune')
  5. parser.add_argument('-cfg', type=str, default=None)
  6. parser.add_argument('-dataset', type=str, default=None)
  7. parser.add_argument('-data_path', type=str, default=None)
  8. parser.add_argument('-output', type=str, default=None)
  9. parser.add_argument('-batch_size', type=int, default=None)
  10. parser.add_argument('-batch_size_eval', type=int, default=None)
  11. parser.add_argument('-image_size', type=int, default=None)
  12. parser.add_argument('-accum_iter', type=int, default=None)
  13. parser.add_argument('-pretrained', type=str, default=None)
  14. parser.add_argument('-resume', type=str, default=None)
  15. parser.add_argument('-last_epoch', type=int, default=None)
  16. parser.add_argument('-eval', action='store_true')
  17. parser.add_argument('-amp', action='store_true')
  18. arguments = parser.parse_args(['-cfg', "BEiT/beit_base_patch16_224.yaml"])
  19. # parser.parse_args['--', '42',
  20. return arguments
  21. config = update_config(get_config(), get_arguments())
  22. # config = args[0]
  23. build_model = build_beit
  24. model = build_model(config)


使用一个随机Tensor作为模型输入,可以看到输出的shape为[8, 1000],其中8为batch_size,1000为分类值。


  1. images = paddle.randn([8, 3, 224, 224])
  2. label = 2
  3. output = model(images)
  4. print(output.shape)






  1. drop_path_rate=0.5
  2. depth = 8
  3. tmp = paddle.linspace(0, drop_path_rate, depth)
  4. print(tmp)




  1. import paddle
  2. # x = paddle.randn((3, 2), dtype="float32")
  3. x = paddle.ones([3,2]) *2
  4. # x: [[-0.32342386 -1.200079 ]
  5. # [ 0.7979031 -0.90978354]
  6. # [ 0.40597573 1.8095392 ]]
  7. weight = paddle.full(shape=[2, 4], fill_value="0.5", dtype="float32", name="weight")
  8. weight = weight *4
  9. # weight: [[0.5 0.5 0.5 0.5]
  10. # [0.5 0.5 0.5 0.5]]
  11. bias = paddle.ones(shape=[4], dtype="float32", name="bias")
  12. bias = bias + 0.88
  13. # bias[:] = 0
  14. # bias: [1. 1. 1. 1.]
  15. y = paddle.nn.functional.linear(x, weight, bias)
  16. # y: [[0.23824859 0.23824859 0.23824859 0.23824859]
  17. # [0.9440598 0.9440598 0.9440598 0.9440598 ]
  18. # [2.1077576 2.1077576 2.1077576 2.1077576 ]]
  19. print(x.shape, y.shape)
  20. print(y==paddle.matmul(x,weight)+bias)



  1. window_size = [3, 4]
  2. coords_h = paddle.arange(window_size[0])
  3. coords_w = paddle.arange(window_size[1])
  4. # print(coords_h, coords_w)
  5. coords = paddle.stack(paddle.meshgrid([coords_h, coords_w])) # 2, Wh, Ww
  6. print(coords)
  7. coords_flatten = paddle.flatten(coords, 1) # 2, Wh*Ww
  8. print(coords_flatten)

坐标变量分别在axis=2和axis=1 增加维度,然后做减法,经过广播,得到一个3D的坐标变量

  1. relative_coords = coords_flatten.unsqueeze(
  2. axis=2
  3. ) - coords_flatten.unsqueeze(
  4. axis=1
  5. )
  6. # relative_coords = coords_flatten.unsqueeze(axis=2 )
  7. relative_coords


  1. import paddle
  2. import paddle.nn as nn
  3. net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
  4. def init_weights(layer):
  5. if type(layer) == nn.Linear:
  6. print('before init weight:', layer.weight.numpy())
  7. new_weight = paddle.full(shape=layer.weight.shape, dtype=layer.weight.dtype, fill_value=0.9)
  8. layer.weight.set_value(new_weight)
  9. print('after init weight:', layer.weight.numpy())
  10. net.apply(init_weights)
  11. print(net.state_dict())


根据 shape 指定的形状扩展 x ,扩展后, x 的形状和 shape 指定的形状一致。

  1. import paddle
  2. data = paddle.to_tensor([1, 2, 3], dtype='int32')
  3. out = paddle.expand(data, shape=[2, 3])
  4. print(out)
  5. # [[1, 2, 3], [1, 2, 3]]


报错module 'paddlenlp.ops.optimizer' has no attribute 'AdamWDL'


报错 cannot import name 'load_dataset' from 'datasets'

  1. [2022-05-05 22:35:44,247] [ WARNING] - Detected that datasets module was imported before paddlenlp. This may cause PaddleNLP datasets to be unavalible in intranetPlease import paddlenlp before datasets module to avoid download issues
  2. ...
  3. File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/datasets/dataset.py", line 48, in <module>
  4. from datasets import load_dataset as origin_load_dataset
  5. ImportError: cannot import name 'load_dataset' from 'datasets' (/home/aistudio/BEiT/datasets.py)

搞不定啊,只好不调用paddlenlp了,把需要调用的函数单独写出来,放在tmpadam目录,import tmpadam, 然后在训练的时候,使用命令位optimizer = tmpadam.AdamWDL


不知道是显示问题,还是卡住了。 用unzip命令代替,也是卡住,晕。 只好放弃后台任务模式,在notebook里面执行了,索性也就需要2个小时。不用后台任务影响也不大。


运行报错,说shape对不齐。仔细检查了配置,也没有问题。 后来发现是Mixup函数默认参数是num_classes=1000,修改代码,将num_classes=config.TRAIN.NUM_CLASSES加入进去,问题解决。

  1. if (config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or
  2. config.TRAIN.CUTMIX_MINMAX is not None):
  3. mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
  4. cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
  5. cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
  6. prob=config.TRAIN.MIXUP_PROB,
  7. switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
  8. mode=config.TRAIN.MIXUP_MODE,
  9. label_smoothing=config.TRAIN.SMOOTHING,
  10. num_classes=config.TRAIN.NUM_CLASSES)#


