赞
踩
论文题目: An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale
原论文下载链接:https://arxiv.org/abs/2010.11929
本博客代码可以直接生成训练集和测试集的损失和准确率的折线图,便于写论文使用。
Transformer最先应用于在NIP领域,并且取得了巨大的成功,事实上NIP和CV作为深度学习应用最广的两个领域,两者的技术也在相互借鉴的发展,Transformer在NIP领域取得的巨大成功使得研究人员开始思考能否将其应用在CV领域,因此Vision Transformer应运而生,并且如研究人员所料,在CV领域也掀起了惊涛骇浪,毕竟跟传统的卷积神经网络有所不同,Vision Transformer以其特定的结构为CV的研究带来新思路。
这期博客我们来学习一下Vision Transformer,理论上他的效果要比传统的卷积神经网络都要好,当然也只是理论上,具体的细节要看不同的数据集和模型参数的调节过程。
首先我们来看一下他在各类数据集上的实际效果。
在流行的图像分类基准上与现有技术进行比较,重新报告了精度的平均值和标准偏差,在三次微调中取平均值。VIT在JFT-300M数据集上预训练的变压器模型在所有方面都优于基于ResNet的基线数据集,同时预训练所需的计算资源要少得多,ViT在较小的公共ImageNet-21k数据集也表现良好。
如下是Vision Transformer的网络结构:
Vision Transformer的基本结构与传统的卷积神经网络类似,分为两个部分,先进行特征提取,再进行分类,跟传统的区别在于特征的提取过程,首先对输入的图片进行分块处理,处理过程类似滑动窗口。每隔一段距离对图像进行一段的分块,
然后将分块之后的图片特征层组成序列,然后为所有特征添加上位置信息,最后进行分类。
关于Self-Attention以及Multi-Head Attention的部分后续出一篇博客专业讲解把,稍微有点复杂,需要准备一下。
这是Vision Transformer的不同版本跟ResNet在不同数据集上的效果比较,可以看出Vision Transformer明显优于Resnet。
训练代码:
- import torch
- import torchvision.models
-
- from matplotlib import pyplot as plt
- from tqdm import tqdm
- from torch import nn
- from torch.utils.data import DataLoader
- from torchvision.transforms import transforms
- from functools import partial
- from collections import OrderedDict
- from typing import Optional, Callable
- data_transform = {
- "train": transforms.Compose([transforms.RandomResizedCrop(224),
- transforms.RandomHorizontalFlip(),
- transforms.ToTensor(),
- transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]),
- "val": transforms.Compose([transforms.Resize((224, 224)), # cannot 224, must (224, 224)
- transforms.ToTensor(),
- transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])}
-
- train_data = torchvision.datasets.ImageFolder(root = "./data/train" , transform = data_transform["train"])
-
- traindata = DataLoader(dataset=train_data, batch_size = 32 , shuffle=True, num_workers=0) # 将训练数据以每次32张图片的形式抽出进行训练
-
- test_data = torchvision.datasets.ImageFolder(root = "./data/val" , transform = data_transform["val"])
-
- train_size = len(train_data) # 训练集的长度
- test_size = len(test_data) # 测试集的长度
- print(train_size) #输出训练集长度看一下,相当于看看有几张图片
- print(test_size) #输出测试集长度看一下,相当于看看有几张图片
- testdata = DataLoader(dataset=test_data, batch_size = 32, shuffle=True, num_workers=0) # 将训练数据以每次32张图片的形式抽出进行测试
-
- device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
- print("using {} device.".format(device))
-
-
- def drop_path(x, drop_prob: float = 0., training: bool = False):
- """
- Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
- This is the same as the DropConnect impl I created for EfficientNet, etc networks, however,
- the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
- See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for
- changing the layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use
- 'survival rate' as the argument.
- """
- if drop_prob == 0. or not training:
- return x
- keep_prob = 1 - drop_prob
- shape = (x.shape[0],) + (1,) * (x.ndim - 1) # work with diff dim tensors, not just 2D ConvNets
- random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
- random_tensor.floor_() # binarize
- output = x.div(keep_prob) * random_tensor
- return output
-
-
- class DropPath(nn.Module):
- """
- Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
- """
- def __init__(self, drop_prob=None):
- super(DropPath, self).__init__()
- self.drop_prob = drop_prob
-
- def forward(self, x):
- return drop_path(x, self.drop_prob, self.training)
-
-
- class PatchEmbed(nn.Module):
- """
- 2D Image to Patch Embedding
- """
- def __init__(self, img_size=224, patch_size=16, in_c=3, embed_dim=768, norm_layer=None):
- super().__init__()
- img_size = (img_size, img_size)
- patch_size = (patch_size, patch_size)
- self.img_size = img_size
- self.patch_size = patch_size
- self.grid_size = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
- self.num_patches = self.grid_size[0] * self.grid_size[1]
-
- self.proj = nn.Conv2d(in_c, embed_dim, kernel_size=patch_size, stride=patch_size)
- self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
-
- def forward(self, x):
- B, C, H, W = x.shape
- assert H == self.img_size[0] and W == self.img_size[1], \
- f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
-
- # flatten: [B, C, H, W] -> [B, C, HW]
- # transpose: [B, C, HW] -> [B, HW, C]
- x = self.proj(x).flatten(2).transpose(1, 2)
- x = self.norm(x)
- return x
-
-
- class Attention(nn.Module):
- def __init__(self,
- dim, # 输入token的dim
- num_heads=8,
- qkv_bias=False,
- qk_scale=None,
- attn_drop_ratio=0.,
- proj_drop_ratio=0.):
- super(Attention, self).__init__()
- self.num_heads = num_heads
- head_dim = dim // num_heads
- self.scale = qk_scale or head_dim ** -0.5
- self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
- self.attn_drop = nn.Dropout(attn_drop_ratio)
- self.proj = nn.Linear(dim, dim)
- self.proj_drop = nn.Dropout(proj_drop_ratio)
-
- def forward(self, x):
- # [batch_size, num_patches + 1, total_embed_dim]
- B, N, C = x.shape
-
- # qkv(): -> [batch_size, num_patches + 1, 3 * total_embed_dim]
- # reshape: -> [batch_size, num_patches + 1, 3, num_heads, embed_dim_per_head]
- # permute: -> [3, batch_size, num_heads, num_patches + 1, embed_dim_per_head]
- qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
- # [batch_size, num_heads, num_patches + 1, embed_dim_per_head]
- q, k, v = qkv[0], qkv[1], qkv[2] # make torchscript happy (cannot use tensor as tuple)
-
- # transpose: -> [batch_size, num_heads, embed_dim_per_head, num_patches + 1]
- # @: multiply -> [batch_size, num_heads, num_patches + 1, num_patches + 1]
- attn = (q @ k.transpose(-2, -1)) * self.scale
- attn = attn.softmax(dim=-1)
- attn = self.attn_drop(attn)
-
- # @: multiply -> [batch_size, num_heads, num_patches + 1, embed_dim_per_head]
- # transpose: -> [batch_size, num_patches + 1, num_heads, embed_dim_per_head]
- # reshape: -> [batch_size, num_patches + 1, total_embed_dim]
- x = (attn @ v).transpose(1, 2).reshape(B, N, C)
- x = self.proj(x)
- x = self.proj_drop(x)
- return x
-
-
- class Mlp(nn.Module):
- """
- MLP as used in Vision Transformer, MLP-Mixer and related networks
- """
- def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
- super().__init__()
- out_features = out_features or in_features
- hidden_features = hidden_features or in_features
- self.fc1 = nn.Linear(in_features, hidden_features)
- self.act = act_layer()
- self.fc2 = nn.Linear(hidden_features, out_features)
- self.drop = nn.Dropout(drop)
-
- def forward(self, x):
- x = self.fc1(x)
- x = self.act(x)
- x = self.drop(x)
- x = self.fc2(x)
- x = self.drop(x)
- return x
-
-
- class Block(nn.Module):
- def __init__(self,
- dim,
- num_heads,
- mlp_ratio=4.,
- qkv_bias=False,
- qk_scale=None,
- drop_ratio=0.,
- attn_drop_ratio=0.,
- drop_path_ratio=0.,
- act_layer=nn.GELU,
- norm_layer=nn.LayerNorm):
- super(Block, self).__init__()
- self.norm1 = norm_layer(dim)
- self.attn = Attention(dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale,
- attn_drop_ratio=attn_drop_ratio, proj_drop_ratio=drop_ratio)
- # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
- self.drop_path = DropPath(drop_path_ratio) if drop_path_ratio > 0. else nn.Identity()
- self.norm2 = norm_layer(dim)
- mlp_hidden_dim = int(dim * mlp_ratio)
- self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop_ratio)
-
- def forward(self, x):
- x = x + self.drop_path(self.attn(self.norm1(x)))
- x = x + self.drop_path(self.mlp(self.norm2(x)))
- return x
-
-
- class VisionTransformer(nn.Module):
- def __init__(self, img_size=224, patch_size=16, in_c=3, num_classes=1000,
- embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True,
- qk_scale=None, representation_size=None, distilled=False, drop_ratio=0.,
- attn_drop_ratio=0., drop_path_ratio=0., embed_layer=PatchEmbed, norm_layer=None,
- act_layer=None):
- """
- Args:
- img_size (int, tuple): input image size
- patch_size (int, tuple): patch size
- in_c (int): number of input channels
- num_classes (int): number of classes for classification head
- embed_dim (int): embedding dimension
- depth (int): depth of transformer
- num_heads (int): number of attention heads
- mlp_ratio (int): ratio of mlp hidden dim to embedding dim
- qkv_bias (bool): enable bias for qkv if True
- qk_scale (float): override default qk scale of head_dim ** -0.5 if set
- representation_size (Optional[int]): enable and set representation layer (pre-logits) to this value if set
- distilled (bool): model includes a distillation token and head as in DeiT models
- drop_ratio (float): dropout rate
- attn_drop_ratio (float): attention dropout rate
- drop_path_ratio (float): stochastic depth rate
- embed_layer (nn.Module): patch embedding layer
- norm_layer: (nn.Module): normalization layer
- """
- super(VisionTransformer, self).__init__()
- self.num_classes = num_classes
- self.num_features = self.embed_dim = embed_dim # num_features for consistency with other models
- self.num_tokens = 2 if distilled else 1
- norm_layer = norm_layer or partial(nn.LayerNorm, eps=1e-6)
- act_layer = act_layer or nn.GELU
-
- self.patch_embed = embed_layer(img_size=img_size, patch_size=patch_size, in_c=in_c, embed_dim=embed_dim)
- num_patches = self.patch_embed.num_patches
-
- self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
- self.dist_token = nn.Parameter(torch.zeros(1, 1, embed_dim)) if distilled else None
- self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + self.num_tokens, embed_dim))
- self.pos_drop = nn.Dropout(p=drop_ratio)
-
- dpr = [x.item() for x in torch.linspace(0, drop_path_ratio, depth)] # stochastic depth decay rule
- self.blocks = nn.Sequential(*[
- Block(dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
- drop_ratio=drop_ratio, attn_drop_ratio=attn_drop_ratio, drop_path_ratio=dpr[i],
- norm_layer=norm_layer, act_layer=act_layer)
- for i in range(depth)
- ])
- self.norm = norm_layer(embed_dim)
-
- # Representation layer
- if representation_size and not distilled:
- self.has_logits = True
- self.num_features = representation_size
- self.pre_logits = nn.Sequential(OrderedDict([
- ("fc", nn.Linear(embed_dim, representation_size)),
- ("act", nn.Tanh())
- ]))
- else:
- self.has_logits = False
- self.pre_logits = nn.Identity()
-
- # Classifier head(s)
- self.head = nn.Linear(self.num_features, num_classes) if num_classes > 0 else nn.Identity()
- self.head_dist = None
- if distilled:
- self.head_dist = nn.Linear(self.embed_dim, self.num_classes) if num_classes > 0 else nn.Identity()
-
- # Weight init
- nn.init.trunc_normal_(self.pos_embed, std=0.02)
- if self.dist_token is not None:
- nn.init.trunc_normal_(self.dist_token, std=0.02)
-
- nn.init.trunc_normal_(self.cls_token, std=0.02)
- self.apply(_init_vit_weights)
-
- def forward_features(self, x):
- # [B, C, H, W] -> [B, num_patches, embed_dim]
- x = self.patch_embed(x) # [B, 196, 768]
- # [1, 1, 768] -> [B, 1, 768]
- cls_token = self.cls_token.expand(x.shape[0], -1, -1)
- if self.dist_token is None:
- x = torch.cat((cls_token, x), dim=1) # [B, 197, 768]
- else:
- x = torch.cat((cls_token, self.dist_token.expand(x.shape[0], -1, -1), x), dim=1)
-
- x = self.pos_drop(x + self.pos_embed)
- x = self.blocks(x)
- x = self.norm(x)
- if self.dist_token is None:
- return self.pre_logits(x[:, 0])
- else:
- return x[:, 0], x[:, 1]
-
- def forward(self, x):
- x = self.forward_features(x)
- if self.head_dist is not None:
- x, x_dist = self.head(x[0]), self.head_dist(x[1])
- if self.training and not torch.jit.is_scripting():
- # during inference, return the average of both classifier predictions
- return x, x_dist
- else:
- return (x + x_dist) / 2
- else:
- x = self.head(x)
- return x
-
-
- def _init_vit_weights(m):
- """
- ViT weight initialization
- :param m: module
- """
- if isinstance(m, nn.Linear):
- nn.init.trunc_normal_(m.weight, std=.01)
- if m.bias is not None:
- nn.init.zeros_(m.bias)
- elif isinstance(m, nn.Conv2d):
- nn.init.kaiming_normal_(m.weight, mode="fan_out")
- if m.bias is not None:
- nn.init.zeros_(m.bias)
- elif isinstance(m, nn.LayerNorm):
- nn.init.zeros_(m.bias)
- nn.init.ones_(m.weight)
-
-
- def vit_base_patch16_224(num_classes: int = 1000):
- """
- ViT-Base model (ViT-B/16) from original paper (https://arxiv.org/abs/2010.11929).
- ImageNet-1k weights @ 224x224, source https://github.com/google-research/vision_transformer.
- weights ported from official Google JAX impl:
- 链接: https://pan.baidu.com/s/1zqb08naP0RPqqfSXfkB2EA 密码: eu9f
- """
- model = VisionTransformer(img_size=224,
- patch_size=16,
- embed_dim=768,
- depth=12,
- num_heads=12,
- representation_size=None,
- num_classes=num_classes)
- return model
-
-
- def vit_base_patch16_224_in21k(num_classes: int = 21843, has_logits: bool = True):
- """
- ViT-Base model (ViT-B/16) from original paper (https://arxiv.org/abs/2010.11929).
- ImageNet-21k weights @ 224x224, source https://github.com/google-research/vision_transformer.
- weights ported from official Google JAX impl:
- https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_patch16_224_in21k-e5005f0a.pth
- """
- model = VisionTransformer(img_size=224,
- patch_size=16,
- embed_dim=768,
- depth=12,
- num_heads=12,
- representation_size=768 if has_logits else None,
- num_classes=num_classes)
- return model
-
-
- def vit_base_patch32_224(num_classes: int = 1000):
- """
- ViT-Base model (ViT-B/32) from original paper (https://arxiv.org/abs/2010.11929).
- ImageNet-1k weights @ 224x224, source https://github.com/google-research/vision_transformer.
- weights ported from official Google JAX impl:
- 链接: https://pan.baidu.com/s/1hCv0U8pQomwAtHBYc4hmZg 密码: s5hl
- """
- model = VisionTransformer(img_size=224,
- patch_size=32,
- embed_dim=768,
- depth=12,
- num_heads=12,
- representation_size=None,
- num_classes=num_classes)
- return model
-
-
- def vit_base_patch32_224_in21k(num_classes: int = 21843, has_logits: bool = True):
- """
- ViT-Base model (ViT-B/32) from original paper (https://arxiv.org/abs/2010.11929).
- ImageNet-21k weights @ 224x224, source https://github.com/google-research/vision_transformer.
- weights ported from official Google JAX impl:
- https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_patch32_224_in21k-8db57226.pth
- """
- model = VisionTransformer(img_size=224,
- patch_size=32,
- embed_dim=768,
- depth=12,
- num_heads=12,
- representation_size=768 if has_logits else None,
- num_classes=num_classes)
- return model
-
-
- def vit_large_patch16_224(num_classes: int = 1000):
- """
- ViT-Large model (ViT-L/16) from original paper (https://arxiv.org/abs/2010.11929).
- ImageNet-1k weights @ 224x224, source https://github.com/google-research/vision_transformer.
- weights ported from official Google JAX impl:
- 链接: https://pan.baidu.com/s/1cxBgZJJ6qUWPSBNcE4TdRQ 密码: qqt8
- """
- model = VisionTransformer(img_size=224,
- patch_size=16,
- embed_dim=1024,
- depth=24,
- num_heads=16,
- representation_size=None,
- num_classes=num_classes)
- return model
-
-
- def vit_large_patch16_224_in21k(num_classes: int = 21843, has_logits: bool = True):
- """
- ViT-Large model (ViT-L/16) from original paper (https://arxiv.org/abs/2010.11929).
- ImageNet-21k weights @ 224x224, source https://github.com/google-research/vision_transformer.
- weights ported from official Google JAX impl:
- https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_large_patch16_224_in21k-606da67d.pth
- """
- model = VisionTransformer(img_size=224,
- patch_size=16,
- embed_dim=1024,
- depth=24,
- num_heads=16,
- representation_size=1024 if has_logits else None,
- num_classes=num_classes)
- return model
-
-
- def vit_large_patch32_224_in21k(num_classes: int = 21843, has_logits: bool = True):
- """
- ViT-Large model (ViT-L/32) from original paper (https://arxiv.org/abs/2010.11929).
- ImageNet-21k weights @ 224x224, source https://github.com/google-research/vision_transformer.
- weights ported from official Google JAX impl:
- https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_large_patch32_224_in21k-9046d2e7.pth
- """
- model = VisionTransformer(img_size=224,
- patch_size=32,
- embed_dim=1024,
- depth=24,
- num_heads=16,
- representation_size=1024 if has_logits else None,
- num_classes=num_classes)
- return model
-
-
- def vit_huge_patch14_224_in21k(num_classes: int = 21843, has_logits: bool = True):
- """
- ViT-Huge model (ViT-H/14) from original paper (https://arxiv.org/abs/2010.11929).
- ImageNet-21k weights @ 224x224, source https://github.com/google-research/vision_transformer.
- NOTE: converted weights not currently available, too large for github release hosting.
- """
- model = VisionTransformer(img_size=224,
- patch_size=14,
- embed_dim=1280,
- depth=32,
- num_heads=16,
- representation_size=1280 if has_logits else None,
- num_classes=num_classes)
- return model
-
-
-
- vision_transformer = vit_base_patch16_224(num_classes = 2) #将模型命名为vision_transformer,这里num_classes是数据集的种类,我用的猫狗数据集两类,所以等于2 你设置成你数据集的种类即可
- #上面用的是vision_transformer,如果想用efficientnet其他系列的,直接把上面的efficientnet_b其他系列就行即可
- vision_transformer.to(device)
-
- print(vision_transformer.to(device)) #输出模型结构
-
- #
- # test1 = torch.ones(32, 3, 224, 224) # 测试一下输出的形状大小 输入一个32,3,224,224的向量
- #
- # test1 = vision_transformer(test1.to(device)) #将向量打入神经网络进行测试
- # print(test1.shape) #查看输出的结果
-
- epoch = 10 # 迭代次数即训练次数
- learning = 0.0001 # 学习率
- optimizer = torch.optim.Adam(vision_transformer.parameters(), lr=learning) # 使用Adam优化器-写论文的话可以具体查一下这个优化器的原理
- loss = nn.CrossEntropyLoss() # 损失计算方式,交叉熵损失函数
-
- train_loss_all = [] # 存放训练集损失的数组
- train_accur_all = [] # 存放训练集准确率的数组
- test_loss_all = [] # 存放测试集损失的数组
- test_accur_all = [] # 存放测试集准确率的数组
- for i in range(epoch): #开始迭代
- train_loss = 0 #训练集的损失初始设为0
- train_num = 0.0 #
- train_accuracy = 0.0 #训练集的准确率初始设为0
- vision_transformer.train() #将模型设置成 训练模式
- train_bar = tqdm(traindata) #用于进度条显示,没啥实际用处
- for step, data in enumerate(train_bar): #开始迭代跑, enumerate这个函数不懂可以查查,将训练集分为 data是序号,data是数据
- img, target = data #将data 分位 img图片,target标签
- optimizer.zero_grad() # 清空历史梯度
- outputs = vision_transformer(img.to(device)) # 将图片打入网络进行训练,outputs是输出的结果
-
- loss1 = loss(outputs, target.to(device)) # 计算神经网络输出的结果outputs与图片真实标签target的差别-这就是我们通常情况下称为的损失
- outputs = torch.argmax(outputs, 1) #会输出10个值,最大的值就是我们预测的结果 求最大值
- loss1.backward() #神经网络反向传播
- optimizer.step() #梯度优化 用上面的abam优化
- train_loss = train_loss + loss1.item() #将所有损失的绝对值加起来
- accuracy = torch.sum(outputs == target.to(device)) #outputs == target的 即使预测正确的,统计预测正确的个数,从而计算准确率
- train_accuracy = train_accuracy + accuracy #求训练集的准确率
- train_num += img.size(0) #
-
- print("epoch:{} , train-Loss:{} , train-accuracy:{}".format(i + 1, train_loss / train_num, #输出训练情况
- train_accuracy / train_num))
- train_loss_all.append(train_loss / train_num) #将训练的损失放到一个列表里 方便后续画图
- train_accur_all.append(train_accuracy.double().item() / train_num)#训练集的准确率
- test_loss = 0 #同上 测试损失
- test_accuracy = 0.0 #测试准确率
- test_num = 0
- vision_transformer.eval() #将模型调整为测试模型
- with torch.no_grad(): #清空历史梯度,进行测试 与训练最大的区别是测试过程中取消了反向传播
- test_bar = tqdm(testdata)
- for data in test_bar:
- img, target = data
-
- outputs = vision_transformer(img.to(device))
- loss2 = loss(outputs, target.to(device)).cpu()
- outputs = torch.argmax(outputs, 1)
- test_loss = test_loss + loss2.item()
- accuracy = torch.sum(outputs == target.to(device))
- test_accuracy = test_accuracy + accuracy
- test_num += img.size(0)
-
- print("test-Loss:{} , test-accuracy:{}".format(test_loss / test_num, test_accuracy / test_num))
- test_loss_all.append(test_loss / test_num)
- test_accur_all.append(test_accuracy.double().item() / test_num)
-
- #下面的是画图过程,将上述存放的列表 画出来即可,分别画出训练集和测试集的损失和准确率图
- plt.figure(figsize=(12, 4))
- plt.subplot(1, 2, 1)
- plt.plot(range(epoch), train_loss_all,
- "ro-", label="Train loss")
- plt.plot(range(epoch), test_loss_all,
- "bs-", label="test loss")
- plt.legend()
- plt.xlabel("epoch")
- plt.ylabel("Loss")
- plt.subplot(1, 2, 2)
- plt.plot(range(epoch), train_accur_all,
- "ro-", label="Train accur")
- plt.plot(range(epoch), test_accur_all,
- "bs-", label="test accur")
- plt.xlabel("epoch")
- plt.ylabel("acc")
- plt.legend()
- plt.show()
-
- torch.save(vision_transformer, "vision_transformer.pth")
- print("模型已保存")
-

预测代码:
- import torch
- from PIL import Image
- from torch import nn
- from torchvision.transforms import transforms
- from typing import Callable, List, Optional
- from torch import nn, Tensor
- from torch.nn import functional as F
- image_path = "1.jpg"#相对路径 导入图片
- trans = transforms.Compose([transforms.Resize((224 , 224)),
- transforms.ToTensor()]) #将图片缩放为跟训练集图片的大小一样 方便预测,且将图片转换为张量
- image = Image.open(image_path) #打开图片
- # print(image) #输出图片 看看图片格式
- image = image.convert("RGB") #将图片转换为RGB格式
- image = trans(image) #上述的缩放和转张量操作在这里实现
- # print(image) #查看转换后的样子
- image = torch.unsqueeze(image, dim=0) #将图片维度扩展一维
-
- classes = ["cat" , "dog" ] #预测种类,我这里用的猫狗数据集,所以是这两种,你调成你的种类即可
-
- def drop_path(x, drop_prob: float = 0., training: bool = False):
- """
- Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
- This is the same as the DropConnect impl I created for EfficientNet, etc networks, however,
- the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
- See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for
- changing the layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use
- 'survival rate' as the argument.
- """
- if drop_prob == 0. or not training:
- return x
- keep_prob = 1 - drop_prob
- shape = (x.shape[0],) + (1,) * (x.ndim - 1) # work with diff dim tensors, not just 2D ConvNets
- random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
- random_tensor.floor_() # binarize
- output = x.div(keep_prob) * random_tensor
- return output
-
-
- class DropPath(nn.Module):
- """
- Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
- """
- def __init__(self, drop_prob=None):
- super(DropPath, self).__init__()
- self.drop_prob = drop_prob
-
- def forward(self, x):
- return drop_path(x, self.drop_prob, self.training)
-
-
- class PatchEmbed(nn.Module):
- """
- 2D Image to Patch Embedding
- """
- def __init__(self, img_size=224, patch_size=16, in_c=3, embed_dim=768, norm_layer=None):
- super().__init__()
- img_size = (img_size, img_size)
- patch_size = (patch_size, patch_size)
- self.img_size = img_size
- self.patch_size = patch_size
- self.grid_size = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
- self.num_patches = self.grid_size[0] * self.grid_size[1]
-
- self.proj = nn.Conv2d(in_c, embed_dim, kernel_size=patch_size, stride=patch_size)
- self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
-
- def forward(self, x):
- B, C, H, W = x.shape
- assert H == self.img_size[0] and W == self.img_size[1], \
- f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
-
- # flatten: [B, C, H, W] -> [B, C, HW]
- # transpose: [B, C, HW] -> [B, HW, C]
- x = self.proj(x).flatten(2).transpose(1, 2)
- x = self.norm(x)
- return x
-
-
- class Attention(nn.Module):
- def __init__(self,
- dim, # 输入token的dim
- num_heads=8,
- qkv_bias=False,
- qk_scale=None,
- attn_drop_ratio=0.,
- proj_drop_ratio=0.):
- super(Attention, self).__init__()
- self.num_heads = num_heads
- head_dim = dim // num_heads
- self.scale = qk_scale or head_dim ** -0.5
- self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
- self.attn_drop = nn.Dropout(attn_drop_ratio)
- self.proj = nn.Linear(dim, dim)
- self.proj_drop = nn.Dropout(proj_drop_ratio)
-
- def forward(self, x):
- # [batch_size, num_patches + 1, total_embed_dim]
- B, N, C = x.shape
-
- # qkv(): -> [batch_size, num_patches + 1, 3 * total_embed_dim]
- # reshape: -> [batch_size, num_patches + 1, 3, num_heads, embed_dim_per_head]
- # permute: -> [3, batch_size, num_heads, num_patches + 1, embed_dim_per_head]
- qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
- # [batch_size, num_heads, num_patches + 1, embed_dim_per_head]
- q, k, v = qkv[0], qkv[1], qkv[2] # make torchscript happy (cannot use tensor as tuple)
-
- # transpose: -> [batch_size, num_heads, embed_dim_per_head, num_patches + 1]
- # @: multiply -> [batch_size, num_heads, num_patches + 1, num_patches + 1]
- attn = (q @ k.transpose(-2, -1)) * self.scale
- attn = attn.softmax(dim=-1)
- attn = self.attn_drop(attn)
-
- # @: multiply -> [batch_size, num_heads, num_patches + 1, embed_dim_per_head]
- # transpose: -> [batch_size, num_patches + 1, num_heads, embed_dim_per_head]
- # reshape: -> [batch_size, num_patches + 1, total_embed_dim]
- x = (attn @ v).transpose(1, 2).reshape(B, N, C)
- x = self.proj(x)
- x = self.proj_drop(x)
- return x
-
-
- class Mlp(nn.Module):
- """
- MLP as used in Vision Transformer, MLP-Mixer and related networks
- """
- def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
- super().__init__()
- out_features = out_features or in_features
- hidden_features = hidden_features or in_features
- self.fc1 = nn.Linear(in_features, hidden_features)
- self.act = act_layer()
- self.fc2 = nn.Linear(hidden_features, out_features)
- self.drop = nn.Dropout(drop)
-
- def forward(self, x):
- x = self.fc1(x)
- x = self.act(x)
- x = self.drop(x)
- x = self.fc2(x)
- x = self.drop(x)
- return x
-
-
- class Block(nn.Module):
- def __init__(self,
- dim,
- num_heads,
- mlp_ratio=4.,
- qkv_bias=False,
- qk_scale=None,
- drop_ratio=0.,
- attn_drop_ratio=0.,
- drop_path_ratio=0.,
- act_layer=nn.GELU,
- norm_layer=nn.LayerNorm):
- super(Block, self).__init__()
- self.norm1 = norm_layer(dim)
- self.attn = Attention(dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale,
- attn_drop_ratio=attn_drop_ratio, proj_drop_ratio=drop_ratio)
- # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
- self.drop_path = DropPath(drop_path_ratio) if drop_path_ratio > 0. else nn.Identity()
- self.norm2 = norm_layer(dim)
- mlp_hidden_dim = int(dim * mlp_ratio)
- self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop_ratio)
-
- def forward(self, x):
- x = x + self.drop_path(self.attn(self.norm1(x)))
- x = x + self.drop_path(self.mlp(self.norm2(x)))
- return x
-
-
- class VisionTransformer(nn.Module):
- def __init__(self, img_size=224, patch_size=16, in_c=3, num_classes=1000,
- embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True,
- qk_scale=None, representation_size=None, distilled=False, drop_ratio=0.,
- attn_drop_ratio=0., drop_path_ratio=0., embed_layer=PatchEmbed, norm_layer=None,
- act_layer=None):
- """
- Args:
- img_size (int, tuple): input image size
- patch_size (int, tuple): patch size
- in_c (int): number of input channels
- num_classes (int): number of classes for classification head
- embed_dim (int): embedding dimension
- depth (int): depth of transformer
- num_heads (int): number of attention heads
- mlp_ratio (int): ratio of mlp hidden dim to embedding dim
- qkv_bias (bool): enable bias for qkv if True
- qk_scale (float): override default qk scale of head_dim ** -0.5 if set
- representation_size (Optional[int]): enable and set representation layer (pre-logits) to this value if set
- distilled (bool): model includes a distillation token and head as in DeiT models
- drop_ratio (float): dropout rate
- attn_drop_ratio (float): attention dropout rate
- drop_path_ratio (float): stochastic depth rate
- embed_layer (nn.Module): patch embedding layer
- norm_layer: (nn.Module): normalization layer
- """
- super(VisionTransformer, self).__init__()
- self.num_classes = num_classes
- self.num_features = self.embed_dim = embed_dim # num_features for consistency with other models
- self.num_tokens = 2 if distilled else 1
- norm_layer = norm_layer or partial(nn.LayerNorm, eps=1e-6)
- act_layer = act_layer or nn.GELU
-
- self.patch_embed = embed_layer(img_size=img_size, patch_size=patch_size, in_c=in_c, embed_dim=embed_dim)
- num_patches = self.patch_embed.num_patches
-
- self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
- self.dist_token = nn.Parameter(torch.zeros(1, 1, embed_dim)) if distilled else None
- self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + self.num_tokens, embed_dim))
- self.pos_drop = nn.Dropout(p=drop_ratio)
-
- dpr = [x.item() for x in torch.linspace(0, drop_path_ratio, depth)] # stochastic depth decay rule
- self.blocks = nn.Sequential(*[
- Block(dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
- drop_ratio=drop_ratio, attn_drop_ratio=attn_drop_ratio, drop_path_ratio=dpr[i],
- norm_layer=norm_layer, act_layer=act_layer)
- for i in range(depth)
- ])
- self.norm = norm_layer(embed_dim)
-
- # Representation layer
- if representation_size and not distilled:
- self.has_logits = True
- self.num_features = representation_size
- self.pre_logits = nn.Sequential(OrderedDict([
- ("fc", nn.Linear(embed_dim, representation_size)),
- ("act", nn.Tanh())
- ]))
- else:
- self.has_logits = False
- self.pre_logits = nn.Identity()
-
- # Classifier head(s)
- self.head = nn.Linear(self.num_features, num_classes) if num_classes > 0 else nn.Identity()
- self.head_dist = None
- if distilled:
- self.head_dist = nn.Linear(self.embed_dim, self.num_classes) if num_classes > 0 else nn.Identity()
-
- # Weight init
- nn.init.trunc_normal_(self.pos_embed, std=0.02)
- if self.dist_token is not None:
- nn.init.trunc_normal_(self.dist_token, std=0.02)
-
- nn.init.trunc_normal_(self.cls_token, std=0.02)
- self.apply(_init_vit_weights)
-
- def forward_features(self, x):
- # [B, C, H, W] -> [B, num_patches, embed_dim]
- x = self.patch_embed(x) # [B, 196, 768]
- # [1, 1, 768] -> [B, 1, 768]
- cls_token = self.cls_token.expand(x.shape[0], -1, -1)
- if self.dist_token is None:
- x = torch.cat((cls_token, x), dim=1) # [B, 197, 768]
- else:
- x = torch.cat((cls_token, self.dist_token.expand(x.shape[0], -1, -1), x), dim=1)
-
- x = self.pos_drop(x + self.pos_embed)
- x = self.blocks(x)
- x = self.norm(x)
- if self.dist_token is None:
- return self.pre_logits(x[:, 0])
- else:
- return x[:, 0], x[:, 1]
-
- def forward(self, x):
- x = self.forward_features(x)
- if self.head_dist is not None:
- x, x_dist = self.head(x[0]), self.head_dist(x[1])
- if self.training and not torch.jit.is_scripting():
- # during inference, return the average of both classifier predictions
- return x, x_dist
- else:
- return (x + x_dist) / 2
- else:
- x = self.head(x)
- return x
-
-
- def _init_vit_weights(m):
- """
- ViT weight initialization
- :param m: module
- """
- if isinstance(m, nn.Linear):
- nn.init.trunc_normal_(m.weight, std=.01)
- if m.bias is not None:
- nn.init.zeros_(m.bias)
- elif isinstance(m, nn.Conv2d):
- nn.init.kaiming_normal_(m.weight, mode="fan_out")
- if m.bias is not None:
- nn.init.zeros_(m.bias)
- elif isinstance(m, nn.LayerNorm):
- nn.init.zeros_(m.bias)
- nn.init.ones_(m.weight)
-
-
- def vit_base_patch16_224(num_classes: int = 1000):
- """
- ViT-Base model (ViT-B/16) from original paper (https://arxiv.org/abs/2010.11929).
- ImageNet-1k weights @ 224x224, source https://github.com/google-research/vision_transformer.
- weights ported from official Google JAX impl:
- 链接: https://pan.baidu.com/s/1zqb08naP0RPqqfSXfkB2EA 密码: eu9f
- """
- model = VisionTransformer(img_size=224,
- patch_size=16,
- embed_dim=768,
- depth=12,
- num_heads=12,
- representation_size=None,
- num_classes=num_classes)
- return model
-
-
- def vit_base_patch16_224_in21k(num_classes: int = 21843, has_logits: bool = True):
- """
- ViT-Base model (ViT-B/16) from original paper (https://arxiv.org/abs/2010.11929).
- ImageNet-21k weights @ 224x224, source https://github.com/google-research/vision_transformer.
- weights ported from official Google JAX impl:
- https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_patch16_224_in21k-e5005f0a.pth
- """
- model = VisionTransformer(img_size=224,
- patch_size=16,
- embed_dim=768,
- depth=12,
- num_heads=12,
- representation_size=768 if has_logits else None,
- num_classes=num_classes)
- return model
-
-
- def vit_base_patch32_224(num_classes: int = 1000):
- """
- ViT-Base model (ViT-B/32) from original paper (https://arxiv.org/abs/2010.11929).
- ImageNet-1k weights @ 224x224, source https://github.com/google-research/vision_transformer.
- weights ported from official Google JAX impl:
- 链接: https://pan.baidu.com/s/1hCv0U8pQomwAtHBYc4hmZg 密码: s5hl
- """
- model = VisionTransformer(img_size=224,
- patch_size=32,
- embed_dim=768,
- depth=12,
- num_heads=12,
- representation_size=None,
- num_classes=num_classes)
- return model
-
-
- def vit_base_patch32_224_in21k(num_classes: int = 21843, has_logits: bool = True):
- """
- ViT-Base model (ViT-B/32) from original paper (https://arxiv.org/abs/2010.11929).
- ImageNet-21k weights @ 224x224, source https://github.com/google-research/vision_transformer.
- weights ported from official Google JAX impl:
- https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_patch32_224_in21k-8db57226.pth
- """
- model = VisionTransformer(img_size=224,
- patch_size=32,
- embed_dim=768,
- depth=12,
- num_heads=12,
- representation_size=768 if has_logits else None,
- num_classes=num_classes)
- return model
-
-
- def vit_large_patch16_224(num_classes: int = 1000):
- """
- ViT-Large model (ViT-L/16) from original paper (https://arxiv.org/abs/2010.11929).
- ImageNet-1k weights @ 224x224, source https://github.com/google-research/vision_transformer.
- weights ported from official Google JAX impl:
- 链接: https://pan.baidu.com/s/1cxBgZJJ6qUWPSBNcE4TdRQ 密码: qqt8
- """
- model = VisionTransformer(img_size=224,
- patch_size=16,
- embed_dim=1024,
- depth=24,
- num_heads=16,
- representation_size=None,
- num_classes=num_classes)
- return model
-
-
- def vit_large_patch16_224_in21k(num_classes: int = 21843, has_logits: bool = True):
- """
- ViT-Large model (ViT-L/16) from original paper (https://arxiv.org/abs/2010.11929).
- ImageNet-21k weights @ 224x224, source https://github.com/google-research/vision_transformer.
- weights ported from official Google JAX impl:
- https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_large_patch16_224_in21k-606da67d.pth
- """
- model = VisionTransformer(img_size=224,
- patch_size=16,
- embed_dim=1024,
- depth=24,
- num_heads=16,
- representation_size=1024 if has_logits else None,
- num_classes=num_classes)
- return model
-
-
- def vit_large_patch32_224_in21k(num_classes: int = 21843, has_logits: bool = True):
- """
- ViT-Large model (ViT-L/32) from original paper (https://arxiv.org/abs/2010.11929).
- ImageNet-21k weights @ 224x224, source https://github.com/google-research/vision_transformer.
- weights ported from official Google JAX impl:
- https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_large_patch32_224_in21k-9046d2e7.pth
- """
- model = VisionTransformer(img_size=224,
- patch_size=32,
- embed_dim=1024,
- depth=24,
- num_heads=16,
- representation_size=1024 if has_logits else None,
- num_classes=num_classes)
- return model
-
-
- def vit_huge_patch14_224_in21k(num_classes: int = 21843, has_logits: bool = True):
- """
- ViT-Huge model (ViT-H/14) from original paper (https://arxiv.org/abs/2010.11929).
- ImageNet-21k weights @ 224x224, source https://github.com/google-research/vision_transformer.
- NOTE: converted weights not currently available, too large for github release hosting.
- """
- model = VisionTransformer(img_size=224,
- patch_size=14,
- embed_dim=1280,
- depth=32,
- num_heads=16,
- representation_size=1280 if has_logits else None,
- num_classes=num_classes)
- return model
-
-
-
- #以上是神经网络结构,因为读取了模型之后代码还得知道神经网络的结构才能进行预测
- device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") #将代码放入GPU进行训练
- print("using {} device.".format(device))
-
-
- model = torch.load("vision_transformer.pth") #读取模型
- model.eval() #关闭梯度,将模型调整为测试模式
- with torch.no_grad(): #梯度清零
- outputs = model(image.to(device)) #将图片打入神经网络进行测试
- # print(model) #输出模型结构
- # print(outputs) #输出预测的张量数组
- ans = (outputs.argmax(1)).item() #最大的值即为预测结果,找出最大值在数组中的序号,
- # 对应找其在种类中的序号即可然后输出即为其种类
- print("该图片的种类为:",classes[ans])
- # print(classes[ans])

网络结构搭建部分注释的很详细,有问题朋友欢迎在评论区指出,感谢!
不懂我代码使用方法的可以看看我之前开源的代码,更为详细:手撕Resnet卷积神经网络-pytorch-详细注释版(可以直接替换自己数据集)-直接放置自己的数据集就能直接跑。跑的代码有问题的可以在评论区指出,看到了会回复。训练代码和预测代码均有。_pytorch更换数据集需要改代码吗_小馨馨的小翟的博客-CSDN博客
本代码使用的数据集是猫狗数据集,已经分好训练集和测试集了,下面给出数据集的下载链接。
链接:https://pan.baidu.com/s/1_gUznMQnzI0UhvsV7wPgzw
提取码:3ixd
Vision Transformer代码下载链接:https://pan.baidu.com/s/11ViyEvr8Wcj-ubfGlz_LUg
提取码:eg43
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。