赞
踩
Paper: 《Learning Transferable Visual Models From Natural Language Supervision》
Code: https://github.com/openai/CLIP
研究动机
作者的研究动机就是在 NLP 领域利用大规模数据去预训练模型,而且用这种跟下游任务无关的训练方式,NLP 那边取得了非常革命性的成功,比如 GPT-3。作者希望把 NLP 中的这种成功应用到其他领域,如视觉领域。在预训练时 CLIP 使用了对比学习,利用文本的提示去做 zero-shot 迁移学习。在大规模数据集和大模型的双向加持下,CLIP 的性能可以与特定任务的有监督训练出来的模型竞争,同时也有很大的改进空间。
CLIP 概述
CLIP的全称是 Contrastive Language-Image Pre-training,即对比语言-图片预训练。
CLIP方法的核心思想是从自然语言的监督中学习感知。
与其他训练方法相比,从自然语言中学习有几个优势:
(1)预训练
用于预训练CLIP的数据集是互联网上各种公开可用的资源中搜集到的4亿对(图像,文本)对。
CLIP模型主要包括两个模态:
Text Encoder
(Transformer)得到文本特征 (visual_embedding
)。Image Encoder
(Resnet或者Vision Transformer)得到视觉特征 (text_embedding
)。visual_embedding [N, embedding_size]
text_embedding [N, embedding_size]
不同模态的数据表示之间可能存在gap,无法进行直接的比较,因此先将不同模态的数据映射到同一个多模态空间(joint multimodal sapce),有利于后续的相似度计算等操作
接下来CLIP就对这些文本和图像对之间做对比学习,其中只有对角线上(上图的蓝色格子:
I
1
T
1
,
I
2
T
2
,
I
3
T
3
.
.
.
I
N
T
N
I_1 T_1, I_2 T_2, I_3 T_3 ... I_N T_N
I1T1,I2T2,I3T3...INTN) 的图像文本对是匹配的,为正样本 (
N
N
N个),其余的都是负样本(
N
2
−
N
N^2-N
N2−N 个)。
有了正、负样本后,模型就可以通过对比学习的方式去训练,不需要任何手工的标注,是一种无监督的训练方式。
我们将visual_embedding
和text_embedding
做内积,得到图像向量和文本向量之间的cosine相似度矩阵,大小为
N
×
N
N\times N
N×N,如果图像和对应的文本嵌入越相似,那么他们的内积便越大。
然后通过交叉熵进行训练,将来自同一个样本的图像和文本嵌入映射到相近的位置,而将来自不同样本的嵌入映射到较远的位置。这使得模型能够学习到图像和文本之间的共同特征。
(2)推理
prompt template
,将N个类(如图中"plane", “car”, “dog”, …, “brid”)变成一个句子,也就是将这些类别去替代 “A photo of a {object}” 中的 “{object}” ,那么 N个类别就都在这里生成了N个句子。然后将这N个句子通过先前预训练好的 Text Encoder 就会得到N个文本的特征 (
T
1
,
T
2
,
T
3
.
.
.
,
T
N
T_1,T_2,T_3... ,T_N
T1,T2,T3...,TN)为什么要采用对比学习的方法?
对于一张图片来说,可以有很多不同的描述,文本之间的差距将是非常巨大的。如果用这种预测型的任务去预训练模型的话,它就会有太多的可能的结果,模型训练的过程会很慢。
如果把训练任务变成对比的任务,也就是说只需要判断这个图片和这个文本是不是配对的,那么这个任务就简单了很多,约束一下就放宽了很多。下图中仅仅把预测型的目标函数换成对比型的目标函数,训练效率一下就提高了4倍。
CLIP的实验结果
由于CLIP 学习的是文本语义信息,而不是one-hot编码的单类别信息,这使得CLIP具有更好的迁移能力。CLIP不仅在ImageNet 常规数据集上表现优秀,对于ImageNet Sketch 素描图、ImageNet-R 动漫图等非常规图像上的迁移学习能力要远远优于Resnet101,如下:
Zero-Shot CLIP 是指不进行任何的微调,直接迁移到其他的数据集上进行测试。
Linear Probe CLIP 是指训练的时候把预训练好的模型权重冻住,直接用其提取特征,然后只是去训练最后的 fc 分类头。
从下图中可以看出Zero-Shot CLIP的能力已经超过了其他有监督的网络。而Linear Probe CLIP 在few-shot的设置下,性能也达到了最佳。
CLIP的代码实现
下图是模型总体结构的伪代码:
CLIP中使用Transformer对文本进行编码。
Transformer
Transformer实现的就是将输入的文本嵌入通过layers
个串联的ResidualAttentionBlock
class Transformer(nn.Module):
def __init__(self, width: int, layers: int, heads: int, attn_mask: torch.Tensor = None):
super().__init__()
self.width = width
self.layers = layers
# layers个ResidualAttentionBlock串联
self.resblocks = nn.Sequential(*[ResidualAttentionBlock(width, heads, attn_mask) for _ in range(layers)])
def forward(self, x: torch.Tensor):
return self.resblocks(x)
ResidualAttentionBlock
下述代码实现的就是标准的Transformer中encoder的结构。
关于Transformer的介绍可以参考 详解注意力机制和Transformer 和 代码详解Transformer 这两篇博客。
Transformer的结构主要有多头自注意力(Multi-Head Attention), 层归一化(LayerNorm) 和多层感知机(MLP)。
class ResidualAttentionBlock(nn.Module): def __init__(self, d_model: int, n_head: int, attn_mask: torch.Tensor = None): super().__init__() self.attn = nn.MultiheadAttention(d_model, n_head) # 多头注意力 self.ln_1 = LayerNorm(d_model) # 层归一化 self.mlp = nn.Sequential(OrderedDict([ # FeedForward ("c_fc", nn.Linear(d_model, d_model * 4)), # 经过第一层线性变换,维度扩大4倍 ("gelu", QuickGELU()), # GLUE激活函数的快速实现版本 ("c_proj", nn.Linear(d_model * 4, d_model)) # 最后经过第二层线性变换(c_proj)将维度缩小回d_model ])) # 这种设计可以增加模型的表示能力,使得模型能够学习更复杂的函数映射关系。 self.ln_2 = LayerNorm(d_model) # 层归一化 self.attn_mask = attn_mask # attention 中使用的mask def attention(self, x: torch.Tensor): self.attn_mask = self.attn_mask.to(dtype=x.dtype, device=x.device) if self.attn_mask is not None else None return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0] def forward(self, x: torch.Tensor): x = x + self.attention(self.ln_1(x))# LayerNorm->MultiHead self-attention->残差连接 x = x + self.mlp(self.ln_2(x))# LayerNorm->FeedForward->残差连接 return x
其中QucickGlue是Glue激活函数的一个快速实现版本,具体如下:
class QuickGELU(nn.Module):
def forward(self, x: torch.Tensor):
return x * torch.sigmoid(1.702 * x)
在CLIP中,图像编码器有两种选择,分别是Vision Transformer和Resnet
VisionTransformer
Vision Transformer(ViT, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)
Vision Transformer的核心还是上文中介绍到的Transformer结构,只是在输入上把图像划分成一个个的patch, 然后将每个图像patch经过一个线性层投影后,添加位置编码和类别编码。
class VisionTransformer(nn.Module): def __init__(self, input_resolution: int, patch_size: int, width: int, layers: int, heads: int, output_dim: int): super().__init__() self.input_resolution = input_resolution self.output_dim = output_dim # conv1用来讲输入的图片划分成一个个的patch,kernel的大小和步长都为patch_size self.conv1 = nn.Conv2d(in_channels=3, out_channels=width, kernel_size=patch_size, stride=patch_size, bias=False) scale = width ** -0.5 self.class_embedding = nn.Parameter(scale * torch.randn(width)) # 类别编码 self.positional_embedding = nn.Parameter(scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width)) # 位置编码 self.ln_pre = LayerNorm(width) # 层归一化 self.transformer = Transformer(width, layers, heads) # Transformer Block self.ln_post = LayerNorm(width) self.proj = nn.Parameter(scale * torch.randn(width, output_dim)) def forward(self, x: torch.Tensor): # x: (b,3,h,w) # 将图像划分成patch x = self.conv1(x) # shape = [b, width, grid, grid] 其中grid=h/patch_size x = x.reshape(x.shape[0], x.shape[1], -1) # shape = [b, width, grid ** 2] x = x.permute(0, 2, 1) # shape = [b, grid ** 2, width] # 添加class token x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1) # shape = [b, grid ** 2 + 1, width] # 添加位置编码 x = x + self.positional_embedding.to(x.dtype) x = self.ln_pre(x) # LayerNorm x = x.permute(1, 0, 2) # NLD -> LND [b, grid ** 2 + 1, width]->[grid ** 2 + 1,b,width] x = self.transformer(x) # multi-head Transformer [grid ** 2 + 1,b,width] x = x.permute(1, 0, 2) # LND -> NLD [grid ** 2 + 1,b,width]-> [b, grid ** 2 + 1, width] # 获取类别信息 x = self.ln_post(x[:, 0, :]) # [b,width] if self.proj is not None: x = x @ self.proj # [b,output_dim] return x # [b,output_dim]
ModifiedResNet
图像编码器的另外一种实现方式ModifiedResNet
它一个类似于torchvision的ResNet类,但包含以下更改:
class ModifiedResNet(nn.Module): def __init__(self, layers, output_dim, heads, input_resolution=224, width=64): super().__init__() self.output_dim = output_dim self.input_resolution = input_resolution # the 3-layer stem self.conv1 = nn.Conv2d(3, width // 2, kernel_size=3, stride=2, padding=1, bias=False) # (b,3,h,w)->(b,width/2,h/2,w/2) self.bn1 = nn.BatchNorm2d(width // 2) self.relu1 = nn.ReLU(inplace=True) self.conv2 = nn.Conv2d(width // 2, width // 2, kernel_size=3, padding=1, bias=False) # (b,width/2,h/2,w/2)->(b,width/2,h/2,w/2) self.bn2 = nn.BatchNorm2d(width // 2) self.relu2 = nn.ReLU(inplace=True) self.conv3 = nn.Conv2d(width // 2, width, kernel_size=3, padding=1, bias=False) # (b,width/2,h/2,w/2)->(b,width/2,h/2,w/2) self.bn3 = nn.BatchNorm2d(width) self.relu3 = nn.ReLU(inplace=True) self.avgpool = nn.AvgPool2d(2) # residual layers self._inplanes = width # this is a *mutable* variable used during construction self.layer1 = self._make_layer(width, layers[0]) # Layers[0]个bottleneck self.layer2 = self._make_layer(width * 2, layers[1], stride=2)# Layers[1]个bottleneck self.layer3 = self._make_layer(width * 4, layers[2], stride=2)# Layers[2]个bottleneck self.layer4 = self._make_layer(width * 8, layers[3], stride=2)# Layers[3]个bottleneck embed_dim = width * 32 # the ResNet feature dimension self.attnpool = AttentionPool2d(input_resolution // 32, embed_dim, heads, output_dim) def _make_layer(self, planes, blocks, stride=1): # Blocks个BottleNeck串联 layers = [Bottleneck(self._inplanes, planes, stride)] self._inplanes = planes * Bottleneck.expansion for _ in range(1, blocks): layers.append(Bottleneck(self._inplanes, planes)) return nn.Sequential(*layers) def forward(self, x): # x: (b,3,h,w) def stem(x): x = self.relu1(self.bn1(self.conv1(x))) x = self.relu2(self.bn2(self.conv2(x))) x = self.relu3(self.bn3(self.conv3(x))) x = self.avgpool(x) return x x = x.type(self.conv1.weight.dtype) # 转换x的数据类型 x = stem(x) # (b,width/2,h/2,w/2) x = self.layer1(x) x = self.layer2(x) x = self.layer3(x) x = self.layer4(x) x = self.attnpool(x) # AttentionPool2d return x
Bottleneck
ModifiedResNet 中的layer1~4使用的就是Bottleneck
class Bottleneck(nn.Module): expansion = 4 def __init__(self, inplanes, planes, stride=1): super().__init__() # all conv layers have stride 1. an avgpool is performed after the second convolution when stride > 1 self.conv1 = nn.Conv2d(inplanes, planes, 1, bias=False) self.bn1 = nn.BatchNorm2d(planes) self.relu1 = nn.ReLU(inplace=True) self.conv2 = nn.Conv2d(planes, planes, 3, padding=1, bias=False) self.bn2 = nn.BatchNorm2d(planes) self.relu2 = nn.ReLU(inplace=True) self.avgpool = nn.AvgPool2d(stride) if stride > 1 else nn.Identity() self.conv3 = nn.Conv2d(planes, planes * self.expansion, 1, bias=False) self.bn3 = nn.BatchNorm2d(planes * self.expansion) self.relu3 = nn.ReLU(inplace=True) self.downsample = None self.stride = stride if stride > 1 or inplanes != planes * Bottleneck.expansion: # downsampling layer is prepended with an avgpool, and the subsequent convolution has stride 1 self.downsample = nn.Sequential(OrderedDict([ ("-1", nn.AvgPool2d(stride)), ("0", nn.Conv2d(inplanes, planes * self.expansion, 1, stride=1, bias=False)), ("1", nn.BatchNorm2d(planes * self.expansion)) ])) def forward(self, x: torch.Tensor): identity = x # 1*1 conv -> BatchNorm2d ->Relu out = self.relu1(self.bn1(self.conv1(x))) # (b,inplanes,h,w)->(b,planes,h,w) # 3*3 conv -> BatchNorm2d ->Relu out = self.relu2(self.bn2(self.conv2(out))) # (b,planes,h,w)->(b,planes,h,w) out = self.avgpool(out) # AvgPool2d 二维平均池化 out = self.bn3(self.conv3(out))# (b,planes,h,w)->(b,planes*expansion,h,w) if self.downsample is not None: identity = self.downsample(x) # 进行下采样操作 out += identity # 残差连接 out = self.relu3(out) return out # (b,planes*expansion,h,w)
AttentionPool2d
ModifiedResNet 的最后一层使用的就是AttentionPool2d
。
class AttentionPool2d(nn.Module): def __init__(self, spacial_dim: int, embed_dim: int, num_heads: int, output_dim: int = None): super().__init__() self.positional_embedding = nn.Parameter(torch.randn(spacial_dim ** 2 + 1, embed_dim) / embed_dim ** 0.5) self.k_proj = nn.Linear(embed_dim, embed_dim) self.q_proj = nn.Linear(embed_dim, embed_dim) self.v_proj = nn.Linear(embed_dim, embed_dim) self.c_proj = nn.Linear(embed_dim, output_dim or embed_dim) self.num_heads = num_heads def forward(self, x): # (b,c,h,w) x = x.flatten(start_dim=2).permute(2, 0, 1) # (b,c,h*w)->(h*w,b,c) x = torch.cat([x.mean(dim=0, keepdim=True), x], dim=0) # (h*w+1,b,c) x = x + self.positional_embedding[:, None, :].to(x.dtype) # 添加位置编码 (h*w+1,b,c) x, _ = F.multi_head_attention_forward( # 多头注意力机制 query=x[:1], key=x, value=x, embed_dim_to_check=x.shape[-1], num_heads=self.num_heads, q_proj_weight=self.q_proj.weight, k_proj_weight=self.k_proj.weight, v_proj_weight=self.v_proj.weight, in_proj_weight=None, in_proj_bias=torch.cat([self.q_proj.bias, self.k_proj.bias, self.v_proj.bias]), bias_k=None, bias_v=None, add_zero_attn=False, dropout_p=0, out_proj_weight=self.c_proj.weight, out_proj_bias=self.c_proj.bias, use_separate_proj_weight=True, training=self.training, need_weights=False ) return x.squeeze(0)
CLIP模型(https://github.com/openai/CLIP)实现的核心代码在clip/models.py
文件中定义的CLIP
类。
__init__
初始化函数
def __init__(self, embed_dim: int, # vision image_resolution: int, vision_layers: Union[Tuple[int, int, int, int], int], vision_width: int, vision_patch_size: int, # text context_length: int, vocab_size: int, transformer_width: int, transformer_heads: int, transformer_layers: int ): super().__init__() self.context_length = context_length # 图像编码器的两种形式 # 当输入的vision_layer 的格式是(tuple,list), 则用ResNet实现 if isinstance(vision_layers, (tuple, list)): vision_heads = vision_width * 32 // 64 self.visual = ModifiedResNet( layers=vision_layers, output_dim=embed_dim, heads=vision_heads, input_resolution=image_resolution, width=vision_width ) else: # 否则用Vision Transformer对图像进行编码 vision_heads = vision_width // 64 self.visual = VisionTransformer( input_resolution=image_resolution, patch_size=vision_patch_size, width=vision_width, layers=vision_layers, heads=vision_heads, output_dim=embed_dim ) # 文本编码器用Transformer实现 self.transformer = Transformer( width=transformer_width, layers=transformer_layers, heads=transformer_heads, attn_mask=self.build_attention_mask() ) self.vocab_size = vocab_size self.token_embedding = nn.Embedding(vocab_size, transformer_width) # vocab_size 表示词汇表的大小,transformer_width 表示每个 token 被映射成的向量的维度。 self.positional_embedding = nn.Parameter(torch.empty(self.context_length, transformer_width)) self.ln_final = LayerNorm(transformer_width) self.text_projection = nn.Parameter(torch.empty(transformer_width, embed_dim)) self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07)) self.initialize_parameters()
encode_image
图像编码器,调用self.visual
对图像进行编码
def encode_image(self, image):
return self.visual(image.type(self.dtype))
# 先转换image的数据类别,然后再输入到图像编码器中进行编码
其中self.dtype
的实现如下, 用于获取图像编码器中conv1的权重的数据类别。
@property
def dtype(self):
return self.visual.conv1.weight.dtype
encode_text
文本编码器
def encode_text(self, text):
# 每个句子前面有两个特殊符号 [CLS] 和 [Seq]
x = self.token_embedding(text).type(self.dtype) # [batch_size, n_ctx, d_model]
x = x + self.positional_embedding.type(self.dtype) # 添加位置编码
x = x.permute(1, 0, 2) # NLD -> LND
x = self.transformer(x)
x = x.permute(1, 0, 2) # LND -> NLD [batch_size, n_ctx, d_model]
x = self.ln_final(x).type(self.dtype) # LayerNorm
# x.shape = [batch_size, n_ctx, transformer.width]
# take features from the eot embedding (eot_token is the highest number in each sequence)
x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection
return x
forward函数
CLIP模型的前向传播过程,首先编码图像和文本信息,然后对图像和文本特征进行归一化,将归一化后的特征计算相似度得分。
def forward(self, image, text):
image_features = self.encode_image(image) # 编码图像特征
text_features = self.encode_text(text) # 编码文字特征
# 对特征进行归一化
image_features = image_features / image_features.norm(dim=1, keepdim=True)
text_features = text_features / text_features.norm(dim=1, keepdim=True)
# cosine similarity as logits
logit_scale = self.logit_scale.exp() # 可学习参数
logits_per_image = logit_scale * image_features @ text_features.t() # 每个图像与每个文本之间的相似度得分。
logits_per_text = logits_per_image.t() # 每个文本与每个图像之间的相似度得分。
# shape = [global_batch_size, global_batch_size]
return logits_per_image, logits_per_text
API
clip
提供如下方法可供如下API可供调用
clip.available_models()
返回可以使用的CLIP模型的名称clip.load(name, device=..., jit=False)
clip.tokenize(text: Union[str, List[str]], context_length=77)
返回一个LongTensor, 包含输入文本的token化序列。由clip.load()
返回的模型具有如下的方法:
model.encode_image(image: Tensor)
输入一组batch的图片,返回编码后的图像特征。model.encode_text(text: Tensor)
输入一组batch的文本token, 返回CLIP模型编码后的文本特征。model(image: Tensor, text: Tensor)
给定一个图像批次和一个文本标记批次,返回两个张量,包含对应于每个图像和文本输入的logit分数。这些值是对应图像和文本特征之间的余弦相似度乘以100。环境配置
conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
pip install ftfy regex tqdm
# 方法1: 直接下载并安装
pip install git+https://github.com/openai/CLIP.git
# 方法2: 从github上下载clip源码到本地,然后解压后,进入文件夹内编译
cd CLIP-main
pip install -v -e .
推理测试
计算一张图片和多个文本间的相似度得分
import torch import clip from PIL import Image device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load("ViT-B/32", device=device) # 加载模型 image = preprocess(Image.open("../CLIP.png")).unsqueeze(0).to(device) # 图片预处理 text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)# 文本序列化 with torch.no_grad(): image_features = model.encode_image(image) # 编码图像特征 text_features = model.encode_text(text) # 编码文本特征 logits_per_image, logits_per_text = model(image, text) probs = logits_per_image.softmax(dim=-1).cpu().numpy() print("logits_per_image: ",logits_per_image) print("logits_per_text:", logits_per_text) print("Label probs:", probs) # prints: [[0.9927937 0.00421068 0.00299572]]
Zero-Shot 预测
预测单张图片的类别
import os import clip import torch from torchvision.datasets import CIFAR100 # 加载模型 device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load('ViT-B/32', device) # 下载数据集 cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False) # 输入准备 image, class_id = cifar100[3637] image_input = preprocess(image).unsqueeze(0).to(device) text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device) # 计算图像和文本特征 with torch.no_grad(): image_features = model.encode_image(image_input) text_features = model.encode_text(text_inputs) # 特征归一化 image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) # 计算余弦相似度 similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1) # 选取前五个相似度得分最高的 values, indices = similarity[0].topk(5) # 打印结果 print("\nTop predictions:\n") for value, index in zip(values, indices): print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")
验证
在多张图片上进行验证
import os import clip import torch import numpy as np from sklearn.linear_model import LogisticRegression from torch.utils.data import DataLoader from torchvision.datasets import CIFAR100 from tqdm import tqdm # 加载模型 device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load('ViT-B/32', device) # 加载测试和验证数据集 root = os.path.expanduser("~/.cache") train = CIFAR100(root, download=True, train=True, transform=preprocess) test = CIFAR100(root, download=True, train=False, transform=preprocess) def get_features(dataset): all_features = [] all_labels = [] with torch.no_grad(): for images, labels in tqdm(DataLoader(dataset, batch_size=100)): features = model.encode_image(images.to(device))# 对图像特征进行编码 all_features.append(features) all_labels.append(labels) return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy() # 对训练集和测试集的图像进行编码 train_features, train_labels = get_features(train) test_features, test_labels = get_features(test) # 训练过程:执行 logistic regression classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1) classifier.fit(train_features, train_labels) # 验证分类结果 predictions = classifier.predict(test_features) accuracy = np.mean((test_labels == predictions).astype(float)) * 100. print(f"Accuracy = {accuracy:.3f}") # 得到总的分类准确率
https://colab.research.google.com/github/openai/clip/blob/master/notebooks/Interacting_with_CLIP.ipynb
(1) 环境配置
安装相应的包和CLIP
! pip install ftfy regex tqdm
! pip install git+https://github.com/openai/CLIP.git
测试torch的版本
import numpy as np
import torch
from pkg_resources import packaging
print("Torch version:", torch.__version__)
(2)加载模型
输出clip中可用的预训练模型
import clip
clip.available_models()
加载clip模型并打印相关的参数信息
model, preprocess = clip.load("ViT-B/32") # 加载模型
model.cuda().eval() # 验证模式
input_resolution = model.visual.input_resolution
context_length = model.context_length
vocab_size = model.vocab_size
print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}") # 模型的参数量
print("Input resolution:", input_resolution)# 输入图像分辨率大小
print("Context length:", context_length)# 文本长度
print("Vocab size:", vocab_size)# 词汇表大小
(3) 图像预处理
图像预处理的步骤,包括Resize到244*244,并进行CenterCrop 和Normalization操作。
preprocess
(4)文本预处理
文本预处理使用的是不区分大小写的分词器,可以通过clip.tokenize()
来调用。默认情况下,输出被填充为77个令牌长。
clip.tokenize("Hello World!")
(5)设置输入图像和文本
我们将向模型输入8张示例图片及其文字描述,并比较相应特征之间的相似性。
其中分词器不区分大小写,我们可以自由地提供任何适当的文字描述。
import os import skimage import IPython.display import matplotlib.pyplot as plt from PIL import Image import numpy as np from collections import OrderedDict import torch %matplotlib inline %config InlineBackend.figure_format = 'retina' # images in skimage to use and their textual descriptions descriptions = { "page": "a page of text about segmentation", "chelsea": "a facial photo of a tabby cat", "astronaut": "a portrait of an astronaut with the American flag", "rocket": "a rocket standing on a launchpad", "motorcycle_right": "a red motorcycle standing in a garage", "camera": "a person looking at a camera on a tripod", "horse": "a black-and-white silhouette of a horse", "coffee": "a cup of coffee on a saucer" }
下面的代码主要展示我们的测试图片和对应的文本描述
original_images = [] images = [] texts = [] plt.figure(figsize=(16, 5)) for filename in [filename for filename in os.listdir(skimage.data_dir) if filename.endswith(".png") or filename.endswith(".jpg")]: name = os.path.splitext(filename)[0] if name not in descriptions: continue image = Image.open(os.path.join(skimage.data_dir, filename)).convert("RGB") plt.subplot(2, 4, len(images) + 1) plt.imshow(image) plt.title(f"{filename}\n{descriptions[name]}") plt.xticks([]) plt.yticks([]) original_images.append(image) images.append(preprocess(image)) texts.append(descriptions[name]) plt.tight_layout()
(6)创建图像文本特征
然后对图片进行归一化处理,对每个文本输入进行分词,并运行模型的前向传递,以获得图片和文本的特征。
image_input = torch.tensor(np.stack(images)).cuda()
text_tokens = clip.tokenize(["This is " + desc for desc in texts]).cuda()
with torch.no_grad():
image_features = model.encode_image(image_input).float() # 图像特征
text_features = model.encode_text(text_tokens).float()#文本特征
(7)计算余弦相似度
将特征进行归一化,并计算余弦相似度。
image_features /= image_features.norm(dim=-1, keepdim=True)# 对图像特征归一化
text_features /= text_features.norm(dim=-1, keepdim=True)# 对文本特征归一化
similarity = text_features.cpu().numpy() @ image_features.cpu().numpy().T # 点乘,计算相似度
用热力图的形式可视化相似度矩阵
count = len(descriptions) plt.figure(figsize=(20, 14)) plt.imshow(similarity, vmin=0.1, vmax=0.3) # plt.colorbar() plt.yticks(range(count), texts, fontsize=18) plt.xticks([]) for i, image in enumerate(original_images): plt.imshow(image, extent=(i - 0.5, i + 0.5, -1.6, -0.6), origin="lower") for x in range(similarity.shape[1]): for y in range(similarity.shape[0]): plt.text(x, y, f"{similarity[y, x]:.2f}", ha="center", va="center", size=12) for side in ["left", "top", "right", "bottom"]: plt.gca().spines[side].set_visible(False) plt.xlim([-0.5, count - 0.5]) plt.ylim([count + 0.5, -2]) plt.title("Cosine similarity between text and image features", size=20)
可以看到对角线上是匹配的图像文本对,相似度值最高。
(8)Zero-shot 图像分类
from torchvision.datasets import CIFAR100 cifar100 = CIFAR100(os.path.expanduser("~/.cache"), transform=preprocess, download=True) text_descriptions = [f"This is a photo of a {label}" for label in cifar100.classes] # 将类别名嵌入到文本中 text_tokens = clip.tokenize(text_descriptions).cuda() # 对文本进行序列化 with torch.no_grad(): text_features = model.encode_text(text_tokens).float()# 对文本进行编码 text_features /= text_features.norm(dim=-1, keepdim=True)# 对文本特征进行归一化 text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)# 计算余弦相似度 top_probs, top_labels = text_probs.cpu().topk(5, dim=-1) # 取相似度最高的5个值 # 分类结果可视化 plt.figure(figsize=(16, 16)) for i, image in enumerate(original_images): plt.subplot(4, 4, 2 * i + 1) plt.imshow(image) plt.axis("off") plt.subplot(4, 4, 2 * i + 2) y = np.arange(top_probs.shape[-1]) plt.grid() plt.barh(y, top_probs[i]) plt.gca().invert_yaxis() plt.gca().set_axisbelow(True) plt.yticks(y, [cifar100.classes[index] for index in top_labels[i].numpy()]) plt.xlabel("probability") plt.subplots_adjust(wspace=0.5) plt.show()
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。