赞
踩
代码:https://github.com/microsoft/RegionCLIP
出处:CVPR2022 Oral | 微软 | 张鹏川
近期,视觉-语言模型取得了很大的突破,如 CLIP 和 ALIGN,这些模型使用了极大的图文对儿来学习图像和文本的匹配,并且在很多无手工标签的情况下也取得了很好的效果。
为了探索这种思路能否在 region-caption 的情况下起作用,作者基于预训练好的 CLIP 模型构建了一个 R-CNN 形式的目标检测器。
效果和现状:
主要思路:
作者想探索一下这种差别来源于哪里?
本文如何解决 image 和 region 之间的差距:
面临的问题:
如何预训练:
如何 zero-shot 推理:
如何迁移到目标检测:
本文的目标是学习一个区域级别的视觉-语义空间,能够覆盖足够丰富的目标词汇且用于开放词汇目标检测
总体框架图如图 2:
1、Visual region representation
可以使用现有的目标定位器(如 RPN)或密集滑动窗口 来进行图像区域的生成
作者使用经过人工标注 bbox 训练过的 RPN 来生成,这里不对 bbox 的类别进行区分
2、Semantic region representation
一个单个的图像通常会包含丰富的语义信息,多个不同类别的目标,且人工标注这么大规模的数据也不太可行
所以,作者首先构建了一个大的词汇池,来尽可能的覆盖所有区域词汇,如图 2 所示,而且建立的词汇池是从文本语料库中解析得来的
有了词汇池后,按照如下的方式来构建每个区域的语义表达:
3、visual-semantic alignment for regions
① 如何对齐 region-text pairs:使用 CLIP 来构建伪标签,即使用 teacher model CLIP 预测的得分最大的 concept 作为该区域的描述
作者借用 teacher visual encoder 来建立 region-text 之间的关系,这里的 text 表示语义编码,区域 r i r_i ri 的 visual representation v i t v_i^t vit 是从 teacher visual encoder V t V_t Vt 中抽取的
然后,计算 v i t v_i^t vit 和 { l j } \{l_j\} {lj} 的匹配得分,得分最高的就和区域进行关联起来,然后就能得到每个区域的伪标签: { v i , l m } \{v_i, l_m\} {vi,lm}
② 如何预训练:
同时使用来自网络的 region-text pairs 和 image-text pairs
region-text pairs 就是通过 ① 的方法来创建的
拿到上述 region-text pairs { v i , l m } \{v_i, l_m\} {vi,lm},使用对比学习 loss 和蒸馏 loss 来训练 visual decoder,总共包含 3 部分
region-text 的对比学习 loss 如下, τ \tau τ 是预定义的温度参数, N r i N_{ri} Nri 是 region r i r_i ri 的 negative textual samples,也就是在一个 batch 中和 region r i r_i ri 不匹配但和其他区域匹配的
除了对比学习 loss 以外,还有考虑每个图像区域的知识蒸馏,蒸馏 loss 如下, q i t q_i^t qit 是从 teacher model 得到的 soft target, q i q_i qi 是 student model 得到的预测
image-text 的对比学习 loss L c n t r s t − i m g L_{cntrst-img} Lcntrst−img 可以从 region level 扩展而来,也就是特殊情况,即 ① 一个 box 覆盖了整张图,② 文本描述来源于网络,③ negative samples 是从其他图像而来的文本描述
③ 零样本推理
预训练之后,训练得到的 visual encoder 可以直接用于 region reasoning 任务,比如从 RPN 获得区域,从训练的 visual encoder 得到该区域的视觉表达,然后和文本词汇表达进行匹配,得到相似度最高的文本
实验证明使用 RPN score 能够提升 zero-shot 推理的效果,所以作者也使用了 RPN objectness score + category confidence score 的均值来作为最终的得分,用于匹配。
预训练中,本文的 visual encoder 是从 teacher model 提供的 region-text alignment 中学习的,不需要人为一些操作,所以也会有一个噪声,当引入更强的监督信号(如人为标注 label)时,可以进一步 fine-tuning visual encoder,如图 2
如何将预训练网络迁移到目标检测器呢,作者通过初始化目标检测器的 visual backbone 来实现,先使用现有的 RPN 网络来进行目标区域的定位,然后将区域和文本匹配
开放词汇目标检测:
预训练时,作者使用:
为了开放词汇目标检测的迁移学习,作者使用 COCO 数据集和 LVIS 数据集的基础类来训练。
作者使用目标检测标准测评:AP 和 AP50
1、预训练
2、目标检测迁移
3、目标检测零样本推理
RegionCLIP 在开放词汇检测的 novel 类上的效果达到了 39.3AP
环境安装:
https://github.com/microsoft/RegionCLIP/blob/zero-shot/docs/INSTALL.md
数据准备:
https://github.com/microsoft/RegionCLIP/blob/zero-shot/datasets/README.md
1、zero-shot 测试(使用 gt 框,主要测试的是对 region 的识别能力)
# RN50x4, GT, COCO
python3 ./tools/train_net.py \
--eval-only \
--num-gpus 1 \
--config-file ./configs/COCO-InstanceSegmentation/CLIP_fast_rcnn_R_50_C4_ovd_zsinf.yaml \
MODEL.WEIGHTS ./pretrained_ckpt/regionclip/regionclip_pretrained-cc_rn50x4.pth \
MODEL.CLIP.TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/coco_65_cls_emb_rn50x4.pth \
MODEL.CLIP.CROP_REGION_TYPE GT \
MODEL.CLIP.MULTIPLY_RPN_SCORE False \
MODEL.CLIP.TEXT_EMB_DIM 640 \
MODEL.RESNETS.DEPTH 200 \
MODEL.ROI_BOX_HEAD.POOLER_RESOLUTION 18 \
Evaluation results for bbox:
| AP | AP50 | AP75 | APs | APm | APl |
|:------:|:------:|:------:|:------:|:------:|:------:|
| 64.946 | 65.451 | 64.903 | 52.922 | 71.390 | 70.545 |
Per-category bbox AP:
| category | AP | category | AP | category | AP |
|:-----------|:-------|:-----------|:-------|:-------------|:-------|
| person | 73.463 | bicycle | 79.673 | car | 84.925 |
| motorcycle | 66.633 | airplane | 97.827 | bus | 85.852 |
| train | 89.867 | truck | 51.154 | boat | 82.183 |
| bench | 42.759 | bird | 73.442 | cat | 76.972 |
| dog | 73.316 | horse | 83.049 | sheep | 90.901 |
| cow | 81.264 | elephant | 93.201 | bear | 84.405 |
| zebra | 95.208 | giraffe | 95.268 | backpack | 46.059 |
| umbrella | 71.930 | handbag | 33.646 | tie | 80.318 |
| suitcase | 65.223 | frisbee | 24.740 | skis | 38.639 |
| snowboard | 14.005 | kite | 59.969 | skateboard | 53.792 |
| surfboard | 44.970 | bottle | 76.109 | cup | 64.295 |
| fork | 53.513 | knife | 19.605 | spoon | 30.435 |
| bowl | 52.905 | banana | 81.173 | apple | 67.057 |
| sandwich | 72.380 | orange | 68.320 | broccoli | 91.791 |
| carrot | 80.076 | pizza | 87.340 | donut | 78.442 |
| cake | 73.971 | chair | 68.974 | couch | 57.402 |
| bed | 56.792 | toilet | 74.647 | tv | 65.258 |
| laptop | 66.412 | mouse | 23.762 | remote | 19.321 |
| keyboard | 50.904 | microwave | 66.310 | oven | 52.184 |
| toaster | 29.439 | sink | 60.021 | refrigerator | 60.273 |
| book | 84.199 | clock | 91.148 | vase | 62.869 |
| scissors | 40.443 | toothbrush | 59.085 | | |
2、fine-tuning COCO 48 个基础类别的代码
数据进入模型的位置:RegionCLIP/detectron2/engine/train_loop.py line273
# data
[{'file_name': 'datasets/coco/train2017/000000526362.jpg', 'height': 403, 'width': 640, 'image_id': 526362, 'image': tensor([[[67, 70, 74, ..., 56, 56, 56],
[68, 71, 74, ..., 57, 57, 57],
[70, 72, 74, ..., 59, 59, 59],
...,
[69, 71, 75, ..., 88, 83, 80],
[67, 69, 72, ..., 82, 77, 74],
[66, 68, 70, ..., 79, 74, 71]],
[[49, 50, 52, ..., 56, 54, 53],
[51, 52, 54, ..., 56, 54, 52],
[54, 55, 56, ..., 57, 53, 50],
...,
[66, 69, 74, ..., 80, 74, 71],
[63, 65, 69, ..., 74, 68, 65],
[61, 63, 66, ..., 70, 64, 61]],
[[47, 51, 58, ..., 31, 34, 36],
[48, 51, 57, ..., 33, 37, 40],
[50, 52, 55, ..., 37, 43, 46],
...,
[61, 64, 70, ..., 81, 75, 72],
[59, 61, 66, ..., 75, 69, 66],
[57, 59, 63, ..., 71, 65, 62]]], dtype=torch.uint8), 'instances': Instances(num_instances=11, image_height=736, image_width=1169, fields=[gt_boxes: Boxes(tensor([[125.1378, 305.5587, 361.0018, 481.5047],
[558.6907, 319.8769, 689.9657, 380.5649],
[550.3981, 297.8700, 613.7433, 347.7828],
[681.0156, 162.0844, 896.0568, 673.1569],
[104.9726, 271.2607, 152.7920, 347.9289],
[ 53.9933, 272.5574, 119.6947, 468.4102],
[809.8613, 271.8634, 870.2109, 377.9351],
[683.9929, 315.2929, 733.6754, 439.1162],
[456.3666, 289.1585, 561.3574, 395.3945],
[350.7730, 279.0408, 427.6165, 470.8756],
[ 6.0459, 486.8557, 146.3076, 728.2017]])), gt_classes: tensor([ 2, 2, 2, 0, 0, 0, 15, 15, 5, 0, 2])])}]
模型:
CLIPFastRCNN(
(offline_backbone): ResNet(
(stem): BasicStem(
(conv1): Conv2d(
3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
)
(res2): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv1): Conv2d(
64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv2): Conv2d(
64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv3): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv2): Conv2d(
64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv3): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv2): Conv2d(
64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv3): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
)
)
(res3): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv1): Conv2d(
256, 128, kernel_size=(1, 1), stride=(2, 2), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
(3): BottleneckBlock(
(conv1): Conv2d(
512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
)
(res4): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
(conv1): Conv2d(
512, 256, kernel_size=(1, 1), stride=(2, 2), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(3): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(4): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(5): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
)
)
(backbone): ModifiedResNet(
(conv1): Conv2d(3, 40, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=40, eps=1e-05)
(conv2): Conv2d(40, 40, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=40, eps=1e-05)
(conv3): Conv2d(40, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=80, eps=1e-05)
(avgpool): AvgPool2d(kernel_size=2, stride=2, padding=0)
(relu): ReLU(inplace=True)
(layer1): Sequential(
(0): Bottleneck(
(conv1): Conv2d(80, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=80, eps=1e-05)
(conv2): Conv2d(80, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=80, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(80, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(-1): AvgPool2d(kernel_size=1, stride=1, padding=0)
(0): Conv2d(80, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): FrozenBatchNorm2d(num_features=320, eps=1e-05)
)
)
(1): Bottleneck(
(conv1): Conv2d(320, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=80, eps=1e-05)
(conv2): Conv2d(80, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=80, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(80, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(320, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=80, eps=1e-05)
(conv2): Conv2d(80, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=80, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(80, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(relu): ReLU(inplace=True)
)
(3): Bottleneck(
(conv1): Conv2d(320, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=80, eps=1e-05)
(conv2): Conv2d(80, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=80, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(80, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(relu): ReLU(inplace=True)
)
)
(layer2): Sequential(
(0): Bottleneck(
(conv1): Conv2d(320, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=160, eps=1e-05)
(conv2): Conv2d(160, 160, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=160, eps=1e-05)
(avgpool): AvgPool2d(kernel_size=2, stride=2, padding=0)
(conv3): Conv2d(160, 640, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=640, eps=1e-05)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(-1): AvgPool2d(kernel_size=2, stride=2, padding=0)
(0): Conv2d(320, 640, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): FrozenBatchNorm2d(num_features=640, eps=1e-05)
)
)
(1): Bottleneck(
(conv1): Conv2d(640, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=160, eps=1e-05)
(conv2): Conv2d(160, 160, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=160, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(160, 640, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=640, eps=1e-05)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(640, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=160, eps=1e-05)
(conv2): Conv2d(160, 160, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=160, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(160, 640, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=640, eps=1e-05)
(relu): ReLU(inplace=True)
)
(3): Bottleneck(
(conv1): Conv2d(640, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=160, eps=1e-05)
(conv2): Conv2d(160, 160, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=160, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(160, 640, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=640, eps=1e-05)
(relu): ReLU(inplace=True)
)
(4): Bottleneck(
(conv1): Conv2d(640, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=160, eps=1e-05)
(conv2): Conv2d(160, 160, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=160, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(160, 640, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=640, eps=1e-05)
(relu): ReLU(inplace=True)
)
(5): Bottleneck(
(conv1): Conv2d(640, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=160, eps=1e-05)
(conv2): Conv2d(160, 160, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=160, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(160, 640, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=640, eps=1e-05)
(relu): ReLU(inplace=True)
)
)
(layer3): Sequential(
(0): Bottleneck(
(conv1): Conv2d(640, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(conv2): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(avgpool): AvgPool2d(kernel_size=2, stride=2, padding=0)
(conv3): Conv2d(320, 1280, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=1280, eps=1e-05)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(-1): AvgPool2d(kernel_size=2, stride=2, padding=0)
(0): Conv2d(640, 1280, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): FrozenBatchNorm2d(num_features=1280, eps=1e-05)
)
)
(1): Bottleneck(
(conv1): Conv2d(1280, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(conv2): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(320, 1280, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=1280, eps=1e-05)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(1280, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(conv2): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(320, 1280, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=1280, eps=1e-05)
(relu): ReLU(inplace=True)
)
(3): Bottleneck(
(conv1): Conv2d(1280, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(conv2): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(320, 1280, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=1280, eps=1e-05)
(relu): ReLU(inplace=True)
)
(4): Bottleneck(
(conv1): Conv2d(1280, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(conv2): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(320, 1280, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=1280, eps=1e-05)
(relu): ReLU(inplace=True)
)
(5): Bottleneck(
(conv1): Conv2d(1280, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(conv2): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(320, 1280, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=1280, eps=1e-05)
(relu): ReLU(inplace=True)
)
(6): Bottleneck(
(conv1): Conv2d(1280, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(conv2): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(320, 1280, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=1280, eps=1e-05)
(relu): ReLU(inplace=True)
)
(7): Bottleneck(
(conv1): Conv2d(1280, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(conv2): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(320, 1280, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=1280, eps=1e-05)
(relu): ReLU(inplace=True)
)
(8): Bottleneck(
(conv1): Conv2d(1280, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(conv2): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(320, 1280, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=1280, eps=1e-05)
(relu): ReLU(inplace=True)
)
(9): Bottleneck(
(conv1): Conv2d(1280, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(conv2): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=320, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(320, 1280, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=1280, eps=1e-05)
(relu): ReLU(inplace=True)
)
)
(layer4): Sequential(
(0): Bottleneck(
(conv1): Conv2d(1280, 640, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=640, eps=1e-05)
(conv2): Conv2d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=640, eps=1e-05)
(avgpool): AvgPool2d(kernel_size=2, stride=2, padding=0)
(conv3): Conv2d(640, 2560, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=2560, eps=1e-05)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(-1): AvgPool2d(kernel_size=2, stride=2, padding=0)
(0): Conv2d(1280, 2560, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): FrozenBatchNorm2d(num_features=2560, eps=1e-05)
)
)
(1): Bottleneck(
(conv1): Conv2d(2560, 640, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=640, eps=1e-05)
(conv2): Conv2d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=640, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(640, 2560, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=2560, eps=1e-05)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(2560, 640, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=640, eps=1e-05)
(conv2): Conv2d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=640, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(640, 2560, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=2560, eps=1e-05)
(relu): ReLU(inplace=True)
)
(3): Bottleneck(
(conv1): Conv2d(2560, 640, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=640, eps=1e-05)
(conv2): Conv2d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=640, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(640, 2560, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=2560, eps=1e-05)
(relu): ReLU(inplace=True)
)
(4): Bottleneck(
(conv1): Conv2d(2560, 640, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=640, eps=1e-05)
(conv2): Conv2d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=640, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(640, 2560, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=2560, eps=1e-05)
(relu): ReLU(inplace=True)
)
(5): Bottleneck(
(conv1): Conv2d(2560, 640, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): FrozenBatchNorm2d(num_features=640, eps=1e-05)
(conv2): Conv2d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): FrozenBatchNorm2d(num_features=640, eps=1e-05)
(avgpool): Identity()
(conv3): Conv2d(640, 2560, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): FrozenBatchNorm2d(num_features=2560, eps=1e-05)
(relu): ReLU(inplace=True)
)
)
(attnpool): AttentionPool2d(
(k_proj): Linear(in_features=2560, out_features=2560, bias=True)
(q_proj): Linear(in_features=2560, out_features=2560, bias=True)
(v_proj): Linear(in_features=2560, out_features=2560, bias=True)
(c_proj): Linear(in_features=2560, out_features=640, bias=True)
)
)
(offline_proposal_generator): RPN(
(rpn_head): StandardRPNHead(
(conv): Conv2d(
1024, 1024, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(activation): ReLU()
)
(objectness_logits): Conv2d(1024, 15, kernel_size=(1, 1), stride=(1, 1))
(anchor_deltas): Conv2d(1024, 60, kernel_size=(1, 1), stride=(1, 1))
)
(anchor_generator): DefaultAnchorGenerator(
(cell_anchors): BufferList()
)
)
(roi_heads): CLIPRes5ROIHeads(
(pooler): ROIPooler(
(level_poolers): ModuleList(
(0): ROIAlign(output_size=(18, 18), spatial_scale=0.0625, sampling_ratio=0, aligned=True)
)
)
(box_predictor): FastRCNNOutputLayers(
(cls_score): Linear(in_features=640, out_features=48, bias=False)
(cls_bg_score): Linear(in_features=640, out_features=1, bias=False)
(test_cls_score): Linear(in_features=640, out_features=65, bias=False)
(bbox_pred): Linear(in_features=640, out_features=4, bias=True)
)
)
)
第一步:生成 proposals 1000 个
第二步:对输入图像使用 backbone (ResNet50)提取特征,得到 [1, 1280, 46, 73],这里的 backbone 是使用 CLIP visual encoder 的权重来初始化的,会参与训练,对输入的 text 使用 CLIP 的 text encoder 来编码,且不参与训练。
第三步:对 RPN 预测的 proposal 分配标签,并剔除不满足 IoU 阈值的 proposals,分配的标准是依据 IoU 来将其分配到对应的 gt 上,gt 的类别就是该 proposal 的类别,分配标签后的 proposal 数量一般会比 1000 少,比如 512 个,然后会包含 box 坐标、objectness 得分、对应的类别 cls、gt box。假设有 48 类,那么背景类别的标签就是 48
第四步:根据 proposal 的 box 坐标来提取特征图中对应位置的特征,并且使用 RoIAlign 的方法,将输出特征图大小统一为 18x18,得到特征 [512, 1280, 18, 18] 的特征,512 表示有 512 个 proposals,然后使用 backbone_res5 对 proposal 的特征进行处理,得到 [512, 2560, 9, 9] 特征
第五步:对得到的 proposal 特征进行 attention,得到 [512, 640] 特征,每个 proposal 特征使用 640 维的向量表示(或者进行取平均来表示每个 proposal 的特征)
第六步:使用 attention 后的特征来得到每个 proposal 的 score 和 regress delta,
第七步:计算 loss,使用 focal loss 来计算分类 loss,即计算每个 proposal 的类别和真实 gt 类别的 loss,前景类别权重为 1 ,背景类别权重为 0.2,使用 L1 loss (或 GIoU loss )来计算回归 loss,回归 loss 只计算前景 proposal 的
赞
踩
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。