当前位置:   article > 正文

[RIS]SLViT: Scale-Wise Language-Guided Vision Transformer forReferring Image Segmentation

[RIS]SLViT: Scale-Wise Language-Guided Vision Transformer forReferring Image Segmentation

1. BaseInfo

TitleSLViT: Scale-Wise Language-Guided Vision Transformer forReferring Image Segmentation
Adresshttps://www.ijcai.org/proceedings/2023/0144.pdf
Journal/TimeIJCAI 2023
AuthorZhejiang University
Codehttps://github.com/NaturalKnight/SLViT
Read2024/08/09
Table#RIS #Seg

2. Creative Q->A

  1. 视觉特征提取和跨模态融合分别考虑, 视觉-语言对齐不足。-> Language-Guided Multi-Scale Fusion Attention (LMFA)
  2. 两个模块是顺序存在的,信息交互不充分。-> An Uncertain Region Cross-Scale Enhancement module (URCE)
    SLViT

3. Concrete

3.1. Model

在这里插入图片描述

类似LAVT在 encoder 里做融合

3.1.1. Input

图片+文本

3.1.2. Backbone

ViT + BERT
2,3,4 特征图下采样得到:stride of 2 and kernel size of 3 × 3 + batch normalization layer

  • Language-Guided Multi-Scale Fusion Attention
    • Conv 5 x 5 初步局部特征
    • 多尺度卷积 1 x k i k_{i} ki 同时存在三个具有不同内核大小的卷积分支,以捕捉不同感受野的局部特征,在模拟丰富的局部视觉信息时具有空间归纳偏向。卷积核大小分别为 7,11,21。
    • Cross Attention
    • Gate 门控
    • 两个分支融合后的 Conv 1 x 1
      代码中的 GatedCrossModalAttention 就和 LAVT 中的 PWAM 一样。

在这里插入图片描述

  • Uncertain Region Cross-Scale Enhancement
    在这里插入图片描述
    多头自注意力

3.1.3. Neck

3.1.4. Decoder

Hamburger [Geng et al., 2021] :Hamburger function 、1×1 convolution 、 an upsampling function
ImageNet-22K from the SegNeXt [Guo et al., 2022]

3.1.5. Loss

CE

3.2. Training

BERT,12层,维度 768
In convolutional branches of LMFA, we use k1 = 7, k2 = 11, k3 = 21 kernel sizes for our convolutions. The rest of weights in our model are randomly initialized.
AdamW optimizer weight decay 0.01.
The learning rate is initialed as 3e-5 and scheduled by polynomial learning rate decay with a power of 0.9.
Epoch 60, Batch Size 16.
Image Size 480 x 480.

3.2.1. Resource

3.2.2 Dataset

NameImages Numberreferencesreference expressionsTaskNote
RefCOCO19,99450,000142,209Referring Expression Segmentation
RefCOCO+19,99249,856141,564
G-Ref26,71154,822104,560比前两个的句子表达长,object少

3.3. Eval

verall intersectionover-union (oIoU),
mean intersection-over-union (mIoU),
precision at the 0.5, 0.7, and 0.9 threshold values.
在这里插入图片描述

3.4. Ablation

在这里插入图片描述

4. Reference

[Geng et al., 2021] Zhengyang Geng, Meng-Hao Guo, Hongxu Chen, Xia Li, Ke Wei, and Zhouchen Lin. Is attention better than matrix decomposition? arXiv preprint arXiv:2109.04553, 2021.
[Guo et al., 2022] Meng-Hao Guo, Cheng-Ze Lu, Qibin Hou, Zhengning Liu, Ming-Ming Cheng, and Shi-Min Hu. Segnext: Rethinking convolutional attention design for semantic segmentation. arXiv preprint arXiv:2209.08575, 2022.

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/寸_铁/article/detail/1014357
推荐阅读
相关标签
  

闽ICP备14008679号