赞
踩
Title | SLViT: Scale-Wise Language-Guided Vision Transformer forReferring Image Segmentation |
Adress | https://www.ijcai.org/proceedings/2023/0144.pdf |
Journal/Time | IJCAI 2023 |
Author | Zhejiang University |
Code | https://github.com/NaturalKnight/SLViT |
Read | 2024/08/09 |
Table | #RIS #Seg |
类似LAVT在 encoder 里做融合
图片+文本
ViT + BERT
2,3,4 特征图下采样得到:stride of 2 and kernel size of 3 × 3 + batch normalization layer
Hamburger [Geng et al., 2021] :Hamburger function 、1×1 convolution 、 an upsampling function
ImageNet-22K from the SegNeXt [Guo et al., 2022]
CE
BERT,12层,维度 768
In convolutional branches of LMFA, we use k1 = 7, k2 = 11, k3 = 21 kernel sizes for our convolutions. The rest of weights in our model are randomly initialized.
AdamW optimizer weight decay 0.01.
The learning rate is initialed as 3e-5 and scheduled by polynomial learning rate decay with a power of 0.9.
Epoch 60, Batch Size 16.
Image Size 480 x 480.
Name | Images Number | references | reference expressions | Task | Note |
---|---|---|---|---|---|
RefCOCO | 19,994 | 50,000 | 142,209 | Referring Expression Segmentation | |
RefCOCO+ | 19,992 | 49,856 | 141,564 | ||
G-Ref | 26,711 | 54,822 | 104,560 | 比前两个的句子表达长,object少 |
verall intersectionover-union (oIoU),
mean intersection-over-union (mIoU),
precision at the 0.5, 0.7, and 0.9 threshold values.
[Geng et al., 2021] Zhengyang Geng, Meng-Hao Guo, Hongxu Chen, Xia Li, Ke Wei, and Zhouchen Lin. Is attention better than matrix decomposition? arXiv preprint arXiv:2109.04553, 2021.
[Guo et al., 2022] Meng-Hao Guo, Cheng-Ze Lu, Qibin Hou, Zhengning Liu, Ming-Ming Cheng, and Shi-Min Hu. Segnext: Rethinking convolutional attention design for semantic segmentation. arXiv preprint arXiv:2209.08575, 2022.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。