Transformers in Vision:A Survey 阅读笔记_transformers in vision: a survey
作者:凡人多烦事01 | 2024-06-14 19:26:50
赞
踩
transformers in vision: a survey
ACM上的一篇综述,讨论Transformer在CV上的应用。
摘要:
Among their salient benefits,Transformers enable modeling long dependencies between inputsequence elements and support parallel processing of sequence as compared to recurrent networks e.g.,Long short-term memory(LSTM).
Furthermore,the straightforward design of Transformers allows processing multiple modalities(e.g.,images,videos,text and speech)using similar processing blocks and demonstrates excellent scalability to very large capacity networks and hugedatasets.
As a result,Transformer mod-els and their variants have been successfully used for imagerecognition[11],[12],object detection[13],[14],segmenta-tion[15],image super-resolution[16],video understanding[17],[18],image generation[19],text-image synthesis[20]and visual question answering[21],[22],among severalother use cases[23]–[26].
Al-though attention models have been extensively used inboth feed-forward and recurrent networks[27],[28],Trans-formers are based solely on the attention mechanism andhave a unique implementation(i.e.,multi-head attention)optimized for parallelization.
Since Transformers assume minimal priorknowledge about the structure of the problem as comparedto their convolutional and recurrent counterparts[30]–[32],they are typically pre-trained using pretext tasks on large-scale(unlabelled)datasets[1],[3].
The first one is self-attention,which allows capturing‘long-term’dependencies between sequence elements as com-pared to conventional recurrent models that find it chal-lenging to encode such relationships.
The second keyidea is that of pre-training1on a large(un)labelled corpus ina(self)supervised manner,and subsequently fine-tuning tothe target task with a small labeled dataset[3],[7],[38].
The self-attentionmechanism is an integral component of Transformers, which explicitly models the interactions between all entities of asequence for structured prediction tasks.
For a given entity in the sequence,the self-attention basi-cally computes the dot-product of the query with all keys,which is then normalized using softmax operator to get theattention scores.
For the Transformer model[1]which is trained to predict the next entity of the sequence,the self-attention blocks used in the decoder are masked toprevent attending to the subsequent future entities.
Basically,while pre-dicting an entity in the sequence,the attention scores of thefuture entities are set to zero in masked self-attention.
在掩码的自注意力机制中,预测序列中的实体时,未来实体的注意力分数被设为零。
In order to encapsulate multiplecomplex relationships amongst different elements in thesequence,the multi-head attention comprises multiple self-attention blocks(h=8 in the original Transformer model[1]).Each block has its own set of learnable weight ma-trices{WQi,WKi,WVi},where i=0···(h−1).
The main difference of self-attention with convolution operation is that the filters are dynamically calculated in-stead of static filters(that stay the same for any input)as in the case of convolution.
self-attention is invariant to permutations and changes in the number of input points.As a result,it can easily operate on irregular inputs as op-posed to standard convolution that requires grid structure.
In fact,self-attention provides the capability to learn the global aswell as local features,and provide expressivity to adaptivelylearn kernel weights as well as the receptive field(similar todeformable convolutions[42]).
自注意力能够学习全局和局部特征,并提供适应性学习核权重和感受野的能力,类似于可变形卷积。
2.2 自监督预训练
self-supervised learning has been very effectivelyused in the pre-training stage.The self-supervision basedpre-training stage training has played a crucial role in un-leashing the scalability and generalization of Transformernetworks,
the basicidea of SSL is to fill in the blanks,i.e.,try to predict theoccluded data in images,future or past frames in temporalvideo sequences or predict a pretext task
Self-supervised learning provides a promising learningparadigm since it enables learning from a vast amount ofreadily available non-annotated data.
自监督学习提供了一种有前景的学习范式,因为它能够从大量可用的非注释数据中进行学习。
The pseudo-labels for the pretext task are automati-cally generated(without requiring any expensive manual annotations)based on data attributes and task definition. Therefore,the pretext task definition is a critical choice in SSL.
existing SSL methods basedupon their pretext tasks into (a)generative approaches whichsynthesize images or videos(given conditional inputs),(b)context-based methods which exploit the relationships be-tween image patches or video frames,and(c)cross-modalmethods which leverage from multiple data modalities.
Bidirectional Encoder Representations fromTransformers(BERT)[3]proposed to jointly encode the rightand left context of a word in a sentence,thus improvingthe learned feature representations for textual data in anself-supervised manner.
双向编码器表示( Bidirectional Encoder Representations from Transformers,BERT ) [ 3 ]提出对句子中单词的左右上下文进行联合编码,以自监督的方式改进文本数据的学习特征表示。
Masked Language Model(MLM)-Afixed percentage(15%)of words in a sentence are randomlymasked and the model is trained to predict these maskedwords using cross-entropy loss.
Next Sentence Prediction(NSP)-Given a pairof sentences,the model predicts a binary label i.e.,whetherthe pair is valid from the original document or not.Thetraining data for this can easily be generated from anymonolingual text corpus.
NSP enables the model to capture sentence-to-sentencerelationships which are crucial in many language modelingtasks such as Question Answering and Natural LanguageInference.
NSP使模型能够捕获在许多语言建模任务(如问答和自然语言推理)中至关重要的句子间关系。
3.CV中的自注意力与Transformer
We broadly categorize vision models with self-attentioninto two categories:the models which use single-head self-attention(Sec.3.1),and the models which employ multi-head self-attention based Transformer modules into theirarchitectures(Sec.3.2).
single-head self-attention based frameworks,which generally apply global or local self-attention withinCNN architectures,or utilize matrix factorization to enhancedesign efficiency and use vectorized attention models.
This way,the non-local operation isable to capture interactions between any two positions inthe feature map regardless of the distance between them.
这样,非局部操作能够捕获特征图中任意两个位置之间的交互作用,无论它们之间的距离有多远。
Videos classification is an example of a task where long-range interactions between pixels exist both in space andtime.
在视频分类任务中,像素之间存在空间和时间上的长距离交互作用。
Although the self-attention allows us to model full-image contextual information,it is both memory and com-pute intensive.
自注意力允许我们建模全图像的上下文信息,但它既消耗内存又计算密集。
Another shortcoming of the convolutional operatorcomes from the fact that after training,it applies fixedweights regardless of any changes to the visual input.
卷积操作的另一个缺点是,在训练之后,它应用固定权重,不管视觉输入是否发生变化。
self-attention as an alternative to convolutional operators.
自注意力被探索作为卷积操作的替代方法。
3.1.2 单独使用自注意力
On the other hand,global attention[1]which attend toall spatial locations of the input can be computationallyintensive and is preferred on down-sampled small images,image patches[11]or augmenting the convolutional featuresspace[79].
Below,we discuss these methods by categorizingthem into:uniform scale ViTs having single-scale featuresthrough all layers(Sec.3.2.1),multi-scale ViTs that learnhierarchical features which are more suitable for denseprediction tasks(Sec.3.2.2),and hybrid designs havingconvolution operations within ViTs(Sec.3.2.3).
The original Vision Transformer[11]model belongs to thisfamily,where the multi-head self-attention is applied to aconsistent scale in the input image where the spatial scale ismaintained through the network hierarchy.
Vision Transformer(ViT)[11](Fig.6)is the first workto showcase how Transformers can‘altogether’replacestandard convolutions in deep neural networks on large-scale image datasets.
Besides using augmentation and regularizationprocedures common in CNNs,the main contribution ofDeiT[12]is a novel native distillation approach for Trans-formers which uses a CNN as a teacher model(RegNetY-16GF[86])to train the Transformer model.
In standard ViTs,the number of the tokens and token featuredimension are kept fixed throughout different blocks ofthe network.
多阶段分层设计的ViTs逐渐减少了标记数量,同时逐步增加了标记特征维度,
These architectures mostly sparsify tokens by merg-ing neighboring tokens and projecting them to a higherdimensional feature space.
这些架构通常通过合并相邻标记并将它们投影到更高维度的特征空间来稀疏化标记。
Some of them are hybrid designs(with both convolution and self-attentionoperations,see Sec.3.2.3),while others only employ pureself-attention based design(discussed next).
这些架构中的一些是混合设计(同时使用卷积和自注意力操作),而其他一些仅采用纯自注意力设计。
Pyramid ViT(PVT)[93]is the first hierarchical designfor ViT,and proposes a progressive shrinking pyramidand spatial-reduction attention.
Convolutions do an excellent job at capturing low-level localfeatures in images,and have been explored in multiple hy-brid ViT designs,specially at the beginning to“patchify andtokenize”an input image.
Contrastive learning based self-supervised approaches,which have gained significant success for CNN based visiontasks,have also been investigated for ViTs.
基于对比学习的自监督方法在基于CNN的视觉任务中取得了显著的成功,也被研究用于ViTs中。
3.3 用于目标检测的Transformer
Transformers based modules have been used for objectdetection in the following manner:(a)Transformer back-bones for feature extraction,with a R-CNN based headfor detection(see Sec.3.2.2),(b)CNN backbone for visualfeatures and a Transformer based decoder for object detec-tion[13],[14],[122],[123](see Sec.3.3.1,and(c)a purelytransformer based design for end-to-end object detection[124](see Sec.3.3.2).
Detection Transformer(DETR)[13]treats object detectionas a set prediction task i.e.,given a set of image features,the objective is to predict the set of object bounding boxes.
bipartite matching between predictions and ground-truthboxes.
DETR使用一种集合损失函数,允许预测与地面真实框之间的二部匹配。
The main advantage of DETR is that it removesthe dependence on hand-crafted modules and operations,such as the RPN(region proposal network)and NMS(non-maximal suppression)commonly used in object detection[125]–[129].
The DETR[13]model successfully combines convolu-tional networks with Transformers[1]to remove hand-crafted design requirements and achieves an end-to-endtrainable object detection pipeline.
DETR模型成功地将卷积网络与Transformer相结合,实现了端到端可训练的对象检测流水线。
3.3.2 用纯Transformer做检测
You Only Look at One Sequence(YOLOS)[124]is a sim-ple,attention-only architecture directly built upon the ViT
You Only Look at One Sequence(YOLOS)是一个简单的基于注意力机制的架构,直接建立在ViT之上。
It replaces the class-token in ViT with multiplelearnable object query tokens,and the bipartite matchingloss is used for object detection similar to[13].
We note that it isfeasible to combine other recent ViTs with transformer baseddetection heads as well to create pure ViT based designs[124],and we hope to see more such efforts in future.
Self-attention can be leveraged for dense prediction taskslike image segmentation that requires modeling rich interac-tions between pixels.
自注意力可以用于密集预测任务,如图像分割,需要建模像素之间的丰富交互。
To tackle these issues,Wang et al.[133]propose the position-sensitive axial-attention where the 2Dself-attention mechanism is reformulated as two 1D axial-attention layers,applied to height-axis and width-axis se-quentially(see Fig.8).
Segmentation Transformer(SETR)[134]has a ViT encoder,and two decoder designs basedupon progressive upsampling,and multi-level feature ag-gregation.SegFormer[101]has a hierarchical pyramid ViT[93](without position encoding)as an encoder,and a simpleMLP based decoder with upsampling operation to get thesegmentation mask.
Their approachmodels the joint distribution of the image pixels by factor-izing it as a product of pixel-wise conditional distributions.
他们的方法通过将图像像素的联合分布因子化为像素级条件分布的乘积来建模图像像素的联合分布。
Inspired by the success of GPT model[5]in the lan-guage domain,image GPT(iGPT)[143]demonstrated thatsuch models can be directly used for image generationtasks,and to learn strong features for downstream visiontasks(e.g.,image classification).
Ramesh etal.[20]recently proposed DALL·E which is a Transformermodel capable of generating high-fidelity images from agiven text description.DALL·E model has 12 billion param-eters and it is trained on a large set of text-image pairs takenfrom the internet.Before training,images are first resizedto 256×256 resolution,and subsequently compressed toa 32×32 grid of latent codes using a pre-trained discretevariational autoencoder[162],[163].
numerous Transformer-based meth-ods have been proposed for low-level vision tasks,includingimage super-resolution[16],[19],[164],denoising[19],[165],deraining[19],[165],and colorization[24].
Image restorationrequires pixel-to-pixel correspondence from the input to theoutput images.One major goal of restoration algorithmsis to preserve desired fine image details(such as edgesand texture)in the restored images.
In contrast,algorithms for low-level vision tasks such as image denoising,super-resolution,and deraining are directly trained on task-specific data,thereby suffer from these limitations:(i)small number of im-ages available in task-specific datasets(e.g.,the commonlyused DIV2K dataset for image super-resolution containsonly 2000 images),(ii)the model trained for one imageprocessing task does not adapt well to other related tasks.
It is capable of performing various imagerestoration tasks such as super-resolution,denoising,andderaining.The overall architecture of IPT consists of multi-heads and multi-tails to deal with different tasks separately,and a shared encoder-decoder Transformer body.
During training,each task-specific head takes asinput a degraded image and generates visual features.Thesefeature maps are divided into small crops and subsequentlyflattened before feeding them to the Transformer encoder(whose architecture is the same as[1]).
The outputs of the encoder along with the task-specific embeddings are givenas input to the Transformer decoder.The features from thedecoder output are reshaped and passed to the multi-tailthat yields restored images.
While the SR methods[167],[170]–[173]that are based on pixel-wise loss functions(e.g.,L1,MSE,etc.)yield impressive results in terms of image fi-delity metrics such as PSNR and SSIM,they struggle torecover fine texture details and often produce images thatare overly-smooth and perceptually less pleasant.
The above mentioned SR approaches follow two distinct(butconflicting)research directions:one maximizing the recon-struction accuracy and the other maximizing the perceptual quality,but never both.
Given a grayscale image,colorization seeks to produce thecorresponding colorized sample.It is a one-to-many task asfor a given grayscale input,there exist many possibilitiesin the colorized output space.
Colorization Trans-former[24]is a probabilistic model based on conditionalattention mechanism[179].It divides the image colorizationtask into three sub-problems and proposes to solve eachtask sequentially by a different Transformer network.
These layers capture the interaction between each pixel of an input image while being computation-ally less costly.
这些层捕捉输入图像每个像素之间的交互作用,同时计算成本较低。
3.7 用于多模型任务的Transformer
Sev-eral works in this direction target effective vision-languagepre-training(VLP)on large-scale multi-modal datasets tolearn generic representations that effectively encode cross-modality relationships(e.g.,grounding semantic attributesof a person in a given image).
several of these modelsstill use CNNs as vision backbone to extract visual featureswhile Transformers are used mainly used to encode textfollowed by the fusion of language and visual features.
The single-stream designs feed the multi-modal inputs to a single Transformerwhile the multi-stream designs first use independent Trans-formers for each modality and later learn cross-modal repre-sentations using another Transformer(see Fig.12).
3.7.1 多流Transformer
ViLBERTdeveloped a two-stream architecture where each stream is dedicated to model the vision or language inputs(Fig.12-h).
The pre-training phase oper-ates in a self-supervised manner,i.e.,pretext tasks are cre-ated without manual labeling on the large-scale unlabelled dataset.
For the first time in the literature,they propose tolearn an end-to-end multi-modal bidirectional Transformermodel called PEMT on audio-visual data from unlabeled videos.
short-term(e.g.,1-3 seconds)video dynamicsare encoded using CNNs,followed by a modality-specificTransformer(audio/visual)to model long-term dependen-cies(e.g.,30 seconds).A multi-modal Transformer is then applied to the modality-specific Transformer outputs to ex-change information across visual-linguistic domains.
CLIP[195]is a contrastive approach to learn image rep-resentations from text,with a learning objective which max-imizes similarity of correct text-image pairs embeddings ina large batch size.
Different from two-stream networks like ViLBERT[181]and LXMERT[21],VisualBERT[63]uses a single stack ofTransformers to model both the domains(images and text).
The input sequence of text(e.g.,caption)and the visualfeatures corresponding to the object proposals are fed tothe Transformer that automatically discovers relations be-tween the two domains.
To address this problem,Object-Semantics AlignedPre-Training(Oscar)[44]first uses an object detector toobtain object tags(labels),which are then subsequently usedas a mechanism to align relevant visual features with thesemantic information(Fig.12-b).
The visual and text features are then separately linearly projected to a shared space,concatenated and fed toa transformer model(with an architecture similar to DETR)to predict the bounding boxes for objects corresponding to the queries in the grounding text.
Visual Groundingwith Transformer[206]has an encoder-decoder architecture,where visual tokens(features extracted from a pretrainedCNN model)and text tokens(parsed through an RNNmodule)are processed in parallel with two distinct branchesin the encoder,with cross-modality attention to generatetext-guided visual features.The decoder then computesattention between the text queries and visual features andpredicts query-specific bounding boxes.
Visual Grounding with Transformer[206]具有编码器-解码器架构,其中视觉标记(从预训练的CNN模型中提取的特征)和文本标记(通过RNN模块解析)在编码器的两个不同分支中并行处理,具有跨模态注意力,以生成文本引导的视觉特征。解码器然后计算文本查询和视觉特征之间的注意力,并预测查询特定的边界框。
3.8 视频理解
3.8.1 视频与语言模型的联合
The VideoBERT[17]model leverages Transformer networksand the strength of self-supervised learning to learn effec-tive multi-modal representations.
VideoBERT[17]利用Transformer网络和自监督学习的优势来学习有效的多模态表示。
VideoBERT uses the prediction of masked visual and linguistic tokens as a pretext task(Fig.12-c).This allows modeling high-level se-mantics and long-range temporal dependencies,importantfor video understanding tasks.
The video+text model uses a visual-linguisticalignment task to learn cross-modality relationships.Thedefinition of this pre-text task is simple,given the latentstate of the[cls]token,the task is to predict whether thesentence is temporally aligned with the sequence of visual tokens.
Neimark et al.[211]propose Video Transformer Network(VTN)that first ob-tains frame-wise features using 2D CNN and apply a Trans-former encoder(Longformer[103])on top to learn temporalrelationships.
The classification token is passed through afully connected layer to recognize actions or events.Theadvantage of using Transformer encoder on top of spatialfeatures is two fold:(a)it allows processing a complete videoin a single pass,and(b)considerably improves training andinference efficiency by avoiding the expensive 3D convolu-tions.
Multiscale Vision Transformers(MViT)[219]build afeature hierarchy by progressively expanding the channelcapacity and reducing the spatio-temporal resolution invideos.They introduce multi-head pooling attention togradually change the visual resolution in their pyramidstructure.
First,the spatio-temporal tokens are extracted and then efficient factorisedversions of self-attention are applied to encode relationshipsbetween tokens.However,they require initialization withimage-pretrained models to effectively learn the ViT models.
Trans-former models have been used to learn set-to-set mappingson this support set[26]or learn the spatial relationshipsbetween a given input query and support set samples[25].
Amortized clustering is a challenging problemthat seeks to learn a parametric function that can map aninput set of points to their corresponding cluster centers.
摊销聚类是一个具有挑战性的问题,它旨在学习一个参数化函数,能够将输入点集映射到它们对应的聚类中心。
3.11 3D分析中的Transformer
Transformers provide a promising mechanism to encoderich relationships between 3D data points.
a strength of Transformer modelsis their flexibility to scale to high parametric complexity.While this is a remarkable property that allows trainingenormous sized models,this results in high training andinference cost
Transformer模型的灵活性使其能够扩展到高参数复杂度,但这也导致了高训练和推理成本。
Additionally,these large-scale models require aggressivecompression(e.g.,distillation)to make them feasible for real-world settings.
除此之外,这些大规模模型需要进行激烈的压缩,以使它们适用于实际环境。
Numerous methods have beenproposed that make special design choices to perform self-attention more‘efficiently’,for instance employing pool-ing/downsampling in self-attention[97],[219],[249],localwindow-based attention[36],[250],axial-attention[179],[251],low-rank projection attention[38],[252],[253],ker-24nelizable attention[254],[255],and similarity-clusteringbased methods[246],[256].
Therefore,thereis a pressing need to develop an efficient self-attentionmechanism that can be applied to HR images on resource-limited systems without compromising accuracy.
因此,迫切需要开发一种高效的自注意力机制,可以应用于资源有限的系统中,而不会牺牲准确性。
4.2 海量数据样本的需求
Since Transformer architectures do not inherently encodeinductive biases(prior knowledge)to deal with visual data,they typically require large amount of training to figureout the underlying modality-specific rules.
DeiT[12]uses a distillation approach to achievedata efficiency while T2T(Tokens-to-Token)ViT[35]modelslocal structure by combining spatially close tokens together,thus leading to competitive performance when trained onlyon ImageNet from scratch(without pre-training).
by smoothing the local loss surface using sharpness-awareminimizer(SAM)[258],ViTs can be trained with simpledata augmentation scheme(random crop,and horizontalflip)[259],instead of employing compute intensive strongdata augmentation strategies,and can outperform theircounterpart ResNet models.
Although theinitial results from these simple applications are quite en-couraging and motivate us to look further into the strengthsof self-attention and self-supervised learning,current archi-tectures may still remain better tailored for language prob-lems(with a sequence structure)and need further intuitionsto make them more efficient for visual inputs.
One may argue thatthe architectures like Transformer models should remaingeneric to be directly applicable across domains,we noticethat the high computational and time cost for pre-trainingsuch models demands novel design strategies to make theirtraining more affordable on vision problems.
While Nerual Architecuter Search(NAS)has been wellexplored for CNNs to find an optimized architecture,itis relatively less explored in Transformers(even for lan-guage transformers[261],[262]).
It will be insightfulto further explore the domain-specific design choices(e.g.,the contrasting requirements between language and visiondomains)using NAS to design more efficient and light-weight models similar to CNNs[87].
compared with CNNs,ViTs demonstrate strongrobustness against texture changes and severe occlusions
与CNN相比,ViTs对纹理变化和严重遮挡表现出强大的鲁棒性
The main challenge is that the attention originatingin each layer,gets inter-mixed in the subsequent layers in acomplex manner,making it difficult to visualize the relativecontribution of input tokens towards final predictions.
Furtherprogress in this direction can help in better understandingthe Transformer models,diagnosing any erroneous behav-iors and biases in the decision process.It can also help usdesign novel architectures that can help us avoid any biases.
Some recent ef-forts have been reported to compress and accelerate NLPmodels on embedded systems such as FPGAs[270].
一些最近的工作已经报道了在嵌入式系统(如FPGA)上压缩和加速NLP模型的努力。
However,such hardware efficient designs are currently lacking for the vision Transformers to enable their seamless deployment in resource-constrained devices.
Inspired by the biological systems that canprocess information from a diverse range of modalities,Perceiver model[274]aims to learn a unified model thatcan process any given input modality without makingdomain-specific architectural assumptions.
An interesting and openfuture direction is to achieve total modality-agnosticism inthe learning pipeline.
一个有趣且开放的未来方向是实现学习管道的完全模态无关性
5.结论
Attention has played a key role in delivering efficientand accurate computer vision systems,while simultane-ously providing insights into the function of deep neu-ral networks.
注意力在提供高效准确的计算机视觉系统方面发挥了关键作用,同时还揭示了深度神经网络功能的洞察。
Specifically,we in-clude state of the art self-attention models for image recog-nition,object detection,semantic and instance segmentation,video analysis and classification,visual question answering,visual commonsense reasoning,image captioning,vision-language navigation,clustering,few-shot learning,and 3Ddata analysis.
We systematically highlight the key strengthsand limitations of the existing methods and particularlyelaborate on the important future research directions.