赞
踩
原文地址:https://arxiv.org/abs/2304.14394
本文将视觉跟踪建模为一个序列生成问题,以自回归的方式预测目标边界框。抛弃了设计复杂的头网络,采用encoder-decoder transformer architecture,编码器用ViT提取视觉特征(≈OSTrack),而解码器用因果转换器自回归生成一个边界框值序列。
编码器输入模板和搜索图像。现有的跟踪器中模板图像的分辨率通常小于搜索图像的分辨率,SeqTrack使用相同的尺寸,发现在模板中添加更多的背景有助于提高跟踪性能(在其他工作里使用小尺寸模板图像都解释的是为了减少背景干扰,这里说法不是很一致)。
- DATA:
- MAX_SAMPLE_INTERVAL: 400
- MEAN:
- - 0.485
- - 0.456
- - 0.406
- SEARCH:
- CENTER_JITTER: 3.5
- FACTOR: 4.0
- SCALE_JITTER: 0.5
- SIZE: 384
- NUMBER: 1
- STD:
- - 0.229
- - 0.224
- - 0.225
- TEMPLATE:
- CENTER_JITTER: 0
- FACTOR: 4.0
- SCALE_JITTER: 0
- SIZE: 384
- NUMBER: 2
- TRAIN:
- DATASETS_NAME:
- - GOT10K_train_full
- DATASETS_RATIO:
- - 1
- SAMPLE_PER_EPOCH: 30000

边界框转换为离散序列[ x , y , w , h ] [ x , y , w , h][x,y,w,h],将每个连续坐标统一离散为[ 1 , nbins]之间的整数。使用共享词汇表V(4000),V中的每个单词对应一个可学习的嵌入,在训练过程中进行优化。如下代码所示:
- class DecoderEmbeddings(nn.Module):
- def __init__(self, vocab_size, hidden_dim, max_position_embeddings, dropout):
- super().__init__()
- self.word_embeddings = nn.Embedding(
- vocab_size, hidden_dim)
- self.position_embeddings = nn.Embedding(
- max_position_embeddings, hidden_dim
- )
-
- self.LayerNorm = torch.nn.LayerNorm(
- hidden_dim)
- self.dropout = nn.Dropout(dropout)
-
- def forward(self, x):
- input_embeds = self.word_embeddings(x)
- embeddings = input_embeds
-
- embeddings = self.LayerNorm(embeddings)
- embeddings = self.dropout(embeddings)
-
- return embeddings

最终使用一个带softmax的多层感知器,根据输出嵌入对V中的单词进行采样来将嵌入映射回单词。
它的基本架构是这样:
SeqTrack的架构,a:左边编码器拿的ViT,右边解码器用的transformer里的。编码器提取视觉特征,解码器利用特征自回归生成边界框序列。b:解码器结构,最下层输入目标序列,先自注意力再与视觉特征做注意力,自回归输出生成目标序列。
解码时加入一个因果注意力掩码(NLP那边用的差不多,防止偷看后边的果)
使用了两个特殊的标记:start和end。开始令牌告诉模型开始生成,而结束令牌则表示生成的完成。
训练时,解码器的输入序列为[start,x,y,w,h] [start,x,y,w,h][start,x,y,w,h],目标序列为[ x , y , w , h , e n d ] [ x , y , w , h , end][x,y,w,h,end]。(NLP里的)
self.bottleneck = nn.Linear(encoder.num_channels, hidden_dim)
- def forward_features(self, images_list):
- num_template = self.num_template
- template_list = images_list[0:num_template]
- search_list = images_list[num_template:]
- num_search = len(search_list)
-
- z_list = []
- for i in range(num_template):
- z = template_list[i]
- z = self.patch_embed(z)
- z = z + self.pos_embed[:, self.num_patches_search:, :]
- z_list.append(z)
- z_feat = torch.cat(z_list, dim=1)
-
- x_list = []
- for i in range(num_search):
- x = search_list[i]
- x = self.patch_embed(x)
- x = x + self.pos_embed[:, :self.num_patches_search, :]
- x_list.append(x)
- x_feat = torch.cat(x_list, dim=1)
- xz_feat = torch.cat([x_feat, z_feat], dim=1)
-
- xz = self.pos_drop(xz_feat)
-
- for blk in self.blocks: #batch is the first dimension.
- if self.use_checkpoint:
- xz = checkpoint.checkpoint(blk, xz)
- else:
- xz = blk(xz)
-
- xz = self.norm(xz) # B,N,C
- return xz

生成因果掩码代码如下:
- def generate_square_subsequent_mask(sz):
- r"""Generate a square mask for the sequence. The masked positions are filled with float('-inf').
- Unmasked positions are filled with float(0.0).
- """
-
- #each token only can see tokens before them
- mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
- mask = mask.float().masked_fill(mask == 0, float(
- '-inf')).masked_fill(mask == 1, float(0.0))
- return mask
损失函数:通过交叉熵损失最大化target tokens在前一个子序列和输入视频帧上的对数似然。
引入在线模板更新和窗口惩罚,在推理过程中融合先验知识,进一步提高了模型精度和鲁棒性,使用generated tokens的似然来自动选择可靠的动态模板。引入了一种新的窗口惩罚策略,当前搜索区域中心点的离散坐标为[ n_bins / 2 , n_bins / 2],即为上一帧目标中心点位置。在生成x和y时,我们根据整数(即词)与nbins / 2的差来惩罚V中整数(即词)的可能性。差值越大惩罚越大。
创建seqtrack虚拟环境并且激活
- conda create -n seqtrack python=3.8
- conda activate seqtrack
所需要的安装包如下所示
- echo "****************** Installing pytorch ******************"
- pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
- #conda install -y pytorch=1.11 torchvision torchaudio cudatoolkit=11.3 -c pytorch
-
- echo ""
- echo ""
- echo "****************** Installing yaml ******************"
- pip install PyYAML
-
- echo ""
- echo ""
- echo "****************** Installing easydict ******************"
- pip install easydict
-
- echo ""
- echo ""
- echo "****************** Installing cython ******************"
- pip install cython
-
- echo ""
- echo ""
- echo "****************** Installing opencv-python ******************"
- pip install opencv-python
-
- echo ""
- echo ""
- echo "****************** Installing pandas ******************"
- pip install pandas
-
- echo ""
- echo ""
- echo "****************** Installing tqdm ******************"
- conda install -y tqdm
-
- echo ""
- echo ""
- echo "****************** Installing coco toolkit ******************"
- pip install pycocotools
-
- echo ""
- echo ""
- echo "****************** Installing jpeg4py python wrapper ******************"
- pip install jpeg4py
-
- echo ""
- echo ""
- echo "****************** Installing tensorboard ******************"
- pip install tb-nightly
-
- echo ""
- echo ""
- echo "****************** Installing tikzplotlib ******************"
- pip install tikzplotlib
-
- echo ""
- echo ""
- echo "****************** Installing thop tool for FLOPs and Params computing ******************"
- pip install --upgrade git+https://github.com/Lyken17/pytorch-OpCounter.git
-
- echo ""
- echo ""
- echo "****************** Installing colorama ******************"
- pip install colorama
-
- echo ""
- echo ""
- echo "****************** Installing lmdb ******************"
- pip install lmdb
-
- echo ""
- echo ""
- echo "****************** Installing scipy ******************"
- pip install scipy
-
- echo ""
- echo ""
- echo "****************** Installing visdom ******************"
- pip install visdom
-
- echo ""
- echo ""
- echo "****************** Installing vot-toolkit python ******************"
- pip install git+https://github.com/votchallenge/vot-toolkit-python
-
- echo ""
- echo ""
- echo "****************** Installing timm ******************"
- pip install timm==0.5.4
-
- echo ""
- echo ""
- echo "****************** Installing yacs ******************"
- pip install yacs
-
- echo ""
- echo ""
-
-
- echo "****************** Installation complete! ******************"

运行以下命令安装
bash install.sh
将项目路径添加到环境变量
export PYTHONPATH=<absolute_path_of_SeqTrack>:$PYTHONPATH
跟踪数据格式如下所示
- ${SeqTrack_ROOT}
- -- data
- -- lasot
- |-- airplane
- |-- basketball
- |-- bear
- ...
- -- got10k
- |-- test
- |-- train
- |-- val
- -- coco
- |-- annotations
- |-- images
- -- trackingnet
- |-- TRAIN_0
- |-- TRAIN_1
- ...
- |-- TRAIN_11
- |-- TEST

运行以下命令来设置此项目的路径
python tracking/create_default_local_file.py --workspace_dir . --data_dir ./data --save_dir .
训练SeqTrack
python -m torch.distributed.launch --nproc_per_node 8 lib/train/run_training.py --script seqtrack --config seqtrack_b256 --save_dir .
根据基准进行测试和评估这个部分还未全部完成,后续会持续跟进
SeqTrack:(SeqTrack-L256为例)
- class SEQTRACK(nn.Module):
- """ This is the base class for SeqTrack """
- def __init__(self, encoder, decoder, hidden_dim,
- bins=1000, feature_type='x', num_frames=1, num_template=1):
- """ Initializes the model.
- Parameters:
- encoder: torch module of the encoder to be used. See encoder.py
- decoder: torch module of the decoder architecture. See decoder.py
- """
- super().__init__()
- self.encoder = encoder
- self.num_patch_x = self.encoder.body.num_patches_search
- self.num_patch_z = self.encoder.body.num_patches_template
- self.side_fx = int(math.sqrt(self.num_patch_x))
- self.side_fz = int(math.sqrt(self.num_patch_z))
- self.hidden_dim = hidden_dim
- self.bottleneck = nn.Linear(encoder.num_channels, hidden_dim) # the bottleneck layer, which aligns the dimmension of encoder and decoder
- self.decoder = decoder
- self.vocab_embed = MLP(hidden_dim, hidden_dim, bins+2, 3)
-
- self.num_frames = num_frames
- self.num_template = num_template
- self.feature_type = feature_type
-
- # Different type of visual features for decoder.
- # Since we only use one search image for now, the 'x' is same with 'x_last' here.
- if self.feature_type == 'x':
- num_patches = self.num_patch_x * self.num_frames
- elif self.feature_type == 'xz':
- num_patches = self.num_patch_x * self.num_frames + self.num_patch_z * self.num_template
- elif self.feature_type == 'token':
- num_patches = 1
- else:
- raise ValueError('illegal feature type')
-
- # position embeding for the decocder
- self.pos_embed = nn.Parameter(torch.zeros(1, num_patches, hidden_dim))
- pos_embed = get_sinusoid_encoding_table(num_patches, self.pos_embed.shape[-1], cls_token=False)
- self.pos_embed.data.copy_(torch.from_numpy(pos_embed).float().unsqueeze(0))

encoder:(ViT)
- @register_model
- def vit_large_patch16(pretrained=False, pretrain_type='default',
- search_size=384, template_size=192, **kwargs):
- patch_size = 16
- model = VisionTransformer(
- search_size=search_size, template_size=template_size,
- patch_size=patch_size, num_classes=0,
- embed_dim=1024, depth=24, num_heads=16, mlp_ratio=4, qkv_bias=True,
- norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
- cfg_type = 'vit_large_patch16_224_' + pretrain_type
- if pretrain_type == 'scratch':
- pretrained = False
- return model
- model.default_cfg = default_cfgs[cfg_type]
- if pretrained:
- load_pretrained(model, pretrain_type, num_classes=model.num_classes, in_chans=kwargs.get('in_chans', 3))
- return model

decoder:(DETR)
- class SeqTrackDecoder(nn.Module):
-
- def __init__(self, d_model=512, nhead=8,
- num_decoder_layers=6, dim_feedforward=2048, dropout=0.1,
- activation="relu", normalize_before=False,
- return_intermediate_dec=False, bins=1000, num_frames=9):
- super().__init__()
- self.bins = bins
- self.num_frames = num_frames
- self.num_coordinates = 4 # [x,y,w,h]
- max_position_embeddings = (self.num_coordinates+1) * num_frames
- self.embedding = DecoderEmbeddings(bins+2, d_model, max_position_embeddings, dropout)
-
- decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward,
- dropout, activation, normalize_before)
- decoder_norm = nn.LayerNorm(d_model)
- self.body = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm,
- return_intermediate=return_intermediate_dec)
-
- self._reset_parameters()
-
- self.d_model = d_model
- self.nhead = nhead

Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。