当前位置:   article > 正文

用于视觉对象跟踪的序列到序列学习

用于视觉对象跟踪的序列到序列学习

阅读完此论文后,对着代码过一遍思路

原文地址:https://arxiv.org/abs/2304.14394

本文将视觉跟踪建模为一个序列生成问题,以自回归的方式预测目标边界框。抛弃了设计复杂的头网络,采用encoder-decoder transformer architecture,编码器用ViT提取视觉特征(≈OSTrack),而解码器用因果转换器自回归生成一个边界框值序列。

图像表示

编码器输入模板和搜索图像。现有的跟踪器中模板图像的分辨率通常小于搜索图像的分辨率,SeqTrack使用相同的尺寸,发现在模板中添加更多的背景有助于提高跟踪性能(在其他工作里使用小尺寸模板图像都解释的是为了减少背景干扰,这里说法不是很一致)。

  1. DATA:
  2. MAX_SAMPLE_INTERVAL: 400
  3. MEAN:
  4. - 0.485
  5. - 0.456
  6. - 0.406
  7. SEARCH:
  8. CENTER_JITTER: 3.5
  9. FACTOR: 4.0
  10. SCALE_JITTER: 0.5
  11. SIZE: 384
  12. NUMBER: 1
  13. STD:
  14. - 0.229
  15. - 0.224
  16. - 0.225
  17. TEMPLATE:
  18. CENTER_JITTER: 0
  19. FACTOR: 4.0
  20. SCALE_JITTER: 0
  21. SIZE: 384
  22. NUMBER: 2
  23. TRAIN:
  24. DATASETS_NAME:
  25. - GOT10K_train_full
  26. DATASETS_RATIO:
  27. - 1
  28. SAMPLE_PER_EPOCH: 30000
序列表示

边界框转换为离散序列[ x , y , w , h ] [ x , y , w , h][x,y,w,h],将每个连续坐标统一离散为[ 1 , nbins]之间的整数。使用共享词汇表V(4000),V中的每个单词对应一个可学习的嵌入,在训练过程中进行优化。如下代码所示:

  1. class DecoderEmbeddings(nn.Module):
  2. def __init__(self, vocab_size, hidden_dim, max_position_embeddings, dropout):
  3. super().__init__()
  4. self.word_embeddings = nn.Embedding(
  5. vocab_size, hidden_dim)
  6. self.position_embeddings = nn.Embedding(
  7. max_position_embeddings, hidden_dim
  8. )
  9. self.LayerNorm = torch.nn.LayerNorm(
  10. hidden_dim)
  11. self.dropout = nn.Dropout(dropout)
  12. def forward(self, x):
  13. input_embeds = self.word_embeddings(x)
  14. embeddings = input_embeds
  15. embeddings = self.LayerNorm(embeddings)
  16. embeddings = self.dropout(embeddings)
  17. return embeddings

最终使用一个带softmax的多层感知器,根据输出嵌入对V中的单词进行采样来将嵌入映射回单词。

模型架构

它的基本架构是这样:

SeqTrack的架构,a:左边编码器拿的ViT,右边解码器用的transformer里的。编码器提取视觉特征,解码器利用特征自回归生成边界框序列。b:解码器结构,最下层输入目标序列,先自注意力再与视觉特征做注意力,自回归输出生成目标序列。

解码时加入一个因果注意力掩码(NLP那边用的差不多,防止偷看后边的果)
使用了两个特殊的标记:start和end。开始令牌告诉模型开始生成,而结束令牌则表示生成的完成。

训练时,解码器的输入序列为[start,x,y,w,h] [start,x,y,w,h][start,x,y,w,h],目标序列为[ x , y , w , h , e n d ] [ x , y , w , h , end][x,y,w,h,end]。(NLP里的)

编码器
  • 去掉了分类用的cls token。
  • 在最后一层附加一个线性投影来对齐编码器和解码器的特征维度。
self.bottleneck = nn.Linear(encoder.num_channels, hidden_dim) 
  • 只有搜索图像的特征被送入解码器。
  1. def forward_features(self, images_list):
  2. num_template = self.num_template
  3. template_list = images_list[0:num_template]
  4. search_list = images_list[num_template:]
  5. num_search = len(search_list)
  6. z_list = []
  7. for i in range(num_template):
  8. z = template_list[i]
  9. z = self.patch_embed(z)
  10. z = z + self.pos_embed[:, self.num_patches_search:, :]
  11. z_list.append(z)
  12. z_feat = torch.cat(z_list, dim=1)
  13. x_list = []
  14. for i in range(num_search):
  15. x = search_list[i]
  16. x = self.patch_embed(x)
  17. x = x + self.pos_embed[:, :self.num_patches_search, :]
  18. x_list.append(x)
  19. x_feat = torch.cat(x_list, dim=1)
  20. xz_feat = torch.cat([x_feat, z_feat], dim=1)
  21. xz = self.pos_drop(xz_feat)
  22. for blk in self.blocks: #batch is the first dimension.
  23. if self.use_checkpoint:
  24. xz = checkpoint.checkpoint(blk, xz)
  25. else:
  26. xz = blk(xz)
  27. xz = self.norm(xz) # B,N,C
  28. return xz
解码器
  • 接收来自前一个block的词嵌入并利用一个因果关系掩码保证每个序列元素的输出只依赖于其前面的序列元素。

生成因果掩码代码如下:

  1. def generate_square_subsequent_mask(sz):
  2. r"""Generate a square mask for the sequence. The masked positions are filled with float('-inf').
  3. Unmasked positions are filled with float(0.0).
  4. """
  5. #each token only can see tokens before them
  6. mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
  7. mask = mask.float().masked_fill(mask == 0, float(
  8. '-inf')).masked_fill(mask == 1, float(0.0))
  9. return mask

训练和推理
训练

损失函数:通过交叉熵损失最大化target tokens在前一个子序列和输入视频帧上的对数似然。

推理

引入在线模板更新和窗口惩罚,在推理过程中融合先验知识,进一步提高了模型精度和鲁棒性,使用generated tokens的似然来自动选择可靠的动态模板。引入了一种新的窗口惩罚策略,当前搜索区域中心点的离散坐标为[ n_bins / 2 , n_bins / 2],即为上一帧目标中心点位置。在生成x和y时,我们根据整数(即词)与nbins / 2的差来惩罚V中整数(即词)的可能性。差值越大惩罚越大。

实验

创建seqtrack虚拟环境并且激活

  1. conda create -n seqtrack python=3.8
  2. conda activate seqtrack

所需要的安装包如下所示

  1. echo "****************** Installing pytorch ******************"
  2. pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
  3. #conda install -y pytorch=1.11 torchvision torchaudio cudatoolkit=11.3 -c pytorch
  4. echo ""
  5. echo ""
  6. echo "****************** Installing yaml ******************"
  7. pip install PyYAML
  8. echo ""
  9. echo ""
  10. echo "****************** Installing easydict ******************"
  11. pip install easydict
  12. echo ""
  13. echo ""
  14. echo "****************** Installing cython ******************"
  15. pip install cython
  16. echo ""
  17. echo ""
  18. echo "****************** Installing opencv-python ******************"
  19. pip install opencv-python
  20. echo ""
  21. echo ""
  22. echo "****************** Installing pandas ******************"
  23. pip install pandas
  24. echo ""
  25. echo ""
  26. echo "****************** Installing tqdm ******************"
  27. conda install -y tqdm
  28. echo ""
  29. echo ""
  30. echo "****************** Installing coco toolkit ******************"
  31. pip install pycocotools
  32. echo ""
  33. echo ""
  34. echo "****************** Installing jpeg4py python wrapper ******************"
  35. pip install jpeg4py
  36. echo ""
  37. echo ""
  38. echo "****************** Installing tensorboard ******************"
  39. pip install tb-nightly
  40. echo ""
  41. echo ""
  42. echo "****************** Installing tikzplotlib ******************"
  43. pip install tikzplotlib
  44. echo ""
  45. echo ""
  46. echo "****************** Installing thop tool for FLOPs and Params computing ******************"
  47. pip install --upgrade git+https://github.com/Lyken17/pytorch-OpCounter.git
  48. echo ""
  49. echo ""
  50. echo "****************** Installing colorama ******************"
  51. pip install colorama
  52. echo ""
  53. echo ""
  54. echo "****************** Installing lmdb ******************"
  55. pip install lmdb
  56. echo ""
  57. echo ""
  58. echo "****************** Installing scipy ******************"
  59. pip install scipy
  60. echo ""
  61. echo ""
  62. echo "****************** Installing visdom ******************"
  63. pip install visdom
  64. echo ""
  65. echo ""
  66. echo "****************** Installing vot-toolkit python ******************"
  67. pip install git+https://github.com/votchallenge/vot-toolkit-python
  68. echo ""
  69. echo ""
  70. echo "****************** Installing timm ******************"
  71. pip install timm==0.5.4
  72. echo ""
  73. echo ""
  74. echo "****************** Installing yacs ******************"
  75. pip install yacs
  76. echo ""
  77. echo ""
  78. echo "****************** Installation complete! ******************"

运行以下命令安装

bash install.sh

将项目路径添加到环境变量

export PYTHONPATH=<absolute_path_of_SeqTrack>:$PYTHONPATH

跟踪数据格式如下所示

  1. ${SeqTrack_ROOT}
  2. -- data
  3. -- lasot
  4. |-- airplane
  5. |-- basketball
  6. |-- bear
  7. ...
  8. -- got10k
  9. |-- test
  10. |-- train
  11. |-- val
  12. -- coco
  13. |-- annotations
  14. |-- images
  15. -- trackingnet
  16. |-- TRAIN_0
  17. |-- TRAIN_1
  18. ...
  19. |-- TRAIN_11
  20. |-- TEST

运行以下命令来设置此项目的路径

python tracking/create_default_local_file.py --workspace_dir . --data_dir ./data --save_dir .

训练SeqTrack

python -m torch.distributed.launch --nproc_per_node 8 lib/train/run_training.py --script seqtrack --config seqtrack_b256 --save_dir .

根据基准进行测试和评估这个部分还未全部完成,后续会持续跟进

code

SeqTrack:(SeqTrack-L256为例)

  1. class SEQTRACK(nn.Module):
  2. """ This is the base class for SeqTrack """
  3. def __init__(self, encoder, decoder, hidden_dim,
  4. bins=1000, feature_type='x', num_frames=1, num_template=1):
  5. """ Initializes the model.
  6. Parameters:
  7. encoder: torch module of the encoder to be used. See encoder.py
  8. decoder: torch module of the decoder architecture. See decoder.py
  9. """
  10. super().__init__()
  11. self.encoder = encoder
  12. self.num_patch_x = self.encoder.body.num_patches_search
  13. self.num_patch_z = self.encoder.body.num_patches_template
  14. self.side_fx = int(math.sqrt(self.num_patch_x))
  15. self.side_fz = int(math.sqrt(self.num_patch_z))
  16. self.hidden_dim = hidden_dim
  17. self.bottleneck = nn.Linear(encoder.num_channels, hidden_dim) # the bottleneck layer, which aligns the dimmension of encoder and decoder
  18. self.decoder = decoder
  19. self.vocab_embed = MLP(hidden_dim, hidden_dim, bins+2, 3)
  20. self.num_frames = num_frames
  21. self.num_template = num_template
  22. self.feature_type = feature_type
  23. # Different type of visual features for decoder.
  24. # Since we only use one search image for now, the 'x' is same with 'x_last' here.
  25. if self.feature_type == 'x':
  26. num_patches = self.num_patch_x * self.num_frames
  27. elif self.feature_type == 'xz':
  28. num_patches = self.num_patch_x * self.num_frames + self.num_patch_z * self.num_template
  29. elif self.feature_type == 'token':
  30. num_patches = 1
  31. else:
  32. raise ValueError('illegal feature type')
  33. # position embeding for the decocder
  34. self.pos_embed = nn.Parameter(torch.zeros(1, num_patches, hidden_dim))
  35. pos_embed = get_sinusoid_encoding_table(num_patches, self.pos_embed.shape[-1], cls_token=False)
  36. self.pos_embed.data.copy_(torch.from_numpy(pos_embed).float().unsqueeze(0))

encoder:(ViT)

  1. @register_model
  2. def vit_large_patch16(pretrained=False, pretrain_type='default',
  3. search_size=384, template_size=192, **kwargs):
  4. patch_size = 16
  5. model = VisionTransformer(
  6. search_size=search_size, template_size=template_size,
  7. patch_size=patch_size, num_classes=0,
  8. embed_dim=1024, depth=24, num_heads=16, mlp_ratio=4, qkv_bias=True,
  9. norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
  10. cfg_type = 'vit_large_patch16_224_' + pretrain_type
  11. if pretrain_type == 'scratch':
  12. pretrained = False
  13. return model
  14. model.default_cfg = default_cfgs[cfg_type]
  15. if pretrained:
  16. load_pretrained(model, pretrain_type, num_classes=model.num_classes, in_chans=kwargs.get('in_chans', 3))
  17. return model

decoder:(DETR)

  1. class SeqTrackDecoder(nn.Module):
  2. def __init__(self, d_model=512, nhead=8,
  3. num_decoder_layers=6, dim_feedforward=2048, dropout=0.1,
  4. activation="relu", normalize_before=False,
  5. return_intermediate_dec=False, bins=1000, num_frames=9):
  6. super().__init__()
  7. self.bins = bins
  8. self.num_frames = num_frames
  9. self.num_coordinates = 4 # [x,y,w,h]
  10. max_position_embeddings = (self.num_coordinates+1) * num_frames
  11. self.embedding = DecoderEmbeddings(bins+2, d_model, max_position_embeddings, dropout)
  12. decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward,
  13. dropout, activation, normalize_before)
  14. decoder_norm = nn.LayerNorm(d_model)
  15. self.body = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm,
  16. return_intermediate=return_intermediate_dec)
  17. self._reset_parameters()
  18. self.d_model = d_model
  19. self.nhead = nhead

声明:本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号