当前位置:   article > 正文

自动驾驶-BEV检测篇三:DETR-3D_detr3d代码

detr3d代码

论文地址:DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

代码地址:WangYueFt/detr3d (github.com)

 1、引言

        DETR3D3D目标检测算法的重要组成之一,是DETR2D在3D空间中进行检测的开创性工作。不同于LSSBEVDet等一系列自上而下的方法,先进行深度估计,再进行2D-D转换的解决方案。

        DETR3D通过先预设一系列预测框的查询向量object querys,利用它们生成3D reference point,将这些3D reference point 利用相机参数转换矩阵,投影回2D图像坐标,并根据他们在图像的位置去找到对应的图像特征,用图像特征object queryscross-attention,不断refine object querys。最后利用两个MLP分支分别输出分类预测结果回归预测结果。正负样本则采用和DETR相同的二分图匹配,即根据最小cost在900个object querys中找到与GT数量最匹配的N个预测框。由于正负样本匹配以及object querys这种查询目标的方式与DETR类似,因此可以看成是DETR在3D的扩展。

        注:总体来看,DETR3D也是一个transformer结构的检测框架,但是其没有encoder的结构,使用的是传统的卷积网络的backbone来进行特征的提取工作。

图 1

        总体步骤:

        (1)首先利用图像特征提取网络,例如resnet50等,对不同视角相机拍摄的图像进行特征提取。(这里可以直接认为就是transformer结构中的encoder结构,其实完全不同,可以看出是ViT结构还没有出来之前,早期将transformer和视觉任务结合的尝试)。

        (2)使用nn.embeding初始化一个Object Query Embedding,之后利用一个全连接MLP回归出一个3D参考点C^{_{i}}(也就是第i个box的中心)。

        (3)通过相机的内外参数(2)得到的3D参考点ref point(世界坐标系下的点)将其投影到相机平面的特征图上去,下面就是使用2维的DETR来进行2D目标检测的方法了。

图2

        (4)由于每一路输入都是多尺度特征,为了避免不同尺度之间分辨率的影响,使用双线性差值对特征图进行采样,得到不同尺度下的特征差值采样结果{F1,F2,F3,F4}。(这里其实可以理解为不同尺度的特征图和reference_2D之间做一个cross-attention交互)。

        (5)将不同尺度特征的采样结果进行合并,再加入原本的Object Query Embedding进行精炼。

        (6)多次迭代,最终从最后的Query Embedding回归出类别和位置。

2、pipeline

2.1 img_backbone + grid_mask + img_neck

2.1.1 原理

        这里img_backbone使用的resnet50(也是比较常见的图像提取网络了),gird_mask也是比较很常见的数据增强方案,感兴趣的可以参考一下我的bevformer的博客。

        最终得到的不同尺度img_feats为一个list,其中包含不同大小的特征图:[1,6,256,116,200],[1,6,256,58,100],[1,6,29,50],[1,6,15,25]。

2.1.2 代码

        extract_img_feat函数

  1. def extract_img_feat(self, img, img_metas):
  2. """Extract features of images."""
  3. B = img.size(0)
  4. if img is not None:
  5. input_shape = img.shape[-2:]
  6. # update real input shape of each single img
  7. for img_meta in img_metas:
  8. img_meta.update(input_shape=input_shape)
  9. if img.dim() == 5 and img.size(0) == 1:
  10. img.squeeze_()
  11. elif img.dim() == 5 and img.size(0) > 1:
  12. B, N, C, H, W = img.size()
  13. img = img.view(B * N, C, H, W)
  14. if self.use_grid_mask:
  15. img = self.grid_mask(img)
  16. img_feats = self.img_backbone(img)
  17. if isinstance(img_feats, dict):
  18. img_feats = list(img_feats.values())
  19. else:
  20. return None
  21. if self.with_img_neck:
  22. img_feats = self.img_neck(img_feats)
  23. img_feats_reshaped = []
  24. for img_feat in img_feats:
  25. BN, C, H, W = img_feat.size()
  26. img_feats_reshaped.append(img_feat.view(B, int(BN / B), C, H, W))
  27. return img_feats_reshaped

2.2 Decoder

2.2.1 query初始化 + reference_points初始化

2.2.1.1 原理

        这里的query_embeds其实是由nn.Embedding初始化的一组可以学习的特征,shape为[900,512]。通过torch.split函数将其分为queryquery_pos,分别表示queryquery的位置编码。

        通过一个MLP,将shape为[900,256]query回归到一个3D中心点坐标reference_point(shape[900,3]),再做了一个sigmoid将其映射到0~1之间。

        注:这边的900怎么理解呢,我的个人理解是num_query,对应之后的reference_point的个数,相当于是我预测900个bounding-box,每一个box的编码是256

  1. # self.query_embedding = nn.Embedding(self.num_query,self.embed_dims * 2)
  2. query_embeds = self.query_embedding.weight # [900,512]
2.2.1.2 代码
  1. def forward(self,
  2. mlvl_feats, # [1,6,256,116,200]、[1,6,256,58,100]、[16,256,29,50]、[1,6,256,15,25]
  3. query_embed, # [900,512]
  4. reg_branches=None, # 6个全连接层
  5. **kwargs): # []
  6. """Forward function for `Detr3DTransformer`.
  7. Args:
  8. mlvl_feats (list(Tensor)): Input queries from
  9. different level. Each element has shape
  10. [bs, embed_dims, h, w].
  11. query_embed (Tensor): The query embedding for decoder,
  12. with shape [num_query, c].
  13. mlvl_pos_embeds (list(Tensor)): The positional encoding
  14. of feats from different level, has the shape
  15. [bs, embed_dims, h, w].
  16. reg_branches (obj:`nn.ModuleList`): Regression heads for
  17. feature maps from each decoder layer. Only would
  18. be passed when
  19. `with_box_refine` is True. Default to None.
  20. Returns:
  21. tuple[Tensor]: results of decoder containing the following tensor.
  22. - inter_states: Outputs from decoder. If
  23. return_intermediate_dec is True output has shape \
  24. (num_dec_layers, bs, num_query, embed_dims), else has \
  25. shape (1, bs, num_query, embed_dims).
  26. - init_reference_out: The initial value of reference \
  27. points, has shape (bs, num_queries, 4).
  28. - inter_references_out: The internal value of reference \
  29. points in decoder, has shape \
  30. (num_dec_layers, bs,num_query, embed_dims)
  31. - enc_outputs_class: The classification score of \
  32. proposals generated from \
  33. encoder's feature maps, has shape \
  34. (batch, h*w, num_classes). \
  35. Only would be returned when `as_two_stage` is True, \
  36. otherwise None.
  37. - enc_outputs_coord_unact: The regression results \
  38. generated from encoder's feature maps., has shape \
  39. (batch, h*w, 4). Only would \
  40. be returned when `as_two_stage` is True, \
  41. otherwise None.
  42. """
  43. assert query_embed is not None
  44. bs = mlvl_feats[0].size(0) # 1
  45. query_pos, query = torch.split(query_embed, self.embed_dims , dim=1) # query:[900,256] query_pos:[900,256]
  46. query_pos = query_pos.unsqueeze(0).expand(bs, -1, -1) # [1,900,256]
  47. query = query.unsqueeze(0).expand(bs, -1, -1) # [1,900,256]
  48. reference_points = self.reference_points(query_pos) # [1,900,3] Linear(in_features=256, out_features=3, bias=True)
  49. reference_points = reference_points.sigmoid() # 压缩xyz坐标到[0,1]之间
  50. init_reference_out = reference_points # [1,900,3]
  51. # decoder
  52. query = query.permute(1, 0, 2) # [1,900,256]
  53. query_pos = query_pos.permute(1, 0, 2) # [1,900,256]
  54. inter_states, inter_references = self.decoder(
  55. query=query, # [1,900,256]
  56. key=None, # None
  57. value=mlvl_feats, # [1,6,256,115,200]、[1,6,256,58,100]、[1,6,256,29,50]、[1,6,256,15,25]
  58. query_pos=query_pos, # [1,900,256]
  59. reference_points=reference_points, # [1,900,3]
  60. reg_branches=reg_branches, # 6个全连接层
  61. **kwargs)
  62. # inter_states: [6,900,1,256] inter_references:[6,1,900,3]
  63. inter_references_out = inter_references # [6,1,900,3]
  64. return inter_states, init_reference_out, inter_references_out # init_reference_out和inter_references_out的区别,多个decoder的stack的结果

2.2.2 feature_sampling—2D-3D转换模块

2.2.2.1 原理

        feature_sampling函数的作用主要是根据从query中回归得到的3D中心点reference_points不同尺度的mlvl_feats进行采样,得到中心点处的特征值。

        (1)首先论文中定义的世界坐标系激光雷达坐标系(车体坐标系),需要将其转换到相机坐标系下的,再将其转换到像素坐标系上。

        具体做法如下:lidar2img就是坐标系转换矩阵(其中包含了R,T),先利用pc_range中保存的上下限对reference_pointsxyz坐标进行缩放(因为之前通过sigmoid函数归一化到了0~1之间),再将其变为齐次坐标xyzs,之后乘以坐标系转换矩阵,再将点复制6份,分别对应6个相机。之后又通过一系列的mask、过滤、缩放等操作,将其映射到了像素坐标系上。

        (2)得到像素平面的2D中心点reference_points_cams之后(shape为[1,6,900,2]),就需要和之前backbone提取的多尺度图像特征进行采样交互了。这里还是使用F.grid_sample双线性插值函数进行采样的,也就是在不同尺度[6,256,116,200] [6,256,58,100] [6,256,29,50] [6,256,15,25]的特征图上采样900个点,最终每个特征图采样之后的shape都为[6,256,900,1],但是其内容是不一样的,最后在将其在最后一个维度上stack起来,最终sampled_feats的shape变为[1,256,900,6,1,4]。

2.2.2.2 代码
  1. # 十分关键的特征采样函数,2D-to-3D 特征变换
  2. def feature_sampling(mlvl_feats, reference_points, pc_range, img_metas):
  3. lidar2img = []
  4. for img_meta in img_metas:
  5. lidar2img.append(img_meta['lidar2img'])
  6. lidar2img = np.asarray(lidar2img) # [1,6,4,4] 激光坐标系到相机坐标系的变换矩阵
  7. lidar2img = reference_points.new_tensor(lidar2img) # (B, N, 4, 4) [1,6,4,4] 就是将之前的numpy变为的tensor
  8. reference_points = reference_points.clone() # [1,900,3]
  9. reference_points_3d = reference_points.clone() # [1,900,3]
  10. # pc_range 的含义 [x_min, y_min, z_min, x_max, y_max, z_max]
  11. reference_points[..., 0:1] = reference_points[..., 0:1] * (pc_range[3] - pc_range[0]) + pc_range[0] # 对 x 进行缩放
  12. reference_points[..., 1:2] = reference_points[..., 1:2] * (pc_range[4] - pc_range[1]) + pc_range[1] # 对 y 进行缩放
  13. reference_points[..., 2:3] = reference_points[..., 2:3] * (pc_range[5] - pc_range[2]) + pc_range[2] # 对 z 进行缩放
  14. # reference_points (B, num_queries, 4) 将非齐次坐标转换为齐次坐标
  15. reference_points = torch.cat((reference_points, torch.ones_like(reference_points[..., :1])), -1)
  16. B, num_query = reference_points.size()[:2] # B:1 , num_query: 900
  17. num_cam = lidar2img.size(1) # 6
  18. reference_points = reference_points.view(B, 1, num_query, 4).repeat(1, num_cam, 1, 1).unsqueeze(-1) # [1,6,900,4,1] 复制6个相机的情况
  19. lidar2img = lidar2img.view(B, num_cam, 1, 4, 4).repeat(1, 1, num_query, 1, 1) # [1,6,900,4,4] 复制6个相机转换矩阵
  20. reference_points_cam = torch.matmul(lidar2img, reference_points).squeeze(-1) # [1,6,900,4] 乘以坐标转换矩阵
  21. eps = 1e-5 # 阈值
  22. mask = (reference_points_cam[..., 2:3] > eps) # 过滤 [1,6,900,1]
  23. reference_points_cam = reference_points_cam[..., 0:2] / torch.maximum(
  24. reference_points_cam[..., 2:3], torch.ones_like(reference_points_cam[..., 2:3])*eps) # [1,6,900,2] 将3D上的点映射到2d平面上
  25. reference_points_cam[..., 0] /= img_metas[0]['img_shape'][0][1] # 缩放 x
  26. reference_points_cam[..., 1] /= img_metas[0]['img_shape'][0][0] # 缩放 y
  27. reference_points_cam = (reference_points_cam - 0.5) * 2
  28. mask = (mask & (reference_points_cam[..., 0:1] > -1.0)
  29. & (reference_points_cam[..., 0:1] < 1.0)
  30. & (reference_points_cam[..., 1:2] > -1.0)
  31. & (reference_points_cam[..., 1:2] < 1.0))
  32. mask = mask.view(B, num_cam, 1, num_query, 1, 1).permute(0, 2, 3, 1, 4, 5) # [1,1,600,6,1,1]
  33. mask = torch.nan_to_num(mask)
  34. sampled_feats = []
  35. for lvl, feat in enumerate(mlvl_feats): # 对FFN层的不同的多尺度特征进行操作 feat:[1,6,256,116,200]
  36. B, N, C, H, W = feat.size()
  37. feat = feat.view(B*N, C, H, W) # [6,256,116,200]
  38. reference_points_cam_lvl = reference_points_cam.view(B*N, num_query, 1, 2) # [6,900,1,2]
  39. sampled_feat = F.grid_sample(feat, reference_points_cam_lvl) # [6,256,900,1]
  40. sampled_feat = sampled_feat.view(B, N, C, num_query, 1).permute(0, 2, 3, 1, 4) # [1,256,900,6,1]
  41. sampled_feats.append(sampled_feat)
  42. sampled_feats = torch.stack(sampled_feats, -1) # [1,256,900,6,1,4]
  43. sampled_feats = sampled_feats.view(B, C, num_query, num_cam, 1, len(mlvl_feats)) # [1,256,900,6,1,4]
  44. return reference_points_3d, sampled_feats, mask

2.2.3 Detr3DCrossAtten模块

2.2.3.1 原理

        Detr3DCrossAtten模块的主要作用是将queryvaluereference_points之间的特征进行交互。

        首先query = query + query_pos,即query + query的位置编码,然后再将其送给一个全连接层得到attention_weights,shape变为[1,900,256] -> [1,900,24] -> [1,1,900,6,1,4]。

        之后就是上一小节讲得feature_sample进行特征采样,得到reference_points_3doutput,其中output中保存了不同尺度的图像特征采样后的特征结果。再将attention_weightoutput进行注意力权重计算,再sum一下最后3各维度,并进行permute维度交互,最后再连接一个全连接层,进行特征映射得到最终的output结果,shape为[900,1,256]

        最后再使用self.position_encoder函数对3d点的信息reference_points_3d进行编码,使其的shape变为[900,1,256]。最后得到的结果为self.dropout(output) + inp_residual + pos_feat。也就是(dropout后的不同尺度图像采样特征)+(原始的query)+(3D点坐标的编码信息)。

2.2.3.2 代码
  1. def forward(self,
  2. query, # [900,1,256]
  3. key, # None
  4. value, # list [1,6,256,116,200] [1,6,256,58,100] [1,6,256,29,50] [1,6,256,15,25]
  5. residual=None, # None
  6. query_pos=None, # [900,1,256]
  7. key_padding_mask=None, # None
  8. reference_points=None, # [1,900,3]
  9. spatial_shapes=None, # None
  10. level_start_index=None, # None
  11. **kwargs):
  12. """Forward Function of Detr3DCrossAtten.
  13. Args:
  14. query (Tensor): Query of Transformer with shape
  15. (num_query, bs, embed_dims).
  16. key (Tensor): The key tensor with shape
  17. `(num_key, bs, embed_dims)`.
  18. value (Tensor): The value tensor with shape
  19. `(num_key, bs, embed_dims)`. (B, N, C, H, W)
  20. residual (Tensor): The tensor used for addition, with the
  21. same shape as `x`. Default None. If None, `x` will be used.
  22. query_pos (Tensor): The positional encoding for `query`.
  23. Default: None.
  24. key_pos (Tensor): The positional encoding for `key`. Default
  25. None.
  26. reference_points (Tensor): The normalized reference
  27. points with shape (bs, num_query, 4),
  28. all elements is range in [0, 1], top-left (0,0),
  29. bottom-right (1, 1), including padding area.
  30. or (N, Length_{query}, num_levels, 4), add
  31. additional two dimensions is (w, h) to
  32. form reference boxes.
  33. key_padding_mask (Tensor): ByteTensor for `query`, with
  34. shape [bs, num_key].
  35. spatial_shapes (Tensor): Spatial shape of features in
  36. different level. With shape (num_levels, 2),
  37. last dimension represent (h, w).
  38. level_start_index (Tensor): The start index of each level.
  39. A tensor has shape (num_levels) and can be represented
  40. as [0, h_0*w_0, h_0*w_0+h_1*w_1, ...].
  41. Returns:
  42. Tensor: forwarded results with shape [num_query, bs, embed_dims].
  43. """
  44. if key is None:
  45. key = query
  46. if value is None:
  47. value = key
  48. if residual is None:
  49. inp_residual = query # [900,1,256] 用于残差连接
  50. if query_pos is not None:
  51. query = query + query_pos # [900,1,256] query + query的位置编码
  52. # change to (bs, num_query, embed_dims)
  53. query = query.permute(1, 0, 2) # [1,900,256]
  54. bs, num_query, _ = query.size() # bs:1, num_query:900, _:256
  55. # [1,900,256] -> [1,900,24] -> [1,1,900,6,1,4]
  56. attention_weights = self.attention_weights(query).view(bs, 1, num_query, self.num_cams, self.num_points, self.num_levels)
  57. # reference_points_3d:[1,900,3]
  58. # output:[1,256,900,6,1,4]
  59. # mask:[1,1,900,6,1,1]
  60. reference_points_3d, output, mask = feature_sampling(value, reference_points, self.pc_range, kwargs['img_metas'])
  61. output = torch.nan_to_num(output) # torch.nan_to_num,用于将张量中的非数字(NaN, 正无穷inf, 负无穷-inf)替换为数值
  62. mask = torch.nan_to_num(mask) # torch.nan_to_num,用于将张量中的非数字(NaN, 正无穷inf, 负无穷-inf)替换为数值
  63. attention_weights = attention_weights.sigmoid() * mask # [1,1,900,6,1,4]
  64. output = output * attention_weights # [1,256,900,6,1,4]
  65. output = output.sum(-1).sum(-1).sum(-1) # [1,256,900] 合并6、1、4维度
  66. output = output.permute(2, 0, 1) # [900,1,256]
  67. output = self.output_proj(output) # [900,1,256]
  68. # (num_query, bs, embed_dims) 还把3d点的信息进行编码 与 采样之后的图像特征 相加
  69. pos_feat = self.position_encoder(inverse_sigmoid(reference_points_3d)).permute(1, 0, 2) # [1,900,3] -> [1,900,256] -> [900,1,256]
  70. return self.dropout(output) + inp_residual + pos_feat

2.2.4  其他结构

2.2.4.1 原理

        Detr3DHead函数的forward函数。

        首先通过transformer类得到6个decoder输出的采样特征图hs,shape为[6,1,900,256]。之后分别用两个全连接层对每一个decoder输出的特征图hs[lvl],shape为[1,900,256]进行全连接映射到[1,900,10],得到outputs_classtmp,分别用来预测类别和box大小,又结合reference和pc_range也就是3D中心点位置xyz[x_min,y_min,z_min,x_max,y_max,z_max]坐标上下限对tmp的结果进行平移缩放。最后再将不同decoder层的预测结果stack起来进行loss计算。

2.2.4.2 代码
  1. def forward(self, mlvl_feats, img_metas):
  2. """Forward function.
  3. Args:
  4. mlvl_feats (tuple[Tensor]): Features from the upstream
  5. network, each is a 5D-tensor with shape
  6. (B, N, C, H, W).
  7. Returns:
  8. all_cls_scores (Tensor): Outputs from the classification head, \
  9. shape [nb_dec, bs, num_query, cls_out_channels]. Note \
  10. cls_out_channels should includes background.
  11. all_bbox_preds (Tensor): Sigmoid outputs from the regression \
  12. head with normalized coordinate format (cx, cy, w, l, cz, h, theta, vx, vy). \
  13. Shape [nb_dec, bs, num_query, 9].
  14. """
  15. query_embeds = self.query_embedding.weight # [900,512]
  16. hs, init_reference, inter_references = self.transformer(
  17. mlvl_feats, # [1,6,256,116,200]、[1,6,256,58,100]、[1,6,256,29,50]、[1,6,256,15,25]
  18. query_embeds, # [900,512]
  19. reg_branches=self.reg_branches if self.with_box_refine else None, # 6个全连接层
  20. img_metas=img_metas,) # 一堆list
  21. hs = hs.permute(0, 2, 1, 3) # hs: [6,900,1,256]->[6,1,900,256] init_reference:[1,900,3] inter_references:[6,1,900,3]
  22. outputs_classes = []
  23. outputs_coords = []
  24. for lvl in range(hs.shape[0]): # 遍历每一个decoder输出的特征图,多尺度预测
  25. if lvl == 0:
  26. reference = init_reference
  27. else:
  28. reference = inter_references[lvl - 1]
  29. reference = inverse_sigmoid(reference) # 反sigmoid函数
  30. outputs_class = self.cls_branches[lvl](hs[lvl]) # 对每一个decoder输出的特征图[1,900,256]进行全连接映射 [1,900,10]
  31. tmp = self.reg_branches[lvl](hs[lvl]) # [1,900,10]
  32. # TODO: check the shape of reference
  33. assert reference.shape[-1] == 3
  34. tmp[..., 0:2] += reference[..., 0:2]
  35. tmp[..., 0:2] = tmp[..., 0:2].sigmoid()
  36. tmp[..., 4:5] += reference[..., 2:3]
  37. tmp[..., 4:5] = tmp[..., 4:5].sigmoid()
  38. tmp[..., 0:1] = (tmp[..., 0:1] * (self.pc_range[3] - self.pc_range[0]) + self.pc_range[0])
  39. tmp[..., 1:2] = (tmp[..., 1:2] * (self.pc_range[4] - self.pc_range[1]) + self.pc_range[1])
  40. tmp[..., 4:5] = (tmp[..., 4:5] * (self.pc_range[5] - self.pc_range[2]) + self.pc_range[2])
  41. # TODO: check if using sigmoid
  42. outputs_coord = tmp # 坐标预测结果
  43. outputs_classes.append(outputs_class) # 类别预测结果
  44. outputs_coords.append(outputs_coord) # 坐标预测结果
  45. outputs_classes = torch.stack(outputs_classes)
  46. outputs_coords = torch.stack(outputs_coords)
  47. outs = {
  48. 'all_cls_scores': outputs_classes,
  49. 'all_bbox_preds': outputs_coords,
  50. 'enc_cls_scores': None,
  51. 'enc_bbox_preds': None,
  52. }
  53. return outs

2.3 Loss

2.3.1 原理

        这边Loss也是比较传统的目标检测loss,由分类损失和回归损失组成,唯一需要注意的是_get_target_single函数,它的主要作用是填充(这边建议仔细食用一下代码)。

2.3.2 代码

        loss函数

  1. def loss(self,
  2. gt_bboxes_list, # 18个物体的 box 信息
  3. gt_labels_list, # 18个物体的 label
  4. preds_dicts, # [[6,1,900,10],[6,1,900,10],[None],[None]]
  5. gt_bboxes_ignore=None): # None
  6. """"Loss function.
  7. Args:
  8. gt_bboxes_list (list[Tensor]): Ground truth bboxes for each image
  9. with shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format.
  10. gt_labels_list (list[Tensor]): Ground truth class indices for each
  11. image with shape (num_gts, ).
  12. preds_dicts:
  13. all_cls_scores (Tensor): Classification score of all
  14. decoder layers, has shape
  15. [nb_dec, bs, num_query, cls_out_channels].
  16. all_bbox_preds (Tensor): Sigmoid regression
  17. outputs of all decode layers. Each is a 4D-tensor with
  18. normalized coordinate format (cx, cy, w, h) and shape
  19. [nb_dec, bs, num_query, 4].
  20. enc_cls_scores (Tensor): Classification scores of
  21. points on encode feature map , has shape
  22. (N, h*w, num_classes). Only be passed when as_two_stage is
  23. True, otherwise is None.
  24. enc_bbox_preds (Tensor): Regression results of each points
  25. on the encode feature map, has shape (N, h*w, 4). Only be
  26. passed when as_two_stage is True, otherwise is None.
  27. gt_bboxes_ignore (list[Tensor], optional): Bounding boxes
  28. which can be ignored for each image. Default None.
  29. Returns:
  30. dict[str, Tensor]: A dictionary of loss components.
  31. """
  32. assert gt_bboxes_ignore is None, \
  33. f'{self.__class__.__name__} only supports ' \
  34. f'for gt_bboxes_ignore setting to None.'
  35. all_cls_scores = preds_dicts['all_cls_scores'] # [6,1,900,10]
  36. all_bbox_preds = preds_dicts['all_bbox_preds'] # [6,1,900,10]
  37. enc_cls_scores = preds_dicts['enc_cls_scores'] # None
  38. enc_bbox_preds = preds_dicts['enc_bbox_preds'] # None
  39. num_dec_layers = len(all_cls_scores) # decoder的层数:6个
  40. device = gt_labels_list[0].device
  41. # gt_bboxes.gravity_center: 这是gt_bboxes对象的一个属性,代表边界框的重心或质心坐标。
  42. # 它是一个形状为(N, 3)的Tensor,其中N是边界框的数量,3是二维坐标(x,y,z)
  43. # gt_bboxes.tensor[:, 3:]: 这部分从gt_bboxes.tensor这个Tensor中选取了所有行(由:指定)但仅从第4列开始到最后的列。
  44. # 这表示边界框的宽度、高度和其他可能的属性(例如旋转角度等)。假设gt_bboxes.tensor的形状为(N, M),其中M大于或等于4,那么这部分将返回一个形状为(N, M-3)的Tensor。
  45. gt_bboxes_list = [torch.cat((gt_bboxes.gravity_center, gt_bboxes.tensor[:, 3:]),dim=1).to(device) for gt_bboxes in gt_bboxes_list]
  46. all_gt_bboxes_list = [gt_bboxes_list for _ in range(num_dec_layers)] # 复制6份,每个decoder复制一份
  47. all_gt_labels_list = [gt_labels_list for _ in range(num_dec_layers)] # 复制6份,每个decoder复制一份
  48. all_gt_bboxes_ignore_list = [gt_bboxes_ignore for _ in range(num_dec_layers)] # 复制6份,每个decoder复制一份
  49. losses_cls, losses_bbox = multi_apply(self.loss_single,
  50. all_cls_scores, all_bbox_preds,
  51. all_gt_bboxes_list,
  52. all_gt_labels_list,
  53. all_gt_bboxes_ignore_list)
  54. loss_dict = dict()
  55. # loss of proposal generated from encode feature map.
  56. if enc_cls_scores is not None:
  57. binary_labels_list = [torch.zeros_like(gt_labels_list[i])for i in range(len(all_gt_labels_list))]
  58. enc_loss_cls, enc_losses_bbox = self.loss_single(enc_cls_scores, enc_bbox_preds,gt_bboxes_list, binary_labels_list, gt_bboxes_ignore)
  59. loss_dict['enc_loss_cls'] = enc_loss_cls
  60. loss_dict['enc_loss_bbox'] = enc_losses_bbox
  61. # loss from the last decoder layer
  62. loss_dict['loss_cls'] = losses_cls[-1]
  63. loss_dict['loss_bbox'] = losses_bbox[-1]
  64. # loss from other decoder layers
  65. num_dec_layer = 0
  66. for loss_cls_i, loss_bbox_i in zip(losses_cls[:-1],losses_bbox[:-1]):
  67. loss_dict[f'd{num_dec_layer}.loss_cls'] = loss_cls_i
  68. loss_dict[f'd{num_dec_layer}.loss_bbox'] = loss_bbox_i
  69. num_dec_layer += 1
  70. return loss_dict

        _get_target_single函数

  1. def _get_target_single(self,
  2. cls_score, # [900,10]
  3. bbox_pred, # [900,10]
  4. gt_labels, # [18]
  5. gt_bboxes, # [18,9]
  6. gt_bboxes_ignore=None): # None
  7. """"Compute regression and classification targets for one image.
  8. Outputs from a single decoder layer of a single feature level are used.
  9. Args:
  10. cls_score (Tensor): Box score logits from a single decoder layer
  11. for one image. Shape [num_query, cls_out_channels].
  12. bbox_pred (Tensor): Sigmoid outputs from a single decoder layer
  13. for one image, with normalized coordinate (cx, cy, w, h) and
  14. shape [num_query, 4].
  15. gt_bboxes (Tensor): Ground truth bboxes for one image with
  16. shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format.
  17. gt_labels (Tensor): Ground truth class indices for one image
  18. with shape (num_gts, ).
  19. gt_bboxes_ignore (Tensor, optional): Bounding boxes
  20. which can be ignored. Default None.
  21. Returns:
  22. tuple[Tensor]: a tuple containing the following for one image.
  23. - labels (Tensor): Labels of each image.
  24. - label_weights (Tensor]): Label weights of each image.
  25. - bbox_targets (Tensor): BBox targets of each image.
  26. - bbox_weights (Tensor): BBox weights of each image.
  27. - pos_inds (Tensor): Sampled positive indices for each image.
  28. - neg_inds (Tensor): Sampled negative indices for each image.
  29. """
  30. num_bboxes = bbox_pred.size(0) # 900
  31. # assigner and sampler 正负样本的分配与采样
  32. # assigner将预测的边界框与真实的边界框进行匹配,并确定哪些预测框是正样本,哪些是负样本。然后,sampler根据这些匹配结果进行采样,以确保正负样本的平衡。
  33. assign_result = self.assigner.assign(bbox_pred, cls_score, gt_bboxes, gt_labels, gt_bboxes_ignore)
  34. sampling_result = self.sampler.sample(assign_result, bbox_pred,gt_bboxes)
  35. pos_inds = sampling_result.pos_inds # [18] # 获取正样本的索引
  36. neg_inds = sampling_result.neg_inds # [882] # 获取负样本的索引
  37. # label targets
  38. # 初始化一个形状为[900]的Tensor,其中所有元素都是类别的数量(这里作为背景类别的索引)。然后,将正样本的索引对应的标签设置为真实的类别标签。
  39. labels = gt_bboxes.new_full((num_bboxes, ),self.num_classes,dtype=torch.long) # [900]
  40. labels[pos_inds] = gt_labels[sampling_result.pos_assigned_gt_inds]
  41. # 为所有预测框(包括正样本和负样本)生成标签权重,这里都设为1。但在某些情况下,你可能希望为负样本赋予不同的权重。
  42. label_weights = gt_bboxes.new_ones(num_bboxes) # [900]
  43. # bbox targets
  44. # 初始化一个与bbox_pred形状相同的Tensor,但只保留前9个通道(假设边界框的坐标和尺寸有9个参数)。
  45. # 然后,初始化一个与bbox_pred形状相同的Tensor,但所有元素都是0。接着,将正样本的索引对应的权重设为1.0
  46. bbox_targets = torch.zeros_like(bbox_pred)[..., :9] # [900,9]
  47. bbox_weights = torch.zeros_like(bbox_pred) # [900,10]
  48. bbox_weights[pos_inds] = 1.0
  49. # DETR
  50. # 将正样本的索引对应的边界框目标设置为真实的边界框坐标和尺寸。
  51. bbox_targets[pos_inds] = sampling_result.pos_gt_bboxes
  52. # 注意:变量包含了真实值(特别是对于正样本)和一些初始化的值(特别是对于负样本)。
  53. return (labels, label_weights, bbox_targets, bbox_weights, pos_inds, neg_inds)

        get_targets函数

  1. def get_targets(self,
  2. cls_scores_list, # [900,10]
  3. bbox_preds_list, # [900,10]
  4. gt_bboxes_list, # [18,9] 9个label信息,3个xyz,3个长宽高,3个旋转角
  5. gt_labels_list, # 18个label
  6. gt_bboxes_ignore_list=None):
  7. """"Compute regression and classification targets for a batch image.
  8. Outputs from a single decoder layer of a single feature level are used.
  9. Args:
  10. cls_scores_list (list[Tensor]): Box score logits from a single
  11. decoder layer for each image with shape [num_query,
  12. cls_out_channels].
  13. bbox_preds_list (list[Tensor]): Sigmoid outputs from a single
  14. decoder layer for each image, with normalized coordinate
  15. (cx, cy, w, h) and shape [num_query, 4].
  16. gt_bboxes_list (list[Tensor]): Ground truth bboxes for each image
  17. with shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format.
  18. gt_labels_list (list[Tensor]): Ground truth class indices for each
  19. image with shape (num_gts, ).
  20. gt_bboxes_ignore_list (list[Tensor], optional): Bounding
  21. boxes which can be ignored for each image. Default None.
  22. Returns:
  23. tuple: a tuple containing the following targets.
  24. - labels_list (list[Tensor]): Labels for all images.
  25. - label_weights_list (list[Tensor]): Label weights for all \
  26. images.
  27. - bbox_targets_list (list[Tensor]): BBox targets for all \
  28. images.
  29. - bbox_weights_list (list[Tensor]): BBox weights for all \
  30. images.
  31. - num_total_pos (int): Number of positive samples in all \
  32. images.
  33. - num_total_neg (int): Number of negative samples in all \
  34. images.
  35. """
  36. assert gt_bboxes_ignore_list is None, \
  37. 'Only supports for gt_bboxes_ignore setting to None.'
  38. num_imgs = len(cls_scores_list) # 1
  39. gt_bboxes_ignore_list = [gt_bboxes_ignore_list for _ in range(num_imgs)] # None
  40. (labels_list, label_weights_list, bbox_targets_list,
  41. bbox_weights_list, pos_inds_list, neg_inds_list) = multi_apply(
  42. self._get_target_single, cls_scores_list, bbox_preds_list,
  43. gt_labels_list, gt_bboxes_list, gt_bboxes_ignore_list)
  44. num_total_pos = sum((inds.numel() for inds in pos_inds_list)) # 18
  45. num_total_neg = sum((inds.numel() for inds in neg_inds_list)) # 882
  46. return (labels_list, label_weights_list, bbox_targets_list,
  47. bbox_weights_list, num_total_pos, num_total_neg)


        loss_single函数

  1. def loss_single(self,
  2. cls_scores, # [1,900,10]
  3. bbox_preds, # [1,900,10]
  4. gt_bboxes_list, # list[[18,9] 9表示3个xyz,3个长宽高,3个旋转角]
  5. gt_labels_list, # list[18个label]
  6. gt_bboxes_ignore_list=None):# None
  7. """"Loss function for outputs from a single decoder layer of a single
  8. feature level.
  9. Args:
  10. cls_scores (Tensor): Box score logits from a single decoder layer
  11. for all images. Shape [bs, num_query, cls_out_channels].
  12. bbox_preds (Tensor): Sigmoid outputs from a single decoder layer
  13. for all images, with normalized coordinate (cx, cy, w, h) and
  14. shape [bs, num_query, 4].
  15. gt_bboxes_list (list[Tensor]): Ground truth bboxes for each image
  16. with shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format.
  17. gt_labels_list (list[Tensor]): Ground truth class indices for each
  18. image with shape (num_gts, ).
  19. gt_bboxes_ignore_list (list[Tensor], optional): Bounding
  20. boxes which can be ignored for each image. Default None.
  21. Returns:
  22. dict[str, Tensor]: A dictionary of loss components for outputs from
  23. a single decoder layer.
  24. """
  25. num_imgs = cls_scores.size(0) # 1
  26. cls_scores_list = [cls_scores[i] for i in range(num_imgs)] # [900,10]
  27. bbox_preds_list = [bbox_preds[i] for i in range(num_imgs)] # [900,10]
  28. cls_reg_targets = self.get_targets(cls_scores_list, bbox_preds_list,gt_bboxes_list, gt_labels_list, gt_bboxes_ignore_list)
  29. # cls——reg_targets为一个list,里面有6个值,这是真实值
  30. (labels_list, label_weights_list, bbox_targets_list, bbox_weights_list,num_total_pos, num_total_neg) = cls_reg_targets
  31. # 真实的label值
  32. labels = torch.cat(labels_list, 0) # [900]
  33. # 真实的label_weight
  34. label_weights = torch.cat(label_weights_list, 0) # [900]
  35. # 真实的bbox
  36. bbox_targets = torch.cat(bbox_targets_list, 0) # [900,9]
  37. # 真实bbox_weight
  38. bbox_weights = torch.cat(bbox_weights_list, 0) # [900,10]
  39. # 预测的类别分数
  40. cls_scores = cls_scores.reshape(-1, self.cls_out_channels) # [900,10]
  41. # construct weighted avg_factor to match with the official DETR repo
  42. cls_avg_factor = num_total_pos * 1.0 + num_total_neg * self.bg_cls_weight # 18
  43. if self.sync_cls_avg_factor:
  44. cls_avg_factor = reduce_mean(cls_scores.new_tensor([cls_avg_factor])) # 1
  45. # 类别损失
  46. cls_avg_factor = max(cls_avg_factor, 1) # [18]
  47. loss_cls = self.loss_cls(cls_scores, labels, label_weights, avg_factor=cls_avg_factor) # 类别损失 2.2571
  48. # Compute the average number of gt boxes accross all gpus, for
  49. # normalization purposes
  50. num_total_pos = loss_cls.new_tensor([num_total_pos])
  51. num_total_pos = torch.clamp(reduce_mean(num_total_pos), min=1).item()
  52. # regression L1 loss
  53. bbox_preds = bbox_preds.reshape(-1, bbox_preds.size(-1)) # 18.0
  54. normalized_bbox_targets = normalize_bbox(bbox_targets, self.pc_range) # [900,10]
  55. isnotnan = torch.isfinite(normalized_bbox_targets).all(dim=-1) # 布尔值
  56. bbox_weights = bbox_weights * self.code_weights # [900,10]
  57. loss_bbox = self.loss_bbox(bbox_preds[isnotnan, :10], normalized_bbox_targets[isnotnan, :10], bbox_weights[isnotnan, :10], avg_factor=num_total_pos)
  58. loss_cls = torch.nan_to_num(loss_cls)
  59. loss_bbox = torch.nan_to_num(loss_bbox)
  60. return loss_cls, loss_bbox

总结

        (1)DETR3D是DETR2D的一个改进版,主要通过初始化一个3D的object query,将其投影到2D像素平面上和不同视角图像特征进行交互,来预测3D物体的位置。

        (2)和LSS、BEVdet等一系列基于深度估计的BEV方案完全不同。

        (3)BEVformer可以看成DETR3D的改进版,是DETR3D和BEV方案的结合产物,个人感觉是介于DETR3D和自上而下的BEV方案的中间产物。

参考

DETR3D:将DETR用于3D目标检测任务-CSDN博客

论文精读:《DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries》-CSDN博客

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/从前慢现在也慢/article/detail/917920
推荐阅读
相关标签
  

闽ICP备14008679号