之前学习了机器学习和神经网络(RNN) pytorch使用等相关知识,进行了两个demo的实战
显卡驱动安装后 我用smi显示不出来显卡信息。
我明明在自带的software中心中的driver选择了驱动,但是smi命令找不到显卡信息,最后通过gpt查询,原来显卡驱动一直没有加载。因为我开起来secure boot的签名验证。关闭secure boot 就好了。
首先 conda 和pip换源操作可以参考:https://blog.csdn.net/h904798869/article/details/131719404
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple +包名
第一个大报错 是mmdetection编译报错,
查询了很久 发现根源是我的系统找不到有效的cuda
这里贴两个帖子,他们在讲解 pytorch cuda 和显卡驱动的 辩证关系
基本就是在排查 GPU驱动 Cuda版本 Pytorch版本之间的问题
https://zhuanlan.zhihu.com/p/91334380 我认为这个讲的最好
用nvcc -v可以看到cuda版本
在conda list中确认其他库的版本
lspci | grep -i nvidia 查看显卡型号
import torch print('CUDA 可用:', torch.cuda.is_available()) if torch.cuda.is_available(): print('可用的 CUDA 设备数:', torch.cuda.device_count()) print('当前 CUDA 设备索引:', torch.cuda.current_device()) print('当前 CUDA 设备名称:', torch.cuda.get_device_name(torch.cuda.current_device())) else: print('CUDA 不可用') import torch, torchvision print(torch.__version__, torch.cuda.is_available()) # Pytorch 实际使用的运行时的 cuda 目录 import torch.utils.cpp_extension print(torch.utils.cpp_extension.CUDA_HOME) # 编译该 Pytorch release 版本时使用的 cuda 版本 import torch print(torch.version.cuda )
如果pytorch成功导入了,但是出现false 则说明cuda设备不可用,可以去NVidia官网自动查找对应驱动https://www.nvidia.com/Download/index.aspx
我cuda toolkit和 nvidia driver是版本是可以匹配的,但cuda版本太高了我去
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
git clone git@github.com:facebookresearch/detectron2.git
pip install .
MMDetection(Masked Object Detection)是一个开源计算机视觉库,用于目标检测任务。它提供了丰富的目标检测算法和模型,包括 Faster R-CNN、Mask R-CNN 等
MMsegmentation(Image Segmentation)是一个开源计算机视觉库,用于图像分割任务。它包含了多种图像分割算法和模型,如 U-Net、DeepLabV3 等。
Detectron2 是由Facebook AI Research(FAIR)开发的开源目标检测库。它是原始 Detectron 库的继任者,为构建计算机视觉模型提供了灵活和模块化的框架,特别适用于目标检测和实例分割等任务
下载车身can数据、车辆位姿数据、地图场景数据、Camera Lidar传感器数据
python tools/create_data.py nuscenes --root-path ./data/nuscenes --out-dir ./data/nuscenes --extra-tag nuscenes --version v1.0-mini --canbus ./data
报错就把这里将所有的data.converter 前面的tools去掉
BEV复现教程 结果如下,看起来还不太准确。
环境和数据整理完毕后,按照教程进行,我的显卡可以跑small 和Tiny,把base替换成一位上即可。
模型结构使用config管理参数的方式,在bevformer_XXX.py中是参数,具体的模型搭建其实在bevformer_head.py中 组件在modules中可以找到
重点是Encoder中的 BEVFormerLayer,有作者提出的Temporalsellfattention SpatialCrossAttention 和可变形注意力
# --------------------------------------------- # Copyright (c) OpenMMLab. All rights reserved. # --------------------------------------------- # Modified by Zhiqi Li # --------------------------------------------- from projects.mmdet3d_plugin.models.utils.bricks import run_time from .multi_scale_deformable_attn_function import MultiScaleDeformableAttnFunction_fp32 from mmcv.ops.multi_scale_deform_attn import multi_scale_deformable_attn_pytorch import warnings import torch import torch.nn as nn from mmcv.cnn import xavier_init, constant_init from mmcv.cnn.bricks.registry import ATTENTION import math from mmcv.runner.base_module import BaseModule, ModuleList, Sequential from mmcv.utils import (ConfigDict, build_from_cfg, deprecated_api_warning, to_2tuple) from mmcv.utils import ext_loader ext_module = ext_loader.load_ext( '_ext', ['ms_deform_attn_backward', 'ms_deform_attn_forward']) @ATTENTION.register_module() class TemporalSelfAttention(BaseModule): """An attention module used in BEVFormer based on Deformable-Detr. `Deformable DETR: Deformable Transformers for End-to-End Object Detection. <https://arxiv.org/pdf/2010.04159.pdf>`_. Args: embed_dims (int): The embedding dimension of Attention. Default: 256. num_heads (int): Parallel attention heads. Default: 64. num_levels (int): The number of feature map used in Attention. Default: 4. num_points (int): The number of sampling points for each query in each head. Default: 4. im2col_step (int): The step used in image_to_column. Default: 64. dropout (float): A Dropout layer on `inp_identity`. Default: 0.1. batch_first (bool): Key, Query and Value are shape of (batch, n, embed_dim) or (n, batch, embed_dim). Default to True. norm_cfg (dict): Config dict for normalization layer. Default: None. init_cfg (obj:`mmcv.ConfigDict`): The Config for initialization. Default: None. num_bev_queue (int): In this version, we only use one history BEV and one currenct BEV. the length of BEV queue is 2. """ # embed_dims (int): 注意力机制的嵌入维度。 # num_heads (int): 注意力机制中并行的注意头数。 # num_levels (int): 使用的特征图的数量。 # num_points (int): 每个注意头中每个查询点的采样点数。 # im2col_step (int): 在图像到列矩阵转换中使用的步长。 # dropout (float): 应用于 inp_identity 的 Dropout 层的丢弃率。 # batch_first (bool): Key、Query 和 Value 的形状是否为 (batch, n, embed_dim) 或 (n, batch, embed_dim)。 # norm_cfg (dict): 用于规范化层的配置字典。 # init_cfg (obj: mmcv.ConfigDict): 用于初始化的配置对象。 # num_bev_queue (int): 在这个版本中,我们只使用一个历史 Bird's Eye View(BEV)和一个当前 BEV。BEV 队列的长度为 2。 def __init__(self, embed_dims=256, num_heads=8, num_levels=4, num_points=4, num_bev_queue=2, im2col_step=64, dropout=0.1, batch_first=True, norm_cfg=None, init_cfg=None): super().__init__(init_cfg) if embed_dims % num_heads != 0:#检查 embed_dims 特征维度是否可以被 num_heads 多头数量 整除,否则引发错误。 raise ValueError(f'embed_dims must be divisible by num_heads, ' f'but got {embed_dims} and {num_heads}') dim_per_head = embed_dims // num_heads # 多头注意力量 划分特征 self.norm_cfg = norm_cfg self.dropout = nn.Dropout(dropout) self.batch_first = batch_first self.fp16_enabled = False # you'd better set dim_per_head to a power of 2 # which is more efficient in the CUDA implementation def _is_power_of_2(n): if (not isinstance(n, int)) or (n < 0): raise ValueError( 'invalid input for _is_power_of_2: {} (type: {})'.format( n, type(n))) return (n & (n - 1) == 0) and n != 0 if not _is_power_of_2(dim_per_head): warnings.warn( "You'd better set embed_dims in " 'MultiScaleDeformAttention to make ' 'the dimension of each attention head a power of 2 ' 'which is more efficient in our CUDA implementation.') self.im2col_step = im2col_step self.embed_dims = embed_dims self.num_levels = num_levels self.num_heads = num_heads self.num_points = num_points self.num_bev_queue = num_bev_queue # 用于生成采样偏移的线性层。 self.sampling_offsets = nn.Linear( embed_dims*self.num_bev_queue, num_bev_queue*num_heads * num_levels * num_points * 2) # 用于生成注意力权重的线性层 self.attention_weights = nn.Linear(embed_dims*self.num_bev_queue, num_bev_queue*num_heads * num_levels * num_points) #: 用于投影值的线性层。 self.value_proj = nn.Linear(embed_dims, embed_dims) #用于输出投影的线性层。 self.output_proj = nn.Linear(embed_dims, embed_dims) self.init_weights() def init_weights(self): """Default initialization for Parameters of Module.""" constant_init(self.sampling_offsets, 0.) thetas = torch.arange( self.num_heads, dtype=torch.float32) * (2.0 * math.pi / self.num_heads) grid_init = torch.stack([thetas.cos(), thetas.sin()], -1) grid_init = (grid_init / grid_init.abs().max(-1, keepdim=True)[0]).view( self.num_heads, 1, 1, 2).repeat(1, self.num_levels*self.num_bev_queue, self.num_points, 1) for i in range(self.num_points): grid_init[:, :, i, :] *= i + 1 self.sampling_offsets.bias.data = grid_init.view(-1) #用于将参数初始化为常量值。 constant_init(self.attention_weights, val=0., bias=0.) #用于使用 Xavier 初始化参数 xavier_init(self.value_proj, distribution='uniform', bias=0.) xavier_init(self.output_proj, distribution='uniform', bias=0.) self._is_init = True def forward(self, query, key=None, value=None, identity=None, query_pos=None, key_padding_mask=None, reference_points=None, spatial_shapes=None, level_start_index=None, flag='decoder', **kwargs): """Forward Function of MultiScaleDeformAttention. Args: query (Tensor): Query of Transformer with shape (num_query, bs, embed_dims). key (Tensor): The key tensor with shape `(num_key, bs, embed_dims)`. value (Tensor): The value tensor with shape `(num_key, bs, embed_dims)`. identity (Tensor): The tensor used for addition, with the same shape as `query`. Default None. If None, `query` will be used. query_pos (Tensor): The positional encoding for `query`. Default: None. key_pos (Tensor): The positional encoding for `key`. Default None. reference_points (Tensor): The normalized reference points with shape (bs, num_query, num_levels, 2), all elements is range in [0, 1], top-left (0,0), bottom-right (1, 1), including padding area. or (N, Length_{query}, num_levels, 4), add additional two dimensions is (w, h) to form reference boxes. key_padding_mask (Tensor): ByteTensor for `query`, with shape [bs, num_key]. spatial_shapes (Tensor): Spatial shape of features in different levels. With shape (num_levels, 2), last dimension represents (h, w). level_start_index (Tensor): The start index of each level. A tensor has shape ``(num_levels, )`` and can be represented as [0, h_0*w_0, h_0*w_0+h_1*w_1, ...]. Returns: Tensor: forwarded results with shape [num_query, bs, embed_dims]. """ # 输入参数: # query (Tensor): Transformer 的查询张量,形状为 (num_query, bs, embed_dims)。 # key (Tensor): 键张量,形状为 (num_key, bs, embed_dims)。 # value (Tensor): 值张量,形状为 (num_key, bs, embed_dims)。 # identity (Tensor): 用于加法的张量,与 query 形状相同。如果为None,将使用 query。 # query_pos (Tensor): 用于 query 的位置编码。 # key_padding_mask (Tensor): 用于 query 的 ByteTensor,形状为 [bs, num_key]。 # reference_points (Tensor): 归一化的参考点,形状为 (bs, num_query, num_levels, 2),或 (N, Length_{query}, num_levels, 4)。这用于变形注意力。 # spatial_shapes (Tensor): 不同层级中特征的空间形状,形状为 (num_levels, 2),其中最后一个维度表示 (h, w)。 # level_start_index (Tensor): 每个层级的起始索引,形状为 (num_levels,)。 # 输出: # 返回值: 形状为 [num_query, bs, embed_dims] 的张量,表示前向传播的结果。 # flag: 一个字符串参数,可能用于指定这个操作是在编码器(encoder)还是解码器(decoder)中。 if value is None: assert self.batch_first bs, len_bev, c = query.shape # (num_query, bs, embed_dims) value = torch.stack([query, query], 1).reshape(bs*2, len_bev, c) #获取 query 张量的形状信息,并利用 torch.stack 和 reshape 函数将其复制为 value 张量 # value = torch.cat([query, query], 0) if identity is None: identity = query if query_pos is not None: query = query + query_pos # 将位置编码加入到q中 if not self.batch_first: # change to (bs, num_query ,embed_dims) query = query.permute(1, 0, 2) value = value.permute(1, 0, 2) #按照惯例整理顺序 bs, num_query, embed_dims = query.shape _, num_value, _ = value.shape# (num_key, bs, embed_dims) assert (spatial_shapes[:, 0] * spatial_shapes[:, 1]).sum() == num_value # (num_levels, 2),其中最后一个维度表示 (h, w) ???没看懂 assert self.num_bev_queue == 2 query = torch.cat([value[:bs], query], -1) value = self.value_proj(value) #将 query 连接到 value 的前部分,并对 value 应用 self.value_proj # 我的理解,由于value是上一时刻和上上bev的信息,如此增加模型在进行自注意力计算时对上下文的理解,而线性变换 #将输入数据映射到一个更高维度的空间,以便提高模型的表示能力 #gpt说如果一个查询需要依赖较远的位置的信息,通过将值信息添加到查询前面, # 可以使得模型更容易捕捉到这些长距离的依赖关系,提高了模型对整个序列的建模能力。 if key_padding_mask is not None: value = value.masked_fill(key_padding_mask[..., None], 0.0) #如果存在 key_padding_mask,则使用 masked_fill 将 value 进行填充 value = value.reshape(bs*self.num_bev_queue, num_value, self.num_heads, -1) sampling_offsets = self.sampling_offsets(query) sampling_offsets = sampling_offsets.view( bs, num_query, self.num_heads, self.num_bev_queue, self.num_levels, self.num_points, 2) attention_weights = self.attention_weights(query).view( bs, num_query, self.num_heads, self.num_bev_queue, self.num_levels * self.num_points) attention_weights = attention_weights.softmax(-1) # 利用线性层计算采样offset和权重,并且把权重归一化 attention_weights = attention_weights.view(bs, num_query, self.num_heads, self.num_bev_queue, self.num_levels, self.num_points) attention_weights = attention_weights.permute(0, 3, 1, 2, 4, 5)\ .reshape(bs*self.num_bev_queue, num_query, self.num_heads, self.num_levels, self.num_points).contiguous() sampling_offsets = sampling_offsets.permute(0, 3, 1, 2, 4, 5, 6)\ .reshape(bs*self.num_bev_queue, num_query, self.num_heads, self.num_levels, self.num_points, 2) #根据 reference_points 的形状不同(2 或 4),计算 sampling_locations if reference_points.shape[-1] == 2: offset_normalizer = torch.stack( [spatial_shapes[..., 1], spatial_shapes[..., 0]], -1) sampling_locations = reference_points[:, :, None, :, None, :] \ + sampling_offsets \ / offset_normalizer[None, None, None, :, None, :] #NONE插入维度 #将采样偏移 sampling_offsets 转化为相对于输入空间的实际位置。 elif reference_points.shape[-1] == 4: sampling_locations = reference_points[:, :, None, :, None, :2] \ + sampling_offsets / self.num_points \ * reference_points[:, :, None, :, None, 2:] \ * 0.5 else: raise ValueError( f'Last dim of reference_points must be' f' 2 or 4, but get {reference_points.shape[-1]} instead.') if torch.cuda.is_available() and value.is_cuda: # using fp16 deformable attention is unstable because it performs many sum operations if value.dtype == torch.float16: MultiScaleDeformableAttnFunction = MultiScaleDeformableAttnFunction_fp32 else: MultiScaleDeformableAttnFunction = MultiScaleDeformableAttnFunction_fp32 output = MultiScaleDeformableAttnFunction.apply( value, spatial_shapes, level_start_index, sampling_locations, attention_weights, self.im2col_step) else: output = multi_scale_deformable_attn_pytorch( value, spatial_shapes, sampling_locations, attention_weights) # output shape (bs*num_bev_queue, num_query, embed_dims) # (bs*num_bev_queue, num_query, embed_dims)-> (num_query, embed_dims, bs*num_bev_queue) output = output.permute(1, 2, 0) # fuse history value and current value # (num_query, embed_dims, bs*num_bev_queue)-> (num_query, embed_dims, bs, num_bev_queue) output = output.view(num_query, embed_dims, bs, self.num_bev_queue) output = output.mean(-1) #计算 output 张量中每个元素在最后一个维度上的均值 # (num_query, embed_dims, bs)-> (bs, num_query, embed_dims) output = output.permute(2, 0, 1) output = self.output_proj(output) # out再整一次变换 if not self.batch_first: output = output.permute(1, 0, 2) return self.dropout(output) + identity # 加一次dropout防止过拟合,同时引入残差连接或者跳跃连接,从而帮助梯度传播以及加速模型的训练
# --------------------------------------------- # Copyright (c) OpenMMLab. All rights reserved. # --------------------------------------------- # Modified by Zhiqi Li # --------------------------------------------- from mmcv.ops.multi_scale_deform_attn import multi_scale_deformable_attn_pytorch import warnings import torch import torch.nn as nn import torch.nn.functional as F from mmcv.cnn import xavier_init, constant_init from mmcv.cnn.bricks.registry import (ATTENTION, TRANSFORMER_LAYER, TRANSFORMER_LAYER_SEQUENCE) from mmcv.cnn.bricks.transformer import build_attention import math from mmcv.runner import force_fp32, auto_fp16 from mmcv.runner.base_module import BaseModule, ModuleList, Sequential from mmcv.utils import ext_loader from .multi_scale_deformable_attn_function import MultiScaleDeformableAttnFunction_fp32, \ MultiScaleDeformableAttnFunction_fp16 from projects.mmdet3d_plugin.models.utils.bricks import run_time ext_module = ext_loader.load_ext( '_ext', ['ms_deform_attn_backward', 'ms_deform_attn_forward']) @ATTENTION.register_module() class SpatialCrossAttention(BaseModule): """An attention module used in BEVFormer. Args: embed_dims (int): The embedding dimension of Attention. Default: 256. 是bev线性变换后注意里特征数量 num_cams (int): The number of cameras 摄像头的数量 dropout (float): A Dropout layer on `inp_residual`. Default: 0.. 为了防止过拟合dropout层参数 init_cfg (obj:`mmcv.ConfigDict`): The Config for initialization. Default: None. 初始化参数 deformable_attention: (dict): The config for the deformable attention used in SCA. SCA的可变性注意力参数 """ def __init__(self, embed_dims=256, num_cams=6, pc_range=None, dropout=0.1, init_cfg=None, batch_first=False, deformable_attention=dict( type='MSDeformableAttention3D', embed_dims=256, num_levels=4), **kwargs ): super(SpatialCrossAttention, self).__init__(init_cfg) self.init_cfg = init_cfg self.dropout = nn.Dropout(dropout) self.pc_range = pc_range self.fp16_enabled = False self.deformable_attention = build_attention(deformable_attention) self.embed_dims = embed_dims self.num_cams = num_cams self.output_proj = nn.Linear(embed_dims, embed_dims) self.batch_first = batch_first self.init_weight() def init_weight(self): """Default initialization for Parameters of Module.""" xavier_init(self.output_proj, distribution='uniform', bias=0.) #以上初始化和TSA基本一致,没有笔记内容 @force_fp32(apply_to=('query', 'key', 'value', 'query_pos', 'reference_points_cam')) def forward(self, query, key, value, residual=None, query_pos=None, key_padding_mask=None, reference_points=None, spatial_shapes=None, reference_points_cam=None, bev_mask=None, level_start_index=None, flag='encoder', **kwargs): """Forward Function of Detr3DCrossAtten. Args: query (Tensor): Query of Transformer with shape (num_query, bs, embed_dims). #Q #网上解释num_query类似于 DETR里面的 object_queries,也就是最多预测多少个目标 key (Tensor): The key tensor with shape `(num_key, bs, embed_dims)`. # k value (Tensor): The value tensor with shape `(num_key, bs, embed_dims)`. (B, N, C, H, W) residual (Tensor): The tensor used for addition, with the same shape as `x`. Default None. If None, `x` will be used. #残差 query_pos (Tensor): The positional encoding for `query`. Default: None. key_pos (Tensor): The positional encoding for `key`. Default None. # q 和 k的位置编码 reference_points (Tensor): The normalized reference points with shape (bs, num_query, 4), all elements is range in [0, 1], top-left (0,0), bottom-right (1, 1), including padding area. or (N, Length_{query}, num_levels, 4), add additional two dimensions is (w, h) to form reference boxes. # 参考点归一化 数据标准化( Standardization )是将数据转换为均值为0,方差为1的数据,也就是将数据按比例缩放, 使得其分布具有标准正态分布。 数据归一化( Normalization ) 是将数据转换为满足0≤x≤1的数据,也就是将数据缩放到 [0,1]区间。 #num_levels:The number of feature map used in Attention 被用于注意力的特征地图的数量 key_padding_mask (Tensor): ByteTensor for `query`, with shape [bs, num_key]. #k 注意力掩码 spatial_shapes (Tensor): Spatial shape of features in different level. With shape (num_levels, 2), last dimension represent (h, w). #空间形状,在不同level的特征的空间形状,最后一个维度2是(h,w) level_start_index (Tensor): The start index of each level. A tensor has shape (num_levels) and can be represented as [0, h_0*w_0, h_0*w_0+h_1*w_1, ...]. # 开始遍历的index Returns: Tensor: forwarded results with shape [num_query, bs, embed_dims]. # 返回查询数量 批次 特征数量 """ if key is None: key = query if value is None: value = key if residual is None: inp_residual = query # 残差链接网络传输值被初始化为query slots = torch.zeros_like(query) # 以 query的shape初始化 slot if query_pos is not None: query = query + query_pos # 同样地把线性层学习到的query位置编码和query叠加到一起 bs, num_query, _ = query.size()#(num_query, bs, embed_dims) #这里是不是错了? 在input地方的备注维度顺序不同? D = reference_points_cam.size(3) indexes = [] for i, mask_per_img in enumerate(bev_mask): index_query_per_img = mask_per_img[0].sum(-1).nonzero().squeeze(-1) indexes.append(index_query_per_img) max_len = max([len(each) for each in indexes]) # 每个特征点对应一个mask点,特征点的值为false,就可以将其在注意力中抛弃 # 举例子说明:如果mask_per_img =m torch.tensor([[1, 0, 1, 0],[1, 1, 0, 1]]) # sum_per_img = mask_per_img.sum(-1) 得到tensor[2,3] # nonzero_indices = sum_per_img.nonzero() 得到tensor [[0],[1]] # index_query_per_img = nonzero_indices.squeeze(-1)去除上一步操作后多出来的维度 # 得到[0,1] # 最后用indexes 储存计算好的indices # each camera only interacts with its corresponding BEV queries. This step can greatly save GPU memory. queries_rebatch = query.new_zeros( [bs, self.num_cams, max_len, self.embed_dims]) reference_points_rebatch = reference_points_cam.new_zeros( [bs, self.num_cams, max_len, D, 2]) for j in range(bs): for i, reference_points_per_img in enumerate(reference_points_cam): index_query_per_img = indexes[i] queries_rebatch[j, i, :len(index_query_per_img)] = query[j, index_query_per_img] reference_points_rebatch[j, i, :len(index_query_per_img)] = reference_points_per_img[j, index_query_per_img] #重新计算q和reference point 根据上一步计算的index num_cams, l, bs, embed_dims = key.shape key = key.permute(2, 0, 1, 3).reshape( bs * self.num_cams, l, self.embed_dims) value = value.permute(2, 0, 1, 3).reshape( bs * self.num_cams, l, self.embed_dims) queries = self.deformable_attention(query=queries_rebatch.view(bs*self.num_cams, max_len, self.embed_dims), key=key, value=value, reference_points=reference_points_rebatch.view(bs*self.num_cams, max_len, D, 2), spatial_shapes=spatial_shapes, level_start_index=level_start_index).view(bs, self.num_cams, max_len, self.embed_dims) # 使用可变形注意力 for j in range(bs): for i, index_query_per_img in enumerate(indexes): slots[j, index_query_per_img] += queries[j, i, :len(index_query_per_img)] # 用计算好的 queries和indexed更新slots count = bev_mask.sum(-1) > 0 # 将bev_mask 按照最后一个维度相加 判断是否大于0 结果储存在count中 count = count.permute(1, 2, 0).sum(-1) count = torch.clamp(count, min=1.0) # 将count的元素的 最小值设为1 slots = slots / count[..., None] slots = self.output_proj(slots) return self.dropout(slots) + inp_residual # [num_query, bs, embed_dims]. @ATTENTION.register_module() class MSDeformableAttention3D(BaseModule): """An attention module used in BEVFormer based on Deformable-Detr. `Deformable DETR: Deformable Transformers for End-to-End Object Detection. <https://arxiv.org/pdf/2010.04159.pdf>`_. Args: embed_dims (int): The embedding dimension of Attention. Default: 256. num_heads (int): Parallel attention heads. Default: 64. num_levels (int): The number of feature map used in Attention. Default: 4. num_points (int): The number of sampling points for each query in each head. Default: 4. im2col_step (int): The step used in image_to_column. Default: 64. dropout (float): A Dropout layer on `inp_identity`. Default: 0.1. batch_first (bool): Key, Query and Value are shape of (batch, n, embed_dim) or (n, batch, embed_dim). Default to False. norm_cfg (dict): Config dict for normalization layer. Default: None. init_cfg (obj:`mmcv.ConfigDict`): The Config for initialization. Default: None. """ # embed_dims(嵌入维度):注意力机制中的嵌入维度。默认为256,影响了注意力机制中的向量表示维度。 # num_heads(注意力头数):并行的注意力头数。默认为64,控制了注意力机制中多头注意力的并行数量。 # num_levels(特征图数量):注意力中使用的特征图数量。默认为4,影响了注意力机制中特征图的层级数。 # num_points(采样点数):每个注意力头中每个查询点的采样点数。默认为4,决定了每个头部的注意力机制对查询点进行采样的数量。 # im2col_step(image_to_column 步长):在 image_to_column 操作中使用的步长。 # dropout(丢弃率):应用于 inp_identity 的 Dropout 层的丢弃率。默认为0.1,用于在训练中随机丢弃输入张量中的一部分元素,以防止过拟合。 # batch_first(批次优先):用于指定输入张量的维度顺序。如果为 True,表示输入张量的形状是(batch, n, embed_dim),否则为 (n, batch, embed_dim)。默认为 False。 # norm_cfg(归一化层配置):用于归一化层的配置字典。默认为 None, # init_cfg(初始化配置):初始化配置的配置对象。 def __init__(self, embed_dims=256, num_heads=8, num_levels=4, num_points=8, im2col_step=64, dropout=0.1, batch_first=True, norm_cfg=None, init_cfg=None): super().__init__(init_cfg) if embed_dims % num_heads != 0: raise ValueError(f'embed_dims must be divisible by num_heads, ' f'but got {embed_dims} and {num_heads}') dim_per_head = embed_dims // num_heads # 每一个头的特征数量 self.norm_cfg = norm_cfg self.batch_first = batch_first self.output_proj = None self.fp16_enabled = False # you'd better set dim_per_head to a power of 2 # which is more efficient in the CUDA implementation def _is_power_of_2(n): if (not isinstance(n, int)) or (n < 0): raise ValueError( 'invalid input for _is_power_of_2: {} (type: {})'.format( n, type(n))) return (n & (n - 1) == 0) and n != 0 if not _is_power_of_2(dim_per_head): warnings.warn( "You'd better set embed_dims in " 'MultiScaleDeformAttention to make ' 'the dimension of each attention head a power of 2 ' 'which is more efficient in our CUDA implementation.') self.im2col_step = im2col_step self.embed_dims = embed_dims self.num_levels = num_levels self.num_heads = num_heads self.num_points = num_points self.sampling_offsets = nn.Linear( embed_dims, num_heads * num_levels * num_points * 2) self.attention_weights = nn.Linear(embed_dims, num_heads * num_levels * num_points) self.value_proj = nn.Linear(embed_dims, embed_dims) # 同 TSA的注意力 self.init_weights() def init_weights(self): """Default initialization for Parameters of Module.""" constant_init(self.sampling_offsets, 0.) #极坐标网格构建 # 创建一个0到2pi 等分为8分的tensor thetas = torch.arange( self.num_heads, dtype=torch.float32) * (2.0 * math.pi / self.num_heads) # 初始化grid #利用三角函数计算每个角度对应的余弦和正弦值,然后通过torch.stack在最后一个维度 #将这两个值堆叠在一起形成一个形状为(num_heads, 2)的张量。 # 这个张量的每一行表示一个角度对应的极坐标中的(x, y)坐标, # 使用grid_init.abs().max(-1, keepdim=True)[0]计算每个行向量的绝对值中的最大值, # 并在最后一个维度上保持维度。然后,将grid_init除以这个最大值,实现归一化。 # 最后,通过view函数将结果变形成形状为(num_heads, 1, 1, 2)的张量 #最终的输出是一个形状为(num_heads, 1, 1, 2)的张量, # 表示了num_heads个头部的极坐标网格。每个头部的网格用一个(x, y)坐标表示, # 这个坐标在单位圆上,且在整个num_heads中均匀分布 grid_init = torch.stack([thetas.cos(), thetas.sin()], -1) grid_init = (grid_init / grid_init.abs().max(-1, keepdim=True)[0]).view( self.num_heads, 1, 1, 2).repeat(1, self.num_levels, self.num_points, 1) ## 遍历第二个维度上,通过这种方式记录是第几个采样点的极坐标 for i in range(self.num_points): grid_init[:, :, i, :] *= i + 1 #grid_init.view(-1) 将 grid_init 张量展平为一个一维张量 self.sampling_offsets.bias.data = grid_init.view(-1) constant_init(self.attention_weights, val=0., bias=0.) xavier_init(self.value_proj, distribution='uniform', bias=0.) xavier_init(self.output_proj, distribution='uniform', bias=0.) self._is_init = True def forward(self, query, key=None, value=None, identity=None, query_pos=None, key_padding_mask=None, reference_points=None, spatial_shapes=None, level_start_index=None, **kwargs): """Forward Function of MultiScaleDeformAttention. Args: query (Tensor): Query of Transformer with shape ( bs, num_query, embed_dims). key (Tensor): The key tensor with shape `(bs, num_key, embed_dims)`. value (Tensor): The value tensor with shape `(bs, num_key, embed_dims)`. identity (Tensor): The tensor used for addition, with the same shape as `query`. Default None. If None, `query` will be used. query_pos (Tensor): The positional encoding for `query`. Default: None. key_pos (Tensor): The positional encoding for `key`. Default None. reference_points (Tensor): The normalized reference points with shape (bs, num_query, num_levels, 2), all elements is range in [0, 1], top-left (0,0), bottom-right (1, 1), including padding area. or (N, Length_{query}, num_levels, 4), add additional two dimensions is (w, h) to form reference boxes. key_padding_mask (Tensor): ByteTensor for `query`, with shape [bs, num_key]. spatial_shapes (Tensor): Spatial shape of features in different levels. With shape (num_levels, 2), last dimension represents (h, w). level_start_index (Tensor): The start index of each level. A tensor has shape ``(num_levels, )`` and can be represented as [0, h_0*w_0, h_0*w_0+h_1*w_1, ...]. Returns: Tensor: forwarded results with shape [num_query, bs, embed_dims]. """ if value is None: value = query if identity is None: identity = query if query_pos is not None: query = query + query_pos if not self.batch_first: # change to (bs, num_query ,embed_dims) query = query.permute(1, 0, 2) value = value.permute(1, 0, 2) bs, num_query, _ = query.shape bs, num_value, _ = value.shape assert (spatial_shapes[:, 0] * spatial_shapes[:, 1]).sum() == num_value value = self.value_proj(value) if key_padding_mask is not None: value = value.masked_fill(key_padding_mask[..., None], 0.0) value = value.view(bs, num_value, self.num_heads, -1) sampling_offsets = self.sampling_offsets(query).view( bs, num_query, self.num_heads, self.num_levels, self.num_points, 2) attention_weights = self.attention_weights(query).view( bs, num_query, self.num_heads, self.num_levels * self.num_points) attention_weights = attention_weights.softmax(-1) attention_weights = attention_weights.view(bs, num_query, self.num_heads, self.num_levels, self.num_points) if reference_points.shape[-1] == 2: """ For each BEV query, it owns `num_Z_anchors` in 3D space that having different heights. After proejcting, each BEV query has `num_Z_anchors` reference points in each 2D image. For each referent point, we sample `num_points` sampling points. For `num_Z_anchors` reference points, it has overall `num_points * num_Z_anchors` sampling points. """ offset_normalizer = torch.stack( [spatial_shapes[..., 1], spatial_shapes[..., 0]], -1) bs, num_query, num_Z_anchors, xy = reference_points.shape reference_points = reference_points[:, :, None, None, None, :, :] sampling_offsets = sampling_offsets / \ offset_normalizer[None, None, None, :, None, :] bs, num_query, num_heads, num_levels, num_all_points, xy = sampling_offsets.shape sampling_offsets = sampling_offsets.view( bs, num_query, num_heads, num_levels, num_all_points // num_Z_anchors, num_Z_anchors, xy) sampling_locations = reference_points + sampling_offsets bs, num_query, num_heads, num_levels, num_points, num_Z_anchors, xy = sampling_locations.shape assert num_all_points == num_points * num_Z_anchors sampling_locations = sampling_locations.view( bs, num_query, num_heads, num_levels, num_all_points, xy) elif reference_points.shape[-1] == 4: assert False else: raise ValueError( f'Last dim of reference_points must be' f' 2 or 4, but get {reference_points.shape[-1]} instead.') # sampling_locations.shape: bs, num_query, num_heads, num_levels, num_all_points, 2 # attention_weights.shape: bs, num_query, num_heads, num_levels, num_all_points #准备步骤基本可TSA相同 if torch.cuda.is_available() and value.is_cuda: if value.dtype == torch.float16: MultiScaleDeformableAttnFunction = MultiScaleDeformableAttnFunction_fp32 else: MultiScaleDeformableAttnFunction = MultiScaleDeformableAttnFunction_fp32 output = MultiScaleDeformableAttnFunction.apply( value, spatial_shapes, level_start_index, sampling_locations, attention_weights, self.im2col_step) else: output = multi_scale_deformable_attn_pytorch( value, spatial_shapes, sampling_locations, attention_weights) if not self.batch_first: output = output.permute(1, 0, 2) return output
Embedding 层在深度学习中主要用于将高维的离散数据映射到低维的连续空间
Embedding 层将输入的离散数据,比如单词、类别标签等,映射到一个固定维度的实数向量。这使得模型能够更好地理解和处理这些数据,因为连续向量包含了更多的信息
Embedding 层会根据模型的训练数据学习出适合任务的特征表示。这意味着相似的类别或单词在嵌入空间中会有相似的表示,这有助于提高模型的泛化能力
Embedding 层可以将高维的离散数据映射到低维的连续空间。这有助于减少模型的参数数量,提高训练和推理效率
与独热编码等稀疏表示相比,Embedding 层提供了密集的表示,其中每个维度都包含信息。这可以减少存储需求,并更有效地传达模型学到的知识
• 连续的向量: 意味着向量的每个元素都可以是任意实数,而不仅仅是整数。在嵌入层中,这通常是为了获得更灵活、更具表达力的表示。
• 低维向量: 意味着向量的维度相对较低。在嵌入的上下文中,这有助于减少模型参数的数量,同时保留重要的特征。低维度的表示通常更容易被模型学习和泛化。
import torch import torch.nn as nn # 假设我们有一个词汇表的大小为10,每个词的嵌入维度为3 vocab_size = 10 embedding_dim = 3 # 创建一个 Embedding 层 embedding_layer = nn.Embedding(vocab_size, embedding_dim) # 定义一个输入,包含三个词的索引 input_indices = torch.tensor([1, 5, 9], dtype=torch.long) # 将输入传递给嵌入层,得到嵌入向量 embedded_vector = embedding_layer(input_indices) # 输出嵌入向量 print(embedded_vector)
tensor([[-2.8465, 0.1365, -0.4851],
[ 0.4402, -0.3163, -0.8770],
[-0.4027, -0.1626, 0.3808]], grad_fn=<EmbeddingBackward0>)
“每个词的嵌入维度为3” 意味着在嵌入层中为每个词分配的嵌入向量的维度是3。嵌入向量是一个实数向量,用于表示模型学习到的词汇表中每个词的语义信息。
从均匀分布中随机初始化权重矩阵W,范围为[-a, a],其中a = sqrt(6 / (n + m))。
对于高斯分布的Xavier初始化(高斯版):从高斯分布中随机初始化权重矩阵W,均值为0,方差为variance,其中variance = 2 / (n + m)。 核心设计思想解释
- 隐藏层1(80个神经元) - 隐藏层2(60个神经元) - 隐藏层3(40个神经元) - 输出层(10个神经元)。现在,我们将使用Xavier初始化来初始化每一层的权重。
= sqrt(6 / (100 + 80)) ≈ 0.136。现在,我们可以从均匀分布[-0.136, 0.136]中随机初始化隐藏层1的权重矩阵。接下来,我们继续计算隐藏层2的权重初始范围。前一层是隐藏层1,有80个神经元,后一层是隐藏层2本身,有60个神经元。我们使用相同的公式来计算权重初始范围a:a
= sqrt(6 / (80 + 60)) ≈ 0.153。然后,我们从均匀分布[-0.153, 0.153]中随机初始化隐藏层2的权重矩阵。类似地,我们可以计算隐藏层3和输出层的权重初始范围,并进行相应的初始化。
tmp_prev_bev = prev_bev[:, i].reshape(
bev_h, bev_w, -1).permute(2, 0, 1)
有我不熟悉的语法 gpt查询如下:
prev_bev[:, i]:表示选择 prev_bev 张量中的所有行 (
