赞
踩
论文链接:https://arxiv.org/abs/2005.12872
代码链接:https://github.com/facebookresearch/detr
出处:ECCV2020 | MetaAI
bilibili 大神讲解(超超超牛!!!):https://www.bilibili.com/video/BV1GB4y1X72R/?spm_id_from=333.788&vd_source=21011151235423b801d3f3ae98b91e94
来自 IDEA 的 DETR 系列检测器汇总:https://github.com/IDEA-Research/awesome-detection-transformer#toolbox
detrex 工具箱:https://github.com/IDEA-Research/detrex
效果:
贡献:把之前的目标检测中不可学的东西变成了可学的东西
目标检测任务是对图片中的每个感兴趣的目标预测位置和类别,现在流行的 CNN 方法的目标检测器大都使用的非直接手段进行预测,比如通过大量的 proposal、anchor、window center 来回归和分类出目标的位置和类别。这种方法会被后处理方法(如 NMS)影响效果,为了简化这种预测方法,作者提出了一种直接的 end-to-end 的方法,输入一张图片,输出直接是预测结果,不需要后处理。
NMS 的问题:
NMS 在 anchor-based 和 anchor-free 中都在使用,且 NMS 不是所有硬件都支持的,所以端到端(不需要后处理)的目标检测结构被迫切的需要
DETR 的贡献:
DETR 开创性的利用 Transformer 这种能够全局建模的能力,将目标检测看成了一个集合预测的问题,也正是因为这种全局建模的能力,而不需要预测出那么多的框,最后输出的结果就是最终预测的目标结果,不需要 NMS 后处理,方便了训练和部署
文章一直强调其把目标检测构建成了一个集合构建的问题,因为其实目标检测就是预测一系列的框的位置和类别,每个框中的目标是不同的,那么每个图片对应的输出的框集合也是不一样的,所以就是给定图片预测集合框的问题
把目标检测做成了一个端到端的框架,去掉了之前非常依赖先验知识的部分,包括 NMS 和 anchor,去掉了 anchor 生成之后,就不需要预测那么多框,也不需要对冗余框进行删除了,也不需要很多超参数去调了,整个网络就变得非常简单
DETR 提出了两个东西:
DETR 的两大特点:
DETR 的结构如图 1 所示,DEtection TRsformer(DETR)可以直接预测所有目标,训练也是使用一个 loss 来进行端到端的训练。
DETR 的训练阶段:(需要二分图匹配的 loss)
DETR 的推理阶段:(不需要二分图匹配的 loss)
DETR 的结构如图 2,包括三个部分:
训练举例:
Decoder 细节:
Backbone:
Transformer encoder:
Transformer decoder:
Prediction feed-forward networks (FFNs)
Auxiliary decoding losses
什么是二分图匹配:
DETR 中如何使用二分图匹配:
scipy
库中的 linear_sum_assigment
函数中,得到最后的最优解。这里的匹配方式约束更强,一定要得到这个一对一的匹配关系,也就是每个预测框只会与一个 gt 框是对应的,这样后面才不需要去做那个后处理 nmsDETR 能够一次性推断出 N 个预测结果,其中 N 是远远大于图像中目标个数的值,也就是每次都会预测 100 个框,那么怎么将这 100 个框和 gt 进行匹配呢,作者就使用了二分匹配,匹配上的预测框才会进一步计算分类和回归损失。
第一步:预测框和 gt 框一对一匹配(使用匈牙利匹配)
二分匹配求最低 cost,每个 object query 和每个 gt 对应的花费就填到 cost matrix 中去,花费就是 loss,loss 是由分类损失和框位置损失得到的,就是
L
m
a
t
c
h
L_{match}
Lmatch,每个 object query 和每个 gt 都有了 cost(loss)之后,使用 linear_sum_assigment
函数就能得到最优解
为了获得最优的二分匹配,作者在 N 个 object query σ \sigma σ 中寻找出了一个集合,这个集合有最低的 cost:
L m a t c h L_{match} Lmatch 是 gt 和 object query 的 matching cost, L m a t c h ( y i , y ^ σ ( i ) ) L_{match}(y_i, \hat{y}_{\sigma(i)}) Lmatch(yi,y^σ(i)) 如下:
这个 matching cost 同时考虑了类别、框的相似度
第 i 个真值可以看成 y i = ( c i , b i ) y_i=(c_i, b_i) yi=(ci,bi),其中 c i c_i ci 是类别 label, b i ∈ [ 0 , 1 ] 4 b_i\in[0, 1]^4 bi∈[0,1]4 是框的中心和宽高
对于第 σ ( i ) \sigma(i) σ(i) 个预测,作者定义类别 c i c_i ci 的预测为 p ^ σ ( i ) ( c i ) \hat{p}_{\sigma(i)}(c_i) p^σ(i)(ci),框的预测为 b ^ σ ( i ) \hat{b}_{\sigma(i)} b^σ(i)
这种匹配过程类似于之前的 proposal match 或 anchor match,最主要的不同是作者需要进行一对一的匹配,没有过多剩余的匹配,也是这个匹配过程,才决定了它不需要 NMS
第二步:计算预测框和真实框一对一匹配后的 loss
DETR 的 box 回归 loss:
1、Encoder layer 个数的影响:
表 2 展示了不同 encoder 个数对效果的影响,使用 encoder AP 能提升 3.9 个点,作者猜想是因为 encoder 能捕捉全局场景,所以有利于对不同目标的解耦。
在图 3 中,展示了最后一层 encoder 的 attention map,图中的红点就是在这个位置点一个点,箭头指出的特征图呢就表示这个红点和图中那些位置相关性更高,可以发现自注意力做的很好,可以把示例层面已经分的很开了,能够对不同实例进行区分,能够简化decoder的目标提取和定位。
2、Decoder layer 个数的影响:
图 4 展示了随着 decoder layer 数量增加,AP 和 AP50 的变化情况,每增加一层,效果就有一定上升,总共带来了 8.2/9.5 的增加。
NMS 的影响:
即使在遮挡这么严重的情况下,大象的蹄子是蓝色,小象的蹄子是黄色
斑马遮挡这么严重的情况下,还是能区分出不同的斑马的特征
作者就认为 Transformer 的 encoder 和 decoder 缺一不可:
3、Importance of FFN
4、Importance of positional encoding
5、Decoder output slot analysis,也就是 object query 的可视化
图 7 可视化了 COCO2017 验证集的 20 种 bbox 预测输出(原本 object query 是 100 ,但这里只画了 20 个框),每个框表示一个 object query
object query 到底学了什么:
object query 和 anchor 有些像:
DETR 给每个查询输入学习了不同的特殊效果。
可以看出,每个 slot 都会聚焦于不同的区域和目标大小,所有的 slot 都有预测 image-wide box 的模式(对齐的红点),作者假设这与 COCO 的分布有关。
这是论文中的一段简化的 inference 代码
import torch
from torch import nn
from torchvision.models import resnet50
class DETR(nn.Module):
def __init__(self, num_classes, hidden_dim, nheads,
num_encoder_layers, num_decoder_layers):
super().__init__()
# We take only convolutional layers from ResNet-50 model
import pdb; pdb.set_trace()
self.backbone = nn.Sequential(*list(resnet50(pretrained=True).children())[:-2])
self.conv = nn.Conv2d(2048, hidden_dim, 1)
self.transformer = nn.Transformer(hidden_dim, nheads, num_encoder_layers, num_decoder_layers)
self.linear_class = nn.Linear(hidden_dim, num_classes + 1)
self.linear_bbox = nn.Linear(hidden_dim, 4)
self.query_pos = nn.Parameter(torch.rand(100, hidden_dim))
self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
def forward(self, inputs):
x = self.backbone(inputs) # inputs=[1, 3, 800, 1200], x=[1, 1024, 25, 38]
h = self.conv(x) # h=[1, 256, 25, 38]
H, W = h.shape[-2:] # H=25, W=38
pos = torch.cat([
self.col_embed[:W].unsqueeze(0).repeat(H, 1, 1),
self.row_embed[:H].unsqueeze(1).repeat(1, W, 1),], dim=-1).flatten(0, 1).unsqueeze(1) # pos=[950, 1, 256]
h = self.transformer(pos + h.flatten(2).permute(2, 0, 1), self.query_pos.unsqueeze(1)) # h=[100, 1, 256]
return self.linear_class(h), self.linear_bbox(h).sigmoid() # [100, 1, 92], [100, 1, 4]
detr = DETR(num_classes=91, hidden_dim=256, nheads=8, num_encoder_layers=6, num_decoder_layers=6)
detr.eval()
inputs = torch.randn(1, 3, 800, 1200)
logits, bboxes = detr(inputs)
训练:下载代码detr,然后把coco图像放到dataset下即可训练
python main.py
DETR 模型结构:detr.py
输入:原始图片
build backbone:
def build_backbone(args):
position_embedding = build_position_encoding(args) # PositionEmbeddingSine()
train_backbone = args.lr_backbone > 0 # True
return_interm_layers = args.masks # False
backbone = Backbone(args.backbone, train_backbone, return_interm_layers, args.dilation)
model = Joiner(backbone, position_embedding) #(0) backbone() (1) PositionEmbeddingSine()
model.num_channels = backbone.num_channels # 2048
return model
def build_transformer(args):
return Transformer(
d_model=args.hidden_dim,
dropout=args.dropout,
nhead=args.nheads,
dim_feedforward=args.dim_feedforward,
num_encoder_layers=args.enc_layers,
num_decoder_layers=args.dec_layers,
normalize_before=args.pre_norm,
return_intermediate_dec=True,
)
class Transformer(nn.Module):
def __init__(self, d_model=512, nhead=8, num_encoder_layers=6,
num_decoder_layers=6, dim_feedforward=2048, dropout=0.1,
activation="relu", normalize_before=False,
return_intermediate_dec=False):
super().__init__()
encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward,
dropout, activation, normalize_before)
encoder_norm = nn.LayerNorm(d_model) if normalize_before else None
self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward,
dropout, activation, normalize_before)
decoder_norm = nn.LayerNorm(d_model)
self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm,
return_intermediate=return_intermediate_dec)
self._reset_parameters()
self.d_model = d_model
self.nhead = nhead
def _reset_parameters(self):
for p in self.parameters():
if p.dim() > 1:
nn.init.xavier_uniform_(p)
def forward(self, src, mask, query_embed, pos_embed):
# flatten NxCxHxW to HWxNxC
bs, c, h, w = src.shape
src = src.flatten(2).permute(2, 0, 1)
pos_embed = pos_embed.flatten(2).permute(2, 0, 1)
query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)
mask = mask.flatten(1)
tgt = torch.zeros_like(query_embed)
memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)
hs = self.decoder(tgt, memory, memory_key_padding_mask=mask,
pos=pos_embed, query_pos=query_embed)
return hs.transpose(1, 2), memory.permute(1, 2, 0).view(bs, c, h, w)
其中各个模块:
encoder_layer:
TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
)
self.encoder
TransformerEncoder(
(layers): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
)
(1): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
)
(2): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
)
(3): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
)
(4): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
)
(5): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
)
)
)
decoder layer:
TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
self.decoder
TransformerDecoder(
(layers): ModuleList(
(0): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
(1): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
(2): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
(3): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
(4): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
(5): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
训练:engine.py
def train_one_epoch(model: torch.nn.Module, criterion: torch.nn.Module,
data_loader: Iterable, optimizer: torch.optim.Optimizer,
device: torch.device, epoch: int, max_norm: float = 0):
model.train()
criterion.train()
metric_logger = utils.MetricLogger(delimiter=" ")
metric_logger.add_meter('lr', utils.SmoothedValue(window_size=1, fmt='{value:.6f}'))
metric_logger.add_meter('class_error', utils.SmoothedValue(window_size=1, fmt='{value:.2f}'))
header = 'Epoch: [{}]'.format(epoch)
print_freq = 10
for samples, targets in metric_logger.log_every(data_loader, print_freq, header):
samples = samples.to(device) # samples.tensors.shape=[2, 3, 736, 920]
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
outputs = model(samples) # outputs.keys(): ['pred_logits', 'pred_boxes', 'aux_outputs']
# outputs['pred_logits'].shape=[2, 100, 92], outputs['pred_boxes'].shape=[2, 100, 4]
# outputs['aux_outputs'][0]['pred_logits'].shape = [2, 100, 92]
# outputs['aux_outputs'][0]['pred_boxes'].shape = [2, 100, 4]
loss_dict = criterion(outputs, targets)
weight_dict = criterion.weight_dict
losses = sum(loss_dict[k] * weight_dict[k] for k in loss_dict.keys() if k in weight_dict)
# reduce losses over all GPUs for logging purposes
loss_dict_reduced = utils.reduce_dict(loss_dict)
loss_dict_reduced_unscaled = {f'{k}_unscaled': v
for k, v in loss_dict_reduced.items()}
loss_dict_reduced_scaled = {k: v * weight_dict[k]
for k, v in loss_dict_reduced.items() if k in weight_dict}
losses_reduced_scaled = sum(loss_dict_reduced_scaled.values())
loss_value = losses_reduced_scaled.item()
if not math.isfinite(loss_value):
print("Loss is {}, stopping training".format(loss_value))
print(loss_dict_reduced)
sys.exit(1)
optimizer.zero_grad()
losses.backward()
if max_norm > 0:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
optimizer.step()
metric_logger.update(loss=loss_value, **loss_dict_reduced_scaled, **loss_dict_reduced_unscaled)
metric_logger.update(class_error=loss_dict_reduced['class_error'])
metric_logger.update(lr=optimizer.param_groups[0]["lr"])
# gather the stats from all processes
metric_logger.synchronize_between_processes()
print("Averaged stats:", metric_logger)
return {k: meter.global_avg for k, meter in metric_logger.meters.items()}
上述代码中的 target 长下面这个样子:len(target)=2
({'boxes': tensor([[0.4567, 0.4356, 0.3446, 0.2930],
[0.8345, 0.4459, 0.3310, 0.3111],
[0.4484, 0.0582, 0.3947, 0.1164],
[0.8436, 0.0502, 0.3128, 0.1005],
[0.7735, 0.5084, 0.0982, 0.0816],
[0.1184, 0.4742, 0.2369, 0.2107],
[0.4505, 0.4412, 0.3054, 0.2844],
[0.8727, 0.4563, 0.1025, 0.0892],
[0.1160, 0.0405, 0.2319, 0.0810],
[0.8345, 0.4266, 0.3310, 0.2843]]),
'labels': tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
'image_id': tensor([382006]),
'area': tensor([45937.5547, 46857.6406, 20902.2129, 14303.7432, 3646.2932, 22717.0332, 39523.7812, 4161.9199, 8552.1719, 42826.0391]),
'iscrowd': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
'orig_size': tensor([429, 640]),
'size': tensor([711, 640])},
{'boxes': tensor([[0.7152, 0.3759, 0.0934, 0.0804],
[0.8129, 0.3777, 0.0576, 0.0562],
[0.7702, 0.3866, 0.0350, 0.0511],
[0.7828, 0.6463, 0.0743, 0.2580],
[0.8836, 0.5753, 0.1511, 0.2977],
[0.9162, 0.6202, 0.0880, 0.3273],
[0.8424, 0.3788, 0.0254, 0.0398],
[0.9716, 0.3712, 0.0569, 0.0707],
[0.0615, 0.4210, 0.0242, 0.0645],
[0.8655, 0.3775, 0.0398, 0.0368],
[0.8884, 0.3701, 0.0349, 0.0329],
[0.9365, 0.3673, 0.0144, 0.0221],
[0.5147, 0.1537, 0.0220, 0.0541],
[0.9175, 0.1185, 0.0294, 0.0438],
[0.0675, 0.0934, 0.0223, 0.0596],
[0.9125, 0.3683, 0.0185, 0.0347],
[0.9905, 0.3934, 0.0191, 0.0373],
[0.5370, 0.1577, 0.0211, 0.0553]]),
'labels': tensor([2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
'image_id': tensor([565286]),
'area': tensor([ 3843.2710, 1264.0637, 753.2280, 8374.9131, 11725.9863, 11407.3760, 373.8589, 1965.6881, 649.5052, 625.6856, 483.2297, 117.1264, 733.8372, 727.9525, 758.3719, 307.3998, 365.8919, 712.7703]),
'iscrowd': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
'orig_size': tensor([512, 640]),
'size': tensor([736, 920])})
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。