yolov8中detect loss代码解析_yolov8训练数据加载及损失计算

作者：知新_RL | 2024-02-16 05:13:32

踩

yolov8训练数据加载及损失计算

yolov8损失代码解析

yolov8版本为`version = '8.0.110'`

先验知识

统一注释
a = na = 3wh = h*w = num_anchor =8400 = 20*20 + 40*40 + 80*80 因为yolov8是free anchor的, 因此, 这个就是num_total_anchor
batchsize: bs = b = 32
lhw = 80*80, mwh=40*40, swh=20*20 

n_max_boxes = max_num_obj =22 类似于: max(batch.len(label)) 
len(label)是batch中每一张图片的label数量, max(batch.len(label))是选择其中的最大值
1
2
3
4
5
6
7

`损失函数代码讲解`

'''
逻辑：
1. 将预测的三个特征图合并并进行split, 得到pred_scores(bs, 3hw, 2), pred_distri(bs, 3hw, 64)
2. 生成anchor_points[3hw, 2]和stride_tensor[3hw, 1]
3. 生成gt_labels, gt_bboxes, 这两部分其中有0补齐
4. 将pred_distri转化为xyxy, pred_bboxes(bs, 3hw, 4)
5. 使用TAL进行正负样本分配, 得到target_bboxes, target_scores(bboxs, scores都是与pred的bboxs, scores一一对应的), fg_mask是最终的正负样本分配结果
6. 计算cls_loss和boxes_loss
'''
# Criterion class for computing training losses
class Loss:

    def __init__(self, model):  # model must be de-paralleled

        device = next(model.parameters()).device  # get model device
        h = model.args  # hyperparameters

        m = model.model[-1]  # Detect() module
        self.bce = nn.BCEWithLogitsLoss(reduction='none')
        self.hyp = h
        self.stride = m.stride  # model strides
        self.nc = m.nc  # number of classes
        self.no = m.no  # 每一个检测头的输出的[b, n, h, w]中的n: 2(class) + 4(box) * 16(reg_max) = 66
        self.reg_max = m.reg_max  # dfl reg_max
        self.device = device

        self.use_dfl = m.reg_max > 1

        self.assigner = TaskAlignedAssigner(topk=10, num_classes=self.nc, alpha=0.5, beta=6.0)
        self.bbox_loss = BboxLoss(m.reg_max - 1, use_dfl=self.use_dfl).to(device)
        self.proj = torch.arange(m.reg_max, dtype=torch.float, device=device)

    def preprocess(self, targets, batch_size, scale_tensor):
        """
        这个函数的作用是输出一个tensor, 这个tensor由targets和zeros组成
        counts是Batch中每一张图片的labels数量, 选择其中最大的数量生成一个out, 其shape为[Batch, max_labels, 5]的全零tensor
        将targets中对应图片的labels(cls, xyxy)复制到out[..., :5]中, out[..., 0]为cls, out[..., 1:5]为xyxy
        targets中有小于max_labels数量的, 即out[B, len(targets):, 5]全都为0
        Args:
            targets: torch.Size([na, 6])
            batch_size: int
            scale_tensor: tensor([640, 640, 640, 640])
        return:
            out: torch.Size([Batch, max_labels, 5])
        """
        """Preprocesses the target counts and matches with the input batch size to output a tensor."""
        if targets.shape[0] == 0:
            out = torch.zeros(batch_size, 0, 5, device=self.device)
        else:
            i = targets[:, 0]  # image index
            _, counts = i.unique(return_counts=True)  # 图片索引出现的次数  torch.Size([32])
            counts = counts.to(dtype=torch.int32)
            out = torch.zeros(batch_size, counts.max(), 5, device=self.device)
            for j in range(batch_size):
                matches = i == j
                n = matches.sum()
                if n:
                    # 将targets中对应图片的labels(cls, xyxy)复制到out[..., :5]中, out[..., 0]为cls, out[..., 1:5]为xyxy
                    # j是对应的图片, n是该图片的label数量, out[j, n:]全部是0
                    out[j, :n] = targets[matches, 1:]
            # scale_tensor: [640, 640, 640, 640]
            # mul_是将out[..., 1:5]中的值逐元素的乘以scale_tensor
            out[..., 1:5] = xywh2xyxy(out[..., 1:5].mul_(scale_tensor))
        return out


    def bbox_decode(self, anchor_points, pred_dist):
        # 这个bbox_decode函数的目的是从预测的物体边界框坐标分布（pred_dist）和参考点（anchor_points）解码出实际的边界框坐标xyxy。
        """Decode predicted object bounding box coordinates from anchor points and distribution."""
        if self.use_dfl:
            b, a, c = pred_dist.shape  # batch, anchors, channels
            # pred_dist在matmul之前, shape为[b, a, 4, 16], self.proj的shape为[16], 最终的pred_dist的shape为[b, a, 4]
            # 如果不理解可以直接使用 a = torch.ones((1, 3, 4, 16)), 与b=torch.rand(16)进行matmul
            pred_dist = pred_dist.view(b, a, 4, c // 4).softmax(3).matmul(self.proj.type(pred_dist.dtype))
            # pred_dist = pred_dist.view(b, a, c // 4, 4).transpose(2,3).softmax(3).matmul(self.proj.type(pred_dist.dtype))
            # pred_dist = (pred_dist.view(b, a, c // 4, 4).softmax(2) * self.proj.type(pred_dist.dtype).view(1, 1, -1, 1)).sum(2)
        return dist2bbox(pred_dist, anchor_points, xywh=False)


    # 假设检测是二分类['flame','smoke']
    # datasets如果使用了moasic技术的话, 那么batch_idx的label数量就对不上这个im_file中的图片上的label数量
    # list((32, ))表示列表, (32, )类似shape
    # stride: 是检测头输出的特征图相对于模型输入图片的缩小倍数
    def __call__(self, preds, batch):
        """
        Args:
            preds: [tensor(Size([32, 66, 80, 80])), tensor(Size([32, 66, 40, 40])), tensor(Size([32, 66, 20, 20]))]
            bacth: {'im_file': list((32, )), 'ori_shape': list((32, )), 'resized_shape': list((32, )), 
                    'img': tensor(Size[32, 3, 640, 640]), 'cls': tensor(Size[211, 1]), 'bboxes': tensor(Size[211, 4]), 'batch_idx': Size[211]}
        """

        """Calculate the sum of the loss for box, cls and dfl multiplied by batch size."""
        loss = torch.zeros(3, device=self.device)  # box, cls, dfl
        feats = preds[1] if isinstance(preds, tuple) else preds  # preds是列表, feats=preds
        # feats[0].shape[0]: bs  self.no: 66
        # xi: tensor(Size[bs, 66, w*h]]) h, w对应的三个检测头的特征图的大小分别为20, 40, 80(大目标, 中目标, 小目标)
        # pred_distri: tensor(Size[bs, 64, 8400])  pred_scores: tensor(Size[bs, 2, 8400])
        # pred_distri表示的是边界框的距离, 将来会decode为边界框坐标, pred_scores表示的是边界框的分类分数
        pred_distri, pred_scores = torch.cat([xi.view(feats[0].shape[0], self.no, -1) for xi in feats], 2).split(
            (self.reg_max * 4, self.nc), 1)

        pred_scores = pred_scores.permute(0, 2, 1).contiguous()  # (bs, 3hw, 2)
        pred_distri = pred_distri.permute(0, 2, 1).contiguous()  # (bs, 3hw, 64)

        dtype = pred_scores.dtype  # dtype: torch.float16
        batch_size = pred_scores.shape[0]  # batch_size: 32
        imgsz = torch.tensor(feats[0].shape[2:], device=self.device, dtype=dtype) * self.stride[0]  # image size (h,w) imgsz: [640, 640]
        
        # anchor_point: torch.Size([8400, 2])
        # anchor_point代表的是三个特征图80*80, 40*40, 20*20的anchor的中心点坐标
        # stride_tensor: torch.Size([8400, 1])
        # 中心点坐标对应的stride
        anchor_points, stride_tensor = make_anchors(feats, self.stride, 0.5)

        # targets
        # targets: shape[na, 6] 这里的na表示的是32张图片mosaic后的总的label数量, 6表示的是(batch_idx, cls, xyxy)
        targets = torch.cat((batch['batch_idx'].view(-1, 1), batch['cls'].view(-1, 1), batch['bboxes']), 1)
        # targets: torch.Size([bs, n_max_boxes, 5])
        # 这里的22就是32张图片中的最大label数量
        targets = self.preprocess(targets.to(self.device), batch_size, scale_tensor=imgsz[[1, 0, 1, 0]])
        # gt_labels: torch.Size([bs, n_max_boxes, 1]), gt_bboxes: torch.Size([bs, n_max_boxes, 4])
        gt_labels, gt_bboxes = targets.split((1, 4), 2)  # cls, xyxy

        # mask_gt: torch.Size([bs, n_max_boxes, 1])
        # 这个是用来判断是否有gt的, 先将xyxy的所有值求和, gt_(0)是大于0的置1, 否则置0
        mask_gt = gt_bboxes.sum(2, keepdim=True).gt_(0)

        # pboxes
        # 将pred_distri解码为边界框坐标, pred_bboxes: torch.Size([bs, a, 4])
        pred_bboxes = self.bbox_decode(anchor_points, pred_distri)  # xyxy, (bs, h*w, 4)

        # fg_mask: torch.Size([bs, h*w])
        # 值为True的正样本, 为False的是负样本
        # target_bboxes: torch.Size([bs, h*w, 4]) 与pred_bboxes的shape 一一对应
        # 这个target_bboxes: h*w中还有负样本, 负样本不参与bbox计算, 在后续会进行处理
        # target_scores(shape(bs, h*w, num_class))
        # 负样本的num_class的值全为0
        # 举个例子, bs=1, h*w=2, num_class=2  target_scores: tensor([[0, 0], [0, 1]])(当然这个1可能是个小数)
        # [0, 0]对应的pd0是负样本, [0, 1]对应的pd1是正样本，对应的是gt1, 即假如是二分类{'0':flame, '1':smoke}对应的是'1':smoke
        _, target_bboxes, target_scores, fg_mask, _ = self.assigner(
            pred_scores.detach().sigmoid(), (pred_bboxes.detach() * stride_tensor).type(gt_bboxes.dtype),
            anchor_points * stride_tensor, gt_labels, gt_bboxes, mask_gt)

        # 在分类损失求平均的时候使用
        # target_scores_sum: tensor(734.898, device='cuda:0')
        # 这里求和有小数的原因是在正负样本分配的最后target_scores乘以了一个动态权重
        target_scores_sum = max(target_scores.sum(), 1)

        # cls loss
        # loss[1] = self.varifocal_loss(pred_scores, target_scores, target_labels) / target_scores_sum  # VFL way
        # bceloss, 正负样本都参加分类损失的计算
        # 计算方式详情可以看https://blog.csdn.net/shilichangtin/article/details/135185583
        # 除以target_scores_sum就是为了求平均
        loss[1] = self.bce(pred_scores, target_scores.to(dtype)).sum() / target_scores_sum  # BCE

        # bbox loss
        # 只有正样本参加bbox损失计算
        if fg_mask.sum():
            target_bboxes /= stride_tensor
            loss[0], loss[2] = self.bbox_loss(pred_distri, pred_bboxes, anchor_points, target_bboxes, target_scores,
                                              target_scores_sum, fg_mask)

        loss[0] *= self.hyp.box  # box gain
        loss[1] *= self.hyp.cls  # cls gain
        loss[2] *= self.hyp.dfl  # dfl gain

        return loss.sum() * batch_size, loss.detach()  # loss(box, cls, dfl)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167

`bbox损失`

class BboxLoss(nn.Module):

    def __init__(self, reg_max, use_dfl=False):
        """Initialize the BboxLoss module with regularization maximum and DFL settings."""
        super().__init__()
        self.reg_max = reg_max
        self.use_dfl = use_dfl

    def forward(self, pred_dist, pred_bboxes, anchor_points, target_bboxes, target_scores, target_scores_sum, fg_mask):
        """IoU loss."""
        # weight: torch.Size([1700, 1])
        weight = target_scores.sum(-1)[fg_mask].unsqueeze(-1)
        # iou: torch.Size([1700, 1])
        # 这个就是只有正阳本参与box计算
        iou = bbox_iou(pred_bboxes[fg_mask], target_bboxes[fg_mask], xywh=False, CIoU=True)
        # 这个类似于分类损失求平均, 再次提醒 target_scores中的值可能是小数, 因此weight也是小数
        loss_iou = ((1.0 - iou) * weight).sum() / target_scores_sum

        # DFL loss
        # DFL loss论文解读https://blog.csdn.net/shilichangtin/article/details/135505430
        if self.use_dfl:
            target_ltrb = bbox2dist(anchor_points, target_bboxes, self.reg_max)
            loss_dfl = self._df_loss(pred_dist[fg_mask].view(-1, self.reg_max + 1), target_ltrb[fg_mask]) * weight
            loss_dfl = loss_dfl.sum() / target_scores_sum
        else:
            loss_dfl = torch.tensor(0.0).to(pred_dist.device)

        return loss_iou, loss_dfl

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

`DFL loss解读`

检测头代码
```
class Detect(nn.Module):
"""YOLOv8 Detect head for detection models."""
dynamic = False  # force grid reconstruction
export = False  # export mode
shape = None
anchors = torch.empty(0)  # init
strides = torch.empty(0)  # init

def __init__(self, nc=80, ch=()):  # detection layer
    super().__init__()
    self.nc = nc  # number of classes
    self.nl = len(ch)  # number of detection layers
    self.reg_max = 16  # DFL channels (ch[0] // 16 to scale 4/8/12/16/20 for n/s/m/l/x)
    self.no = nc + self.reg_max * 4  # number of outputs per anchor
    self.stride = torch.zeros(self.nl)  # strides computed during build
    c2, c3 = max((16, ch[0] // 4, self.reg_max * 4)), max(ch[0], self.nc)  # channels
    # bbox分支
    self.cv2 = nn.ModuleList(
        nn.Sequential(Conv(x, c2, 3), Conv(c2, c2, 3), nn.Conv2d(c2, 4 * self.reg_max, 1)) for x in ch)
    # cls分支
    self.cv3 = nn.ModuleList(nn.Sequential(Conv(x, c3, 3), Conv(c3, c3, 3), nn.Conv2d(c3, self.nc, 1)) for x in ch)
    # 计算dfl，由概率分布转化为anchor points到bbox边的偏移量
    self.dfl = DFL(self.reg_max) if self.reg_max > 1 else nn.Identity()

def forward(self, x):
    """Concatenates and returns predicted bounding boxes and class probabilities."""
    shape = x[0].shape  # BCHW
    for i in range(self.nl):
        x[i] = torch.cat((self.cv2[i](x[i]), self.cv3[i](x[i])), 1)
    # 训练过程中直接返回x
    if self.training:
        return x
    elif self.dynamic or self.shape != shape:
        self.anchors, self.strides = (x.transpose(0, 1) for x in make_anchors(x, self.stride, 0.5))
        self.shape = shape

    x_cat = torch.cat([xi.view(shape[0], self.no, -1) for xi in x], 2)
    if self.export and self.format in ('saved_model', 'pb', 'tflite', 'edgetpu', 'tfjs'):  # avoid TF FlexSplitV ops
        box = x_cat[:, :self.reg_max * 4]
        cls = x_cat[:, self.reg_max * 4:]
    else:
        box, cls = x_cat.split((self.reg_max * 4, self.nc), 1)
    dbox = dist2bbox(self.dfl(box), self.anchors.unsqueeze(0), xywh=True, dim=1) * self.strides
    y = torch.cat((dbox, cls.sigmoid()), 1)
    return y if self.export else (y, x)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
```
解释一下self.dfl = DFL(self.reg_max) if self.reg_max > 1 else nn.Identity()，代码
```
class DFL(nn.Module):
"""
Integral module of Distribution Focal Loss (DFL).
Proposed in Generalized Focal Loss https://ieeexplore.ieee.org/document/9792391
"""

def __init__(self, c1=16):
    """Initialize a convolutional layer with a given number of input channels."""
    super().__init__()
    # 就是构建一个参数为[0, 1, ..., 7, 8, ..., 15]的conv2d(因为reg_max为16)
    self.conv = nn.Conv2d(c1, 1, 1, bias=False).requires_grad_(False)
    x = torch.arange(c1, dtype=torch.float)
    self.conv.weight.data[:] = nn.Parameter(x.view(1, c1, 1, 1))
    self.c1 = c1

def forward(self, x):
    """Applies a transformer layer on input tensor 'x' and returns a tensor."""
    b, c, a = x.shape  # batch, channels, anchors
    # 由概率分布转化为anchor points到bbox边的偏移量
    return self.conv(x.view(b, 4, self.c1, a).transpose(2, 1).softmax(1)).view(b, 4, a)
    # return self.conv(x.view(b, self.c1, 4, a).softmax(1)).view(b, 4, a)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
```
上述的self.dfl的功能实现基本上就是
$\hat{y}=\sum_{i=0}^nP(y_i)y_i=\sum_{i=0}^{15}P(y_i)y_i=0\times{P(y_0)}+1\times{P(y_1)+\dots +7\times{P(y_7)}+8\times{P(y_8)}+\dots+15\times{P(y_{15})}}$
这个 $\hat{y}$ 就是anchor points到bbox边的偏移量
训练过程
计算流程：
1、首先是训练过程中检测头class Detect(nn.Module):是直接返回统计分布x(x是一个[bs, num_anchor, 4 * reg_max + cls_num]的shape)(num_anchor是三个检测层中特征图中所有anchor points的数量，一般来说就是 $80 * 80 + 40 * 40 + 20 * 20$ )。
2、然后就是先将target_box转化为target_dist，即将gt_box转为anchor points到box四条边的距离。
```
target_ltrb = bbox2dist(anchor_points, target_bboxes, self.reg_max)
1
```
```
def bbox2dist(anchor_points, bbox, reg_max):
"""Transform bbox(xyxy) to dist(ltrb)."""
x1y1, x2y2 = bbox.chunk(2, -1)
# anchor_points - x1y1: anchor points到left top的距离
# x2y2 - anchor_points: anchor points到right bottom的距离
return torch.cat((anchor_points - x1y1, x2y2 - anchor_points), -1).clamp_(0, reg_max - 0.01)  # dist (lt, rb)
1
2
3
4
5
6
```
3、计算DFL loss
```
# pred_dist[fg_mask]: 只计算正样本的
# view(-1, self.reg_max + 1): 这个 + 1的原因是因为在初始化BboxLoss类的时候，reg_max传入的是m.reg_max - 1，已经减1了，所以需要 + 1
loss_dfl = self._df_loss(pred_dist[fg_mask].view(-1, self.reg_max + 1), target_ltrb[fg_mask]) * weight
1
2
3
```
注意此时的pred_dist[fg_mask].view(-1, self.reg_max + 1)的参数的形状类似于[bs * n * 4, reg_max]，这个是因为F.cross_entropy传入的参数部分就必须是一个概率分布。具体的代码解析可以直接参考论文解读的DFL loss部分
预测过程
1、模型先生成一个reg_max(默认为16)的概率统计分布，其对应得是 ${0, 1, ...,7,8,...,14, 15\}$ ，这个 ${0, 1, ...,7,8,...,14, 15\}$ 就是模型预测出来的anchor points到bbox边的距离。假设模型预测出的 ${0.01, 0.05, ...,0.12,0.23,...,0.01, 0.34\}$ ，也就是anchor points到bbox边的距离为0的概率是0.01，距离为15的概率为0.34。
2、然后使用上面介绍的检测头代码中的self.dfl求出anchor points到bbox的距离的期望 $\hat{y}$ ，这个 $\hat{y}$ 就是模型预测的最终的anchor points到bbox边的距离，这个期望最大是15，也就是说模型预测出的anchor points到bbox边的距离最大是15

1.为什么使用target_scores_sum

以往的yolov5在BCELoss的使用BCEcls = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([h[‘cls_pw’]], device=device))，默认使用的是reduction='mean',这代表求完损失之后是直接求平均的。

在yolov8检测Loss初始化的时候, 使用的代码是self.bce = nn.BCEWithLogitsLoss(reduction=‘none’)，这个reduction='none的使用方法可以参考我的torch.nn.BCEWithLogitsLoss用法介绍博客，因此需要进行平均，这个平均的方法是除以target_scores_sum(当然这个平均的分母中只有正样本进行参与，负样本的全是0，求和相当于直接排除负样本了, 而self.bce(pred_scores, target_scores.to(dtype))是正负样本都有值，因此求sum()的时候正负样本都参与了，似乎是不公平的？这个还是不是特别理解, yolov5计算的时候使用的是reduction='mean'是直接除以总数)（我阅读GFL论文后，这个除以target_scores_sum在GFL论文中也有提到，自己写的论文解读的最后部分有提到。这个GFL中的只不过并不是除以target_scores中求和得到target_scores_sum，而是直接除以正样本的数量 ${N}_{pos}$ ）

后续的box_loss计算也用上了这个target_scores_sum和target_score，作用也是求平均，与cls_loss类似(box_loss计算时候只有正样本参与计算损失, 因此不存在上面的这个疑惑)

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/知新_RL/article/detail/90993

yolov8中detect loss代码解析_yolov8训练数据加载及损失计算

yolov8损失代码解析

yolov8版本为__version__ = '8.0.110'

损失函数代码讲解

bbox损失

DFL loss解读

1.为什么使用target_scores_sum

yolov8版本为`version = '8.0.110'`

`损失函数代码讲解`

`bbox损失`

`DFL loss解读`