1.1 传统目标识别算法


1.2 Detic目标识别

现阶段的目标检测性能瓶颈: 目标检测器的性能进一步提升受到了数据量规模的限制。传统目标检测任务需要大规模的训练数据以获得良好的性能,但获取这些数据通常很昂贵和耗时。这导致了性能提升的瓶颈。
图像分类数据的优势: 与目标检测不同,图像分类任务的数据量通常更大,而且更容易获取。这些数据包含了更广泛的图像类别,使得可以在更大规模的词汇表上进行推理。
Detic提出解决方法: 为了克服目标检测器性能提升的数据限制,作者提出了Detic这一新的目标检测训练方法。Detic的核心思想是使用图像分类的数据集来训练目标检测器的分类头(classification head)。

  • 简单性: Detic的最大特点在于其简单性。相较于以前的类似弱监督方法,Detic的实施更加简单。
  • 适用性广泛: 以前目标检测的方法通常需要复杂的预测和box-class分配过程,且只能在特定的检测器结构上进行训练。而Detic则易于实施,可以在大部分的检测模型构架和backbone上使用。

1.3 Detic算法简介


在当前情况,与图像分类相比,目标检测的标注数据量明显有限。以LVIS(Large Vocabulary Instance Segmentation)数据集为例,它包含了120,000张图片,涵盖了1000多个类别,而OpenImages数据集拥有1.8百万张图片,包含了500多个类别。相比之下,早期的图像分类数据集ImageNet在10年前就包含了21,000多个类别和1400万张图片。

1.4 Detic中所提到的问题和解决方法




1.5 Detic的损失函数


  • 传统目标检测方法通常采用基于预测的label-box分配机制。
  • 这种方法首先选择一组提议(proposals),然后将每个图像层面的类别标签(如人、篮球等)分配给这些提议。
  • 由于缺乏区域级别的监督信号,这种分配方式容易引入误差,尤其是在具有重叠物体或复杂场景的情况下。


  • Detic采用了一种完全不同的方法,它选择了覆盖整个图像的最大面积提议(通常几乎包括整张图片)。
  • 然后,Detic将整个图像的类别标签分配给这个最大面积的提议。
  • 这种方法的关键在于,Detic不再依赖于传统的proposal级别的标签分配,而是将整个图像视为一个整体,并为其分配类别标签。
  • 这种做法消除了传统方法中可能导致误差的标签和bbox分配过程,简化了训练流程,提高了性能,特别是在检测新颖类别时。


其中f代表proposal对应的RoI feature,c是最大的proposal对应的类别,也就是是该图片对应的类别,W是分类器的权重。同时,再加上传统目标检测器里使用的loss,就组成了Detic的最终loss。



import argparse
import os
from typing import Dict, List, Tuple
import torch
from torch import Tensor, nn

import detectron2.data.transforms as T
from detectron2.checkpoint import DetectionCheckpointer
from detectron2.config import get_cfg
from detectron2.data import build_detection_test_loader, detection_utils
from detectron2.evaluation import COCOEvaluator, inference_on_dataset, print_csv_format
from detectron2.export import (
from detectron2.modeling import GeneralizedRCNN, RetinaNet, build_model
from detectron2.modeling.postprocessing import detector_postprocess
from detectron2.projects.point_rend import add_pointrend_config
from detectron2.structures import Boxes
from detectron2.utils.env import TORCH_VERSION
from detectron2.utils.file_io import PathManager
from detectron2.utils.logger import setup_logger

def setup_cfg(args):
    cfg = get_cfg()
    # cuda context is initialized before creating dataloader, so we don't fork anymore
    return cfg

def export_caffe2_tracing(cfg, torch_model, inputs):
    from detectron2.export import Caffe2Tracer

    tracer = Caffe2Tracer(cfg, torch_model, inputs)
    if args.format == "caffe2":
        caffe2_model = tracer.export_caffe2()
        # draw the caffe2 graph
        caffe2_model.save_graph(os.path.join(args.output, "model.svg"), inputs=inputs)
        return caffe2_model
    elif args.format == "onnx":
        import onnx

        onnx_model = tracer.export_onnx()
        onnx.save(onnx_model, os.path.join(args.output, "model.onnx"))
    elif args.format == "torchscript":
        ts_model = tracer.export_torchscript()
        with PathManager.open(os.path.join(args.output, "model.ts"), "wb") as f:
            torch.jit.save(ts_model, f)
        dump_torchscript_IR(ts_model, args.output)

# experimental. API not yet final
def export_scripting(torch_model):
    assert TORCH_VERSION >= (1, 8)
    fields = {
        "proposal_boxes": Boxes,
        "objectness_logits": Tensor,
        "pred_boxes": Boxes,
        "scores": Tensor,
        "pred_classes": Tensor,
        "pred_masks": Tensor,
        "pred_keypoints": torch.Tensor,
        "pred_keypoint_heatmaps": torch.Tensor,
    assert args.format == "torchscript", "Scripting only supports torchscript format."

    class ScriptableAdapterBase(nn.Module):
        # Use this adapter to workaround https://github.com/pytorch/pytorch/issues/46944
        # by not retuning instances but dicts. Otherwise the exported model is not deployable
        def __init__(self):
            self.model = torch_model

    if isinstance(torch_model, GeneralizedRCNN):

        class ScriptableAdapter(ScriptableAdapterBase):
            def forward(self, inputs: Tuple[Dict[str, torch.Tensor]]) -> List[Dict[str, Tensor]]:
                instances = self.model.inference(inputs, do_postprocess=False)
                return [i.get_fields() for i in instances]


        class ScriptableAdapter(ScriptableAdapterBase):
            def forward(self, inputs: Tuple[Dict[str, torch.Tensor]]) -> List[Dict[str, Tensor]]:
                instances = self.model(inputs)
                return [i.get_fields() for i in instances]

    ts_model = scripting_with_instances(ScriptableAdapter(), fields)
    with PathManager.open(os.path.join(args.output, "model.ts"), "wb") as f:
        torch.jit.save(ts_model, f)
    dump_torchscript_IR(ts_model, args.output)
    # TODO inference in Python now missing postprocessing glue code
    return None

# experimental. API not yet final
def export_tracing(torch_model, inputs):
    assert TORCH_VERSION >= (1, 8)
    image = inputs[0]["image"]
    inputs = [{"image": image}]  # remove other unused keys

    if isinstance(torch_model, GeneralizedRCNN):

        def inference(model, inputs):
            # use do_postprocess=False so it returns ROI mask
            inst = model.inference(inputs, do_postprocess=False)[0]
            return [{"instances": inst}]

        inference = None  # assume that we just call the model directly

    traceable_model = TracingAdapter(torch_model, inputs, inference)

    if args.format == "torchscript":
        ts_model = torch.jit.trace(traceable_model, (image,))
        with PathManager.open(os.path.join(args.output, "model.ts"), "wb") as f:
            torch.jit.save(ts_model, f)
        dump_torchscript_IR(ts_model, args.output)
    elif args.format == "onnx":
        with PathManager.open(os.path.join(args.output, "model.onnx"), "wb") as f:
            torch.onnx.export(traceable_model, (image,), f, opset_version=STABLE_ONNX_OPSET_VERSION)
    logger.info("Inputs schema: " + str(traceable_model.inputs_schema))
    logger.info("Outputs schema: " + str(traceable_model.outputs_schema))

    if args.format != "torchscript":
        return None
    if not isinstance(torch_model, (GeneralizedRCNN, RetinaNet)):
        return None

    def eval_wrapper(inputs):
        The exported model does not contain the final resize step, which is typically
        unused in deployment but needed for evaluation. We add it manually here.
        input = inputs[0]
        instances = traceable_model.outputs_schema(ts_model(input["image"]))[0]["instances"]
        postprocessed = detector_postprocess(instances, input["height"], input["width"])
        return [{"instances": postprocessed}]

    return eval_wrapper

def get_sample_inputs(args):

    if args.sample_image is None:
        # get a first batch from dataset
        data_loader = build_detection_test_loader(cfg, cfg.DATASETS.TEST[0])
        first_batch = next(iter(data_loader))
        return first_batch
        # get a sample data
        original_image = detection_utils.read_image(args.sample_image, format=cfg.INPUT.FORMAT)
        # Do same preprocessing as DefaultPredictor
        aug = T.ResizeShortestEdge(
        height, width = original_image.shape[:2]
        image = aug.get_transform(original_image).apply_image(original_image)
        image = torch.as_tensor(image.astype("float32").transpose(2, 0, 1))

        inputs = {"image": image, "height": height, "width": width}

        # Sample ready
        sample_inputs = [inputs]
        return sample_inputs

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Export a model for deployment.")
        choices=["caffe2", "onnx", "torchscript"],
        help="output format",
        choices=["caffe2_tracing", "tracing", "scripting"],
        help="Method to export models",
    parser.add_argument("--config-file", default="", metavar="FILE", help="path to config file")
    parser.add_argument("--sample-image", default=None, type=str, help="sample image for input")
    parser.add_argument("--run-eval", action="store_true")
    parser.add_argument("--output", help="output directory for the converted model")
        help="Modify config options using the command-line",
    args = parser.parse_args()
    logger = setup_logger()
    logger.info("Command line arguments: " + str(args))
    # Disable re-specialization on new shapes. Otherwise --run-eval will be slow

    cfg = setup_cfg(args)

    # create a torch model
    torch_model = build_model(cfg)

    # convert and save model
    if args.export_method == "caffe2_tracing":
        sample_inputs = get_sample_inputs(args)
        exported_model = export_caffe2_tracing(cfg, torch_model, sample_inputs)
    elif args.export_method == "scripting":
        exported_model = export_scripting(torch_model)
    elif args.export_method == "tracing":
        sample_inputs = get_sample_inputs(args)
        exported_model = export_tracing(torch_model, sample_inputs)

    # run evaluation with the converted model
    if args.run_eval:
        assert exported_model is not None, (
            "Python inference is not yet implemented for "
            f"export_method={args.export_method}, format={args.format}."
        logger.info("Running evaluation ... this takes a long time if you export to CPU.")
        dataset = cfg.DATASETS.TEST[0]
        data_loader = build_detection_test_loader(cfg, dataset)
        # NOTE: hard-coded evaluator. change to the evaluator for your dataset
        evaluator = COCOEvaluator(dataset, output_dir=args.output)
        metrics = inference_on_dataset(exported_model, data_loader, evaluator)
三、模型C++ 部署

模型C++可以使用ONNXRuntime或者OpenCV的DNN进行部署,ONNXRuntime是一个开源的高性能推理引擎,用于在不同硬件平台上运行深度学习模型。它是由微软开发的,并且作为开放神经网络交换(ONNX)的一部分,与各种深度学习框架(如PyTorch、TensorFlow、ONNX等)兼容。ONNX Runtime的主要特点包括:

跨平台支持:ONNX Runtime支持多种硬件平台,包括CPU、GPU和边缘设备,以便在不同的硬件上运行深度学习模型。
高性能:ONNX Runtime经过优化,具有高性能,可以在实时应用中进行推理,包括对大型模型的高效支持。
开源:ONNX Runtime是开源项目,可根据需要进行自定义和扩展。
跨框架兼容性:ONNX Runtime支持多种深度学习框架生成的模型,使其能够在不同框架之间进行模型转换和推理。
轻量级:ONNX Runtime可以在嵌入式和边缘设备上运行,因为它具有相对较小的内存和计算资源需求。
支持ONNX标准:ONNX Runtime遵循ONNX标准,这是一个开放的模型表示标准,有助于实现模型的互操作性和可移植性。

#include "RAMDetic.h"


bool file_exists(std::string& name) 
	std::ifstream f(name.c_str());
	return f.good();

int RAMDetic::init_model(std::string model_path, std::string names_path)
	if (!(file_exists(model_path) && file_exists(names_path)))
		std::cout << "model or class name file does not exist !" << std::endl;
		return -1;

	OrtStatus* status = OrtSessionOptionsAppendExecutionProvider_CUDA(sessionOptions, 0);//GPU加速度,如果没有安装CUDA,要注掉

	std::wstring widestr = std::wstring(model_path.begin(), model_path.end());
	ort_session = new Ort::Session(env, widestr.c_str(), sessionOptions);
	///ort_session = new Session(env, model_path.c_str(), sessionOptions);  ///linux写法

	size_t numInputNodes = ort_session->GetInputCount();
	size_t numOutputNodes = ort_session->GetOutputCount();
	Ort::AllocatorWithDefaultOptions allocator;
	for (int i = 0; i < numInputNodes; i++)
		input_names.push_back(ort_session->GetInputName(i, allocator));
		//AllocatedStringPtr input_name_Ptr = ort_session->GetInputNameAllocated(i, allocator);
		Ort::TypeInfo input_type_info = ort_session->GetInputTypeInfo(i);
		auto input_tensor_info = input_type_info.GetTensorTypeAndShapeInfo();
		auto input_dims = input_tensor_info.GetShape();
	for (int i = 0; i < numOutputNodes; i++)
		output_names.push_back(ort_session->GetOutputName(i, allocator));
		//AllocatedStringPtr output_name_Ptr = ort_session->GetInputNameAllocated(i, allocator);
		Ort::TypeInfo output_type_info = ort_session->GetOutputTypeInfo(i);
		auto output_tensor_info = output_type_info.GetTensorTypeAndShapeInfo();
		auto output_dims = output_tensor_info.GetShape();

	std::ifstream ifs(names_path);
	std::string line;
	while (getline(ifs, line))

	return 0;

void RAMDetic::preprocess(cv::Mat &srcimg)
	cv::Mat dstimg;
	cvtColor(srcimg, dstimg, cv::COLOR_BGR2RGB);
	int im_h = srcimg.rows;
	int im_w = srcimg.cols;
	float oh, ow, scale;
	if (im_h < im_w)
		scale = (float)max_size / (float)im_h;
		oh = max_size;
		ow = scale * (float)im_w;
		scale = (float)max_size / (float)im_h;
		oh = scale * (float)im_h;
		ow = max_size;
	float max_hw = std::max(oh, ow);
	if (max_hw > max_size)
		scale = (float)max_size / max_hw;
		oh *= scale;
		ow *= scale;

	resize(dstimg, dstimg, cv::Size(int(ow + 0.5), int(oh + 0.5)), cv::INTER_LINEAR);

	this->inpHeight = dstimg.rows;
	this->inpWidth = dstimg.cols;
	this->input_image_.resize(this->inpWidth * this->inpHeight * dstimg.channels());
	int k = 0;
	for (int c = 0; c < 3; c++)
		for (int i = 0; i < this->inpHeight; i++)
			for (int j = 0; j < this->inpWidth; j++)
				float pix = dstimg.ptr<uchar>(i)[j * 3 + c];
				this->input_image_[k] = pix;

std::vector<BoxInfo> RAMDetic::detect(cv::Mat &srcimg, int _max_size)
	max_size = _max_size;
	int im_h = srcimg.rows;
	int im_w = srcimg.cols;
	std::array<int64_t, 4> input_shape_{ 1, 3, this->inpHeight, this->inpWidth };

	auto allocator_info = Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeCPU);
	Ort::Value input_tensor_ = Ort::Value::CreateTensor<float>(allocator_info,
		input_image_.data(), input_image_.size(), input_shape_.data(), input_shape_.size());

	// 开始推理
	std::vector<Ort::Value> ort_outputs = ort_session->Run(Ort::RunOptions{ nullptr },
		&input_names[0], &input_tensor_, 1, output_names.data(), output_names.size());

	const float* pred_boxes = ort_outputs[0].GetTensorMutableData<float>();
	const float* scores = ort_outputs[1].GetTensorMutableData<float>();
	const int* pred_classes = ort_outputs[2].GetTensorMutableData<int>();
	//const float *pred_masks = ort_outputs[3].GetTensorMutableData<float>();

	int num_box = ort_outputs[0].GetTensorTypeAndShapeInfo().GetShape()[0];
	const float scale_x = float(im_w) / float(inpWidth);
	const float scale_y = float(im_h) / float(inpHeight);
	std::vector<BoxInfo> preds;
	for (int i = 0; i < num_box; i++)
		float xmin = pred_boxes[i * 4] * scale_x;
		float ymin = pred_boxes[i * 4 + 1] * scale_y;
		float xmax = pred_boxes[i * 4 + 2] * scale_x;
		float ymax = pred_boxes[i * 4 + 3] * scale_y;
		xmin = std::min(std::max(xmin, 0.f), float(im_w));
		ymin = std::min(std::max(ymin, 0.f), float(im_h));
		xmax = std::min(std::max(xmax, 0.f), float(im_w));
		ymax = std::min(std::max(ymax, 0.f), float(im_h));

		const float threshold = 0;
		const float width = xmax - xmin;
		const float height = ymax - ymin;
		if (width > threshold && height > threshold)
			preds.push_back({ int(xmin), int(ymin), int(xmax), int(ymax), scores[i], class_names[pred_classes[i]] });
	return preds;

void RAMDetic::draw(cv::Mat& cv_src, cv::Mat& cv_dst)
	if (cv_src.empty())
	cv_dst = cv_src.clone();
	std::vector<BoxInfo> preds = detect(cv_dst);

	cv::RNG rng(12345);//产生随机数

	for (size_t i = 0; i < preds.size(); ++i)
		int b = rng.uniform(0, 255);
		int g = rng.uniform(0, 255);
		int r = rng.uniform(0, 255);
		cv::rectangle(cv_dst, cv::Point(preds[i].xmin, preds[i].ymin), 
			cv::Point(preds[i].xmax, preds[i].ymax), cv::Scalar(b, g, r), 2);
		std::string label = cv::format("%.2f", preds[i].score);

		cv::putText(cv_dst, label, cv::Point(preds[i].xmin, 
			preds[i].ymin - 5), cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(b, g, r), 1);

void RAMDetic::detect_video(std::string input_path, std::string output_path)
	cv::VideoCapture video_capture(input_path);

	if (!video_capture.isOpened())
		std::cout << "Can not open video: " << input_path << "\n";

	cv::Size S = cv::Size((int)video_capture.get(cv::CAP_PROP_FRAME_WIDTH),

	cv::VideoWriter output_video(output_path, cv::VideoWriter::fourcc('m', 'p', '4', 'v'),25.0, S);

	if (!output_video.isOpened())
		std::cout << "Can not open writer: " << output_path << "\n";

	cv::Mat cv_mat;
	while (video_capture.read(cv_mat))
		cv::Mat bg_upsample;

		draw(cv_mat, bg_upsample);

		output_video << bg_upsample;

import argparse
import cv2
import numpy as np
import onnxruntime as ort

class Detic():
    def __init__(self, modelpath, detection_width=800, confThreshold=0.8):
        # net = cv2.dnn.readNet(modelpath)
        so = ort.SessionOptions()
        so.log_severity_level = 3
        self.session = ort.InferenceSession(modelpath, so)
        model_inputs = self.session.get_inputs()
        self.input_name = model_inputs[0].name
        self.max_size = detection_width
        self.confThreshold = confThreshold
        self.class_names = list(map(lambda x: x.strip(), open('models/class_names.txt').readlines()))
        self.assigned_colors = np.random.randint(0,high=256, size=(len(self.class_names), 3)).tolist()

    def preprocess(self, srcimg):
        im_h, im_w, _ = srcimg.shape
        dstimg = cv2.cvtColor(srcimg, cv2.COLOR_BGR2RGB)
        if im_h < im_w:
            scale = self.max_size / im_h
            oh, ow = self.max_size, scale * im_w
            scale = self.max_size / im_w
            oh, ow = scale * im_h, self.max_size

        max_hw = max(oh, ow)
        if max_hw > self.max_size:
            scale = self.max_size / max_hw
            oh *= scale
            ow *= scale
        ow = int(ow + 0.5)
        oh = int(oh + 0.5)
        dstimg = cv2.resize(dstimg, (ow, oh))
        return dstimg

    def post_processing(self, pred_boxes, scores, pred_classes, pred_masks, im_hw, pred_hw):
        scale_x, scale_y = (im_hw[1] / pred_hw[1], im_hw[0] / pred_hw[0])

        pred_boxes[:, 0::2] *= scale_x
        pred_boxes[:, 1::2] *= scale_y
        pred_boxes[:, [0, 2]] = np.clip(pred_boxes[:, [0, 2]], 0, im_hw[1])
        pred_boxes[:, [1, 3]] = np.clip(pred_boxes[:, [1, 3]], 0, im_hw[0])

        threshold = 0
        widths = pred_boxes[:, 2] - pred_boxes[:, 0]
        heights = pred_boxes[:, 3] - pred_boxes[:, 1]
        keep = (widths > threshold) & (heights > threshold)

        pred_boxes = pred_boxes[keep]
        scores = scores[keep]
        pred_classes = pred_classes[keep]
        pred_masks = pred_masks[keep]

        # mask_threshold = 0.5
        # pred_masks = paste_masks_in_image(
        #     pred_masks[:, 0, :, :], pred_boxes,
        #     (im_hw[0], im_hw[1]), mask_threshold
        # )

        pred = {
            'pred_boxes': pred_boxes,
            'scores': scores,
            'pred_classes': pred_classes,
            'pred_masks': pred_masks,
        return pred

    def draw_predictions(self, img, predictions):
        height, width = img.shape[:2]
        default_font_size = int(max(np.sqrt(height * width) // 90, 10))
        boxes = predictions["pred_boxes"].astype(np.int64)
        scores = predictions["scores"]
        classes_id = predictions["pred_classes"].tolist()
        # masks = predictions["pred_masks"].astype(np.uint8)
        num_instances = len(boxes)
        print('detect', num_instances, 'instances')
        for i in range(num_instances):
            x0, y0, x1, y1 = boxes[i]
            color = self.assigned_colors[classes_id[i]]
            cv2.rectangle(img, (x0, y0), (x1, y1), color=color,thickness=default_font_size // 4)
            text = "{} {:.0f}%".format(self.class_names[classes_id[i]], round(scores[i],2) * 100)
            cv2.putText(img, text, (x0, y0 - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, thickness=1, lineType=cv2.LINE_AA)
        return img

    def detect(self, srcimg):
        im_h, im_w = srcimg.shape[:2]
        dstimg = self.preprocess(srcimg)
        pred_hw = dstimg.shape[:2]
        input_image = np.expand_dims(dstimg.transpose(2, 0, 1), axis=0).astype(np.float32)

        # Inference
        pred_boxes, scores, pred_classes, pred_masks = self.session.run(None, {self.input_name: input_image})
        preds = self.post_processing(pred_boxes, scores, pred_classes, pred_masks, (im_h, im_w), pred_hw)
        return preds

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--imgpath", type=str, default='desk.jpg', help="image path")
    parser.add_argument("--confThreshold", default=0.5, type=float, help='class confidence')
    parser.add_argument("--modelpath", type=str, default='models/Detic_896.onnx', help="onnxmodel path")
    args = parser.parse_args()

    mynet = Detic(args.modelpath, confThreshold=args.confThreshold)
    srcimg = cv2.imread(args.imgpath)
    preds = mynet.detect(srcimg)
    srcimg = mynet.draw_predictions(srcimg, preds)

    # cv2.imwrite('result.jpg', srcimg)
    winName = 'Deep learning Detic in ONNXRuntime'
    cv2.namedWindow(winName, cv2.WINDOW_NORMAL)
    cv2.imshow(winName, srcimg)
