赞
踩
目前比较常用的算法评测工具库主要有如下几个:在有这些算法评测工具之前,大家的模型评测一般都是自己实现,没有统一的标准
torchmetrics
实现的指标也相对较为丰富,但实现的不够全面细致,比如一些检测分割方向经典的评测指标都没有实现;但是作其为算法评测工具的老大哥,经过多年的打磨其实其设计已经相对成熟了,有很多值得学习借鉴的地方
add_state
方法,给予用户最大的灵活度去定义在分布式评测中评测指标计算所需要进程同步的变量,以及其同步的方式nn.Module
,能够使用 nn.Module
一些接口,比如说 state_dict
和 load_state_dict
等,能够支持 Metric 的序列化与反序列化higher_is_better
,is_differentable
和 fp16
等特性torcheval
是 pytorch 官方实现的评测指标,但与 torchmetrics
比较像(所以有点抄袭的嫌疑),对于 Metric 基类的设计,有点像 torchmetrics
Metric 基类简化版,区别是把进程同步的功能解耦合出来为一个 sync_and_compute
函数,对于 Metric 本身,并没有耦合过多的进程同步功能,易于理解和维护,而且sync_and_compute
为之后自定义进程同步方式
huggingface/evaluate
将评测分为三类,分别是metrics / comparisions / measurements
,对应着算法评测,模型输出比较,数据集统计指标,其中每个评测指标都是一个单独的 repo,并且实现 app.py 可以在 huggingface space 上使用:
- Metric 基类设计的较为简单,将每个进程的输入缓存写到文件中,最终计算之前利用 huggingface/datasets 读取拼接文件实现进程同步,以此实现分布式评测,在我看来其实是偷懒了,不管什么情况,都是直接缓存输入的模型预测结果和 ground
truth,并且使用文件的方式来进行通信,不支持多机的并分布式评测- 实现的评测指标主要是与
NLP
相关的居多,并且很多指标的实现其实是直接调用第三方库,比如 Accuracy 直接调用 sklearn.metrics.accuracy_score
mmeval
的核心定位是跨框架算法评测库,希望不同的 codebase 能够使用同一个评测工具,并且不同的训练框架也能够使用同一个评测工具mmeval
扩展了 torchmetrics
检测分割等任务的评测指标,支持的评测指标的更加全面
TorchMetrics
对100+
个PyTorch
指标进行了代码实现,且其提供了一个易于使用的API
来创建自定义指标。对于这些已实现的指标,如准确率Accuracy
、召回率Recall
、精确度Precision
、MSE
等,可以开箱即用;对于尚未实现的指标,也可以轻松创建自定义指标。它的主要特点有:
- 一个标准化的接口,以提高可重复性
- 支持 分布式 训练
- 在批次 batch 之间 自动累积
- 在多个设备之间 自动同步
- 一致性:无论你在何处使用它(CPU、GPU或TPU上),它都提供了相同的结果
TorchMetrics
安装:pip install torchmetrics
或者conda install -c conda-forge torchmetrics
Torchmetrics
可视化接口依赖安装:pip install matplotlib
orpip install 'torchmetrics[visual]'
TorchMetrics
几乎所有的函数版本的指标都有一个相应的 基于类的版本(底层Metric
类继承自torch.nn.Module
),该版本在实际代码中调用对应的函数版本。基于类的指标的特点是有一个或多个内部度量状态
(类似于 PyTorch模块的参数
),,使其能够提供额外的功能:如对多个批次的数据进行累积;多个设备之间的自动同步;指标运算(TorchMetrics
支持大多数 Python 内置的算术、逻辑和位操作的运算符)
Y
和预测值 Y_PRED
传递给 torchmetrics
的度量对象,度量对象会计算批次指标并保存它(在其内部被称为 state
)一个 Epoch 完成
),我们就可以从度量对象返回最终结果(这是对所有批计算的结果
)。这里的每个度量对象都是从 metric 类继承,它包含了 4 个关键方法:
metric.forward(pred,target)
:更新度量状态并返回当前批次上计算的度量结果。 如果您愿意,也可以使用 metric(pred, target)
,没有区别metric.update(pred,target)
:与forward相同,但是不会返回计算结果,相当于是只将结果存入了state。 如果不需要在当前批处理上计算出的度量结果,则优先使用这个方法,因为他不计算最终结果速度会很快metric.compute()
:返回在所有批次上计算的最终结果。也就是说其实 forward
相当于是 update+compute
metric.reset()
: 重置状态,以便为下一个验证阶段做好准备import torch import torchmetrics # initialize metric metric = torchmetrics.Accuracy(task="multiclass", num_classes=5) # move the metric to device you want computations to take place device = "cuda" if torch.cuda.is_available() else "cpu" metric.to(device) n_batches = 10 for i in range(n_batches): # simulate a classification problem preds = torch.randn(10, 5).softmax(dim=-1).to(device) # (10,5), 还需经过 argmax 才能得到 label target = torch.randint(5, (10,)).to(device) # (10,) # metric on current batch acc = metric(preds, target) print(f"Accuracy on batch {i}: {acc}") # metric on all batches using custom accumulation acc = metric.compute() print(f"Accuracy on all data: {acc}") # Reseting internal state such that metric ready for new data metric.reset() # 输出如下 Accuracy on batch 0: 0.30000001192092896 Accuracy on batch 1: 0.20000000298023224 Accuracy on batch 2: 0.30000001192092896 Accuracy on batch 3: 0.10000000149011612 Accuracy on batch 4: 0.10000000149011612 Accuracy on batch 5: 0.10000000149011612 Accuracy on batch 6: 0.10000000149011612 Accuracy on batch 7: 0.30000001192092896 Accuracy on batch 8: 0.10000000149011612 Accuracy on batch 9: 0.4000000059604645 Accuracy on all data: 0.20000000298023224
内部状态需要在 epoch 之间被重置
,并且不应该在训练、验证和测试之间混淆。因此,强烈建议按不同的模式重新初始化指标,如下例所示:from torchmetrics.classification import Accuracy train_accuracy = Accuracy() valid_accuracy = Accuracy() for epoch in range(epochs): for x, y in train_data: y_hat = model(x) # training step accuracy batch_acc = train_accuracy(y_hat, y) print(f"Accuracy of batch{i} is {batch_acc}") for x, y in valid_data: y_hat = model(x) valid_accuracy.update(y_hat, y) # total accuracy over all training batches total_train_accuracy = train_accuracy.compute() # total accuracy over all validation batches total_valid_accuracy = valid_accuracy.compute() print(f"Training acc for epoch {epoch}: {total_train_accuracy}") print(f"Validation acc for epoch {epoch}: {total_valid_accuracy}") # Reset metric states after each epoch train_accuracy.reset() valid_accuracy.reset()
TorchMetrics
的 API
来实现自定义指标,只需继承 torchmetrics.Metric
基类实现如下方法即可:
__init__
方法,在这里为每一个指标计算所需的内部状态调用 self.add_state
update
方法,在这里进行更新指标状态所需的逻辑compute
方法,在这里进行最终的指标计算import torch from torchmetrics import Metric class MyAccuracy(Metric): def __init__(self): # remember to call super super().__init__() # call `self.add_state`for every internal state that is needed for the metrics computations # dist_reduce_fx indicates the function that should be used to reduce # state from multiple processes self.add_state("correct", default=torch.tensor(0), dist_reduce_fx="sum") self.add_state("total", default=torch.tensor(0), dist_reduce_fx="sum") def update(self, preds: torch.Tensor, target: torch.Tensor) -> None: # extract predicted class index for computing accuracy preds = preds.argmax(dim=-1) assert preds.shape == target.shape # update metric states self.correct += torch.sum(preds == target) self.total += target.numel() def compute(self) -> torch.Tensor: # compute final result return self.correct.float() / self.total my_metric = MyAccuracy() preds = torch.randn(10, 5).softmax(dim=-1) target = torch.randint(5, (10,)) print(my_metric(preds, target))
Metric
,自己实现的方式(继承 nn.Module
):from torch import nn
class CTCGreedyDecode(nn.Module):
def __init__(self):
super().__init__()
def forward(self, preds, labels, label_lengths):
preds = preds.permute(1, 0, 2).detach().cpu().numpy() # tensor T,N,C --> numpy N,T,C
labels = labels.cpu().numpy()
label_lengths = label_lengths.cpu().numpy()
gt_labels = get_gt_labels(labels, label_lengths)
acc = cal_acc(preds, gt_labels)
return acc
import torch from torchmetrics import MetricCollection, Accuracy, Precision, Recall target = torch.tensor([0, 2, 0, 2, 0, 1, 0, 2]) preds = torch.tensor([2, 1, 2, 0, 1, 2, 2, 2]) metric_collection = MetricCollection([ Accuracy(task="multiclass", num_classes=3), Precision(task="multiclass", num_classes=3, average='macro'), Recall(task="multiclass", num_classes=3, average='macro') ]) print(metric_collection(preds, target)) # 输出结果如下: {'MulticlassAccuracy': tensor(0.1250), 'MulticlassPrecision': tensor(0.0667), 'MulticlassRecall': tensor(0.1111)}
from torchmetrics.classification import BinaryAccuracy target = torch.tensor([1, 1, 0, 0], device=torch.device("cuda", 0)) preds = torch.tensor([0, 1, 0, 0], device=torch.device("cuda", 0)) # Metric states are always initialized on cpu, and needs to be moved to the correct device confmat = BinaryAccuracy().to(torch.device("cuda", 0)) out = confmat(preds, target) print(out.device) # cuda:0 # when properly defined inside a Module or LightningModule the metric will be automatically moved to the # same device( # metric is correctly identified as a child module of the model (check .children() attribute of the model)) from torchmetrics import MetricCollection from torchmetrics.classification import BinaryAccuracy class MyModule(torch.nn.Module): def __init__(self): ... # valid ways metrics will be identified as child modules self.metric1 = BinaryAccuracy() self.metric2 = nn.ModuleList(BinaryAccuracy()) self.metric3 = nn.ModuleDict({'accuracy': BinaryAccuracy()}) self.metric4 = MetricCollection([BinaryAccuracy()]) # torchmetrics build-in collection class def forward(self, batch): data, target = batch preds = self(data) ... val1 = self.metric1(preds, target) val2 = self.metric2[0](preds, target) val3 = self.metric3['accuracy'](preds, target) val4 = self.metric4(preds, target)
import os import torch import torch.distributed as dist import torch.multiprocessing as mp from torch import nn from torch.nn.parallel import DistributedDataParallel as DDP import torchmetrics def metric_ddp(rank, world_size): os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12355" # create default process group dist.init_process_group("gloo", rank=rank, world_size=world_size) # initialize model metric = torchmetrics.classification.Accuracy(task="multiclass", num_classes=5) # define a model and append your metric to it # this allows metric states to be placed on correct accelerators when # .to(device) is called on the model model = nn.Linear(10, 10) model.metric = metric model = model.to(rank) # initialize DDP model = DDP(model, device_ids=[rank]) n_epochs = 5 # this shows iteration over multiple training epochs for n in range(n_epochs): # this will be replaced by a DataLoader with a DistributedSampler n_batches = 10 for i in range(n_batches): # simulate a classification problem preds = torch.randn(10, 5).softmax(dim=-1) target = torch.randint(5, (10,)) # metric on current batch acc = metric(preds, target) if rank == 0: # print only for rank 0 print(f"Accuracy on batch {i}: {acc}") # metric on all batches and all accelerators using custom accumulation # accuracy is same across both accelerators acc = metric.compute() print(f"Accuracy on all data: {acc}, accelerator rank: {rank}") # Resetting internal state such that metric ready for new data metric.reset() # cleanup dist.destroy_process_group() if __name__ == "__main__": world_size = 2 # number of gpus to parallelize over mp.spawn(metric_ddp, args=(world_size,), nprocs=world_size, join=True)
在深度学习任务中,有两种常见的分类问题,多标签分类和多类别分类,两者之间的主要区别在于每个实例可能具有的标签数量。
- 多类别分类任务中,每个实例都只能属于一个类别。例如,对于手写数字识别任务,每个图像实例只能被归类为一个数字(0到9中的一个)。这种情况下,问题可以被视为一个离散选择问题。我们上文中提到过的二分类、多分类都属于多类别分类。
- 然而,对于多标签分类任务,每个实例可以被赋予多个标签。例如,在音乐分类任务中,一首歌曲可以同时属于多种风格,如“摇滚”和“经典”。
# Accuracy 模块的默认参数如下:指定任务类型,然后调用不同的类 def __new__( # type: ignore[misc] cls, task: Literal["binary", "multiclass", "multilabel"], threshold: float = 0.5, # 在 binary 和 mutilabel 任务中指定;在 multiclass 中内部会使用 argmax num_classes: Optional[int] = None, num_labels: Optional[int] = None, average: Optional[Literal["micro", "macro", "weighted", "none"]] = "micro", multidim_average: Literal["global", "samplewise"] = "global", top_k: Optional[int] = 1, ignore_index: Optional[int] = None, validate_args: bool = True, **kwargs: Any, ) -> Metric: # demo 示例 import torch from torchmetrics import Accuracy # Binary inputs binary_preds = torch.tensor([0, 1, 1]) binary_target = torch.tensor([1, 0, 1]) accuracy = Accuracy(task="binary") # threshold: 0.5 binary_acc = accuracy(binary_preds, binary_target) print(binary_acc) # tensor(0.3333) # Multi-class inputs mc_preds = torch.tensor([0, 2, 1]) mc_target = torch.tensor([0, 1, 2]) mc_accuracy = Accuracy(task="multiclass", num_classes=3) mc_acc = mc_accuracy(mc_preds, mc_target) print(mc_acc) # tensor(0.3333) # Multi-class inputs with probabilities,内部会首先进行 topk 或 argmax 处理 mc_preds_probs = torch.tensor([[0.8, 0.2, 0], [0.1, 0.2, 0.7], [0.3, 0.6, 0.1]]) mc_target_probs = torch.tensor([0, 1, 2]) mc_accuracy = Accuracy(task="multiclass", num_classes=3, top_k=2) # 默认 topk=1 mc_acc_logits = mc_accuracy(mc_preds_probs, mc_target_probs) print(mc_acc_logits) # tensor(0.6667) # Multi-label inputs ml_preds = torch.tensor([[0.11, 0.22, 0.84], [0.73, 0.33, 0.92]]) ml_target = torch.tensor([[0, 1, 0], [1, 0, 1]]) ml_accuracy = Accuracy(task="multilabel", num_labels=3) ml_acc = ml_accuracy(ml_preds, ml_target) print(ml_acc) # tensor(0.6667) # 多分类内部 tp/fp/fn/tn 的计算 elif average == "micro": preds = preds.flatten() target = target.flatten() if ignore_index is not None: idx = target != ignore_index preds = preds[idx] target = target[idx] tp = (preds == target).sum() fp = (preds != target).sum() fn = (preds != target).sum() tn = num_classes * preds.numel() - (fp + fn + tp)
MulticlassAccuracy
使用 forward
和 update
方法的输入和输出:
As input to
forward
andupdate
the metric accepts the following input:
preds
(:class:~torch.Tensor
): An int tensor of shape(N, ...)
or float tensor of shape(N, C, ..)
. If preds is a floating
point we applytorch.argmax
along theC
dimension to automatically convert probabilities/logits into an int tensor.target
(:class:~torch.Tensor
): An int tensor of shape(N, ...)
As output to
forward
andcompute
the metric returns the following output:
mca
(:class:~torch.Tensor
): A tensor with the accuracy score whose returned shape depends on theaverage
and
multidim_average
arguments:
- If
multidim_average
is set toglobal
:
- If
average='micro'/'macro'/'weighted'
, the output will be a scalar tensor- If
average=None/'none'
, the shape will be(C,)
- If
multidim_average
is set tosamplewise
:
- If
average='micro'/'macro'/'weighted'
, the shape will be(N,)
- If
average=None/'none'
, the shape will be(N, C)
MulticlassAccuracy
具体参数如下:
num_classes: Integer specifing the number of classes
average: Defines the reduction that is applied over labels. Should be one of the following:
micro
: Sum statistics over all labelsmacro
: Calculate statistics for each label and average themweighted
: calculates statistics for each label and computes weighted average using their support"none"
orNone
: calculates statistic for each label and applies no reductiontop_k: Number of highest probability or logit score predictions considered to find the correct label. Only works when
preds
contain probabilities/logits.multidim_average: Defines how additionally dimensions
...
should be handled. Should be one of the following:
global
: Additional dimensions are flatted along the batch dimensionsamplewise
: Statistic will be calculated independently for each sample on theN
axis. The statistics in this case are calculated over the additional dimensions.ignore_index: Specifies a target value that is ignored and does not contribute to the metric calculation
validate_args: bool indicating if input arguments and tensors should be validated for correctness. Set to
False
for faster
computations.
import torch
from torchmetrics import MeanSquaredError
target = torch.tensor([0., 1, 2, 3])
preds = torch.tensor([0., 1, 2, 1])
mean_squared_error = MeanSquaredError()
mse_error = mean_squared_error(preds, target)
print(mse_error) # tensor(1.)
L1 Loss
)import torch
from torchmetrics import MeanAbsoluteError
target = torch.tensor([3.0, -0.5, 2.0, 7.0])
preds = torch.tensor([2.5, 0.0, 2.0, 8.0])
mean_absolute_error = MeanAbsoluteError()
mae_error = mean_absolute_error(preds, target)
print(mae_error) # tensor(0.5000)
import torch
from torchmetrics import CosineSimilarity
target = torch.tensor([[0, 1], [1, 1]])
preds = torch.tensor([[0, 1], [0, 1]])
# reduction: how to reduce over the batch dimension using 'sum', 'mean' or 'none'
# (taking the individual scores)
cosine_similarity = CosineSimilarity(reduction='mean') # 默认为 sum
out = cosine_similarity(preds, target)
print(out) # tensor(0.8536)
import torch
from torchmetrics import KLDivergence
p = torch.tensor([[0.36, 0.48, 0.16]])
q = torch.tensor([[1 / 3, 1 / 3, 1 / 3]])
kl_divergence = KLDivergence()
out = kl_divergence(p, q)
print(out) # tensor(0.0853)
mean Average Precision
,可翻译为“全类平均精度”,是将所有类别检测的平均正确率(AP)进行综合加权平均而得到的。而 AP
是 PR曲线(精度-召回率曲线)下面积# MeanAveragePrecision 初始化参数 def __init__( self, box_format: Literal["xyxy", "xywh", "cxcywh"] = "xyxy", iou_type: Union[Literal["bbox", "segm"], Tuple[str]] = "bbox", iou_thresholds: Optional[List[float]] = None, rec_thresholds: Optional[List[float]] = None, max_detection_thresholds: Optional[List[int]] = None, class_metrics: bool = False, extended_summary: bool = False, average: Literal["macro", "micro"] = "macro", backend: Literal["pycocotools", "faster_coco_eval"] = "pycocotools", **kwargs: Any, ) -> None: import torch from torchmetrics.detection.mean_ap import MeanAveragePrecision # pip install pycocotools # 检测相关的 iou 计算 from torchmetrics.detection.ciou import CompleteIntersectionOverUnion from torchmetrics.detection.diou import DistanceIntersectionOverUnion from torchmetrics.detection.giou import GeneralizedIntersectionOverUnion from torchmetrics.detection.iou import IntersectionOverUnion from pprint import pprint preds = [ dict( boxes=torch.tensor([[258.0, 41.0, 606.0, 285.0]]), scores=torch.tensor([0.536]), labels=torch.tensor([0]), ) ] target = [ dict( boxes=torch.tensor([[214.0, 41.0, 562.0, 285.0]]), labels=torch.tensor([0]), ) ] metric = MeanAveragePrecision() out = metric(preds, target) pprint(out) # 输出如下: {'classes': tensor(0, dtype=torch.int32), 'map': tensor(0.6000), 'map_50': tensor(1.), 'map_75': tensor(1.), 'map_large': tensor(0.6000), 'map_medium': tensor(-1.), 'map_per_class': tensor(-1.), 'map_small': tensor(-1.), 'mar_1': tensor(0.6000), 'mar_10': tensor(0.6000), 'mar_100': tensor(0.6000), 'mar_100_per_class': tensor(-1.), 'mar_large': tensor(0.6000), 'mar_medium': tensor(-1.), 'mar_small': tensor(-1.)}
MMEval 是一个机器学习算法评测库,提供高效准确的 分布式评测 以及 多种机器学习框架后端 支持,具有以下特点:
- 提供丰富的计算机视觉各细分方向评测指标
- 支持多种分布式通信库,实现高效准确的分布式评测。
- 支持多种机器学习框架,根据输入自动分发对应实现。
- 安装与使用示例:
pip install mmeval from mmeval import Accuracy import numpy as np accuracy = Accuracy() # 第一种是直接调用实例化的 Accuracy 对象,计算评测指标。 labels = np.asarray([0, 1, 2, 3]) preds = np.asarray([0, 2, 1, 3]) accuracy(preds, labels) # {'top1': 0.5} # 第二种是累积多个批次的数据后,计算评测指标。 for i in range(10): labels = np.random.randint(0, 4, size=(100, )) predicts = np.random.randint(0, 4, size=(100, )) # 调用 `add` 方法,保存指标计算中间结果。 accuracy.add(predicts, labels) # 调用 compute 方法计算评测指标 accuracy.compute() # {'top1': ...} # 调用 reset 方法,清除保存的中间结果。 accuracy.reset()
1、torchmetrics 链接:https://github.com/Lightning-AI/torchmetrics
2、torchmetrics 文档:https://lightning.ai/docs/torchmetrics/stable/
3、torcheval 链接:https://github.com/pytorch/torcheval
4、torcheval 文档:https://pytorch.org/torcheval/stable/
5、huggingface/evaluate 链接:https://github.com/huggingface/evaluate
6、huggingface/evaluate 文档:https://huggingface.co/docs/evaluate/index
7、mmeval 链接:https://github.com/open-mmlab/mmeval
8、mmeval 文档:https://mmeval.readthedocs.io/zh-cn/latest/
9、PyTorch指标计算库TorchMetrics详解
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。