赞
踩
MLPerf is a benchmark suite that is used to evaluate training and inference performance of on-premises and cloud platforms. MLPerf is intended as an independent, objective performance yardstick for software frameworks, hardware platforms, and cloud platforms for machine learning.
The goal of MLPerf is to give developers a way to evaluate hardware architectures and the wide range of advancing machine learning frameworks.
The suite measures the time it takes to train machine learning models to a target level of accuracy.
The suite measure how quickly a trained neural network can perform inference tasks on new data.
The MLPerf inference benchmark is intended as an objective way to measure inference performance in both the data center and the edge. Each benchmark has four measurement scenarios: server, offline, single-stream, and multi-stream. The following figure depicts the basic structure of the MLPerf Inference benchmark and the order that data flows through the system:
Server and offline scenarios are most relevant for data center use cases.
while single-stream and multi-stream scenarios evaluate the workloads of edge devices.
MLPerf Results Show Advances in Machine Learning Inference Performance and Efficiency | MLCommons
Today, MLCommons®, an open engineering consortium, released new results for three MLPerf™ benchmark suites - Inference v2.0, Mobile v2.0, and Tiny v0.7. These three benchmark suites measure the performance of inference - applying a trained machine learning model to new data. Inference enables adding intelligence to a wide range of applications and systems. Collectively, these benchmark suites scale from ultra-low power devices that draw just a few microwatts all the way up to the most powerful datacenter computing platforms. The latest MLPerf results demonstrate wide industry participation, an emphasis on energy efficiency, and up to 3.3X greater performance ultimately paving the way for more capable intelligent systems to benefit society at large.The MLPerf Mobile benchmark suite targets smartphones, tablets, notebooks, and other client systems with the latest submissions highlighting an average 2X performance gain over the previous round.
MLPerf Mobile v2.0 includes a new image segmentation model, MOSAIC, that was developed by Google Research with feedback from MLCommons. The MLPerf Mobile application and the corresponding source code, which incorporates the latest updates and submitting vendors’ backends, are expected to be available in the second quarter of 2022.
MLPerf Training v2.0 is the sixth instantiation for training and consists of eight different workloads covering a broad diversity of use cases, including vision, language, recommenders, and reinforcement learning.
MLPerf Inference v2.0 tested seven different use cases across seven different kinds of neural networks. Three of these use cases were for computer vision, one for recommender systems, two for language processing, and one for medical imaging.
Image Classificataion:
MLPerf Traininguse ResNet v1.5 with the ImageNet dataset.
MLPerf Inference use ResNet v1.5 with the ImageNet dataset.
MLPerf Training uses Single-Shot Detector (SSD) on the COCO 2017 dataset.
MLPerf Inference uses SSD-MobileNet-v1 (low-res, 0.09 MP), with the COCO 2017 dataset.
MLPerf Training uses Mask R-CNN on the COCO 2014 dataset.
MLPerf Inference uses SSD-ResNet-34 (high-res, 1.44 MP) with the COCO 2017 dataset.
MLPerf Training use 3D U-Net with the KiTS19 dataset.
MLPerf Inference use 3D U-Net with the KiTS19 dataset.
MLPerf Training use RNN-T on the Librispeech dataset.
MLPerf Inference use RNN-T on the Librispeech dataset.
MLPerf Training uses Bidirectional Encoder Representations from Transformers (BERT) on the Wikipedia 2020/01/01 dataset.
MLPerf Inference used BERT with the SQuAD v.1.1 dataset.
MLPerf Training use the Deep Learning Recommendation Model (DLRM) on Criteo 1TB dataset.
MLPerf Inference use the Deep Learning Recommendation Model (DLRM) on Criteo 1TB dataset.
MLPerf Training uses Mini Go benchmark based gameplay.
The latest branch for MLPerf Inference is MLPerf Inference v2.0 (submission 02/25/2022)
GitHub - mlcommons/inference: Reference implementations of MLPerf™ inference benchmarks
User Cases | model | reference app | framework | dataset |
---|---|---|---|---|
Image Classificataion: | resnet50-v1.5 | vision/classification_and_detection | tensorflow, pytorch, onnx | imagenet2012 |
Object Detection (Lightweight) | ssd-mobilenet 300x300 | vision/classification_and_detection | tensorflow, pytorch, onnx | coco resized to 300x300 |
Object Detection (Heavyweight) | ssd-resnet34 1200x1200 | vision/classification_and_detection | tensorflow, pytorch, onnx | coco resized to 1200x1200 |
Natural Language Processing (NLP) | bert | language/bert | tensorflow, pytorch, onnx | squad-1.1 |
Recommendation | dlrm | recommendation/dlrm | pytorch, tensorflow(?), onnx(?) | Criteo Terabyte |
Biomedical Image Segmentation | 3d-unet | vision/medical_imaging/3d-unet-kits19 | pytorch, tensorflow, onnx | KiTS19 |
Automatic Speech Recognition (ASR) | rnnt | speech_recognition/rnnt | pytorch | OpenSLR LibriSpeech Corpus |
This benchmark suite measures how fast systems can process inputs and produce results using a trained model.
GitHub - mlcommons/mobile_app_open: Mobile App Open
GitHub - mlcommons/mobile_models: MLPerf™ Mobile models
GitHub - mlcommons/mobile_datasets: Scripts to create performance-only test sets for MLPerf™ Mobile
GitHub - mlcommons/mobile_results_v2.0
In order to enable representative testing of a wide variety of inference platforms and use cases, MLPerf has defined four different scenarios as described below. A given scenario is evaluated by a standard load generator generating inference requests in a particular pattern and measuring a specific metric.
Scenario | Query Generation | Duration | Samples/query | Latency Constraint | Tail Latency | Performance Metric |
---|---|---|---|---|---|---|
Single stream | LoadGen sends next query as soon as SUT completes the previous query | 1024 queries and 60 seconds | 1 | None | 90% | 90%-ile measured latency |
Multiple stream (1.1 and earlier) | LoadGen sends a new query every latency constraint if the SUT has completed the prior query, otherwise the new query is dropped and is counted as one overtime query | 270,336 queries and 60 seconds | Variable, see metric | Benchmark specific | 99% | Maximum number of inferences per query supported |
Multiple stream (2.0 and later) | Loadgen sends next query, as soon as SUT completes the previous query | 270,336 queries and 600 seconds | 8 | None | 99% | 99%-ile measured latency |
Server | LoadGen sends new queries to the SUT according to a Poisson distribution | 270,336 queries and 60 seconds | 1 | Benchmark specific | 99% | Maximum Poisson throughput parameter supported |
Offline | LoadGen sends all queries to the SUT at start | 1 query and 60 seconds | At least 24,576 | None | N/A | Measured throughput |
Each benchmark is defined by a Dataset and Quality Target. The following table summarizes the benchmarks in this version of the suite (the rules remain the official source of truth):
Area | Task | Model | Dataset | Mode | Quality |
---|---|---|---|---|---|
Vision | Image classification | MobileNetEdgeTPU | ImageNet | Single-stream, Offline | 98% of FP32 (Top1: 76.19%) |
Vision | Object detection | MobileDETs | MS-COCO 2017 | Single-stream | 95% of FP32 (mAp: 0.285) |
Vision | Segmentation | DeepLabV3+ (MobileNetV2) | ADE20K (32 classes, 512x512) | Single-stream | 97% of FP32 (32-class mIOU: 54.8) |
Vision | Segmentation, MOSAIC | MOSAIC | ADE20K (32 classes, 512x512) | Single-stream | 96% of FP32 (32-class mIOU: 59.8) |
Language | Language processing | Mobile-BERT | SQUAD 1.1 | Single-stream | 93% of FP32 (F1 score: 90.5) |
Each Mobile benchmark requires the single stream scenario. The Image classification benchmark permits an optional Offline scenario.
MLPerf aims to encourage innovation in software as well as hardware by allowing submitters to reimplement the reference implementations. MLPerf has two Divisions that allow different levels of flexibility during reimplementation.
The Closed division is intended to compare hardware platforms or software frameworks “apples-to-apples” and requires using the same model as the reference implementation.
The Open division is intended to foster innovation and allows using a different model or retraining.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。