量化是指将信号的连续取值近似为有限多个离散值的过程,可理解成一种信息压缩的方法。在计算机系统上考虑这个概念的话,量化有若干相似的术语,低精度可能是最通用的概念。常规精度一般使用 FP32(32位浮点,单精度)存储模型权重;低精度则表示 FP16(半精度浮点),INT8(8位的定点整数)等等数值格式。目前,低精度往往指代INT8,因此也有人称量化为“定点化”,但是严格来讲所表示的范围是缩小的。定点化特指scale为2的幂次的线性量化,是一种更加实用的量化方法。
现有的深度学习框架,比如:TensorFlow,Pytorch,Caffe, MixNet等,在训练深度神经网络时,往往都会使用FP32的数据精度来表示权值、偏置、激活值等。在深度学习模型性能提高的同时,计算也越来越复杂,计算开销和内存需求逐渐增加。仅 8 层的 AlexNet需要0.61 亿个网络参数和 7.29 亿次浮点型计算,花费约 233MB 内存。随后的 VGG-16的网络参数达到 1.38 亿,浮点型计算次数为 156 亿,需要约 553MB 内存。为了克服深层网络的梯度消失问题。He 提出了 ResNet网络,首次在ILSVRC 比赛中实现了低于 5%的 top-5 分类错误,偏浅的 ResNet-50 网络参数就达到 0.25 亿,浮点型计算次数高达 41.2亿,内存花费约 102MB。
Network | Model Size(MB) | GFLOPS |
AlexNet | 214 | 0.72 |
VGG-13 | 532 | 11.3 |
VGG-16 | 552 | 15.6 |
VGG-19 | 576 | 19.6 |
ResNet-50 | 102 | 4.12 |
ResNet-101 | 178 | 7.84 |
ResNet-152 | 240 | 23.1 |
GoogleNet | 27 | 1.6 |
InceptionV3 | 89 | 6 |
MobileNet | 38 | 0.58 |
SequeezeNet | 30 | 0.84 |
表1 不同模型的模型大小及浮点运算次数
庞大的网络参数意味着更大的内存存储,而增长的浮点型计算次数意味着训练成本和计算时间的增长,这极大地限制了在资源受限设备,例如智能手机、智能手环等上的部署。如表2所示,深度模型在 Samsung Galaxy S6 的推理时间远超 Titan X 桌面级显卡,实时性较差,无法满足实际应用的需要。
设备 | |||
Samsung Galaxy S6 | Titan X | ||
模型 | AlexNet | 117 | 0.54 |
GoogleNet | 273 | 1.83 | |
VGG-16 | 1926 | 10.67 |
表2 不同模型在不同设备上的推理时间(单位:ms)
图1 不同精度数值内存占用情况及运算功耗
当 group=1 时,逐组量化与逐层量化等价;当 group=num_filters (即dw卷积)时,逐组量化逐通道量化等价。
图2 深度学习网络维度示意图
训练后量化(Post-Training Quantization,PTQ),PTQ不需要再训练,因此是一种轻量级的量化方法。在大多数情况下,PTQ足以实现接近FP32性能的INT8量化。然而,它也有局限性,特别是针对激活更低位的量化,如4bit、2bit。这时就有了训练时量化的用武之地。
量化过程可以分为两部分:将模型从 FP32 转换为INT8,以及使用INT8 进行推理。本节说明这两部分背后的算术原理。如果不了解基础算术原理,在考虑量化细节时通常会感到困惑。
Fixed-point | Floating-point | |
Format | IIIII.FFFFF | significand×base^exponet |
Decimal | 12345.78901,00123.90000 | 1.2345678901×10^4,1.239×10^2 |
Hex | A1C7D.FF014,00000.000FD | A.1C7DFF014×16^4,F.D×16^-4 |
Binary | 10111.01011,00110.00000 | 1.011101011×2^4,1.1×2^2 |
表3 定点和浮点的格式与示例
在指令集的内置数据类型中,定点是整数,浮点是二进制格式。一般来说,指令集层面的定点是连续的,因为它是整数,且两个邻近的可表示数字的间隙是 1 。而浮点代表实数,其数值间隙由指数确定,因而具有非常宽的值域。同时也可以知道浮点的数值间隙是不均匀的,在相同的指数范围内,可表示数值数量也相同,且值越接近零就越准确。例如,[1,2) 中浮点值的数量与 [0.5,1)、[2,4)、[4,8)等相同。另外,我们也可以得知定点数数值与想要表示的真值是一致的,而浮点数数值与想要表示的真值是有偏差的。
数值范围 | 可取值数量 | |
FP32 | [(2^23-2)×2^127,(2-2^-23)×2^127] | 2^32 |
INT32 | [-2^16,2^16-1] | 2^32 |
表4 FP32和INT32的数值范围及可取值数量
图3 浮点数与定点数对照关系示意图
指数 | 真值范围 | FP32可表示的数值 | 最大误差 |
0 | [2^0,2^1) | {1,1.5} | 约等于0.5 |
3 | [2^3,2^4) | {8,12} | 约等于4 |
表5 浮点数数值间隙不同的示例
其中,X表示原始的FP32数值;Z表示映射的零点Zero Point;S表示缩放因子Scale;表示的是近似取整的数学函数,可以是四舍五入、向上取整、向下取整等;表示的是量化后的一个整数值。
根据参数 Z 是否为零可以将线性量化分为两类—即对称量化和非对称量化,TensorRT使用的时对称量化,即Z=0。
图4 对称带符号量化、对称无符号量化和非对称量化
S | INT8=1时表示的FP32数值范围 | 最大误差 | 量化域 |
10 | [10,20) | 约等于10 | [-1280,1280) |
100 | [100,200) | 约等于100 | [-12800,12800) |
表6 不同缩放尺度的影响
NVIDIA®TensorRT™的核心是一个C++库,可促进对NVIDIA图形处理单元(GPU)的高性能推理。 它旨在与TensorFlow,Caffe,PyTorch,MXNet等训练框架以互补的方式工作。它专门致力于在GPU上快速有效地运行已经训练好的网络,以生成结果。一些训练框架(例如TensorFlow)已经集成了TensorRT,因此可以将其用于框架内加速推理。
图5 TensorRT是可编程的推理加速器
图6 TensorRT量化支持的显卡型号
图7 TensorRT量化支持的网络层
图8 TensorRT量化支持的平台和编译器
图9 TensorRT官网文档说明可以省略bias
图10 FP32卷积层推理流程
图11 FP32卷积层推理流程-细节展开图
图12 INT8卷积层推理流程-量化
图13 INT8卷积层推理流程-激活与再量化
图14 INT8卷积层推理流程-反量化
图15 INT8卷积层推理流程-含bias的再量化
图16 官方伪代码
图17 官方INT8校准示意图 校准前激活分布
图18 同一批数据在不同网络不同层上得到的激活值分布(官方)
这就需要一个定量的衡量指标,回顾前文5.4.3可以知道常用的手段是指数平滑法、直方图截断法和KL散度校准法,TensorRT使用的是KL散度校准法。 KL散度校准法原理
如何通俗的解释交叉熵与相对熵? - 知乎 (zhihu.com)
现在问题就很简单了,我们的目的就是改变量化域,实则就是改变真实的分布,并使得修改后得真实分布在量化后与量化前相对熵越小越好。 具体实现流程
图19 KL校准的官方伪代码 校准后数据分布
图20 校准后分布1(官方)
图21 校准后分布2(官方)
图22 校准后分布3(官方)
图23 量化后准确度(官方)
图24 量化后速度提升效果(官方)
图25 TensorRT流程(官方)
图26 TensorRT INT8程序流程图
- #include "BatchStream.h"
- #include "NvInfer.h"
- //! \class EntropyCalibratorImpl
- //!
- //! \brief Implements common functionality for Entropy calibrators.
- //!
- template <typename TBatchStream>
- class EntropyCalibratorImpl
- {
- public:
- EntropyCalibratorImpl(
- TBatchStream stream, int firstBatch, std::string networkName, const char* inputBlobName, bool readCache = true)
- : mStream{stream}
- , mCalibrationTableName("CalibrationTable" + networkName)
- , mInputBlobName(inputBlobName)
- , mReadCache(readCache)
- {
- nvinfer1::Dims dims = mStream.getDims();
- mInputCount = samplesCommon::volume(dims);
- CHECK(cudaMalloc(&mDeviceInput, mInputCount * sizeof(float)));
- mStream.reset(firstBatch);
- }
- virtual ~EntropyCalibratorImpl()
- {
- CHECK(cudaFree(mDeviceInput));
- }
- int getBatchSize() const noexcept
- {
- return mStream.getBatchSize();
- }
- bool getBatch(void* bindings[], const char* names[], int nbBindings) noexcept
- {
- if (!mStream.next())
- {
- return false;
- }
- CHECK(cudaMemcpy(mDeviceInput, mStream.getBatch(), mInputCount * sizeof(float), cudaMemcpyHostToDevice));
- ASSERT(!strcmp(names[0], mInputBlobName));
- bindings[0] = mDeviceInput;
- return true;
- }
- const void* readCalibrationCache(size_t& length) noexcept
- {
- mCalibrationCache.clear();
- std::ifstream input(mCalibrationTableName, std::ios::binary);
- input >> std::noskipws;
- if (mReadCache && input.good())
- {
- std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(),
- std::back_inserter(mCalibrationCache));
- }
- length = mCalibrationCache.size();
- return length ? mCalibrationCache.data() : nullptr;
- }
- void writeCalibrationCache(const void* cache, size_t length) noexcept
- {
- std::ofstream output(mCalibrationTableName, std::ios::binary);
- output.write(reinterpret_cast<const char*>(cache), length);
- }
- private:
- TBatchStream mStream;
- size_t mInputCount;
- std::string mCalibrationTableName;
- const char* mInputBlobName;
- bool mReadCache{true};
- void* mDeviceInput{nullptr};
- std::vector<char> mCalibrationCache;
- };
- //! \class Int8EntropyCalibrator2
- //!
- //! \brief Implements Entropy calibrator 2.
- //! CalibrationAlgoType is kENTROPY_CALIBRATION_2.
- //!
- template <typename TBatchStream>
- class Int8EntropyCalibrator2 : public IInt8EntropyCalibrator2
- {
- public:
- Int8EntropyCalibrator2(
- TBatchStream stream, int firstBatch, const char* networkName, const char* inputBlobName, bool readCache = true)
- : mImpl(stream, firstBatch, networkName, inputBlobName, readCache)
- {
- }
- int getBatchSize() const noexcept override
- {
- return mImpl.getBatchSize();
- }
- bool getBatch(void* bindings[], const char* names[], int nbBindings) noexcept override
- {
- return mImpl.getBatch(bindings, names, nbBindings);
- }
- const void* readCalibrationCache(size_t& length) noexcept override
- {
- return mImpl.readCalibrationCache(length);
- }
- void writeCalibrationCache(const void* cache, size_t length) noexcept override
- {
- mImpl.writeCalibrationCache(cache, length);
- }
- private:
- EntropyCalibratorImpl<TBatchStream> mImpl;
- };
- #ifndef BATCH_STREAM_H
- #define BATCH_STREAM_H
- #include "NvInfer.h"
- #include "common.h"
- #include <algorithm>
- #include <stdio.h>
- #include <vector>
- class IBatchStream
- {
- public:
- virtual void reset(int firstBatch) = 0;
- virtual bool next() = 0;
- virtual void skip(int skipCount) = 0;
- virtual float* getBatch() = 0;
- virtual float* getLabels() = 0;
- virtual int getBatchesRead() const = 0;
- virtual int getBatchSize() const = 0;
- virtual nvinfer1::Dims getDims() const = 0;
- };
- class MNISTBatchStream : public IBatchStream
- {
- public:
- MNISTBatchStream(int batchSize, int maxBatches, const std::string& dataFile, const std::string& labelsFile,
- const std::vector<std::string>& directories)
- : mBatchSize{batchSize}
- , mMaxBatches{maxBatches}
- , mDims{3, {1, 28, 28}} //!< We already know the dimensions of MNIST images.
- {
- readDataFile(locateFile(dataFile, directories));
- readLabelsFile(locateFile(labelsFile, directories));
- }
- void reset(int firstBatch) override
- {
- mBatchCount = firstBatch;
- }
- bool next() override
- {
- if (mBatchCount >= mMaxBatches)
- {
- return false;
- }
- ++mBatchCount;
- return true;
- }
- void skip(int skipCount) override
- {
- mBatchCount += skipCount;
- }
- float* getBatch() override
- {
- return mData.data() + (mBatchCount * mBatchSize * samplesCommon::volume(mDims));
- }
- float* getLabels() override
- {
- return mLabels.data() + (mBatchCount * mBatchSize);
- }
- int getBatchesRead() const override
- {
- return mBatchCount;
- }
- int getBatchSize() const override
- {
- return mBatchSize;
- }
- nvinfer1::Dims getDims() const override
- {
- return Dims{4, {mBatchSize, mDims.d[0], mDims.d[1], mDims.d[2]}};
- }
- private:
- void readDataFile(const std::string& dataFilePath)
- {
- std::ifstream file{dataFilePath.c_str(), std::ios::binary};
- int magicNumber, numImages, imageH, imageW;
- file.read(reinterpret_cast<char*>(&magicNumber), sizeof(magicNumber));
- // All values in the MNIST files are big endian.
- magicNumber = samplesCommon::swapEndianness(magicNumber);
- ASSERT(magicNumber == 2051 && "Magic Number does not match the expected value for an MNIST image set");
- // Read number of images and dimensions
- file.read(reinterpret_cast<char*>(&numImages), sizeof(numImages));
- file.read(reinterpret_cast<char*>(&imageH), sizeof(imageH));
- file.read(reinterpret_cast<char*>(&imageW), sizeof(imageW));
- numImages = samplesCommon::swapEndianness(numImages);
- imageH = samplesCommon::swapEndianness(imageH);
- imageW = samplesCommon::swapEndianness(imageW);
- // The MNIST data is made up of unsigned bytes, so we need to cast to float and normalize.
- int numElements = numImages * imageH * imageW;
- std::vector<uint8_t> rawData(numElements);
- file.read(reinterpret_cast<char*>(rawData.data()), numElements * sizeof(uint8_t));
- mData.resize(numElements);
- std::transform(
- rawData.begin(), rawData.end(), mData.begin(), [](uint8_t val) { return static_cast<float>(val) / 255.f; });
- }
- void readLabelsFile(const std::string& labelsFilePath)
- {
- std::ifstream file{labelsFilePath.c_str(), std::ios::binary};
- int magicNumber, numImages;
- file.read(reinterpret_cast<char*>(&magicNumber), sizeof(magicNumber));
- // All values in the MNIST files are big endian.
- magicNumber = samplesCommon::swapEndianness(magicNumber);
- ASSERT(magicNumber == 2049 && "Magic Number does not match the expected value for an MNIST labels file");
- file.read(reinterpret_cast<char*>(&numImages), sizeof(numImages));
- numImages = samplesCommon::swapEndianness(numImages);
- std::vector<uint8_t> rawLabels(numImages);
- file.read(reinterpret_cast<char*>(rawLabels.data()), numImages * sizeof(uint8_t));
- mLabels.resize(numImages);
- std::transform(
- rawLabels.begin(), rawLabels.end(), mLabels.begin(), [](uint8_t val) { return static_cast<float>(val); });
- }
- int mBatchSize{0};
- int mBatchCount{0}; //!< The batch that will be read on the next invocation of next()
- int mMaxBatches{0};
- Dims mDims{};
- std::vector<float> mData{};
- std::vector<float> mLabels{};
- };
最后为了实际测试量化的效果,我选取不同复杂度的模型进行了测试,分别是Alexnet、Resnet50、VGG13,具体的参数量和FLOPS(每秒浮点运算次数,Floating-Point Operations Per Second)可以参考第二章的表1和下图27。
a 参数量比较
b 计算量比较
图27 三种模型的比较
另外,运算功耗和显存占用需要使用Nvidia的命令行指令,在anaconda prompt中输入如下指令即可显示显卡的详细信息,见图28。
nvidia-smi -l 2 #l参数表示刷新间隔(s)
图28 显卡信息
Name:GPU类型,图上GPU的类型是:Tesla T4
Disp.A:Display Active ,表示GPU的显示是否初始化
Volatile GPU-Util:GPU使用率
Uncorr. ECC:关于ECC的东西,是否开启错误检查和纠正技术,0/disabled,1/enabled
也可以使用以下指令直接将结果写入文件,具体详见GPU之nvidia-smi命令详解 - 简书 (jianshu.com)
nvidia-smi -l 2 --format=csv --filename=gpucost.csv --query-gpu=timestamp,memory.total,memeory.used
图29 引擎大小
图30 运行功耗
图31 显存占用
图32 推理速度
BatchSize=5 | FP32 | FP16 | INT8(数据集大小) | |||||
100 | 300 | 500 | 700 | 900 | ||||
准确度(%) | ||||||||
Network | AlexNet | 87.84% | 87.73% | 88.64% | 88.52% | 88.52% | 88.52% | 88.41% |
ResNet50 | 97.61% | 97.61% | 97.73% | 97.61% | 97.61% | 97.50% | 97.39% | |
VGG-13 | 97.39% | 97.39% | 97.39% | 97.27% | 97.16% | 97.16% | 97.05% | |
FP32转INT8 准确度变化(%) | ||||||||
AlexNet | 87.84% | 87.73% | 0.80% | 0.68% | 0.68% | 0.68% | 0.57% | |
ResNet50 | 97.61% | 97.61% | 0.11% | 0.00% | 0.00% | -0.11% | -0.23% | |
VGG-13 | 97.39% | 97.39% | 0.00% | -0.11% | -0.23% | -0.23% | -0.34% |
表7 检测准确度测试
图33 检测准确度测试
DataSize=500 | BatchSize | ||||||||
1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | ||
单张图片推理时间(us) | |||||||||
Network | AlexNet | 132.199 | 122.701 | 122.666 | 119.574 | 122.919 | 121.464 | 142.217 | 177.197 |
ResNet50 | 476.866 | 446.152 | 436.84 | 405.513 | 448.45 | 473.267 | 454.716 | 433.675 | |
VGG-13 | 617.144 | 618.869 | 614.11 | 609.547 | 625.358 | 633.083 | 634.107 | 691.968 | |
单张图片推理时间变化率(%) | |||||||||
AlexNet | 0.00% | 7.18% | 7.21% | 9.55% | 7.02% | 8.12% | -7.58% | -34.04% | |
ResNet50 | 0.00% | 6.44% | 8.39% | 14.96% | 5.96% | 0.75% | 4.64% | 9.06% | |
VGG-13 | 0.00% | -0.28% | 0.49% | 1.23% | -1.33% | -2.58% | -2.75% | -12.12% |
- //!
- //! SampleINT8.cpp
- //! This file contains the implementation of the sample. It creates the network using
- //! the caffe model.
- //! It can be run with the following command line:
- //! Command: ./sample_int8 [-h or --help] [-d=/path/to/data/dir or --datadir=/path/to/data/dir]
- //!
- #include "BatchStream.h"
- #include "EntropyCalibrator.h"
- #include "argsParser.h"
- #include "buffers.h"
- #include "common.h"
- #include "logger.h"
- #include "NvCaffeParser.h"
- #include "NvInfer.h"
- #include <cuda_runtime_api.h>
- #include <cstdlib>
- #include <fstream>
- #include <iostream>
- #include <sstream>
- using samplesCommon::SampleUniquePtr;
- const std::string gSampleName = "TensorRT.sample_int8";
- //!
- //! \brief The SampleINT8Params structure groups the additional parameters required by
- //! the INT8 sample.
- //!
- struct SampleINT8Params : public samplesCommon::CaffeSampleParams
- {
- int nbCalBatches; //!< The number of batches for calibration
- int calBatchSize; //!< The calibration batch size
- std::string networkName; //!< The name of the network
- };
- //! \brief The SampleINT8 class implements the INT8 sample
- //!
- //! \details It creates the network using a caffe model
- //!
- class SampleINT8
- {
- public:
- SampleINT8(const SampleINT8Params& params)
- : mParams(params)
- , mEngine(nullptr)
- {
- initLibNvInferPlugins(&sample::gLogger.getTRTLogger(), "");
- }
- //!
- //! \brief Function builds the network engine
- //!
- bool build(DataType dataType);
- //!
- //! \brief Runs the TensorRT inference engine for this sample
- //!
- bool infer(std::vector<float>& score, int firstScoreBatch, int nbScoreBatches);
- //!
- //! \brief Cleans up any state created in the sample class
- //!
- bool teardown();
- private:
- SampleINT8Params mParams; //!< The parameters for the sample.
- nvinfer1::Dims mInputDims; //!< The dimensions of the input to the network.
- std::shared_ptr<nvinfer1::ICudaEngine> mEngine; //!< The TensorRT engine used to run the network
- //!
- //! \brief Parses a Caffe model and creates a TensorRT network
- //!
- bool constructNetwork(SampleUniquePtr<nvinfer1::IBuilder>& builder,
- SampleUniquePtr<nvinfer1::INetworkDefinition>& network, SampleUniquePtr<nvinfer1::IBuilderConfig>& config,
- SampleUniquePtr<nvcaffeparser1::ICaffeParser>& parser, DataType dataType);
- //!
- //! \brief Reads the input and stores it in a managed buffer
- //!
- bool processInput(const samplesCommon::BufferManager& buffers, const float* data);
- //!
- //! \brief Scores model
- //!
- int calculateScore(
- const samplesCommon::BufferManager& buffers, float* labels, int batchSize, int outputSize, int threshold);
- };
- //!
- //! \brief Creates the network, configures the builder and creates the network engine
- //!
- //! \details This function creates the network by parsing the caffe model and builds
- //! the engine that will be used to run the model (mEngine)
- //!
- //! \return Returns true if the engine was created successfully and false otherwise
- //!
- bool SampleINT8::build(DataType dataType)
- {
- auto builder = SampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger()));
- if (!builder)
- {
- return false;
- }
- if ((dataType == DataType::kINT8 && !builder->platformHasFastInt8())
- || (dataType == DataType::kHALF && !builder->platformHasFastFp16()))
- {
- return false;
- }
- auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(0));
- if (!network)
- {
- return false;
- }
- auto config = SampleUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());
- if (!config)
- {
- return false;
- }
- auto parser = SampleUniquePtr<nvcaffeparser1::ICaffeParser>(nvcaffeparser1::createCaffeParser());
- if (!parser)
- {
- return false;
- }
- auto constructed = constructNetwork(builder, network, config, parser, dataType);
- if (!constructed)
- {
- return false;
- }
- ASSERT(network->getNbInputs() == 1);
- mInputDims = network->getInput(0)->getDimensions();
- ASSERT(mInputDims.nbDims == 3);
- return true;
- }
- //!
- //! \brief Uses a caffe parser to create the network and marks the
- //! output layers
- //!
- //! \param network Pointer to the network that will be populated with the network
- //!
- //! \param builder Pointer to the engine builder
- //!
- bool SampleINT8::constructNetwork(SampleUniquePtr<nvinfer1::IBuilder>& builder,
- SampleUniquePtr<nvinfer1::INetworkDefinition>& network, SampleUniquePtr<nvinfer1::IBuilderConfig>& config,
- SampleUniquePtr<nvcaffeparser1::ICaffeParser>& parser, DataType dataType)
- {
- mEngine = nullptr;
- const nvcaffeparser1::IBlobNameToTensor* blobNameToTensor
- = parser->parse(locateFile(mParams.prototxtFileName, mParams.dataDirs).c_str(),
- locateFile(mParams.weightsFileName, mParams.dataDirs).c_str(), *network,
- dataType == DataType::kINT8 ? DataType::kFLOAT : dataType);
- for (auto& s : mParams.outputTensorNames)
- {
- network->markOutput(*blobNameToTensor->find(s.c_str()));
- }
- // Calibrator life time needs to last until after the engine is built.
- std::unique_ptr<IInt8Calibrator> calibrator;
- config->setAvgTimingIterations(1);
- config->setMinTimingIterations(1);
- config->setMaxWorkspaceSize(1_GiB);
- if (dataType == DataType::kHALF)
- {
- config->setFlag(BuilderFlag::kFP16);
- }
- if (dataType == DataType::kINT8)
- {
- config->setFlag(BuilderFlag::kINT8);
- }
- builder->setMaxBatchSize(mParams.batchSize);
- if (dataType == DataType::kINT8)
- {
- MNISTBatchStream calibrationStream(mParams.calBatchSize, mParams.nbCalBatches, "train-images-idx3-ubyte",
- "train-labels-idx1-ubyte", mParams.dataDirs);
- calibrator.reset(new Int8EntropyCalibrator2<MNISTBatchStream>(
- calibrationStream, 0, mParams.networkName.c_str(), mParams.inputTensorNames[0].c_str()));
- config->setInt8Calibrator(calibrator.get());
- }
- if (mParams.dlaCore >= 0)
- {
- samplesCommon::enableDLA(builder.get(), config.get(), mParams.dlaCore);
- if (mParams.batchSize > builder->getMaxDLABatchSize())
- {
- sample::gLogError << "Requested batch size " << mParams.batchSize
- << " is greater than the max DLA batch size of " << builder->getMaxDLABatchSize()
- << ". Reducing batch size accordingly." << std::endl;
- return false;
- }
- }
- // CUDA stream used for profiling by the builder.
- auto profileStream = samplesCommon::makeCudaStream();
- if (!profileStream)
- {
- return false;
- }
- config->setProfileStream(*profileStream);
- SampleUniquePtr<IHostMemory> plan{builder->buildSerializedNetwork(*network, *config)};
- if (!plan)
- {
- return false;
- }
- SampleUniquePtr<IRuntime> runtime{createInferRuntime(sample::gLogger.getTRTLogger())};
- if (!runtime)
- {
- return false;
- }
- mEngine = std::shared_ptr<nvinfer1::ICudaEngine>(
- runtime->deserializeCudaEngine(plan->data(), plan->size()), samplesCommon::InferDeleter());
- if (!mEngine)
- {
- return false;
- }
- return true;
- }
- //!
- //! \brief Runs the TensorRT inference engine for this sample
- //!
- //! \details This function is the main execution function of the sample. It allocates the buffer,
- //! sets inputs and executes the engine.
- //!
- bool SampleINT8::infer(std::vector<float>& score, int firstScoreBatch, int nbScoreBatches)
- {
- float ms{0.0f};
- // Create RAII buffer manager object
- samplesCommon::BufferManager buffers(mEngine, mParams.batchSize);
- auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());
- if (!context)
- {
- return false;
- }
- MNISTBatchStream batchStream(mParams.batchSize, nbScoreBatches + firstScoreBatch, "train-images-idx3-ubyte",
- "train-labels-idx1-ubyte", mParams.dataDirs);
- batchStream.skip(firstScoreBatch);
- Dims outputDims = context->getEngine().getBindingDimensions(
- context->getEngine().getBindingIndex(mParams.outputTensorNames[0].c_str()));
- int64_t outputSize = samplesCommon::volume(outputDims);
- int top1{0}, top5{0};
- float totalTime{0.0f};
- while (batchStream.next())
- {
- // Read the input data into the managed buffers
- ASSERT(mParams.inputTensorNames.size() == 1);
- if (!processInput(buffers, batchStream.getBatch()))
- {
- return false;
- }
- // Memcpy from host input buffers to device input buffers
- buffers.copyInputToDevice();
- cudaStream_t stream;
- CHECK(cudaStreamCreate(&stream));
- // Use CUDA events to measure inference time
- cudaEvent_t start, end;
- CHECK(cudaEventCreateWithFlags(&start, cudaEventBlockingSync));
- CHECK(cudaEventCreateWithFlags(&end, cudaEventBlockingSync));
- cudaEventRecord(start, stream);
- bool status = context->enqueue(mParams.batchSize, buffers.getDeviceBindings().data(), stream, nullptr);
- if (!status)
- {
- return false;
- }
- cudaEventRecord(end, stream);
- cudaEventSynchronize(end);
- cudaEventElapsedTime(&ms, start, end);
- cudaEventDestroy(start);
- cudaEventDestroy(end);
- totalTime += ms;
- // Memcpy from device output buffers to host output buffers
- buffers.copyOutputToHost();
- CHECK(cudaStreamDestroy(stream));
- top1 += calculateScore(buffers, batchStream.getLabels(), mParams.batchSize, outputSize, 1);
- top5 += calculateScore(buffers, batchStream.getLabels(), mParams.batchSize, outputSize, 5);
- if (batchStream.getBatchesRead() % 100 == 0)
- {
- sample::gLogInfo << "Processing next set of max 100 batches" << std::endl;
- }
- }
- int imagesRead = (batchStream.getBatchesRead() - firstScoreBatch) * mParams.batchSize;
- score[0] = float(top1) / float(imagesRead);
- score[1] = float(top5) / float(imagesRead);
- sample::gLogInfo << "Top1: " << score[0] << ", Top5: " << score[1] << std::endl;
- sample::gLogInfo << "Processing " << imagesRead << " images averaged " << totalTime / imagesRead << " ms/image and "
- << totalTime / batchStream.getBatchesRead() << " ms/batch." << std::endl;
- return true;
- }
- //!
- //! \brief Cleans up any state created in the sample class
- //!
- bool SampleINT8::teardown()
- {
- //! Clean up the libprotobuf files as the parsing is complete
- //! \note It is not safe to use any other part of the protocol buffers library after
- //! ShutdownProtobufLibrary() has been called.
- nvcaffeparser1::shutdownProtobufLibrary();
- return true;
- }
- //!
- //! \brief Reads the input and stores it in a managed buffer
- //!
- bool SampleINT8::processInput(const samplesCommon::BufferManager& buffers, const float* data)
- {
- // Fill data buffer
- float* hostDataBuffer = static_cast<float*>(buffers.getHostBuffer(mParams.inputTensorNames[0]));
- std::memcpy(hostDataBuffer, data, mParams.batchSize * samplesCommon::volume(mInputDims) * sizeof(float));
- return true;
- }
- //!
- //! \brief Scores model
- //!
- int SampleINT8::calculateScore(
- const samplesCommon::BufferManager& buffers, float* labels, int batchSize, int outputSize, int threshold)
- {
- float* probs = static_cast<float*>(buffers.getHostBuffer(mParams.outputTensorNames[0]));
- int success = 0;
- for (int i = 0; i < batchSize; i++)
- {
- float *prob = probs + outputSize * i, correct = prob[(int) labels[i]];
- int better = 0;
- for (int j = 0; j < outputSize; j++)
- {
- if (prob[j] >= correct)
- {
- better++;
- }
- }
- if (better <= threshold)
- {
- success++;
- }
- }
- return success;
- }
- //!
- //! \brief Initializes members of the params struct using the command line args
- //!
- SampleINT8Params initializeSampleParams(const samplesCommon::Args& args, int batchSize)
- {
- SampleINT8Params params;
- // Use directories provided by the user, in addition to default directories.
- params.dataDirs = args.dataDirs;
- params.dataDirs.emplace_back("data/mnist/");
- params.dataDirs.emplace_back("int8/mnist/");
- params.dataDirs.emplace_back("samples/mnist/");
- params.dataDirs.emplace_back("data/samples/mnist/");
- params.dataDirs.emplace_back("data/int8/mnist/");
- params.dataDirs.emplace_back("data/int8_samples/mnist/");
- params.batchSize = batchSize;
- params.dlaCore = args.useDLACore;
- params.nbCalBatches = 10;
- params.calBatchSize = 50;
- params.inputTensorNames.push_back("data");
- params.outputTensorNames.push_back("prob");
- params.prototxtFileName = "deploy.prototxt";
- params.weightsFileName = "mnist_lenet.caffemodel";
- params.networkName = "mnist";
- return params;
- }
- //!
- //! \brief Prints the help information for running this sample
- //!
- void printHelpInfo()
- {
- std::cout << "Usage: ./sample_int8 [-h or --help] [-d or --datadir=<path to data directory>] "
- "[--useDLACore=<int>]"
- << std::endl;
- std::cout << "--help, -h Display help information" << std::endl;
- std::cout << "--datadir Specify path to a data directory, overriding the default. This option can be used "
- "multiple times to add multiple directories."
- << std::endl;
- std::cout << "--useDLACore=N Specify a DLA engine for layers that support DLA. Value can range from 0 to n-1, "
- "where n is the number of DLA engines on the platform."
- << std::endl;
- std::cout << "batch=N Set batch size (default = 32)." << std::endl;
- std::cout << "start=N Set the first batch to be scored (default = 16). All batches before this batch will "
- "be used for calibration."
- << std::endl;
- std::cout << "score=N Set the number of batches to be scored (default = 1800)." << std::endl;
- }
- int main(int argc, char** argv)
- {
- if (argc >= 2 && (!strncmp(argv[1], "--help", 6) || !strncmp(argv[1], "-h", 2)))
- {
- printHelpInfo();
- return EXIT_SUCCESS;
- }
- // By default we score over 57600 images starting at 512, so we don't score those used to search calibration
- int batchSize = 32;
- int firstScoreBatch = 16;
- int nbScoreBatches = 1800;
- // Parse extra arguments
- for (int i = 1; i < argc; ++i)
- {
- if (!strncmp(argv[i], "batch=", 6))
- {
- batchSize = atoi(argv[i] + 6);
- }
- else if (!strncmp(argv[i], "start=", 6))
- {
- firstScoreBatch = atoi(argv[i] + 6);
- }
- else if (!strncmp(argv[i], "score=", 6))
- {
- nbScoreBatches = atoi(argv[i] + 6);
- }
- }
- if (batchSize > 128)
- {
- sample::gLogError << "Please provide batch size <= 128" << std::endl;
- return EXIT_FAILURE;
- }
- if ((firstScoreBatch + nbScoreBatches) * batchSize > 60000)
- {
- sample::gLogError << "Only 60000 images available" << std::endl;
- return EXIT_FAILURE;
- }
- samplesCommon::Args args;
- samplesCommon::parseArgs(args, argc, argv);
- SampleINT8 sample(initializeSampleParams(args, batchSize));
- auto sampleTest = sample::gLogger.defineTest(gSampleName, argc, argv);
- sample::gLogger.reportTestStart(sampleTest);
- sample::gLogInfo << "Building and running a GPU inference engine for INT8 sample" << std::endl;
- std::vector<std::string> dataTypeNames = {"FP32", "FP16", "INT8"};
- std::vector<std::string> topNames = {"Top1", "Top5"};
- std::vector<DataType> dataTypes = {DataType::kFLOAT, DataType::kHALF, DataType::kINT8};
- std::vector<std::vector<float>> scores(3, std::vector<float>(2, 0.0f));
- for (size_t i = 0; i < dataTypes.size(); i++)
- {
- sample::gLogInfo << dataTypeNames[i] << " run:" << nbScoreBatches << " batches of size " << batchSize
- << " starting at " << firstScoreBatch << std::endl;
- if (!sample.build(dataTypes[i]))
- {
- if (!samplesCommon::isDataTypeSupported(dataTypes[i]))
- {
- sample::gLogWarning << "Skipping " << dataTypeNames[i]
- << " since the platform does not support this data type." << std::endl;
- continue;
- }
- return sample::gLogger.reportFail(sampleTest);
- }
- if (!sample.infer(scores[i], firstScoreBatch, nbScoreBatches))
- {
- return sample::gLogger.reportFail(sampleTest);
- }
- }
- auto isApproximatelyEqual = [](float a, float b, double tolerance) { return (std::abs(a - b) <= tolerance); };
- const double tolerance{0.01};
- const double goldenMNIST{0.99};
- if ((scores[0][0] < goldenMNIST) || (scores[0][1] < goldenMNIST))
- {
- sample::gLogError << "FP32 accuracy is less than 99%: Top1 = " << scores[0][0] << ", Top5 = " << scores[0][1]
- << "." << std::endl;
- return sample::gLogger.reportFail(sampleTest);
- }
- for (unsigned i = 0; i < topNames.size(); i++)
- {
- for (unsigned j = 1; j < dataTypes.size(); j++)
- {
- if (scores[j][i] != 0.0f && !isApproximatelyEqual(scores[0][i], scores[j][i], tolerance))
- {
- sample::gLogError << "FP32(" << scores[0][i] << ") and " << dataTypeNames[j] << "(" << scores[j][i]
- << ") " << topNames[i] << " accuracy differ by more than " << tolerance << "."
- << std::endl;
- return sample::gLogger.reportFail(sampleTest);
- }
- }
- }
- if (!sample.teardown())
- {
- return sample::gLogger.reportFail(sampleTest);
- }
- return sample::gLogger.reportPass(sampleTest);
- }
[1]高晗, 田育龙, 许封元,等. 深度学习模型压缩与加速综述[J]. 软件学报, 2021, 32(1):25.
[2] Nagel M , Fournarakis M , Amjad R A , et al. A White Paper on Neural Network Quantization. 2021.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。