花生_TL007

这个屌丝很懒，什么也没留下！

热门标签

【论文翻译】卷积神经网络研究综述_卷积神经网络的研究综述

作者：花生_TL007 | 2024-02-19 19:14:03

踩

卷积神经网络的研究综述

论文题目：卷积神经网络研究综述
论文来源：卷积神经网络研究综述
翻译人：BDML@CQUT实验室

卷积神经网络研究综述

Review of Convolutional Neural Network

周飞燕金林鹏董军

摘要

作为一个十余年来快速发展的崭新领域，深度学习受到了越来越多研究者的关注，它在特征提取和模型拟合上都有着相较于浅层模型显然的优势。深度学习善于从原始输入数据中挖掘越来越抽象的分布式特征表示，而这些表示具有良好的泛化能力。它解决了过去人工智能中被认为难以解决的一些问题。且随着训练数据集数量的显著增长以及芯片处理能力的剧增，它在目标检测和计算机视觉、自然语言处理、语音识别和语义分析等领域成效卓然，因此也促进了人工智能的发展。深度学习是包含多级非线性变换的层级机器学习方法，深层神经网络是目前的主要形式，其神经元间的连接模式受启发于动物视觉皮层组织，而卷积神经网络则是其中一种经典而广泛应用的网络结构。卷积神经网络的局部连接、权值共享及池化操作等特性使之可以有效地降低网络的复杂度，减少训练参数的数目，使模型对平移、扭曲、缩放具有一定程度的不变性，并具有强鲁棒性和容错能力，且也易于训练和优化网络结构。基于这些优越的特性，它在各种信号和信息处理任务中的性能优于标准的全连接神经网络。本文首先概述了卷积神经网络的发展历史，然后分别描述了神经元模型、多层感知器的结构。接着，详细分析了卷积神经网络的结构，包括卷积层、取样层、全连接层，它们发挥着不同的作用。然后，讨论了网中网结构、空间变换网络等改进的卷积神经网络。同时，还分别介绍了卷积神经网络的监督学习、无监督学习训练方法以及一些常用的开源工具。此外，本文以图像分类、人脸识别、音频检索、心电图分类及目标检测等为例，对卷积神经网络的应用作了归纳。卷积神经网络与递归神经网络的集成是一个途径。为了给读者以尽可能多的借鉴，本文还设计并试验了不同参数及不同深度的卷积神经网络以图把握各参数间的相互关系及不同参数设置对结果的影响。最后，给出了卷积神经网络及其应用中待解决的若干问题。
关键词 :卷积神经网络；深度学习；网络结构；训练方法；领域数据

Abstract

As a new and rapidly growing field for more than ten years, deep learning has gained more and more attentions from different researchers. Compared with shallow architectures, it has great advantage in both feature extracting and model fitting. And it is very good at discovering increasingly abstract distributed feature representations whose generalization ability is strong from the raw input data. It also has successfully solved some problems which were considered difficult to solve in artificial intelligence in the past. Furthermore, with2 the outstandingly increased size of data used for training and the drastic increases in chip processing capabilities, this method today has resulted in significant progress and been used in a broad area of applications such as object detection, computer vision, natural language processing, speech recognition and semantic parsing and so on, thus also promoting the advancement of artificial intelligence. Deep learning which consists of multiple levels of non-linear transformations is a hierarchical machine learning method. And deep neural network is the main form of the present deep learning method in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex. Convolutional neural network that has been widely used is a classic kind of deep neural network.There are several characteristics such as local connections, shared weights, pooling etc. These features can reduce the complexity of the network model and the number of training parameters, and they also can make the model creating some degree of invariance to shift, distortion and scale and having strong robustness and fault tolerance. So it is easy to train and optimize its network structure. Based on these predominant characteristics, it has been shown to outperform the standard fully connected neural networks in a variety of signal and information processing tasks. In this paper, first of all, the historical development of convolutional neural network is summarized. After that, the structures of a neuron model and multilayer perceptron are shown. Later on, a detailed analysis of the convolutional neural network architecture which is comprised of a number of convolutional layers and pooling layers followed by fully connected layers is given. Different kinds of layers in convolutional neural network architecture play different roles. Then, a few improved algorithms such as network in network and spatial transformer networks of convolutional neural network are described. Meanwhile, the supervised learning and unsupervised learning method of convolutional neural network and some widely used open source tools are introduced, respectively. In addition, the application of convolutional neural network on image classification, face recognition, audio retrieve, electrocardiogram classification, object detection, and so on is analyzed. Integrating of convolutional neural network and recurrent neural network to train inputted data could be an alternative machine learning approach. Finally, different convolution neural network structures with different parameters and different depths are designed. Through a series of experiments, the relations between these parameters in these models and the influence of different parameter settings are ready. Some advantages and remained issues of convolutional neural network and its applications are concluded.
Key words: convolutional neural network; deep learning; network structure; training method; domain data

1 引言

1 Introduction

人工神经元网络（Artificial Neural Network，ANN）是对生物神经网络的一种模拟和近似，是由大量神经元通过相互连接而构成的自适应非线性动态网络系统。1943 年，心理学家 McCulloch 和数理逻辑学家 Pitts 提出了神经元的第一个数学模型—MP 模型[1]。MP 模型具有开创意义，为后来的研究工作提供了依据。到了上世纪 50 年代末、60 年初,Rosenblatt 在 MP 模型的基础之上增加学习功能，提出了单层感知器模型，第一次把神经网络的研究付诸实践[2-3]。但是单层感知器网络模型不能够处理线性不可分问题。直至 1986 年，Rumelhart 和 Hinton 等提出了一种按误差逆传播算法训练的多层前馈网络—反向传播网络（Back Propagation Network，简称 BP 网络），解决了原来一些单层感知器所不能解决的问题[4]。由于在 90 年代，各种浅层机器学习模型相继被提出，较经典的如支持向量机[5]。而且当增加神经网络的层数时传统的 BP 网络会遇到局部最优、过拟合及梯度扩散等问题，这些使得深度模型的研究被搁置。

Artificial neural network is a simulation and approximation of biological neural network, It is an adaptive nonlinear dynamic network system composed of a large number of neurons connected to each other. 1943, Psychologist McCulloch and mathematical logician Pitts proposed the first mathematical model of neurons-the MP model [1]. MP models are pioneering, It provides the basis for the later research work. By the late 1950s and early 60s, Rosenblatt add learning capabilities to MP models, A single layer perceptron model is proposed, The first time to put the study of neural networks into practice [2-3]. However, the single-layer perceptron network model can not deal with linear inseparability. Until 1986, Rumelhart and Hinton proposed a multi-layer feedforward network-backpropagation network (Back Propagation Network,) trained by error inverse propagation algorithm BP networks), Has solved the original some single-layer perceptron can not solve the problem [4]. Because in the’90s, Various shallow machine learning models have been proposed, Classic such as support vector machine [5]. And when the number of layers of neural network is increased, the traditional BP network will encounter local optimization, overfitting and gradient diffusion, These make the study of depth model shelved.

2006 年，Hinton 等人[6]在《Science》上发文，其主要观点有：1）多隐层的人工神经网络具有优异的特征学习能力；2）可通过“逐层预训练”（layer-wise pre-training）来有效克服深层神经网络在训练上的困难，从此引出了深度学习（Deep Learning）的研究，同时也掀起了人工神经网络的又一热潮[7]。在深度学习的逐层预训练算法中首先将无监督学习应用于网络每一层的预训练，每次只无监督训练一层，并将该层的训练结果作为其下一层的输入，然后再用有监督学习（BP 算法）微调预训练好的网络[8-10]。这种深度学习预训练方法在手写体数字识别或者行人检测中，特别是当标注样本数量有限时能使识别效果或者检测效果得到显著提升[11]。Bengio[12]系统地介绍了深度学习所包含的网络结构和学习方法。目前，常用的深度学习模型有深度置信网络（Deep Belief Network，DBN）层叠自动去噪编码机（Stacked Deoising Autoencoders ， SDA ）、卷积神经网络（Convolutional Neural Network，CNN）[19-20]等。 2016 年 1 月 28 日，英国《Nature》杂志以封面文章形式报道：谷歌旗下人工智能公司深灵（DeepMind）开发的 AlphaGo 以 5 比 0 战胜了卫冕欧洲冠军—本以为大概十年后人工智能才能做到[21]。AlphaGo 主要采用价值网络（value networks）来评估棋盘的位置，用策略网络（policy networks）来选择下棋步法，这两种网络都是深层神经网络模型，AlphaGo 所取得的成果是深度学习带来的人工智能的又一次突破，这也说明了深度学习具有强大的潜力。

2006, Hinton et al. Posted on the Science, The main points are as follows :1) Artificial neural network with multiple hidden layers has excellent feature learning ability; 2) can effectively overcome the training difficulties of deep neural network by "layer by layer pre-training "(layer-wise pre-training), From then on led to the study of deep learning (Deep Learning), At the same time, it also set off another upsurge of artificial neural network. In the layer-by-layer pre-training algorithm of deep learning, the unsupervised learning is first applied to each layer of network pre-training, One level of unsupervised training, And take the training result of this layer as the input of the next layer, And then fine-tune the pre-trained network with supervised learning (BP algorithm). This deep learning pre-training method in hand-written digital recognition or pedestrian detection, Especially when the number of tagged samples is limited, the recognition effect or detection effect can be significantly improved. Bengio[12] systematically introduces the network structure and learning methods contained in deep learning. Currently, A commonly used deep learning model has a deep confidence network (Deep Belief Network,). The DBN) Laminated Automatic Denoise Coding Machine (Stacked Deoising Autoencoders,); and SDA, convolutional neural networks (Convolutional Neural Network,); and CNN) etc. 28 January 2016, The AlphaGo developed by Google’s artificial intelligence company Shen Ling (DeepMind) defeated the defending European champion 5-0 in a cover article in Nature》 magazine, which thought artificial intelligence would be [21] in about a decade. AlphaGo mainly uses value networks (value networks) to evaluate the position of the chessboard, A strategy network (policy networks) is used to select chess moves, Both networks are deep neural networks, AlphaGo achievement is another breakthrough in artificial intelligence brought about by deep learning, This also shows that deep learning has a strong potential.

事实上，早在 2006 年以前就已有人提出一种学习效率很高的深度学习模型—卷积神经网络。在上世纪 80、90 年代，一些研究者发表了 CNN 的相关研究工作，且在几个模式识别领域尤其是手写数字识别中取得了良好的识别效果[22- 23]。然而此时的CNN 只适合做小图片的识别，对于大规模数据，识别效果不佳[7]。直至 2012 年，Krizhevsky 等使用扩展了深度的CNN在ImageNet 大规模视觉识别挑战竞赛（ImageNet Large Scale Visual Recognition Challenge，LSVRC）中取得了当时最佳的分类效果，使得 CNN 越来越受到研究者们的重视[24]。

In fact, As early as 2006, a deep learning model with high learning efficiency, convolution neural network, has been proposed. In the 1980s and 1990s, A number of researchers have published CNN studies, Good recognition effect has been achieved in several fields of pattern recognition, especially handwritten digit recognition. However, the CNN is only suitable for small picture recognition, For mass data, Identification effect is not good [7]. Until 2012, Krizhevsky and other uses extend the depth of the CNN in ImageNet large-scale visual recognition challenge competition (ImageNet Large Scale Visual Recognition Challenge,); and LSVRC) achieved the best classification effect at that time, As a result, CNN has been paid more and more attention by researchers.

2 卷积神经网络概述

2 Overview of Convolutional Neural Networks

2.1 神经元

2.1 Neurons

神经元是人工神经网络的基本处理单元，一般是多输入单输出的单元，其结构模型如下图 1 所示：

图 1 神经元模型

其中，xi表示输入信号，n 个输入信号同时输入神经元 j。 wij表示输入信号与神经元 j 连接的权重值，bj表示神经元的内部状态即偏置值，yj为神经元的输出。输入与输出之间的对应关系可用下式表示：

f(.)为激励函数，其选择可以有很多种，可以是线性纠正函数（rectified linear unit， ReLU）[25]，sigmoid 函数、tanh(x)函数、径向基函数等[26]。

Neurons are the basic processing units of artificial neural networks, generally multi-input single-output units, and their structural models are shown in figure 1 below:

Fig.1 neuron model

where the xi represents the input signal, and n input signal is simultaneously inputted into the neuron j. The wij represents the weight value of the connection between the input signal and the neuron j, bj the internal state of the neuron is the bias value, which is yj the output of the neuron. The correspondence between input and output can be expressed as follows:

There are many kinds of choices for the excitation function f(.), which can be linear correction function (rectified linear unit,ReLU)[25], sigmoid function, tanh (x) function, radial basis function, etc.

2.2 多层感知器

2.2 Multi-layer perceptron

多层感知器（Multilayer Perceptron，MLP）是由输入层、隐含层（一层或者多层）及输出层构成的神经网络模型，它可以解决单层感知器不能解决的线性不可分问题。图 2 是含有 2 个隐含层的多层感知器网络拓扑结构图。

图 2 多层感知器结构图

输入层神经元接收输入信号，隐含层和输出层的每一个神经元与之相邻层的所有神经元连接，即全连接，同一层的神经元间不相连。图 2 中，有箭头的线段表示神经元间的连接和信号传输的方向，且每个连接都有一个连接权值。隐含层和输出层中每一个神经元的输入为前一层所有神经元输出值的加权和。假设 Xml是 MLP 中第l层第 m个神经元的输入值， Yml和Bml分别为该神经元输出值和偏置值，为该神经元与第l-1层第i个神经元的连接权值，则有：

当多层感知器用于分类时，其输入神经元个数为输入信号的维数，输出神经元个数为类别数，隐含层个数及隐层神经元个数视具体情况而定。但在实际应用中，由于受到参数学习效率影响，一般使用不超过 3 层的浅层模型。BP 算法可分为两个阶2 段：前向传播和后向传播，其后向传播始于 MLP 的输出层。以图 2 为例，则损失函数为[27]:

其中第l层为输出层，tj为输出层第j个神经元的期望输出，对损失函数求一阶偏导，则网络权值更新公式为：

其中，η为学习率。

Multilayer Perceptron,MLP is a neural network model composed of input layer, hidden layer (one or more layers) and output layer, which can solve the linear inseparability problem that can not be solved by single layer perceptron. Fig .2 is a topology diagram of multilayer perceptron network with 2 hidden layers.

Fig .2 Structure Chart of Multi-layer Perceptron

Input layer neurons receive input signals, Each neuron in the hidden and output layers is connected to all neurons in the adjacent layer, Full connection, The neurons in the same layer are not connected. Figure 2, A segment with arrows indicates the connection between neurons and the direction of signal transmission, And each connection has a connection weight. The input of each neuron in the hidden and output layers is the weighted sum of the output values of all neurons in the previous layer. Suppose Xml is the input value of the m neuron in the l layer, Yml and Bml are the output and bias values of the neuron, is the connection weight between the neuron and the i neuron in layer l-1, There are:

When the multilayer perceptron is used for classification, the number of input neurons is the dimension of input signal, the number of output neurons is the number of categories, the number of hidden layers and the number of hidden layer neurons depend on the specific situation. However, in practical applications, shallow models with no more than 3 layers are generally used due to the influence of parameter learning efficiency. BP algorithm can be divided into two stages: forward propagation and backward propagation, and then forward propagation begins at the MLP output layer. Taking Figure 2 as an example, the loss function is

The l layer is the output layer, the tj is the expected output of the j neurons in the output layer, and the first order partial derivation of the loss function is obtained, then the network weight updating formula is as follows:

2.3 卷积神经网络

2.3 Convolutional Neural Networks

1962 年，生物学家 Hubel 和 Wiesel 通过对猫脑视觉皮层的研究，发现在视觉皮层中存在一系列复杂构造的细胞，这些细胞对视觉输入空间的局部区域很敏感，它们被称为“感受野”[28]。感受野以某种方式覆盖整个视觉域，它在输入空间中起局部作用，因而能够更好地挖掘出存在于自然图像中强烈的局部空间相关性。文献[28]将被称为感受野的这些细胞分为简单细胞和复杂细胞两种类型。根据 Hubel-Wiesel 的层级模型，在视觉皮层中的神经网络有一个层级结构：LGB（外侧膝状体）→简单细胞→复杂细胞→低阶超复杂细胞→高阶超复杂细胞[29]。低阶超复杂细胞与高阶超复杂细胞之间的神经网络结构类似于简单细胞和复杂细胞间的神经网络结构。在该层级结构中，处于较高阶段的细胞通常会有这样一个倾向：对刺激模式更复杂的特征进行选择性响应；同时也具有一个更大的感受野，对刺激模式位置的移动也更不敏感。1980 年，Fukushima 根据 Huble 和 Wiesel 的层级模型提出了结构与之类似的神经认知机（Neocognitron）[29]。神经认知机采用简单细胞层（S-layer，S 层）和复杂细胞层（C-layer，C 层）交替组成，其中 S 层与 Huble-Wiesel 层级模型中的简单细胞层或者低阶超复杂细胞层相对应，C 层对应于复杂细胞层或者高阶超复杂细胞层。S 层能够最大程度地响应感受野内的特定边缘刺激，提取其输入层的局部特征，C层对来自确切位置的刺激具有局部不敏感性。尽管在神经认知机中没有像 BP 算法那样的全局监督学习过程可利用，但它仍可认为是 CNN 的第一个工程实现网络，卷积和下采样分别受启发于Hubel-Wiesel 概念的简单细胞和复杂细胞，它能够准确识别具有位移和轻微形变的输入模式[29-30]。随后，LeCun 等基于 Fukushima 的研究工作使用误差梯度回传方法设计并训练了 CNN（该模型称为LeNet-5），LeNet-5 是经典的 CNN 结构，后续有许多工作基于此进行改进，它在一些模式识别领域中取得了良好的分类效果[19]。

CNN 的基本结构由输入层、卷积层、取样层、全连接层及输出层构成。卷积层和取样层一般会取若干个，采用卷积层和取样层交替设置，即一个卷积层连接一个取样层，取样层后再连接一个卷积层，依此类推。由于卷积层中输出特征面的每个神经元与其输入进行局部连接，并通过对应的连接权值与局部输入进行加权求和再加上偏置值，得到该神经元输入值，该过程等同于卷积过程，卷积神经网络也由此而得名[19]。

1962, Biologists Hubel and Wiesel studied the visual cortex of cats, The discovery of a complex array of cells in the visual cortex, These cells are sensitive to local areas of the visual input space, They are called "feeling wild ". The receptive field somehow covers the entire visual field, It plays a local role in the input space, Therefore, it can better excavate the strong local spatial correlation in natural images. [28] literature divides these cells called receptive fields into two types: simple cells and complex cells. According to the Hubel-Wiesel hierarchy, the neural network in the visual cortex has a hierarchical structure: LGB( the lateral geniculate body)→ simple cells → complex cells low-order hypercomplex cells → high-order hypercomplex cells. The neural network structure between low-order supercomplex cells and high-order supercomplex cells is similar to that between simple cells and complex cells. In this hierarchy, Cells at higher stages usually have a tendency to respond selectively to more complex features of the stimulus pattern; It also has a greater sense of the wild, are also less sensitive to the movement of stimulus mode positions.1980, Fukushima proposed a neural cognitive machine with similar structure according to the hierarchical model of Huble and Wiesel (Neocognitron). A simple cell layer (S-layer,) is used in neurocognitive machines S and complex cell layers (C-layer,); and C layers, The S layer corresponds to the simple cell layer or the low-order super complex cell layer in the Huble-Wiesel hierarchy model, C layer corresponds to complex cell layer or higher order super complex cell layer. S layer can respond to specific edge stimuli in the receptive field to the greatest extent, extracting local features of its input layer, C layer is locally insensitive to stimuli from the exact location. Although there is no globally supervised learning process available in neurocognitive machines like BP algorithms, But it can still be considered CNN first engineering implementation network, Convolution and downsampling are inspired by the Hubel-Wiesel concept of simple and complex cells, It can accurately identify input modes with displacement and slight deformation. Then, LeCun and other Fukushima -based research work uses the error gradient return method to design and train the CNN (the model is called LeNet-5), LeNet-5 is a classic CNN structure, Much of the follow-up is based on this improvement, It has achieved good classification effect in some fields of pattern recognition.

CNN basic structure consists of input layer, convolution layer, sampling layer, full connection layer and output layer. Convolution layer and sampling layer are usually set alternately by convolution layer and sampling layer, that is, a convolution layer is connected with a sampling layer, and then a convolution layer is connected after sampling layer, and so on. Because each neuron of the output feature surface in the convolution layer is locally connected with its input, and the input value of the neuron is obtained by weighted summation with the local input through the corresponding connection weights, and the input value of the neuron is obtained. The convolution neural network is also named.

2.3.1 卷积层

2.3.1 Convolutional Layers

卷积层（convolutional layer）由多个特征面（Feature Map）组成，每个特征面由多个神经元组成，它的每一个神经元通过卷积核与上一层特征面的局部区域相连。卷积核是一个权值矩阵（如对于二维而言可为 33 或 55 矩阵）[19,31]。CNN 的卷积层通过卷积操作提取输入的不同特征，第一层卷积层提取低级特征如边缘、线条、角落，更高层的卷积层提取更高级的特征①。为了能够更好的理解CNN，下面以一维 CNN（1D CNN）为例，2D 和3D CNN 可依此进行拓展。图 3 所示为 1D CNN 的卷积层和取样层结构示意图，最顶层为取样层，中间层为卷积层，最底层为卷积层的输入层。
在这里插入图片描述

图 3.卷积层与取样层结构示意图

由图 3 可看出卷积层的神经元被组织到各个特征面中，每个神经元通过一组权值被连接到上一层特征面的局部区域，即卷积层中的神经元与其输入层中的特征面进行局部连接[11]。然后将该局部加权和传递给一个非线性函数如 ReLU 函数即可获得卷积层中每个神经元的输出值。在同一个输入特征面和同一个输出特征面中，CNN 的权值共享，如图 3 所示，权值共享发生在同一种颜色当中，不同颜色权值不共享。通过权值共享可以减小模型复杂度，使得网络更易于训练。以图 3 中卷积层的输出特征面 1 和其输入层的输入特征面 1 为例，其中表示输入特征面 m 第 i 个神经元与输出特征面 n 第 j 个神经元的连接权值。此外卷积核的滑动步长即卷积核每一次平移的距离也是卷积层中一个重要的参数。在图 3 中，设置卷积核在上一层的滑动步长为 1，卷积核大小为 1*3。 CNN 中每一个卷积层的每个输出特征面的大小（即神经元的个数）oMapN 满足如下关系[32]：
在这里插入图片描述
其中，iMapN 表示每一个输入特征面的大小，CWindow 为卷积核的大小，CInterval 表示卷积核在其上一层的滑动步长。通常情况下，要保证（6）式能够整除，否则需对 CNN 网络结构作额外处理。每个卷积层可训练参数数目 CParams 满足下式[32]：

oMap 为每个卷积层输出特征面的个数，iMap 为输入特征面个数。1 表示偏置，在同一个输出特征面中偏置也共享。假设卷积层中输出特征面 n 第 k 个神经元的输出值为，而表示其输入特征面 m 第 h 个神经元的输出值，以图 3 为例，则[32]：

上式中， Bn为输出特征面 n 的偏置值。f_cov(.)为非线性激励函数。在传统的 CNN 中，激励函数一般使用饱和非线性函数（saturating nonlinearity）如 sigmoid 函数，tanh 函数等。相比较于饱和非线性函数，不饱和非线性函数（ non-saturating nonlinearity）能够解决梯度爆炸/梯度消失问题，同时其也能够加快收敛速度[33]。Jarrett 等[34]探讨了卷积网络中不同的纠正非线性函数（ rectified nonlinearity，包括 max(0,x)非线性函数），通过实验发现它们能够显著提升卷积网络的性能，文献[25]也验证了这一结论[25]。因此在目前的 CNN 结构中常用不饱和非线性函数作为卷积层的激励函数如ReLU 函数。ReLU 函数的计算公式如下所示[24-25]：

图 4 中红色的为 ReLU 曲线，蓝色为 tanh 曲线。对于 ReLU 而言，如果输入大于 0，则输出与输入相等，否则输出为 0。从图 4 可以看出，使用 ReLU 函数，输出不会随着输入的逐渐增加而趋于饱和

Chen 在其报告中分析了影响 CNN 性能的 3 个因素：层数、特征面的数目及网络组织①。该报告使用 9 种结构的 CNN 进行中文手写体识别实验，通过统计测试结果得到具有较小卷积核的 CNN 结构的一些结论：1）增加网络的深度能够提升准确率；2）增加特征面的数目也可以提升准确率；3）增加一个卷积层比增加一个全连接层能获得一个更高的准确率。文献[35]指出深度网络结构具有两个优点：1）可以促进特征的重复利用；2）能够获取高层表达中更抽象的特征，由于更抽象的概念可根据抽象性更弱的概念来构造，因此深度结构能够获取更抽象的表达，例如在 CNN 中通过池化操作来建立这种抽象，更抽象的概念通常对输入的大部分局部变化具有不变性。He 等人[36]探讨了在限定计算复杂度和时间上如何平衡 CNN 网络结构中深度、特征面数目、卷积核大小等因素的问题。文献[36]首先研究了深度（Depth）与卷积核大小间的关系，采用较小的卷积核替代较大的卷积核，同时增加网络深度来增加复杂度，通过实验结果表明网络深度比卷积核大小更重要；当时间复杂度大致相同时，具有更小卷积核且深度更深的 CNN 结构比具有更大卷积核同时深度更浅的 CNN 结构能够获得更好的实验结果。其次，该文献也研究了网络深度和特征面数目间的关系，CNN 网络结构设置为：在增加网络深度时适当减少特征面的数目，同时卷积核的大小保持不变，实验结果表明，深度越深，网络的性能越好；然而随着深度的增加，网络性能也达到饱和。此外，该文献还通过固定网络深度研究了特征面数目和卷积核大小间的关系，通过实验对比，发现特征面数目和卷积核大小的优先级差不多，其发挥的作用均没有网络深度大。由于过度地减小特征面的数目或者卷积核大小会损害网络的准确性，同时过度地增加网络深度也会降低网络准确性，因此如果网络深度很深，那么准确性会达到饱和甚至下降。

在 CNN 结构中，深度越深、特征面数目越多，则网络能够表示的特征空间也就越大、网络学习能力也越强，然而也会使网络的计算更复杂，极易出现过拟合的现象。因而，在实际应用中应当适当选取网络深度、特征面数目、卷积核的大小及卷积时滑动的步长，以使训练能够获得一个好的模型同时还能减少训练时间。

Convolutional Layer consists of multiple Feature Map, Each feature is composed of multiple neurons, Each of its neurons is connected to the local area of the upper layer of feature surface by convolution kernel. Convolution kernel is a weight matrix (for example ,33 or 55 matrix for two-dimensional). CNN convolution layer extracts different features of input by convolution operation, The first convolutional layer extracts low-level features such as edges, lines, corners, higher convolutional layers extract more advanced feature 1. For a better understanding of CNN, Take one-dimensional CNN (1D CNN) as an example, 2D and 3 D CNN can be expanded accordingly. Figure 3 shows the structure of the convolution layer and the sampling layer in D CNN 1, The top layer is the sampling layer, The middle layer is a convolution layer, The lowest layer is the input layer of the convolution layer.
在这里插入图片描述

Fig.3 Schematic illustration of convolution layer and sampling layer structure

It can be seen from figure 3 that the neurons in the convolution layer are organized into each feature surface, and each neuron is connected to the local area of the upper layer feature surface through a set of weights. That is, the neurons in the convolution layer are locally connected with the feature surface in the input layer. After that, the local weighted sum is passed to a nonlinear function such as ReLU function to obtain the output value of each neuron in the convolution layer. In the same input feature surface and the same output feature surface, the weight of the CNN is shared. As shown in figure 3, the weight sharing occurs in the same color, and the weight of different colors is not shared. The model complexity can be reduced by weight sharing, which makes the network easier to train. Taking the output feature plane 1 of the convolution layer and the input feature plane 1 of its input layer in figure 3 as an example,where represents the connection weights between the input feature surface m the i neuron and the output feature surface n the j neuron. In addition, the sliding step size of convolution kernel, that is, the distance of each translation of convolution kernel, is also an important parameter in convolution layer. In Figure 3, Set the sliding step of the convolution kernel on the previous layer to 1, The convolution kernel size is 1*3. The size of each output feature surface of each convolution layer in the CNN (that is, the number of neurons) oMapN satisfy the following relationship:
在这里插入图片描述
Where iMapN represents the size of each input feature surface, CWindow the size of the convolution kernel, CInterval the sliding step of the convolution kernel on its upper layer. Generally, to ensure that (6) can be divisible, otherwise, the CNN network structure should be treated extra. Each convolution layer can train the number of parameters CParams meet the following formula:

oMap output the number of feature surfaces for each convolutional layer, iMap is the number of input feature surfaces. 1 means bias, bias is also shared in the same output feature surface. Suppose that the output value of the output feature surface n the k neuron in the convolution layer is , represents the output value of its input feature surface m the h neuron, Figure 3, for example, then:

In the upper formula, Bn is the bias value n the output feature surface.f_cov(.) is a nonlinear excitation function. And in traditional CNN, The excitation function generally uses saturated nonlinear function (saturating nonlinearity) such as sigmoid function, tanh function, etc. Compared to the saturated nonlinear function, An unsaturated nonlinear function (non-saturating nonlinearity) can solve the problem of gradient explosion / gradient disappearance, At the same time, it can speed up the [33] of convergence speed. Jarrett and other [34] discuss different correction nonlinear functions (rectified nonlinearity,) in convolutional networks Including max (0, x) nonlinear function), Experiments show that they can significantly improve the performance of convolutional networks, literature [25] also validated this conclusion. Therefore, unsaturated nonlinear functions are commonly used as excitation functions such as ReLU functions in convolution layers in current CNN structures. ReLU function formula is shown below:

The red in figure 4 is the ReLU curve and the blue is the tanh curve. For ReLU, if the input is greater than 0, the output is equal to the input, otherwise the output is 0. As can be seen from Figure 4, using the ReLU function, the output does not become saturated as the input increases

Fig.4 ReLU tanh function graph

Chen analyzed three factors affecting CNN performance in its report: the number of layers, the number of feature surfaces, and network organization. A Chinese handwriting recognition experiment using nine structural CNN, According to the statistical test results, some conclusions of the CNN structure with small convolution kernel are obtained :1) increasing the depth of the network can improve the accuracy; 2) increasing the number of feature surfaces can also improve the accuracy; 3) adding a convolution layer can achieve a higher accuracy than adding a full connection layer. According to the literature [35] deep network structure has two advantages :1) it can promote the reuse of features; 2) can acquire more abstract features in high-level expressions, Since more abstract concepts can be constructed from less abstract concepts, So deep structures can get more abstract expressions, For example, by pooling operations in CNN, More abstract concepts are usually invariant to most local changes in the input. He et al [36] explored how to balance the factors such as depth, number of feature surfaces, size of convolution kernel in CNN network structure in terms of limited computational complexity and time. Literature [36] first studied the relationship between depth (Depth) and the size of convolution kernel, Using smaller convolution kernels instead of larger convolution kernels, At the same time, increase network depth to increase complexity, Experimental results show that network depth is more important than convolution kernel size; When the time complexity is roughly the same, The CNN structure with smaller convolution kernel and deeper can obtain better experimental results than the CNN structure with larger convolution kernel and lighter depth. Secondly, the relationship between network depth and the number of feature planes is also studied in this paper. CNN the network structure is set as follows: the number of feature planes is reduced when the network depth is increased, and the size of convolution kernel remains the same. In addition, the relationship between the number of feature planes and the size of convolution kernels is studied by fixed network depth. Through experimental comparison, it is found that the number of feature planes is similar to the priority of convolution kernel size, and its function is not as deep as the network. Because excessive reduction of the number of feature surfaces or the size of convolution kernels will damage the accuracy of the network, and excessive increase of network depth will also reduce the accuracy of the network, so if the network depth is very deep, the accuracy will reach saturation or even decline.

CNN the structure, the deeper the depth and the more the number of feature surfaces, the larger the feature space the network can represent and the stronger the network learning ability. Therefore, the network depth, the number of feature surfaces, the size of convolution kernel and the sliding step of convolution should be selected in practical application, so that the training can obtain a good model and reduce the training time.

2.3.2 取样层

2.3.2 Sampling Layer

取样层（pooling layer，也称为池化层）紧跟在卷积层之后，同样由多个特征面组成，它的每一个特征面唯一对应于其上一层的一个特征面，不会改变特征面的个数。如图 3，卷积层是取样层的输入层，卷积层的一个特征面与取样层中的一个特征面唯一对应，且取样层的神经元也与其输入层的局部接受域相连，不同神经元局部接受域不重叠。取样层旨在通过降低特征面的分辨率来获得具有空间不变性的特征[37]。取样层起到二次提取特征的作用，它的每个神经元对局部接受域进行池化操作。常用的池化方法有最大池化（max-pooling）即取局部接受域中值最大的点、均值池化（mean pooling）即对局部接受域中的所有值求均值、随机池化（stachastic pooling） [38-39]。文献[40]给出了关于最大池化和均值池化详细的理论分析，通过分析得出以下一些预测：1）最大池化特别适用于分离非常稀疏的特征；2）使用局部区域内所有的采样点去执行池化操作也许不是最优的，例如均值池化就利用了局部接受域内的所有采样点。文献[41]比较了最大池化和均值池化两种方法，通过实验发现：当分类层采用线性分类器如线性 SVM 时，最大池化方法比均值池化能够获得一个更好的分类性能。随机池化方法是对局部接受域采样点按照其值大小赋予概率值，再根据概率值大小随机选择，该池化方法确保了特征面中不是最大激励的神经元也能够被利用到[37]。随机池化具有最大池化的优点，同时由于随机性它能够避免过拟合。此外，还有混合池化（mixed pooling）、空间金字塔池化（spatial pyramid pooling）、频谱池化（spectral pooling）等池化方法[37]。在通常所采用的池化方法中，取样层的同一个特征面不同神经元与上一层的局部接受域不重叠，然而也可以采用重叠池化（overlapping pooling）的方法。所谓重叠池化方法就是相邻的池化窗口间有重叠区域。文献[24]采用重叠池化框架使 top-1 和 top-5 的错误率分别降低了 0.4%和 0.3%，与无重叠池化框架相比，其泛化能力更强，更不易产生过拟合。设取样层中第 n 个输出特征面第 l 个神经元的输出值为t_nl,，同样以图 3 为例，则有:

t_nq表示取样层的第n个输入特征面第q个神经元的输出值，f_sub(.) 可为取最大值函数、取均值函数等。

取样层在上一层滑动的窗口也称为取样核。事实上， CNN 中的卷积核与取样核相当于 Hubel-Wiesel 模型[28]中感受野在工程上的实现，卷积层用来模拟 Hubel-Wiesel 理论的简单细胞，取样层模拟该理论的复杂细胞。CNN 中每个取样层的每一个输出特征面的大小（神经元个数）DoMapN 为 [33]：

其中，取样核的大小为 DWindow，在图 3 中 DWindow=2。取样层通过减少卷积层间的连接数量，即通过池化操作神经元数量减少，降低了网络模型的计算量。

Pooling Layer followed the convolution layer, Also composed of multiple feature surfaces, Each of its feature surfaces corresponds only to one of its upper layers, does not change the number of feature faces. Figure 3, The convolution layer is the input layer of the sampling layer, A feature surface of the convolution layer uniquely corresponds to a feature surface in the sampling layer, And the neurons of the sampling layer are also connected to the local receptive field of the input layer, The local receptive fields of different neurons do not overlap. a sampling layer aims to obtain feature [37] with spatial invariance by reducing the resolution of the feature surface. The sampling layer acts as a secondary feature extraction, each of its neurons pool the local receptive field. The commonly used pooling methods include maximum pooling (max-pooling), that is, taking the point with the largest median value in the local receptive field, mean pooling (mean pooling), that is, finding the mean value, random pooling (stachastic pooling) for all values in the local receptive field [38-39]. A detailed theoretical analysis of maximum and mean pooling [40] given in the literature, The following predictions are derived from the analysis :1) Maximum pooling is especially suitable for separating very sparse features; 2) it may not be optimal to perform pooling operations using all sampling points in a local area, For example, mean pooling takes advantage of all sampling points in the local acceptance domain. [41] the literature, two methods of maximum pooling and mean pooling are compared, Experiments show that when the classification layer uses linear classifiers such as linear SVM, The maximum pooling method can obtain a better classification performance than the mean pooling method. The random pool method is to assign the probability value to the sampling point of the local receiving domain according to its value, Then random selection based on probability values, A pool method ensures that neurons that are not the largest excitation in the feature surface can also be used to [37]. Random pooling has the advantage of maximum pooling, At the same time, it can avoid overfitting because of randomness. In addition, And mixed pooling、spatial pyramid pooling、spectral pooling and other pool methods. In the usual pooling methods, Different neurons of the same characteristic surface of the sampling layer do not overlap with the local receptive fields of the previous layer, However, overlapping pooling methods can also be used. The so-called overlapping pool method is that there are overlapping regions between adjacent pool windows. The overlap pooling framework [24] the literature reduced the error rates of top-1 and top-5 by 0.4% and 0.3%, respectively, Compared with the non-overlapping pooling framework, Its generalization ability is stronger, More difficult to produce overfitting. Let the output value of the l neuron of the n output feature surface in the sampling layer be t_nl , Also take Figure 3 as an example, There are:

The t_nq represents the output value of the q neuron in the first input feature plane of the sampling layer, f_sub(.) can take the maximum function, the mean function and so on.

The window in which the sampling layer slips on the upper layer is also called the sampling core. The convolution kernel in the CNN is equivalent to the engineering implementation of the receptive field in the Hubel-Wiesel model [28]. The convolution layer is used to simulate the simple cells of the Hubel-Wiesel theory and the sampling layer to simulate the complex cells of the theory. The size (number of neurons) of each output feature surface of each sampling layer in the CNN is DoMapN as follows:

Where the size of the sampling core is DWindow, in figure 3 DWindow=2. The sampling layer reduces the computational cost of the network model by reducing the number of connections between convolution layers, that is, the number of neurons in the pool operation

2.3.3 全连接层

2.3.3 Full Connection Layer

在 CNN 结构中，经多个卷积层和取样层后，连接着 1 个或 1 个以上的全连接层。与 MLP 类似，全连接层中的每个神经元与其前一层的所有神经元进行全连接。全连接层可以整合卷积层或者取样层中具有类别区分性的局部信息[42]。为了提升 CNN网络性能，全连接层每个神经元的激励函数一般采用 ReLU 函数[43]。最后一层全连接层的输出值被传递给一个输出层，可以采用 softmax 逻辑回归（softmax regression）进行分类，该层也可称为 softmax 层（softmax layer）。对于一个具体的分类任务，选择一个合适的损失函数是十分重要的，文献[37]介绍了 CNN 几种常用的损失函数并分析了它们各自的特点。通常，CNN 的全连接层与 MLP 结构一样，CNN 的训练算法也多采用 BP 算法。

当一个大的前馈神经网络训练一个小的数据集时，由于它的高容量，它在留存测试数据（held-out test data，也可称为校验集）上通常表现不佳[30]。为了避免训练过拟合，常在全连接层中采用正则化方法—dropout 技术即使隐层神经元的输出值以 0.5 的概率变为 0，通过该技术部分隐层节点失效，这些节点不参加 CNN 的前向传播过程，也不会参加后向传播过程[24,30]。对于每次输入到网络中的样本，由于 dropout 技术的随机性，它对应的网络结构不相同，但是所有的这些结构共享权值[24]。由于一个神经元不能依赖于其它特定神经元而存在，所以这种技术降低了神经元间相互适应的复杂性，使神经元学习得到更鲁棒的特征[24]。目前，关于 CNN 的研究大都采用 ReLU+dropout 技术，并取得了很好的分类性能[24,44-45]。

In the CNN structure, After multiple convolution layers and sampling layers,Connects one or more full connection layers. And like MLP, Each neuron in the fully connected layer is fully connected to all neurons in the previous layer. Full connection layer can integrate local information [42] with category differentiation in convolution layer or sampling layer. For CNN network performance, The excitation function of each neuron in the fully connected layer is generally [43] by ReLU function. The output value of the last fully connected layer is passed to an output layer, softmax logical regression (softmax regression) can be used to classify, This layer can also be called softmax layer (softmax layer). For a specific classification task, It is very important to choose an appropriate loss function, The literature [37] introduced CNN several commonly used loss functions and analyzed their respective characteristics. Usually, CNN full connection layer is the same as MLP structure, CNN training algorithms also use BP algorithms.

When a large feedforward neural network trains a small data set, Because of its high capacity, And it keeps test data (held-out test data,) on hold Also known as check sets) usually perform poorly [30]. To avoid overfitting, A regularization method is often used in the full connection layer-the dropout technique even if the output value of the hidden layer neuron changes to 0 with a probability of 0.5, Through this technology, some hidden layer nodes fail, and these nodes do not participate in the forward propagation process of the CNN, Nor will they participate in the backward propagation process, 30]. For each sample entered into the network, Because of the randomness of dropout technology, It has different network structures, But all these structures share weights. Because one neuron can not rely on other specific neurons, So this technique reduces the complexity of neural adaptation, Make neuron learning get more robust features. Currently, Most research on CNN uses ReLU dropout techniques, And achieved good classification performance.

2.3.4 特征面

2.3.4 Feature Surface

特征面数目作为 CNN 的一个重要参数，它通常是根据实际应用进行设置的，如果特征面个数过少，可能会使一些有利于网络学习的特征被忽略掉，从而不利于网络的学习；但是如果特征面个数过多，可训练参数个数及网络训练时间也会增加，这同样不利于学习网络模型。文献[46]提出了一种理论方法用于确定最佳的特征面数目，然而该方法仅对极小的接受域有效，它不能够推广到任意大小的接受域。该文献通过实验发现：与每层特征面数目均相同的 CNN 结构相比，金字塔架构（该网络结构的特征面数目按倍数增加）更能有效利用计算资源。目前，对于 CNN 网络特征面数目的设定通常采用的是人工设置方法，然后进行实验并观察所得训练模型的分类性能，最终根据网络训练时间和分类性能来选取特征面数目。

As an important parameter of CNN, the number of feature surfaces is usually set according to practical applications. If the number of feature surfaces is too small, some features beneficial to network learning may be ignored, which is not conducive to network learning. However, if the number of feature surfaces is too large, the number of trainable parameters and network training time will increase. [46] literature, a theoretical method is proposed to determine the optimal number of feature surfaces. however, this method is only effective for minimal receptive fields, and it can not be extended to receptive fields of any size. Compared with the CNN structure with the same number of feature surfaces in each layer, the pyramid architecture (the number of feature surfaces of the network structure increases by multiple) can effectively utilize the computing resources. At present, the manual setting method is usually used to set the number of manually, then the classification performance of the training model is observed and the number of feature surfaces is selected according to the network training time and classification performance.

2.3.5 CNN 结构的进一步说明

2.3.5 Further Description of the CNN Structure

CNN 的实现过程实际上已经包含了特征提取过程，以图 5、图 6 为例直观地显示 CNN 提取的特征。文献[47]采用 CNN 进行指纹方向场评估，图 5为其模型结构。图 5 共有 3 个卷积层（C1，C3， C5）、2 个取样层（M2，M4）、1 个全连接层（F6）和 1 个输出层（O7）。卷积层通过卷积操作提取其前一层的各种不同的局部特征，由图 5 可看出，C1层提取输入图像的边缘、轮廓特征，可看成是边缘检测器。取样层的作用是在语义上把相似的特征合并起来，取样层通过池化操作使得特征对噪声和变形具有鲁棒性[11]。从图上可看出，各层所提取的特征以增强的方式从不同角度表现原始图像，并且随着层数的增加，其表现形式越来越抽象[48]。全连接层 F6 中的每个神经元与其前一层进行全连接，该层将前期所提取的各种局部特征综合起来，最后通过输出层得到每个类别的后验概率。从模式分类角度来说，满足 Fisher 判别准则的特征最有利于分类，通过正则化方法（dropout 方法），网络参数得到有效调整，从而使全连接层提取的特征尽量满足Fisher 判别准则，最终有利于分类[48]。图 6 给出了CNN 提取心电图（electrocardiogram，ECG）特征的过程，首先通过卷积单元 A1、B1、C1（其中每个卷积单元包括一个卷积层和一个取样层）提取特征，最后由全连接层汇总所有局部特征。由图中也可以看出，层数越高，特征的表现形式也越抽象，显然，这些特征并没有临床诊断的物理意义，仅仅是数理值[48]。

图 5 指纹经过 CNN 的中间层输出特征

图 6 ECG 经过 CNN 的中间层输出特征

CNN implementation actually includes feature extraction, Figure 5 and 6 are taken as examples to visually display CNN extracted features. Literature [47] use CNN to evaluate fingerprint orientation field, Figure 5 shows its model structure. Figure 5 has 3 convolution layers (C1,). Figure 5 C3, C5),2 sampling layers (M2,); and M4),1 fully connected layer (F6) and 1 output layer (O7). The convolution layer extracts various local features of its previous layer by convolution operation, As can be seen from figure 5, C1 layer extracts the edge and contour features of the input image, It can be regarded as an edge detector. The role of the sampling layer is to semantically merge similar features, By pooling operation, the sampling layer makes the features robust to noise and deformation [11]. As you can see from the diagram, The features extracted from each layer represent the original image from different angles in an enhanced manner, And as the number of layers increases, Its form of expression is more and more abstract and [48]. Each neuron in a fully connected F6 is fully connected to its previous layer, This layer combines the local features extracted in the previous period, Finally, the posterior probability of each class is obtained by the output layer. In terms of pattern classification, Having satisfied the characteristics of the Fisher criterion, by regularization method (dropout method), Network parameters are effectively adjusted, As a result, the feature extracted from the fully connected layer satisfies the Fisher criterion, ultimately conducive to classification. CNN extracted electrocardiogram (electrocardiogram,) is shown in Figure 6 ECG) the process of characterization, First, features are extracted by convolution units A1、B1、C1( where each convolution unit includes a convolution layer and a sampling layer), Finally, all local features are summarized by the full connection layer. It can also be seen from the diagram that the higher the number of layers, the more abstract the manifestation of the features. Obviously, these features have no physical significance of clinical diagnosis, but only mathematical values.

2.3.6 与传统的模式识别算法相比

2.3.6 Compared With Traditional Pattern Recognition Algorithms

卷积神经网络的本质就是每一个卷积层包含一定数量的特征面或者卷积核[46]。与传统 MLP 相比，CNN 中卷积层的权值共享使网络中可训练的参数变少，降低了网络模型复杂度，减少过拟合，从而获得了一个更好的泛化能力[49]。同时，在 CNN结构中使用池化操作使模型中的神经元个数大大减少，对输入空间的平移不变性也更具有鲁棒性，[49]。而且 CNN 结构的可拓展性很强，它可以采用很深的层数。深度模型具有更强的表达能力，它能够处理更复杂的分类问题。总的来说，CNN 的局部连接、权值共享和池化操作使其比传统 MLP 具有更少的连接和参数，从而更易于训练。

Convolution neural network is essentially a convolution layer containing a certain number of feature surfaces or convolution kernel [46]. compared with the traditional MLP, the weight sharing of the convolution layer in the CNN makes the trainable parameters in the network less, reduces the complexity of the network model and reduces overfitting, thus obtaining a better generalization ability [49]. meanwhile, the pool operation is used in the CNN structure to greatly reduce the number of neurons in the model, and it is also more robust and [49] to the translation invariance of the input space. And CNN structure is very extensible, it can use very deep layers. The depth model has stronger expression ability and can deal with more complex classification problems. overall, CNN local connections, weight sharing, and pooling operations make it easier to train with fewer connections and parameters than traditional MLP.

3 CNN 的一些改进算法

3 CNN Improved Algorithms

3.1 NIN 结构

3.1 NIN Structure

CNN 中的卷积滤波器是一种广义线性模型（Generalized Linear Model，GLM），GLM 的抽象水平比较低，但通过抽象却可以得到对同一概念的不同变体保持不变的特征[50]。Lin 等人[50]提出了一种 Network In Network（NIN）网络模型，该模型使用微型神经网络（micro neural network）代替传统CNN 的卷积过程，同时还采用全局平均取样层来替换传统 CNN 的全连接层，它可以增强神经网络的表示能力。微神经网络主要是采用 MLP 模型，如下图 7 所示。

图 7 线性卷积层与MLP 卷积层对比

（b）图是 NIN 结构的非线性卷积层，是用 MLP 来取代原来的 GLM。NIN 通过在输入中滑动微型神经网络得到卷积层的特征面。与卷积的权值共享类似，MLP 对同一个特征面的所有局部感受野也共享，即对于同一个特征面 MLP 相同。文献[50]之所以选择MLP，是考虑到 MLP 采用 BP 算法进行训练，能与CNN 结构融合，同时 MLP 也是一种深度模型，具有特征重用的思想。MLP 卷积层能够处理更复杂的非线性问题，提取更加抽象的特征。在传统的 CNN结构中全连接层的参数过多，易于过拟合，因此它严重依赖于 dropout 正则化技术。NIN 结构采用全局平均池化代替原来的全连接层，使模型的参数大大减少。它通过全局平均池化方法对最后一个 MLP卷积层的每个特征面求取均值，再将这些数值连接成向量，最后输入到 softmax 分类层中。全局平均池化可看成是一个结构性的正则化算子（structural regularizer），它可以增强特征面与类别的一致性。在全局平均取样层中没有需要优化的参数，因此能够避免过拟合。此外，全局平均取样层对空间信息进行求和，因此对输入的空间变化具有更强的鲁棒性。Lin等人[50]将该算法应用于MNIST及SVHN 等数据集中，验证了该算法的有效性。Xu 等人[51]结合 NIN 结构提出了 ML-DNN 模型，使用与文献[50] 相同的数据库，与稀疏编码等方法比较，表明了该模型的优越性。

The convolution filter in the CNN is a generalized linear model (Generalized Linear Model,).(1 GLM), GLM low level of abstraction, However, the characteristics of different variants of the same concept can be obtained by abstraction. Lin et al. proposed a Network In Network network model, The model uses a miniature neural network (micro neural network) instead of a traditional CNN convolution process, At the same time, the global average sampling layer is used to replace the traditional CNN full connection layer, It can enhance the representation ability of neural network. Based on the MLP model, As shown in figure 7 below.

Fig.7 Comparison of Linear Convolution Layer and MLP Convolution Layer

The graph (b) is a nonlinear convolution layer of NIN structure, To replace the original GLM. with MLP NIN the feature surface of the convolution layer is obtained by sliding the micro neural network in the input. Similar to convolution weight sharing, MLP share all local receptive fields of the same feature plane, For the same feature surface MLP the same. The choice MLP,[50] literature Taking into account the MLP BP algorithm for training, Can merge with CNN structure, And MLP’s a depth model, idea with feature reuse. MLP convolutional layers can handle more complex nonlinear problems, extract more abstract features. Excessive parameters of the fully connected layer in traditional CNN structures, Easy overfitting, hence it relies heavily on dropout regularization techniques. NIN structure uses global average pooling instead of the original fully connected layer, The parameters of the model are greatly reduced. For each feature surface of the last MLP convolutional layer, Then connect these values to vectors, Finally input into the softmax classification layer. Global average pooling can be regarded as a structured regularization operator (structural regularizer), It can enhance the consistency between feature surface and category. There are no parameters to be optimized in the global average sampling layer, Therefore, overfitting can be avoided. In addition, The global average sampling layer sums up the spatial information, Therefore, it is more robust to the spatial change of input. Lin et al [50] apply the algorithm to data sets such as MNIST and SVHN, The effectiveness of the algorithm is verified. Xu et al .[51] proposed a ML-DNN model combined with the NIN structure, using the same database as the literature, compared with sparse coding and other methods, indicating the superiority of the model.

3.2 空间变换网络

3.2 Space Transformation Network

尽管 CNN 已经是一个能力强大的分类模型，但是它仍然会受到数据在空间上多样性的影响。 Jaderberg 等人[52]采用一种新的可学习模块—空间变换网络（Spatial Transformer Networks，STNs）来解决此问题，该模块由三个部分组成：本地化网络（localisation network）、网格生成器（grid generator）及采样器（sampler）。STNs 可用于输入层，也可插入到卷积层或者其它层的后面，不需要改变原 CNN模型的内部结构。STNs 能够自适应地对数据进行空间变换和对齐，使得 CNN 模型对平移、缩放、旋转或者其它变换等保持不变性。此外，STNs 的计算速度很快，几乎不会影响原有 CNN 模型的训练速度。

Although CNN is already a powerful classification model, But it will still be affected by the spatial diversity of data. Jaderberg et al. adopt a new learningable module, the Spatial Transformation Network (Spatial Transformer Networks,). The STNs) to solve this problem, The module consists of three parts: localization network (localisation network), grid generator (grid generator) and sampler (sampler). STNs can be used in the input layer, Can also be inserted behind the convolution layer or other layers, does not need to change the internal structure of the original CNN model. STNs can adaptively transform and align data, Align the CNN model to translation, scaling, rotation, or other transformations. In addition, STNs’s very fast, hardly affect the training speed of the original CNN model.

3.3 反卷积

3.3 Unconvolutional

由 Zeiler[53] 等人提出的反卷积网络（Deconvolutional Networks）模型与 CNN 的思想类似，只是在运算上有所不同。CNN 是一种自底而上的方法，其输入信号经过多层的卷积、非线性变换和下采样处理。而反卷积网络中的每层信息是自顶而下的，它对由已学习的滤波器组与特征面进行卷积后得到的特征求和就能重构输入信号。随后，Zeiler采用反卷积网络可视化CNN中各网络层学习得到的特征，以利于分析并改进 CNN 网络结构[54]。反卷积网络也可看成是一个卷积模型，它同样需要进行卷积和池化过程，不同之处在于与 CNN 是一个逆过程。文献[54]模型中的每一个卷积层都加上一个反卷积层。在卷积、非线性函数（ReLU）、池化（max-pooling）后，不仅将输出的特征作为下一层的输入，也将它送给对应的反卷积层。反卷积层需要依次进行 unpooling（采用一种近似的方法求max-pooling 的逆过程）、矫正（使用非线性函数来保证所有输出均为非负数）及反卷积操作（利用卷积过程中卷积核的转置作为核，与矫正后的特征作卷积运算），然后形成重构特征。通过反卷积技术可视化 CNN 各网络层学习到的特征，Zeiler 还得出以下结论：CNN 学习到的特征对于平移和缩放具有不变性，但是对于旋转操作一般不具有该特性，除非被识别对象具有很强的对称性[54]。Zhao[55]等人提出了一个新的称为 SWWAE 的结构，SWWAE 模型由卷积结构及反卷积结构组成，采用卷积结构对输入进行编码，而反卷积结构用来进行重构。SWWAE的每一个阶段是一个“内容—位置”（what-where）自动编码机，编码机由一个卷积层及紧随其后的一个 max-pooling 层组成，通过 max-pooling 层产生两个变量集：max-pooling 的输出记为 what 变量，它作为下一层的输入；将 max-pooling 的位置信息记为 where 变量，where 变量要横向传递到反卷积结构中。SWWAE 的损失函数包含三个部分（判别损失，重构损失及中间重构损失）。SWWAE 在各种半监督和有监督任务中取得了很高的准确率，它特别适用于具有大量无标注类别而有标注类别相对少的数据集的情况，该模型也可能适用于与视频相关的任务[55]。

The Deconvolutional Networks model proposed by Zeiler et al is similar to CNN idea, It’s just different in operation. CNN’s a bottom - up approach, Its input signal is processed by multi-layer convolution, nonlinear transformation and downsampling. And each layer of information in the deconvolution network is top-down, It can reconstruct the input signal by convolution the learned filter bank and the feature surface. Then, Zeiler features learned by each network layer in the deconvolution network visualization CNN, For the analysis and improvement of CNN network structure. The deconvolution network can also be regarded as a convolution model, It also requires convolution and pooling, Different from CNN is an inverse process. a deconvolution layer is added to each convolutional layer in the literature [54] model. After convolution, ReLU、max-pooling, Not only the output features as input to the next layer, It is also sent to the corresponding deconvolution layer. The deconvolution layer needs to be unpooling in turn (using an approximate method to find the inverse of the max-pooling), corrected (using nonlinear functions to ensure that all outputs are non-negative) and deconvolutional operations (using the transpose of the convolution kernel in the convolution process as the kernel, convolution operation with corrected features), Then the reconstruction feature is formed. Visualization of features learned CNN each network layer through deconvolution techniques, Zeiler also conclude that CNN learned features are invariant to translation and scaling, However, this feature is generally not available for rotary operations, unless the identified object has strong symmetry. Zhao and others proposed a new structure called SWWAE, SWWAE model consists of convolution structure and deconvolution structure, Encoding the input with convolution, The deconvolution structure is used for reconstruction. Each stage of the SWWAE is a "content-position "(what-where) automaton, Coder consists of a convolution layer and a max-pooling layer that follows, Two sets of variables are generated through the max-pooling layer: the output of the max-pooling is recorded as a what variable, It serves as input to the next layer; Record max-pooling location information as a where variable, where variables are transferred horizontally to the deconvolution structure. SWWAE loss function consists of three parts, Reconstruction loss and intermediate reconstruction loss). SWWAE achieved high accuracy in various semi-supervised and supervised tasks, In particular, it applies to data sets with a large number of unlabeled categories and relatively few annotated categories, The model may also be applicable to video-related tasks.

4 训练方法及开源工具

4 Training Methodology And Open Source Tools

4.1 训练方法

4.1 Training Methods

虽然通常都认为如果没有无监督预训练，对深度神经网络进行有监督训练是非常困难的，但 CNN却是一个特例，它可直接执行有监督学习训练[12]。CNN 通过 BP 算法进行有监督训练，也需经过前向传播和后向传播两个阶段[19]。CNN 开始训练之前，需要采用一些不同的小随机数对网络中所有的权值和偏置值进行随机初始化。使用“小随机数”以保证网络不会因为权过大而进入饱和状态，从而导致训练失败；“不同”用来保证网络可正常地学习训练，如果使用相同的数值初始化权矩阵，那么网络将无能力学习[56]。随机初始化的权值和偏置值的范围可为[-0.5,0.5]或者[-1,1]（或者是其它合适的区间）[57]。在实际应用中，无标注的数据远多于有标注的数据，同时对数据进行人工标注也需要耗费较大的人力。但是为了使有监督 CNN 得到充分的训练并获得较好的泛化能力，又需要大量有标注的训练样本，这一定程度上制约了 CNN 在实际中的应用。这也是有监督学习的一个缺欠。

Although it is generally considered that supervised training of deep neural networks is very difficult without unsupervised pre-training, CNN is a special case, which can directly perform supervised learning training. CNN supervised training through BP algorithm also needs to go through two stages: forward propagation and backward propagation. Before CNN start training, we need to use some different small random numbers to initialize all the weights and bias values in the network randomly. Using “small random numbers” to ensure that the network does not enter saturation state because of excessive weight, resulting in training failure ;" different "is used to ensure that the network can learn training normally, if the same numerical initialization weight matrix is used, Then the network will be unable to learn. The range of randomly initialized weights and bias values can be [-0.5] or [-1](or other appropriate intervals). In practical application, the unlabeled data is much more than the annotated data, and the manual tagging of the data also requires a lot of manpower. Nevertheless, a large number of labeled training samples are needed in order to make the supervised CNN fully trained and get better generalization ability, which restricts the application of CNN in practice to some extent. This is also a gap in supervised learning.

事实上，CNN 也可以进行无监督训练。现存的一些无监督学习算法一般都需要调整很多超参数（hyperparameter），这使得它们难以被利用，对此Ngiam[58]等人提出了一种只需调整一个超参数的无监督学习算法—稀疏滤波（sparse filtering）。稀疏滤波只优化一个简单的代价函数—L2 范数稀疏约束特征，从而得到好的特征表示。在稀疏滤波中，其特征分布矩阵具有如下特点：样本分布稀疏性（population sparsity）、高分散性（high dispersal）、存在稀疏（lifetime sparsity）。文中指出可将稀疏滤波用于深度网络模型中，先用稀疏滤波训练得到一个单层的归一化特征，然后将它们作为第二层的输入来训练第二层，依此类推。通过实验，发现使用稀疏滤波贪心算法逐层训练，可学习到一些很有意义的特征表示。Dong[59]等人将稀疏滤波应用于CNN 的无监督学习，同时使用该 CNN 模型识别交通工具类型。在文献[59]中，采用稀疏滤波作为预训练，并将 CNN 学习到的高级全局特征和低级局部特征输入到 Softmax 层中进行分类。随后，Dong[60]等人又采用一种半监督学习 CNN 用于交通工具类型识别中。文中采用大量无标注的数据无监督训练卷积层的卷积核，该无监督算法为稀疏拉普拉斯滤波器，再用一定量的有标注数据有监督训练CNN 输出层的参数，最后通过 BIT-Vehicle 数据库验证该 CNN 模型的可行性。如果数据集中只有少量的标注数据，同时还需要训练一个大的 CNN 网络，传统的做法是首先进行无监督预训练，然后再采用有监督学习（如 BP 算法）进行微调（fine-tuning）。

In fact, CNN can also conduct unsupervised training. Existing unsupervised learning algorithms generally require many hyperparameters (hyperparameter), Which makes it hard to use, To this end Ngiam et al proposed an unsupervised learning algorithm sparse filtering. which only needs to adjust one hyperparameter Sparse filtering optimizes only one simple cost function - L2 norm sparse constraint feature, Thus a good feature representation is obtained. In sparse filtering, A characteristic distribution matrix is population sparsity、high dispersal、lifetime sparsity. as follows: It is pointed out that sparse filtering can be used in deep network model, A single layer normalized feature is obtained by sparse filtering, They are then used as input to the second layer to train the second layer, And so on. Through experiments, It is found that the sparse filtering greedy algorithm is used to train layer by layer, Some meaningful feature representations can be learned. Dong et al apply sparse filtering to CNN unsupervised learning, At the same time, the CNN model is used to identify the type of vehicle. And in the literature [59], Using sparse filtering as pre-training, CNN learned advanced global features and low-level local features are input into the Softmax layer for classification. Then, Dong and others also use a semi-supervised learning CNN for vehicle type identification. In this paper, a large number of unlabeled data are used to train the convolution kernel of convolution layer, The unsupervised algorithm is a sparse Laplace filter, And then with a certain amount of tagged data supervised training CNN output layer parameters, The feasibility of the CNN model is verified by BIT-Vehicle database. if there is only a small amount of labeled data in the dataset and a large CNN network needs to be trained at the same time, the traditional practice is to conduct unsupervised pre-training first and then fine-tune using supervised learning (such as BP algorithms).

显性训练是传统的神经网络训练方法，其最大特点是训练过程中有一部分样本不参与 CNN 的误差反向传播过程，将该部分样本称为校验集。在显性训练过程中，为了防止发生过拟合现象，每隔一定时间就用当前分类模型测试校验样本，这也表明了校验集中样本选取的好坏会影响最终分类模型的性能。在 CNN 分类模型中，为了增加训练样本数，可采用“平移起始点”和“加躁”这两种技术[61]。不妨以一个采样点数为 11900 的一维信号为例，设置起始点的范围为[1,200]，训练过程中，每个样本随机选定一个起始点，截取其后连续的 1700 个点作为网络的输入参与 BP 训练过程，则 CNN 的输入维数为 11700，显然起始点不同，截取所得的子段也不同。在文献[48]的校验集中，每幅 ECG 的起始点均为 1，实际上起始点也可以不一样，但是在CNN 的整个训练过程中，必须保持该校验集不变，同时校验集和训练集完全没有交集，其样本为来自不同病人的不同记录。此外，只要对类别的最终判断没有影响，也可通过加躁处理或者对原始数据做某种扭曲变换从而达到增加训练样本的目的。

Dominant training is a traditional neural network training method, A part of the sample does not participate in the CNN error back propagation process, This part of the sample is called a check set. During explicit training, To prevent overfitting, Test the samples every once in a while with the current classification model, This also shows that the selection of samples in the calibration set will affect the performance of the final classification model. And in the CNN classification model, To increase the number of training samples, The two techniques of “translation starting point” and “irritability” can be used. Consider a one-dimensional signal with 11900 sampling points, Set the starting point to [1, 200], During training, Each sample randomly selects a starting point, intercepting the subsequent continuous 1700 points as input to the network to participate in the BP training process, CNN the input dimension is 11700, Obviously the starting point is different, The subsections obtained by interception are also different. At the [48] of the literature, Each ECG starts at 1, In fact, the starting point can be different, But throughout CNN training, You have to keep the set constant, At the same time, the calibration set and the training set have no intersection, The samples were different records from different patients. In addition, As long as the final judgment of the category has no effect, The purpose of increasing training samples can also be achieved by adding mania processing or some distortion transformation to the original data.

在某些应用领域如计算机辅助 ECG 分析，不同的 ECG 记录或者一维信号也可能存在一些表现相似的 ECG 记录或者一维信号。如果校验样本不典型，即该校验集没有包含全部有差异的个体，则训练所得的分类模型就会存在偏差。由于受到一些现实条件的影响，人工挑选校验样本也并非是件易事。因此在 CNN 的分类过程中，还可以采用隐性训练方法。与显性训练相比，隐性训练方法与它主要的区别是怎样检验当前的分类模型。隐性训练方法从整个训练集中取出一小部分样本用于校验：用于校验的这部分样本不做加躁处理，并且对于每一个样本都截取起始点固定的子段。在实际应用中，这两种训练方法各有优势。通过实验表明，这种平移起始点和加躁技术对分类性能的提升有很大的帮助，尤其是对于数据不平衡的分类问题[61]。

There may also be some similar ECG records or one-dimensional signals in some applications such as computer-aided ECG analysis, different ECG records or one-dimensional signals. If the calibration sample is not typical, that is, the calibration set does not contain all the different individuals, then the training classification model will be biased. Due to the influence of some practical conditions, manual selection of check samples is not easy. Therefore, in the CNN classification process, we can also use recessive training method. Compared with explicit training, the main difference between implicit training method and it is how to test the current classification model. The recessive training method takes a small part of the sample from the whole training set for verification: this part of the sample used for verification is not irritable, and for each sample the starting point is fixed. In practical application, the two training methods have their own advantages. Experiments show that this translation starting point and addition technique are of great help to improve the classification performance, especially for the problem of unbalanced data classification.

4.2 开源工具

4.2 Open Source Tools

深度学习能够广泛应用于众多研究领域，离不开许多优秀的开源深度学习仿真工具。目前常用的深度学习仿真工具有 Caffe[62] ① 、 Torch ② ③ 及 Theano[63]④等。Caffe 是一个基于 c++语言、且关于CNN 相关算法的架构，它具有出色的 CNN 实现。Caffe可以在 CPU及GPU上运行，它支持MATLAB和 Python 接口。Caffe 提供了一个完整的工具包，用于训练、测试、微调及部署模型。Caffe 允许用户对新数据格式、网络层和损失函数进行拓展；它的运行速度也很快，在单个 K40 或者 Titan GPU 上一天可以训练超过 4 千万张图像；用户还可以通过Caffe 社区参与开发与讨论。尽管 Caffe 可进行许多拓展，但是由于一些遗留的架构问题，它不善于处理递归神经网络（Recurrent Neural Network ，RNN）模型，且 Caffe 的灵活性较差。

Deep learning can be widely used in many research fields, Can not do without many excellent open source depth learning simulation tools. The commonly used depth learning simulation tools are Caffe 、Torch and Theano. Caffe is an architecture based on c languages and CNN related algorithms, and it has excellent CNN implementation. Caffe can run on CPU and GPU, which supports MATLAB and Python interfaces. Caffe provided a complete kit, For training, testing, fine tuning and deployment models. Caffe allows users to extend new data formats, network layers, and loss functions; And it runs fast, Over 40 million images can be trained on a single K40 or Titan GPU day; Users can also participate in development and discussion through Caffe communities. Although Caffe can expand a lot, But because of some legacy architecture issues, He’s not good at dealing with recurrent neural networks (Recurrent Neural Network, RNN) Models, And the flexibility of the Caffe is poor.

Torch 是一个支持机器学习算法的科学计算框架。它是采用 Lua 脚本语言和 C 语言编写的。Torch为设计和训练机器学习模型提供了一个灵活的环境，它还可支持 iOS、 Android 等嵌入式平台。最新版本Torch7使CNN的训练速度得到大幅度提升。对于 Torch 的时域卷积，其输入长度可变，这非常有助于自然语言任务。但 Torch 没有 Python 接口。

Torch is a scientific computing framework that supports machine learning algorithms. which is written in Lua scripting language and C language. Torch provides a flexible environment for designing and training machine learning models, which can also support embedded platforms such as iOS、Android. The latest version Torch7 greatly improved CNN training speed. For Torch time domain convolution, its input length is variable, which is very helpful to natural language tasks. but Torch have no Python interface.

Theano 是一个允许用户定义、优化并评价数学表达式的 python 库。Theano 提供了NumPy 的大部分功能，可在 GPU 上运行。此外，Theano 能够自动求微分，它尤其适用于基于度的方法。Theano能够很容易且高效地实现 RNN 模型。然而 Theano的编译过程很慢，导入 Theano 也需要消耗时间。

Theano is a python library that allows users to define, optimize, and evaluate mathematical expressions. Theano provide most of the NumPy functions that can be run on the GPU. Besides, Theano can automatically find differential, which is especially suitable for degree-based methods. Theano can easily and efficiently implement the RNN model. The Theano compilation process is slow, however, and import Theano takes time.

Bahrampour[64]等从可拓展性、硬件利用率及速度方面对 Caffe、Torch、Theano、Neon ⑤ 及TensorFlow⑥等 5 个深度学习软件架构作了比较。其中 Caffe、Torch 及 Theano 是最广泛使用的软件架构。这五个软件架构均可在 CPU 或者 GPU 上运行，但是 Neon 不能使用多线程 CPU，Caffe 需要在安装的时候确定好 CUP 的线程数，TensorFlow、Torch 及 Theano 则可以灵活地选择CPU 线程数[64]。文献[64]通过实验发现：Torch 与 Theano 是两个最具有拓展性的架构，不仅支持各种深度结构，还支持各种库；在 CPU 上，对于任一深度网络结构的训练和部署，Torch 表现最优，其次是 Theano，Neon 的性能最差；在GPU 上训练卷积和全连接网络，对于小网络模型Theano 的训练速度最快，对于较大的网络模型则是 Torch 最快，而对于大的卷积网络 Neon 也非常有竞争力；在 GPU 上训练和部署 RNN 模型，Theano 的性能最好；Caffe 最易于评价标准深度结构的性能；与 Theano类似，TensorFlow 也是非常灵活的架构，但是它在单个 GPU 上的性能不如其它几个架构。表 1 总结了 Caffe、Torch 及Theano 所具有的一些特点⑦。Theano 没有预训练的 CNN 模型，所以在 Theano 上不能直接进行CNN 无监督预训练。

Bahrampour and so on from the extensibility, the hardware utilization and the speed aspect has made the comparison to the Caffe、Torch、Theano、Neon and the TensorFlow and so on 5 deep learning software architecture. Caffe、Torch and Theano are the most widely used software architectures. These five software architectures can run on CPU or GPU, Neon can’t use multithreaded CPU, though Caffe need to determine the number of CUP threads when installing, TensorFlow、Torch and Theano can flexibly select the number of CPU lines. Literature [64] through experiments, it is found that Torch and Theano are the two most extended structures, Not only support various depth structures, It also supports libraries; At the CPU, For training and deployment of any deep network structure, Torch perform best, Theano, next Neon performance is the worst; training convolutional and fully connected networks on GPU, For small network model Theano training speed is the fastest, For larger network models, the Torch is fastest, And for large convolution network Neon is also very competitive; Train and deploy RNN models on GPU, Theano best performance; Caffe is the easiest to evaluate the performance of standard depth structures; And like Theano, TensorFlow is also a very flexible architecture, But it performs less on a single GPU than on several other architectures. Table 1 summarizes some of the characteristics of Caffe、Torch and Theano. Theano there is no pre-training CNN model, so the Theano can not be directly unsupervised pre-training.

5 实际应用

5 Practical Application

5.1 图像分类

5.1 Image Classification

近年来，CNN 已被广泛应用于图像处理领域中。 Krizhevsky 等人 [24] 第一次将 CNN 用于 LSVRC-12 竞赛中，通过加深 CNN 模型的深度并采用 ReLU+dropout 技术，取得了当时最好的分类结果（该网络结构也被称为 AlexNet）。AlexNet 模型中包含 5 个卷积层和 2 个全连接层。与传统 CNN相比：在 AlexNet 中采用 ReLU 代替饱和非线性函数 tanh 函数，降低了模型的计算复杂度，模型的训练速度也提升了几倍；通过 dropout 技术在训练过程中将中间层的一些神经元随机置为零，使模型更具有鲁棒性，也减少了全连接层的过拟合；而且还通过图像平移、图像水平镜像变换、改变图像灰度等方式来增加训练样本，从而减少过拟合。相比于AlexNet，Szegedy等人[65] 大大增加了CNN的深度，提出了一个超过 20 层的 CNN 结构（称为 GoogLeNet）。在 GoogLeNet 结构中采用了 3 种类型的卷积操作（11，33，55），该结构的主要特点是提升了计算资源的利用率，它的参数比文献[24]少了 12 倍，而且 GoogLeNet 的准确率更高，在 LSVRC-14 中获得了图像分类“指定数据”组的第一名。Simonyan 等人[66]在其发表的文章中探讨了“深度”对于 CNN 网络的重要性。该文通过在现有的网络结构中不断增加具有 33 卷积核的卷积层增加网络的深度，实验表明，当权值层数达到16-19 时，模型的性能能够得到有效提升（文中的模型也称为 VGG 模型）。VGG 模型用具有小卷积核的多个卷积层替换一个具有较大卷积核的卷积层（如用大小均为 33 卷积核的三层卷积层代替一层具有 77 卷积核的卷积层），这种替换方式减少了参数的数量，而且也能够使决策函数更具有判别性。VGG 模型在LSVRC-14 竞赛中，得到了图像分类“指定数据”组的第二名，证明了深度在视觉表示中的重要性。但是由于 VGG 与 GoogLeNet 的深度都比较深，所以网络结构比较复杂，训练时间长，而且 VGG 还需要多次微调网络的参数。

In recent years, CNN has been widely used in the field of image processing. Krizhevsky and others used CNN for the first time in LSVRC-12 competitions, Through deepening the CNN model and adopting ReLU dropout techniques, achieved the best classification results at that time (the network structure is also known as AlexNet). AlexNet model contains 5 convolutional layers and 2 fully connected layers. ReLU is used to replace the saturated nonlinear function tanh function in AlexNet, Reduced computational complexity of the model, The training speed of the model has also increased several times; By dropout technique, some neurons in the middle layer are randomly set to zero during training, Making the model more robust, It also reduces the overfitting of the fully connected layer; Moreover, the training samples are increased by image translation, image horizontal image transformation, image gray level change, thereby reducing overfitting. Compared to AlexNet, Szegedy et al .greatly increased CNN depth, proposed a CNN structure over 20 layers (called GoogLeNet). 3 types of convolution operations (11) are used in the GoogLeNet structure, 33, 55), The main feature of this structure is to improve the utilization of computing resources, and its parameters are 12 times less than the literature [24], And the accuracy of the GoogLeNet is higher, first place in the image classification “specified data” group was obtained in the LSVRC-14. Simonyan et al .[66] explored the importance of “depth” for CNN networks in their published articles. In this paper, the depth of the network is increased by increasing the convolution layer with 33 convolution kernel in the existing network structure, Experiments show that, At 16-19, Model performance can be effectively improved (the model in this paper is also called VGG model). VGG model replaces a convolution layer with a large convolution kernel with multiple convolution layers with a small convolution kernel (such as replacing a convolution layer with a 7/7 convolution kernel with a three-layer convolution layer with a 3/3 convolution kernel), This substitution reduces the number of parameters, It can also make the decision function more discriminant. VGG model in LSVRC-14 competition, Got the second place in the “specified data” group for image classification, prove the importance of depth in visual representation. Since the depth of VGG and GoogLeNet is deeper, So the network structure is more complex, Long training hours, And VGG also need to fine-tune the parameters of the network many times.

AlexNet 模型、GoogLeNet 模型与 VGG 模型都在 ImageNet 竞赛中取得了很好的结果，然而它们只能接受固定大小的输入。事实上，CNN 的卷积层不需要固定大小的输入，它可以产生任意大小的特征面，但是它的全连接层需要固定长度的输入，因此 CNN 的输入大小需保持一致的这一限制是源于它的全连接层[67]。为了获得固定大小的输入，需要对输入图像进行裁剪或者缩放，但是这样的变换会破坏输入图像的纵横比及完整的信息等，从而影响识别的准确率。He等[67]等人提出一种SPP-net模型，该模型是在 CNN 的最后一个卷积层与第一个全连接层中间加入一个空间金字塔池化（spatial pyramid pooling，SPP）层。SPP 层能够使 CNN 不同大小的输入却产生大小相同的输出，打破了以往 CNN 模型的输入均为固定大小的局限，且该改进的 CNN模型训练速度较快，在 LSVRC-14 的图像分类比赛中获得第三名。

AlexNet model, GoogLeNet model and VGG model all achieved good results in the ImageNet competition, However, they can only accept input of a fixed size. In fact, CNN convolution layers don’t require fixed-size inputs, It can produce features of any size, But its full connection layer requires a fixed length of input, Hence the limitation that CNN input sizes need to be consistent is due to its full connection layer. To get a fixed input, The input image needs to be cropped or scaled, But this transformation can destroy the aspect ratio of the input image and complete information, Thus affecting the accuracy of recognition. He et al . proposed a SPP-net model, The model adds a spatial pyramid pooling (spatial pyramid pooling,) between the last convolution layer of the CNN and the first fully connected layer SPP) layer. SPP layer enables input CNN different sizes to produce output of the same size, Breaks the limitation that the input of previous CNN models is fixed size, And the improved CNN model training speed is faster, Get the third place in the LSVRC-14 image classification competition.

在层级很深的深度网络模型中，除了存在梯度扩散问题外，还存在着退化问题。批规范化（Batch Normalization，BN）是解决梯度扩散问题的一种有效方法[68]。所谓退化问题就是：随着深度的增加，网络精度达到饱和，然后迅速下降。且该性能的下降不是由过拟合引起的，而是增加网络的深度使得它的训练误差也随之增加 [69]。文献[69]采用残差网络（Residual Networks，ResNet）来解决退化问题。ResNet 的主要特点是跨层连接，它通过引入捷径连接技术（Shortcut Connections）将输入跨层传递并与卷积的结果相加。在 ResNet 中只有一个取样层，它连接在最后一个卷积层后面。ResNet 使得底层的网络能够得到充分训练，准确率也随着深度的加深而得到显著提升。将深度为 152 层的 ResNet 用于LSVRC-15 的图像分类比赛中，它获得了第一名的成绩。在该文献中，还尝试将ResNet 的深度设置为 1000，并在 CIFAR-10 图像处理数据集中验证该模型。

In deep, hierarchical networks, Except for gradient diffusion, There is also a problem of degradation. (Batch Normalization,) Batch normalization BN) is an effective method to solve the problem of gradient diffusion. The degradation problem is: as the depth increases, The accuracy of the network is saturated, Then it drops rapidly. And this performance decline is not caused by overfitting, It increases the depth of the network and increases its training error. Residual network (Residual Networks,)[69] used in literature ResNet) to solve the degradation problem. A major feature of the ResNet is the cross-layer connection, By introducing shortcut connection technology (Shortcut Connections), it passes the input across layers and adds the result of convolution. And there’s only one sampling layer in the ResNet, which is connected behind the last convolutional layer. ResNet enables the underlying network to be adequately trained, The accuracy is also significantly improved with the deepening of depth. Using a 152- layer ResNet in LSVRC-15 image classification competition, It won the first place. In this literature, Also trying to set the depth of the ResNet to 1000, and validate the model in the CIFAR-10 image processing dataset.

AlexNet 与 VGG 模型的网络结构为直线型，它们的输入都是从第一个卷积层按单个路径直接流入最后一层。在 BP 训练中预测误差是由最顶层传递到底层的，对于很深的网络模型传递至底层的误差很小，难以优化底层参数[70]。因此，对于 AlexNet 与 VGG 模型，如果它们的深度很深，则将难以优化它们的结构。为了使网络结构能够得到有效训练，GoogLeNet 在多个中间层中加入监督信号。ResNet 则通过捷径连接技术使得输入可以通过多个路径流入最顶层，它大幅度降低了更深层模型的训练难度。如何有效地训练层级很深的深度网络模型仍旧是一个有待好好研究的问题。尽管图像分类任务能够受益于层级较深的卷积网络，但一些方法还是不能很好地处理遮挡或者运动模糊等问题。

The network structure of AlexNet and VGG models is linear, and their input flows directly from the first convolution layer to the last layer by a single path. The prediction error in BP training is transferred from the top layer to the bottom layer, and the error from the deep network model to the bottom layer is very small, so it is difficult to optimize the bottom parameter [70]. Therefore, for AlexNet and VGG models, it will be difficult to optimize their structure if their depth is very deep. for the network structure to be effectively trained, supervision signals are GoogLeNet added to multiple middle layers. ResNet, the input can flow into the top layer through multiple paths through shortcut connection technology, which greatly reduces the training difficulty of deeper model. How to train deep network model is still a problem to be studied. Although image classification tasks can benefit from deep convolution networks, some methods can not deal with occlusion or motion blur.

Mishkin 等人 [71] 系统地比较了近年来在 ImageNet 竞赛的大数据中不同 CNN 结构（包括VGG、GoogLeNet）的性能及不同参数选取对 CNN 结构的影响。文中通过实验得到以下一些建议：1）对于激励函数，可选取没有 BN 的指数线性单元（Exponential Linear Unit，ELU）[37,71]或者有 BN 的ReLU 非线性函数；2）在取样层中采用平均池化及最大值池化的和比随机池化、单独的平均池化或者最大池化等方法要好；3）相比较于平方根学习率衰减方法（square root）、平方学习率衰减方法（square）或者阶跃学习率衰减方法（step），使用线性学习率衰减方法（linear）更好；4）最小批量大小（mini-batch size）可取 128 或者 256 左右，如果这对于所用 GPU 而言还是太大，那么可按批量大小（batch size）成比例减少学习率；5）目前深度学习的性能高度依赖于数据集的大小。如果训练集大小小于它的最小值，那么模型性能会迅速下降。因此当研究增加训练集大小时，需要检查数据量是否已达到模型所需的最小值；6）由于要人工标注大数据是不切实际的，因此可以用免费的、可用的噪声标注数据（噪声标注表示该数据的标注不一定正确）代替，然而实验表明数据的整洁性比数据量大小更重要；7）如果不能增加输入图像的大小，那么可以减小其后卷积层中的滑动步长，这样也能够得到大致相同的结果。

Mishkin and others systematically compare the performance of different CNN structures (including VGG、GoogLeNet) and the influence of different parameter selection on CNN structures in the big data of the competition in recent years. The following suggestions are obtained through experiments :1) For the excitation function, The exponential linear unit (Exponential Linear Unit,) without BN can be selected ELU) or ReLU nonlinear functions with BN; 2) average pool and maximum pool are better than random pool, individual average pool or maximum pool in sampling layer; 3) compared with square root learning rate attenuation method (square root), square learning rate attenuation method (square) or step learning rate attenuation method (step), A linear learning rate attenuation method (linear) is better; 4) minimum batch size (mini-batch size) may be about 128 or 256, If this is still too big for the GPU used, can then reduce the learning rate in proportion to the volume size (batch size); 5) the performance of current deep learning is highly dependent on the size of the dataset. If the training set size is smaller than its minimum, Then the model performance will decline rapidly. therefore, when studying to increase the size of the training set, it is necessary to check whether the amount of data has reached the minimum required for the model ;6) since it is impractical to manually label big data, it can be replaced by free, available noise annotation data (noise annotation indicates that the annotation of the data is not necessarily correct). however, experiments show that the neatness of the data is more important than the size of the data ;7) if the size of the input image can not be increased, then the sliding step size in its rear convolutional layer can be reduced, which can also obtain roughly the same results.

5.2 人脸识别

5.2 Face Recognition

在人脸识别中，传统的识别路线包括 4 个步骤：检测-对齐-人脸表示-分类。DeepFace[72]也遵循这一技术路线，但是对人脸对齐和人脸表示阶段进行了改进。在 DeepFace 中首先对图像进行 3D 人脸对齐，再输入到深度神经网络中。DeepFace 的前 3 层（2个卷积层及 1 个取样层）用于提取低级特征（如边缘及纹理信息）。取样层能够使得网络对微小偏移更具有鲁棒性，但是为了减少信息的丢失，DeepFace 的取样层只有 1 层，其紧跟在第一个卷积层后面。DeepFace 的第二个卷积层后紧连着 3 个局部连接层（这 3 个局部连接层卷积核不共享），由于在对齐的人脸图像中不同的区域有不同的局部统计特征，采用不共享的卷积核可减少信息的丢失。DeepFace 具有 2 个全连接层，全连接层可用来捕获人脸图像不同位置的特征之间（如人眼的位置与形状、嘴巴的位置与形状）的相关性。该模型应用于户外人脸检测数据库（Labeled Faces in the Wild，LFW）中，文献[72]取得的人脸识别准确率为 97.35%，接近人眼辨识准确率 97.53%，文中所用方法克服了以往方法的缺点和局限性。然而 DeepFace 的参数个数多于 1.2 亿，其中 95%参数来自 3 个局部连接层及 2 个全连接层，因此 DeepFace对有标注样本的数量要求较高，它需要一个大的有标注数据集。

In face recognition, The traditional recognition route consists of four steps: detection-alignment-face representation-classification. DeepFace followed this technical route, but the face alignment and face representation stages are improved. The DeepFace starts with 3 D face alignment, Then input into the deep neural network. DeepFace first three layers (2 convolution layers and 1 sampling layer) are used to extract low-level features (such as edge and texture information). The sampling layer makes the network more robust to small offsets, But to reduce the loss of information, DeepFace sampling layer is only 1, It follows the first convolution layer. DeepFace the second convolution layer is tightly connected to three local connection layers (these three local connection layer convolution kernels are not shared), Because different regions have different local statistical features in aligned face images, Use non-sharing convolution approval to reduce the loss of information. DeepFace has two fully connected layers, The fully connected layer can be used to capture the correlation between the features of different positions of the face image, such as the position and shape of the human eye, the position and shape of the mouth. Application of this model to outdoor face detection database (Labeled Faces in the Wild,); and LFW), Face recognition accuracy [72]97.35% in literature, Close to the accuracy of human eye recognition 97.53, The methods used in this paper overcome the shortcomings and limitations of the previous methods. However, the number of DeepFace parameters is more than 120 million, Of these ,95% are from three local connection layers and two full connection layers, As a result DeepFace the number of tagged samples is high, it requires a large annotated dataset.

在 DeepID[73]、DeepID2[74]之后，Sun 等人又相继提出了 DeepID2+[75]、DeepID3[76]。DeepID2+继承了 DeepID2 的结构，它也包含 4 个卷积层（其中第四个卷积层权值不共享），且每个卷积层后均紧随着一个取样层，并作了 3 个方面的改进：1）加大网络结构，每个卷积层的特征面个数增加到了128 个，最终的特征表示也增加到了 512 维；2）增加了训练数据；3）一个具有 512 维的全连接层均与每一个取样层进行全连接，且每一取样层都添加监督信号（由人脸辨识信号和人脸确认信号组成），使用监督信号既能够增加类间变化又能够减少类内变化。DeepID2+在 LFW 上的准确率达到了99.47%。DeepID2+具有 3 个重要的属性：1）它的顶层神经元响应是中度稀疏的，即使将神经元二值化后，仍能获得较好的识别结果，该性质能够最大化网络的辨识能力及图像间的距离；2）高层的神经元对人脸身份以及人脸属性具有很高的选择性；3）高层神经元对局部遮挡具有良好的鲁棒性。以往的许多研究工作为了获得这些引人注目的属性，通常需要对模型加入一些显性的约束，但是DeepID2+通过大数据训练深度模型就能够自动地得到这些属性[75]。DeepID2+的提出不仅能够显著提升人脸识别的性能，还能够帮助人们理解深度模型及其网络连接，且对稀疏表示、属性学习和遮挡处理等研究也起一定的指导作用[75]。Sun 等人[76]分别重建了VGG网络和GoogLeNet网络，得到DeepID3 net1 网络和 DeepID3 net2 网络（将它们称为 DeepID3）。DeepID3 继承了 DeepID2+的一些特点，包括在最后几个特征提取层中它们的权值也不共享，并且为了使网络能够更好地学习中级特征及更易于训练，在网络的一些中间层中也要加入人脸辨识-人脸确认监督信号。然而 DeepID3 的深度更深，且它的非线性特征提取层可达 10-15 层。通过结合DeepID3 net1 网络和 DeepID3 net2 网络，在 LFW上 DeepID3 的人脸识别准确率为 99.53%。尽管DeepID3 的深度要比 DeepID2+深，但是它要比VGG 或者 GoogLeNet 深度浅得多。然而当更正了 LFW 上一些标注错误的数据后，它的准确率与 DeepID2+一样，还需在更大的训练集上进一步研究很深的深度模型的有效性。

After DeepID、DeepID2, Sun and others put forward DeepID2、DeepID3. one after another DeepID2 inherited DeepID2 structure, It also contains four convolutional layers (where the fourth convolutional layer weights are not shared), And each convolution layer follows a sampling layer, And three improvements were made :1) increasing the network structure, The number of feature surfaces per convolution layer has increased to 128, The final feature representation also increased to 512 dimensions; 2) Increased training data; 3) a full connection layer with 512 dimensions is fully connected to each sampling layer, And each sampling layer adds supervised signals (composed of face recognition signals and face recognition signals), The use of supervised signals can increase inter-class changes and reduce intra-class changes. DeepID2 accuracy on the LFW reached 99.47. DeepID2 has three important attributes :1) its top-level neuron response is moderately sparse, Even after binary neurons, We can still get better results, This property can maximize the ability of network identification and the distance between images; 2) high-level neurons have high selectivity for face identity and face attributes; 3) high-level neurons have good robustness to local occlusion. In order to obtain these compelling attributes, There are usually explicit constraints to the model, However DeepID2 these attribute [75] can be automatically obtained by big data training depth model. DeepID2 can not only improve the performance of face recognition, It can also help people understand depth models and their network connections, and also plays a guiding role in the research of sparse representation, attribute learning and occlusion processing [75]. Sun and others rebuilt the VGG and GoogLeNet networks, get DeepID3net1 networks and DeepID3net2 networks (call them DeepID3). DeepID3 inherited some of DeepID2 characteristics, Their weights are not shared in the last few feature extraction layers, And in order to make the network better able to learn intermediate features and easier to train, Face recognition-face recognition monitoring signal should also be added to some middle layers of the network. But DeepID3 deeper, And its nonlinear feature extraction layer can reach 10-15 layers. Through a combination of DeepID3net1 and DeepID3net2 networks, The accuracy of face recognition DeepID3 on LFW is 99.53. Though DeepID3 deeper than DeepID2, but it is much lighter than VGG or GoogLeNet depth. However, after correcting some of the mislabeled data on the LFW, As accurate as DeepID2, The effectiveness of deep depth models needs to be further studied on larger training sets.

FaceNet[77]是由 Google 公司提出的一种人脸识别模型，它直接学习从人脸图像到紧致欧式空间的一个映射，使欧式距离直接关联着人脸相似度的一个度量。FaceNet 是一个端对端的学习方法，它通过引入三元组损失函数进行人脸验证、识别和聚类。FaceNet 直接优化与任务相关的三元组损失函数，在训练过程中该损失不仅仅用在最后一层，它也用于多个层中。然而如果选择不合适的三元组损失函数，那么将会影响模型的性能，同时也会使收敛速度变慢，因此三元组损失函数的选择对于FaceNet 性能的提升是很重要的。经 LFW 数据库和 YouTube 人脸数据库测试，FaceNet 得到的识别准确率分别为 99.63%和 95.12%。

FaceNet is a face recognition model proposed by Google company, which directly learns a mapping from face image to compact Euclidean space, which makes Euclidean distance directly related to a measure of face similarity. FaceNet is an end-to-end learning method that performs face verification, recognition, and clustering by introducing triples loss functions. FaceNet directly optimize the task-related triples loss function, the loss is not only used in the last layer, but also in multiple layers during training. However, if the inappropriate triples loss function is selected, it will affect the performance of the model and slow the convergence rate. Therefore, the selection of triples loss function is very important for the improvement of FaceNet performance. The FaceNet recognition accuracy is 99.63% and 95.12%, respectively, by LFW database and YouTube face database test.

相比较于 DeepFace、DeepID，FaceNet 不需要进行复杂的 3D 对齐，DeepID 则需要一个简单的 2D 仿射对齐。 Parkhi 等人[78]在其文章中也研究了在不同 CNN 结构中人脸对齐对人脸识别准确性的影响。文献[78]通过实验发现：有必要对测试集作精准，训练集则不需太准，且对齐后 FaceNet 的识别准确率比原模型的高。在 LFW 数据库，DeepFace系列及 FaceNet 的人脸识别准确率都比较高，但是CNN 在人脸识别中仍然有许多具有挑战性的问题，如面部特征点定位、人脸、姿态等对人脸识别效果的影响，都是需要深入研究的问题[79]。

Compared to DeepFace、DeepID, FaceNet don’t need complicated 3 D alignment, DeepID requires a simple 2 D affine alignment. Parkhi et al. also studied the effect of face alignment on face recognition accuracy in different CNN structures in their articles. Document [78] Through experiments, it is found that it is necessary to make the test set accurate, The training set is not too accurate, And the recognition accuracy of aligned FaceNet is higher than that of the original model. At the LFW database, DeepFace series and FaceNet face recognition accuracy is relatively high, However CNN there are still many challenging problems in face recognition, such as facial feature point location, face, pose and other effects on face recognition effect, All need to be studied in depth.

5.3 音频检索

5.3 Audio Retrieval

Hamid等[80-81]结合隐马尔科夫建立了CNN用于识别语音的模型，并在标准 TIMIT 语音数据库上进行实验，实验结果显示该模型的错误率相对于具有相同隐含层数和权值的常规神经网络模型下降了 10%，表明 CNN 模型能够提升语音的识别准确率。在文献[80-81]中，CNN 模型的卷积层均采用了受限权值共享（limited weight sharing ，LWS）技术，该技术能够更好地处理语音特征，然而这种 LWS 方法仅限于单个卷积层，不像大部分的 CNN 研究使用多个卷积层。IBM 和微软公司近年来在CNN 用于识别语音方面也做了大量的研究工作，并发表了一些相关的论文[82-84]。

Hamid and so on combined with hidden Markov to establish a CNN model for speech recognition, And experimented on TIMIT standard voice database, The experimental results show that the error rate of this model is 10% lower than that of the conventional neural network model with the same number of hidden layers and weights, indicating that the CNN model can improve the recognition accuracy of speech. In [80-81], The convolution layer of the CNN model adopts limited weight sharing, LWS technology, This technique can better handle speech features, However, this LWS method is limited to a single convolution layer, unlike most CNN studies using multiple convolutional layers. IBM and Microsoft have done a lot of research in CNN speech recognition in recent years, and published some related papers.

5.4 ECG 分析

5.4 ECG analysis

ECG 是目前极为有用的一种心血管系统疾病的临床诊断体征。远程医疗诊断服务系统的产生使得更多的人获得医疗专家的诊断服务，许多研究者包括本课题组多年来一直致力于研究计算机辅助ECG 分析[85]。Kadi 等人[86]综述了从 2000 年到 2015年将数据挖掘技术应用于计算机辅助心血管疾病分析的文章。他们根据数据挖掘技术及其性能选出149 篇文献并进行分析，通过研究发现：从 2000 年到 2015 年，关于使用数据挖掘技术辅助分析心血管疾病的研究数量呈增长趋势；研究人员常将挖掘技术用于分类和预测；相比较于其它数据挖掘技术，神经网络和支持向量机能够获得更高的准确率。该文献的分析结果也说明了神经网络技术在计算机辅助心血管疾病分析中的有效性。然而由于实际应用中 ECG 数据形态复杂变，将传统的神经网络技术应用于大数据的 ECG 分析中，取得的结果并不是很理想。

ECG is a very useful clinical diagnostic sign of cardiovascular disease. Because of the generation of telemedicine diagnostic service system, more people have access to medical expert diagnostic services. Many researchers, including our group, have been working on computer aided ECG analysis for many years. Kadi et al [86] reviewed articles that applied data mining techniques to computer-aided cardiovascular disease analysis from 2000 to 2015. According to the data mining technology and its performance ,149 papers were selected and analyzed. From 2000 to 2015, the number of studies on the use of data mining technology to assist in the analysis of cardiovascular diseases increased; Researchers often use mining techniques for classification and prediction; neural networks and support vector machines can achieve higher accuracy than other data mining techniques. The results of this literature also illustrate the effectiveness of neural network technology in computer-aided cardiovascular disease analysis. However, due to the complex change ECG data form in practical application, the traditional neural network technology is applied to the ECG analysis of big data, and the results are not very satisfactory.

临床实际应用中，ECG 多数为多导联信号，与二维图像相似。本课题组成员朱[87]针对多导联 ECG 数据，同时考虑到 CNN 的优越特性，提出了一种 ECG-CNN 模型，从目前公开发表的文献可知，该ECG-CNN 模型也是CNN首次应用于ECG 分类中。ECG-CNN 模型采用具有 3 个卷积层和 3 个取样层的 CNN 结构，其输入数据维数为 81800（对应 8个基本导联 ECG 采样点数）。ECG-CNN 的第一个卷积核的大小为 823，它包含了全部的行，这与LeNet-5 网络结构在图像中的卷积核大小为 5*5 不一样，图像中的卷积核一般不会包含全部的行。通过采用 ECG-CNN 模型对国际公认的心律失常数据库-MIT-BIH 数据库①（该数据库共 48 条记录）中的 40 条 ECG 记录进行病人内心拍分类，得到的准确
率为 99.2%。同时在该文献中还采用 ECG-CNN 模型对本课题组为了面向临床应用而建立的中国心血管疾病数据库[88] (Chinese Cardiovascular Disease Database，CCDD http://58.210.56.164:88/ccdd/)的前 251 条记录进行心拍正异常分类，得到的准确率97.89%。文中将文献[89]和文献[90]作为对照文献，相同数据集上，文献[89]和[90]得到的心拍正异常分类准确率分别为 98.51%和 94.97%。此外文献[87]还采用该算法对 CCDD 数据库的 Set IV 数据集共 11760 条记录进行按记录的病人间正异常分类，最终准确率为 83.49%，文献[89]和[90]在该数据集中得到的准确率分别为 70.15%和 72.14%。从上述对比结果可知，无论是心拍正异常分类还是按记录的病人间正异常分类，ECG-CNN 模型得到的准确率均高于对照文献的准确率。Hakacova等[91] 2012年统计了市场上一些心电图机的自动诊断结果，总共统计了 576 例 ECG，发现 Philips medical 自动诊断准确率只有 80%，Draeger medical systems 的准确率为 75%，而 3 名普通医生的 ECG 判读准确率为85%，对比该统计结果及 ECG-CNN 模型所得结果，可知 CNN 在 ECG 分类中的有效性。

In clinical practice, ECG are mostly multi-lead signals, similar to two-dimensional images. Zhu, a member of our research group, ECG data for multiple leads, Considering the CNN’s superior qualities, A ECG-CNN model is proposed, From the current published literature, The ECG-CNN model is also CNN first applied to ECG classification. ECG-CNN model uses a CNN structure with three convolution layers and three sampling layers, and its input data dimension is 81800(corresponding to 8 basic leads ECG sampling points). ECG-CNN size of the first convolution kernel is 823, It contains all the lines, Different from the convolution kernel size of LeNet-5 network structure in the image, Convolution kernels in images generally do not contain all rows. Using ECG-CNN models to classify 40 ECG records in the internationally recognized arrhythmia database, the MIT-BIH database, which contains 48 records, The accuracy is 99.2. The Chinese cardiovascular disease database (Chinese Cardiovascular Disease Database,) established by our research group for clinical application is also ECG-CNN used in this literature CCDD http://58.210.56.164:88/ccdd/) the first 251 records were classified as positive cardiac anomalies, The accuracy is 97.89. Literature [89] and [90] are used as control documents, On the same dataset, The classification accuracy of positive beats anomaly obtained by literature [89] and [90] is 98.51% and 94.97%, respectively. Moreover, the literature [87] uses the algorithm to classify the Set IV data sets of CCDD database by recorded positive anomalies among patients, The final accuracy was 83.49, The accuracy of literature [89] and [90] in this dataset is 70.15% and 72.14%, respectively. From the above comparison results, Whether it’s a positive cardiac anomaly or a positive patient anomaly, ECG-CNN accuracy of the model is higher than that of the control literature. Hakacova and so on in 2012 counted the market some electrocardiograph machine automatic diagnosis result, A total of 576 ECG, were counted Found Philips medical automatic diagnostic accuracy is only 80%, Draeger medical systems accuracy is 75%, And the ECG interpretation accuracy of three doctors was 85%, Compare the statistical results with the results of the ECG-CNN model, The effectiveness of CNN in ECG classification is known.

文献[87]的 ECG-CNN 模型其实也是一种二维CNN，但是 ECG 的导联间数据相关性与导联内数据的相关性不一样，导联内数据具有时间相关性，导联间的数据却是独立的，因此不宜使用二维图像的 CNN 结构应用于 ECG 分类中[48]。据此，金等[48]在 ECG-CNN 模型上做了改进，提出了导联卷积神经网络（ Lead Convolutional Neural Network ， LCNN）模型。图 8 所示为基于记录分类的 LCNN 结构。

图 8 基于记录分类的 LCNN 结构

The ECG-CNN model [87] in the literature is also a two-dimensional CNN, But the data correlation between ECG leads is not the same as that within leads, The data in the lead are time dependent, The data between leads is independent, Therefore, it is not suitable to use the CNN structure of two-dimensional image in ECG classification. Accordingly, Kim and other [48] improved the ECG-CNN model, A leading convolutional neural network (Lead Convolutional Neural Network,) is proposed LCNN) Model. Figure 8 shows the LCNN structure based on record classification.

在图 8 中，每个卷积单元 CU 均包含一个卷积层和一个取样层，例如 CU-A1、CU-B1 及 CU-C1均分别包含一个卷积层和一个取样层，1D-Coc 表示一维卷积运算。对于 8 个导联，每一个导联均有 3个卷积单元，而且不同导联间的卷积单元是相互独立的。每个导联的数据依次通过 3 个卷积单元，如其中一个导联依次通过卷积单元 CU-A1、CU-B1、CU-C1，然后将每个导联的第三个取样层都连接到同一个全连接层进行信息汇总，最终在逻辑回归层上进行分类。与文献[87]相比，ECG-CNN 模型只有3 个卷积单元，而图 8 中的 LCNN 结构有 24 个卷积单元。文献[87]中对于连接输入层的卷积层，其卷积核大小为 823，图 8 中每一个导联的第一个卷积层的卷积核大小均为 118。为了增加训练样本从而降低不同类别 ECG 数据的不平衡性，LCNN 充分利用了 ECG 记录的周期特性，对 ECG 记录进行起始点平移操作，将一条 ECG 记录所有可能的情况都包含进去[48]。在 LCNN 的训练过程中，采用惯性量和变步长的反向 BP 算法[92]。同样在 CCDD 上进行模型验证，经 15 万多条 ECG 记录的测试，LCNN 取得了 83.66%的 ECG 病人间正异常分类准确率，该结果也说明了 LCNN 在实际应用中的有效性。王[93]构建了一个包含个体内时间序列及统计分类的混合分类模型（简称 ECG-MTHC），该模型包含 RR 间期正异常分析、QRS 波群相似度分析、基于数值和形态特征的 SVM 分类模型及 ECG 典型特征分析 4 个分类模块。金等将 ECG-MTHC 模型同样对 CCDD 中的 15 万多条记录进行测试，但是由于有 1 万多条ECG 记录的中间特征提取出错而无法给出诊断结论，因此 ECG-MTHC 模型只给出了14 多万条 ECG 的自动诊断结果，其判断准确率为72.49%，而 LCNN 在该测试数据上的分类结果为83.72%[48]。与文献[93]相比：1）LCNN 实际上也是一个端对端的学习方法，将中间的卷积层和取样层提取得到的特征输入到全连接层中，最后由 softmax层进行分类； 2）对于较大规模的数据集，LCNN 比 ECG-MTHC 更易于训练；3）由于 LCNN 的深度架构及复杂的网络结构，使它具有很强的非线性拟合能力，克服了 ECG-MTHC 中 SVM 非线性拟合能力受限的缺欠。最终，LCNN 的分类准确率高于ECG-MTHC 的准确率。周等[94] 将 LCNN 作为基分类器提出了一种基于集成学习的室性早博识别方法，采用该方法对 MIT-BIH 中的 48 条记录进行室性早搏心拍分类得到的准确率为 99.91%；同时该文还注重模拟医生诊断 ECG 的思维过程，采用 LCNN与室性早搏诊断规则相结合的方法对 CCDD 进行按记录的室性早搏分类，得到 14 多万条记录的测试准确率为 97.87%。

In Figure 8, Each convolution unit CU contains a convolution layer and a sampling layer, CU-A1、CU-B1 and CU-C1, for example, contain a convolution layer and a sampling layer, 1D-Coc represents one-dimensional convolution operation. For eight leads, Each lead has three convolution units, And the convolution units between different leads are independent of each other. Each lead data passes through three convolution units in turn, If one of the leads is CU-A1、CU-B1、CU-C1, by the convolution unit in turn Then the third sampling layer of each lead is connected to the same full connection layer for information aggregation, Finally, the classification is carried out on the logical regression layer. Compared with the literature [87], ECG-CNN model has only three convolution units, and the LCNN structure in figure 8 has 24 convolutional units. For the convolution layer connecting the input layer in literature [87], Its convolution kernel size is 8*23, The convolution kernel size of the first convolution layer of each lead in figure 8 is 1/18. For the purpose of increasing training samples and thus reducing the imbalance of ECG data in different categories, LCNN took full advantage of the periodic characteristics of ECG records, For the ECG record, Include a ECG record of all possible situations. During LCNN training, An inverse BP algorithm with inertia and variable step length is adopted. Also model validation on the CCDD, After more than 150,000 ECG records, LCNN achieved 83.66% accuracy of positive and abnormal classification among ECG patients, and the results also illustrate the effectiveness of LCNN in practical applications. A mixed classification model including time series and statistical classification within individuals (ECG-MTHC), The model includes four classification modules: RR interval positive anomaly analysis, QRS wave group similarity analysis, SVM classification model based on numerical and morphological features and ECG typical feature analysis. Kim and others will ECG-MTHC the model to test more than 150,000 records in the CCDD, But because there are more than 10,000 ECG records of intermediate feature extraction errors, Therefore ECG-MTHC the model only gives the automatic diagnosis results of more than 140,000 ECG, Its accuracy is 72.49, The classification result of LCNN on this test data is 83.72%[48]. LCNN is actually an end-to-end learning method compared with literature [93], Input the features extracted from the middle convolution layer and the sampling layer into the full connection layer, Finally, the softmax layer is classified; 2) for larger data sets, LCNN is easier to train than ECG-MTHC; 3) because of LCNN deep architecture and complex network architecture, So that it has a strong nonlinear fitting ability, overcome the lack SVM limited nonlinear fitting ability in the ECG-MTHC. Eventually, The classification accuracy of LCNN is higher than that of ECG-MTHC. Zhou et al .[94] proposed a new recognition method based on integrated learning based on compartmentalization, By this method ,48 records in the MIT-BIH were classified into ventricular premature beats with 99.91% accuracy; At the same time, this paper also focuses on simulating the thinking process of doctors’ diagnostic ECG, The LCNN and ventricular premature beat diagnosis rules were used to classify CCDD by recorded ventricular premature beats, The accuracy of the test was 97.87.

然而在文献[48,87,94]的 CNN 结构中，它们的全连接层只能接受固定长度的输入，因此在网络训练之前需要将 ECG 记录截取到固定长度。但是在实际应用中，ECG 记录的长度通常不一致，如在 CCDD 中 ECG 记录的长度为 10s-30s，而且有的疾病（如早搏）可以发生在一条记录的前几秒，它也可发生在记录中的中间几秒或者最后几秒，这种截取到固定长度的方式可能会使信息丢失比较严重。

Nevertheless, in the CNN structures [48,87,94] in the literature, their full connection layers can only accept input of fixed length, so ECG records need to be intercepted to fixed length before network training. The length of ECG records is usually inconsistent in practical applications, such as 10 s-30s, in CCDD and some diseases (such as premature beats) can occur in the first few seconds of a record. It can also occur in the middle or last seconds of the record.

Zheng[95-96]等人将一种多通道的深层卷积神经网络模型（Multi-Channels Deep Convolution Neural Networks，MC-DCNN）应用于时间序列分类中，每一通道的数据都首先经过一个独立的CNN结构，其中每一通道的输入是一个时间序列，然后将每一个 CNN 结构的最后一层卷积层全连接到 MLP 中进行分类，在 BIDMC 充血性心力衰竭数据集上的检测准确率为 94.65% ，优于其他一些算法。Kiranyaz[27]等人提出一种基于一维 CNN 的病人内 ECG 分类，该 CNN 结构包含 3 个 CNN 层和 2 个 MLP 层，将 MIT-BIH 数据库中的 44 条记录作为实验数据，得到室性异位心拍（VEB）和室上性异位心拍（SVEB）的分类准确率分别为 99%和 97.6%。然而这些研究工作仅利用了标准数据库中的部分数据，不能够充分体现模型在实际应用中的整体分类性能。

Zheng et al. Multi-Channels Deep Convolution Neural Networks, a multi-channel deep convolutional neural network model MC-DCNN) applied to time series classification, Each channel passes through a separate CNN structure, The input for each channel is a time series, Then the last layer of convolution layer of each CNN structure is fully connected to the MLP for classification, Detection accuracy on the BIDMC congestive heart failure dataset is 94.65%, Better than some other algorithms. Kiranyaz[27] et al. propose a one-dimensional CNN -based classification of patient ECG, The CNN structure consists of three CNN layers and two MLP layers, Based on 44 records in the MIT-BIH database, The classification accuracy of ventricular ectopic beats (VEB) and supraventricular ectopic beats (SVEB) were 99% and 97.6%, respectively. However, these studies utilize only some of the data in the standard database, It can not fully reflect the overall classification performance of the model in practical application.

由于不同的时间序列可能需要不同时间尺度上的不同特征表示，但是现有的许多算法没有考虑到这些因素，而且受到高频干扰及随机噪声的影响，在实时时间序列数据中具有判别性的模式通常也会变形。为了克服这些问题，Cui[97]等人提出了一种基于多尺度卷积神经网络的时间序列分类模型（称为 MCNN 模型）。MCNN 模型包含 3 个阶段：变换阶段、局部卷积阶段、全卷积阶段。变换阶段：首先对输入数据分别采用不同的变换（包含时域中的恒等映射、下采样变换以及频域中的光谱变换），假设原始输入数据分别经过上述 3 种变换，则得到的 3 种变换数据。局部卷积阶段：将 3种变换数据作为 3 个并联卷积层的入（一种变换数据输入到一个卷积层中，这与文献[48]的 LCNN模型类似），每个卷积层后紧随着一个取样层。全卷积阶段：局部卷积阶段的 3 个取样层连接到同一个卷积层中进行信息汇总，在该阶段中可以采用多个卷积层和多个取样层进行交替设置，最后跟随着一个全连接层及 softmax 层。与文献[48]相比：MCNN 在卷积层中将多通道的数据进行整合，文献[48]则在全连接层中进行信息汇总，MCNN 对卷积核大小及取样核大小的设置也不一样。MCNN 可以处理多元时间序列，它通过将原始数据下采样到不同的时间尺度使其不仅能够提取不同尺度的低级特征还能够提取更高级的特征。CNN 除了用于时间序列分类外，还可以用于时间序列度量学习[98]。

Because different time series may require different feature representations on different time scales, But many existing algorithms do not take these factors into account, And affected by high frequency interference and random noise, patterns with discriminability in real-time time series data are often also deformed. To overcome these problems, Cui et al. proposed a time series classification model based on multi-scale convolutional neural networks (called MCNN model). MCNN model consists of three stages: transformation stage, local convolution stage and full convolution stage. Transformation stage: first, different transformations are used for input data (including identity mapping in time domain, downsampling transformation and spectral transformation in frequency domain), Assuming that the original input data are transformed by the above three transformations, Then the three kinds of transformation data are obtained. Local convolution stage: three kinds of transform data are input into three parallel convolution layers, which is similar to the LCNN model [48] in the literature), each convolutional layer followed by a sampling layer. Full convolution stage: three sampling layers in the local convolution stage are connected to the same convolution layer for information summary, In this phase, multiple convolution layers and multiple sampling layers can be set alternately, Finally, a full connection layer and softmax layer are followed. MCNN integrate multichannel data in the convolution layer, Document [48], in full connection, MCNN the size of convolution kernel and sampling kernel size are also different. MCNN can handle multiple time series, It can extract not only low-level features of different scales but also higher-level features by downsampling the original data to different time scales. CNN in addition to time series classification, can also be used for time series metric learning.

5.5 其它应用

5.5 Other Applications

Redmon 等人[99]将目标检测看成是一个回归问题，采用一个具有 24 个卷积层和 2 个全连接层的 CNN 结构进行目标检测（也称为 YOLO）。在YOLO 中，输入整幅图像，并将图像划分为 7*7 个网格，通过 CNN 预测每个网格的多个包围盒（bounding boxes，用来包裹场景中目标的几何体）及这些包围盒的类别概率。YOLO 将整幅图像作为下文信息，使得背景误差比较小。YOLO 的检测速度也非常快，在 Titan X 的 GPU 上每秒钟可以处理 45 张图像。然而 YOLO 也有存在一些不足：1）因为每个网格只预测两个包围盒且只有一个类别，因此它具有很强的空间约束性，这种约束限制了模型对邻近目标的预测，同时如果小目标数量过多也会影响模型的检测能力；2）对于不包含在训练集中的目标或者有异常比例的目标，它的泛化能力不是很好；3）模型主要的误差仍然是不能精准定位，引起的误差。由于 YOLO 不能精准定位，这也使得它的检测精度小于 Faster R-CNN[100]的，但是 YOLO的速度更快。Faster R-CNN 是候选框网络 (Region proposal network，RPN)[100]与 Fast R-CNN[101] 结合并共享卷积层特征的网络，它也是基于分类器的方法[79]。由于 YOLO 检测精度不是很高，因此 Liu 等人[102]基于 YOLO 提出了 SSD 模型。SSD 利用了 YOLO 的回归思想同时还借鉴了 Faster R-CNN 的锚点机制（anchor 机制）。它与 YOLO 一样通过回归获取目标位置和类别，不同的是：SSD 预测某个位置采用的是该位置周围的特征。最终，SSD 获得的检测精度与 Faster R-CNN 的差不多，但是 SSD保持了 YOLO 快速检测的特性。此外，CNN 还可用于短文本聚类[103]，视觉追踪[104]、图像融合[105]等领域中。

Redmon and others see object detection as a regression problem, target detection using a CNN structure with 24 convolutional and 2 fully connected layers (also known as YOLO). And in the YOLO, Enter the entire image, And divide the image into 7*7 grids, Predict multiple bounding boxes (bounding boxes,) for each grid by CNN The geometry used to wrap the target in the scene) and the class probability of these bounding boxes. YOLO use the entire image as the following information, Make the background error relatively small. YOLO detection speed is also very fast, You can process 45 images per second on Titan X GPU. However YOLO there are some shortcomings :1) because each grid predicts only two bounding boxes and only one category, So it’s very space bound, This constraint limits the model’s prediction of neighboring targets, At the same time, if the number of small targets is too large, it will also affect the detection ability of the model; 2) for targets that are not included in the training set or have abnormal proportions, Its generalization ability is not very good; 3) the main error of the model is still the inability to locate accurately, The resulting error. Because YOLO can’t locate accurately, And that makes it less accurate than Faster R-CNN[100], But YOLO is faster. Faster R-CNN is Region proposal network, RPN networks that combine Fast R-CNN and share convolutional layer features, It is also a classifier-based approach. Because YOLO detection accuracy is not very high, Therefore Liu et al. proposed a SSD model based on YOLO. SSD used the YOLO regression thought and the Faster R-CNN anchor mechanism. Like YOLO, it acquires the target position and category by regression. The difference is that the SSD uses the characteristics around the position to predict a certain position. Ultimately, the detection accuracy SSD obtained is similar to that of Faster R-CNN, but SSD maintains the characteristics of YOLO rapid detection. Besides, CNN can be used in short text clustering, visual tracking, image fusion and other fields.

5.6 CNN 的优势

5.6 CNN Advantages

CNN 具有 4 个特点：局部连接、权值共享、池化操作及多层[11]。CNN 能够通过多层非线性变换，从大数据中自动学习特征，从而代替手工设计的特征，且深层的结构使它具有很的表达能力和学习能力[70]。许多研究实验已经表明了 CNN 结构中深度的重要性，例如从结构来看，AlexNet、VGG、GooleNet 及 ResNet 的一个典型的发展趋势是它们的深度越来越深[37]。在 CNN 中，通过增加深度从而增加网络的非线性使它能够更好地拟合目标函数，获得更好的分布式特征[11]。

CNN has four characteristics: local connection, weight sharing, pool operation and multi-layer. CNN can automatically learn features from big data by multi-layer nonlinear transformation, thus replacing the features of manual design, and the deep structure makes it have very expressive ability and learning ability. Many research experiments have shown the importance of depth in CNN structures. For example, from the structural point of view, a typical that their depth is getting deeper and deeper. in CNN, by increasing the depth and thus increasing the nonlinearity of the network, it can better fit the objective function and obtain better distributed features.

6 关于 CNN 参数设置的一些探讨

6 Discussion on CNN parameter setting

6.1 ECG 实验分析

6.1 ECG Experimental Analysis

CNN 在计算机辅助 ECG 分析领域中的研究已初见端倪。本文就 CNN 在计算机辅助 ECG 分析应用中，设计了不同参数及不同深度的 CNN 网络结构，并将不同网络结构的 CNN模型应用于MIT-BIH数据库中的室性早搏心拍分类中。根据各个实验结果，分析了 CNN 各参数间的相互关系及不同参数设置对分类结果的影响。将 MIT-BIH 数据库中 482 条记录的 110109 个心拍划分为 CNN 模型的训练集和测试集，其中随机选取 24100 个心拍作为训练集，其余心拍为测试集，同时采用 BP 算法进行有监督训练（用开源工具 Theano 实现）。每个 CNN 结构的训练集和测试集均分别一样。心拍截取方式与文献[94]一致。本文采用 AUC[48]即 ROC 曲线下的面积来衡量每个 CNN 结构的室性早搏分类性能。一般来说，AUC 值越大，算法分类性能越好。

CNN research in the field of computer aided ECG analysis has begun to appear. This paper CNN in the application of computer aided ECG analysis, Design the CNN network structure with different parameters and different depth, The CNN models of different network structures are applied to the classification of ventricular premature beats in MIT-BIH database. According to the results of each experiment, Analyze the relationship between CNN parameters and the influence of different parameter settings on the classification results. The 110109 heart beats recorded in the MIT-BIH database are divided into training sets and test sets for CNN models, In which 24100 heart beats were randomly selected as training sets, The rest of the heart is a test set, BP algorithm is also used for supervised training (implemented Theano open source tools). Each CNN structure has the same training set and test set. Heart beat interception is consistent with literature [94]. This paper uses AUC area under the ROC curve to measure the classification performance of ventricular premature beats for each CNN structure. In general, And the bigger the AUC, The better the classification performance of the algorithm.

本文所采用的网络结构深度共有 4 种：深度为 5（含输入层、输出层、全连接层、1 个卷积层及 1 个取样层）、7（含输入层、输出层、全连接层、2个卷积层及 2 个取样层）、9（含输入层、输出层、全连接层、3 个卷积层及 3 个取样层）及 11（含输入层、输出层、全连接层、4 个卷积层及 4 个取样层）。首先讨论卷积核大小对分类性能的影响。实验过程：分别对每一种深度设置 5 个不同的 CNN模型，这 5 个不同的 CNN 模型除卷积核大小外，其他参数如特征面数目、取样核大小、全连接层神经元个数均相同。如表 2 所示：

There are four kinds of network structure depth used in this paper: depth 5(including input layer, output layer, full connection layer ,1 convolution layer and 1 sampling layer),7(including input layer, output layer ,2 convolution layer and 2 sampling layer),9(including input layer, output layer, full connection layer ,3 convolution layer and 3 sampling layer) and 11(including input layer, output layer, full connection layer ,4 convolution layer and 4 sampling layer). First, the effect of convolution kernel size on classification performance is discussed. Experimental process: set 5 different CNN models for each depth, Except for the convolution kernel size, Other parameters, such as the number of feature planes, the size of the sampling nucleus and the number of neurons in the full connection layer, are the same. As shown in table 2:

表 2 列出了每个卷积层和取样层对应卷积核的大小及取样核的大小。每一行参数构成一个 CNN模型，表中特征面数目为每个卷积层所采用的特征面个数，由于卷积层与取样层特征面唯一对应，所以卷积层特征面个数确定后，紧跟其后的取样层特征面个数也唯一确定。表 2 的这 5 个 CNN 模型只有卷积核大小不同。从表 2 的分类结果可看出，对于网络深度为 11，随着卷积核变大，AUC 先增加后减小。对于另外 3 组实验：在深度为 5 或者 7 的
模型中，随着卷积核的增加，AUC 先减小后增加再减小；深度为 9 的模型，随着卷积核增加，AUC 先较小后趋于平稳再减小。图 9 所示为深度是 5 的CNN 结构随着卷积核的改变，其分类性能的变化曲线图。通过实验发现，在某一个范围内我们能够找到一个比较合适的卷积核的大小，卷积核过大或者过小均不利于模型的学习。在本实验中，卷积核的大小取值范围在[10,16]时，其模型能够获得一个更好的分类结果。从这 4 组实验的分类结果也可看出：对于卷积核较小的 CNN 结构，增加网络的深度也能够提升模型的分类性能。

Table 2 lists the size of the corresponding convolution kernel and the size of the sampling kernel for each convolution layer and sampling layer. Each row of parameters forms a CNN model, The number of feature planes in the table is the number of feature planes used for each convolution layer, Because the convolution layer corresponds uniquely to the feature surface of the sampling layer, So after the number of feature surfaces in the convolution layer is determined, The number of characteristic surfaces of the following sampling layer is also uniquely determined. for these 5 CNN models in table 2, only the convolution kernel size is different. As can be seen from the classification results in Table 2, For a network depth of 11, As the convolution kernel becomes larger, AUC increases first and then decreases. For three other experiments: in a model with a depth of 5 or 7, As the convolution kernel increases, AUC decreases first and then increases and then decreases; A 9- depth model, As the convolution kernel increases, AUC is smaller and then tends to stabilize and then decrease. Figure 9 shows a CNN structure with a depth of 5, The curve of the classification performance. Through experiments, In a certain range we can find a more appropriate size of the convolution kernel, Too large or too small convolution kernel is not conducive to the learning of the model. In this experiment, The size of the convolution kernel ranges from [10, 16], Its model can obtain a better classification result. From the classification results of these four groups of experiments, it can be seen that for the smaller CNN structure of the convolution kernel, Increasing the depth of the network can also improve the classification performance of the model.

为了讨论取样核大小对分类性能的影响，我们同样对每一种深度分别设置 3 个不同的 CNN 模型。类似地，这 3 个 CNN 模型，除了取样核大小外，其他参数设置均相同。由于取样核大小要使公式（11）能够整除，因此对于某一深度的网络，取样核大小不能够随意取值。从几组实验的结果来看，一般来说随着取样核大小的增加，AUC 先增加后减小。从总体来看，随着模型深度的增加，其分类结果也越好。在本实验中，模型通常在取样核大小为2 或者 3 时取得相对较好的分类结果。表 3 列出了深度为 9 的 3 个不同网络结构的 CNN 室性早搏分类结果。
To discuss the effect of sampling kernel size on classification performance, we also set 3 different CNN models for each depth. similarly, these 3 CNN models all set the same parameters except the size of the sampling core. Because the size of the sampling core should make the formula (11) divisible, the size of the sampling core can not be taken at will for a certain depth network. From the results of several groups of experiments, generally speaking, with the increase of the size of the sampling nucleus, the AUC increases first and then decreases. In general, with the increase of model depth, the better the classification results. In this experiment, the model usually obtains relatively good classification results when the sampling core size is 2 or 3. table 3 lists the CNN ventricular premature beat classification results for 3 different network structures with depth 9.

为了探讨特征面数目对分类性能的影响，这里我们也对每一种深度分别设置 6 个不同的 CNN 模型。其中，这 6 个CNN 模型除特征面数目外，其他参数设置一样。通过实验发现，如果特征面数目过小，其分类性能较差。这是由于特征面数目过少，使得一些有利于网络学习的特征被忽略掉，因而不利于模型的学习。然而，当特征面数目大于 40时，模型的训练时间大大增加，这同样不利于模型的学习。通过实验可知，本实验中，比较好的特征面数目选取范围可为[10,35]。表 4 列出了深度为11 的 6 个不同网络结构的 CNN 分类结果，在这 6 个CNN 结构中，只有特征面数目不同，且随着特征面数目的增加，AUC 先增加再减小后增加。

To explore the effect of the number of feature planes on the classification performance, here we also set 6 different CNN models for each depth. Among them, these six CNN models except the number of feature surfaces, other parameters are set the same. It is found that if the number of feature surfaces is too small, its classification performance is poor. This is because the number of feature surfaces is too small, so that some features conducive to network learning are ignored, which is not conducive to model learning. However, when the number of feature surfaces is greater than 40, the training time of the model increases greatly, which is also unfavorable to the learning of the model. According to the experiment, the selection range of the better characteristic surface number can be [10,35]. Table 4 lists the CNN classification results of six different network structures with depth of 11. Of these six CNN structures, only the number of feature surfaces is different, and with the increase of the number of feature surfaces, the AUC increases first and then decreases and then increases.
在这里插入图片描述
表 5 所示为 4 个不同深度的 CNN 模型及其室性早搏分类结果。在每一个 Stage: （15）+（12）中，15 表示卷积层中卷积核大小，12 表示紧跟其后的取样层的取样核大小。实验结果表明，随着深度的加深，网络性能也越好

Table 5 shows the classification results of four CNN models with different depths and ventricular premature beats. at each Stage：(15)(12),15 represents the convolutional kernel size in the convolutional layer and 12 represents the sampling kernel size of the following sampling layer. The experimental results show that as the depth deepens, the network performance becomes better.

为了探讨 CNN 的深度、卷积核大小、取样核大小及特征面数目之间的关系，我们采用不同的深度、卷积核大小、取样核大小及特征面数目设计了350 多个不同的 CNN 模型。这些不同的 CNN 模型均利用与上述相同的训练集和测试集进行实验。通过实验发现：1）对于同一深度，特征面数目比卷积核大小更重要，具有更小卷积核及更大特征面数目的 CNN 模型比具有更大卷积核且更小特征面数目的 CNN 模型获得更好的分类结果，这与文献[36]
中特征面数目与卷积核大小所发挥的作用相当不太一样，同时也说明了对于不同的数据库，CNN 的分类性能会有些不一样的表现，本小结的实验分析是基于 MIT-BIH 数据库进行的；2）深度比卷积核大小及取样核大小重要；3）随着网络深度的加深，模型分类性能越好；4）对于同一个深度的模型，特征面数目越大，分类性能越好。

To explore the relationship between the depth of the CNN, the size of the convolution kernel, the size of the sampling kernel and the number of feature planes, we designed more than 350 different CNN models using different depths, the size of the convolution kernel, the size of the sampling kernel and the number of feature planes. all of these different CNN models use the same training set and test set as above for experiments. The experimental results show that :1) for the same depth, the number of feature planes is more important than the size of convolution kernels. CNN models with smaller convolution kernels and larger number of feature planes obtain better classification results than CNN models with larger convolution kernels and smaller number of feature planes, which is [36] with literature

6.2 脉搏波实验分析

6.2 Experimental Analysis Of Pulse Wave

文献[106]采用两种不同深度的 CNN 结构分别在健康/亚健康数据集及动脉硬化/肺动脉硬化数据集进行分类实验。表 6 为不同 CNN 模型分别在两个数据集上的测试结果。
在这里插入图片描述
Literature [106] two CNN structures of different depths were used in the health / sub-health data set and arteriosclerosis / pulmonary arteriosclerosis data set. Table 6 shows the test results CNN different models on two datasets.

表6 中 CNN(7L)表示该 CNN 的深度为 7 层，而 CNN(9L)模型的深度为 9 层。从上述结果也可看出，在两个数据集上 CNN(9L)模型所得各指标均高于 CNN(7L)模型，同时也说明了增加网络的层数可以挖掘脉搏波更深层的特征，深度越深，模型的性能越好。

The CNN (7L) in Table 6 indicates that the depth of the CNN is 7 layers, while the depth of the CNN (9L) model is 9 layers. We can also see from the above results that the indexes obtained from the (9L) model on CNN two data sets are higher than those of the CNN (7L) model, and it also shows that increasing the number of layers of the network can mine deeper features of pulse waves. The deeper the depth, the better the performance of the model.

7 总结

7 Conclusion

近年来，CNN 的权值共享、可训练参数少、鲁棒性强等优良特性使其受到了许多研究者的关注。CNN 通过权值共享减少了需要训练的权值个数、降低了网络的计算复杂度，同时通过池化操作使得网络对输入的局部变换具有一定的不变性如平移不变性、缩放不变性等，提升了网络的泛化能力。CNN 将原始数据直接输入到网络中，然后隐性地从训练数据中进行网络学习，避免了手工提取特征、从而导致误差累积，其整个分类过程是自动的。虽然CNN 所具有的这些特点使其已被广泛应用于各种领域中特别是模式识别与人工智能领域，但是 CNN仍有许多工作需要进一步研究：

CNN weight sharing, less trainable parameters and strong robustness have attracted the attention of many researchers in recent years. CNN reduces the number of weights that need to be trained and reduces the computational complexity of the network through weight sharing. At the same time, the network has certain invariance to the local transformation of input, such as translation invariance, scaling invariance and so on, which improves the generalization ability of the network. CNN input the raw data directly into the network, and then carry on the network learning implicitly from the training data, which avoids the manual extraction of features, which leads to the accumulation of errors, and the whole classification process is automatic. While these characteristics of the CNN make it widely used in various fields, especially in the field of pattern recognition and artificial intelligence, CNN still has a lot of work to study:

1）目前所使用的 CNN 模型是 Hubel-Wiesel 模型[28]简化的版本，需进一步挖掘Hubel-Wiesel模型，对它进行深入研究并发现结构特点及一些规律，同时还需引入其它理论使 CNN 能够充分发挥其潜在的优势。

The CNN model used at present is a simplified version [28] the Hubel-Wiesel model. It is necessary to further excavate the Hubel-Wiesel model, study it deeply and find out the structural characteristics and some laws. At the same time, other theories should be introduced to enable CNN to give full play to its potential advantages.

2）尽管 CNN 在许多领域如计算机视觉上已经取得了令人满意的成果，但是仍然不能够很好地理解其基本理论[107]。对于一个具体的任务，仍很难确定哪种网络结构，使用多少层，每一层使用多少个神经元等才是合适的。仍然需要详细的知识来选择合理的值如学习率、正则化的强度等[107]。

Although CNN has achieved satisfactory results in many fields, such as computer vision, its basic theory is still not well understood. For a specific task, it is still difficult to determine which network structure, how many layers are used, and how many neurons are used in each layer. detailed knowledge is still needed to select reasonable values such as learning rate, intensity of regularization, etc.

3）如果训练数据集与测试数据集的分布不一样，则 CNN 也很难获得一个好的识别结果，特别是对于复杂的数据例如形态复杂多变的临床 ECG数据。因此，需要引入 CNN 模型的自适应技术，可考虑将自适应抽样等应用于 CNN 模型中[16]。

If the distribution of the training dataset is not the same as that of the test dataset, it is also difficult for CNN to obtain a good recognition result, especially for complex data such as clinical ECG data with complex and changeable morphology. Therefore, it is necessary to introduce adaptive technology of CNN model, and we can consider applying adaptive sampling to CNN model.

4）尽管依赖于计算机制的 CNN 模型是否与灵长类视觉系统相似仍待确定，但是通过模仿和纳入灵长类视觉系统也能使 CNN 模型具有进一步提高性能的潜力[107]。

While it remains to be determined whether CNN models that depend on computational mechanisms are similar to primate visual systems, mimicking and incorporating primate visual systems can also give CNN models the potential to further improve performance

5）目前，CNN 在计算机辅助 ECG 分析领域中，其输入维数需保持一致。为了使输入维数保持一致，需要将原始的数据截取到固定长度，如何截取数据从而使 CNN 发挥其优势是一个值得深入研究的问题。由于 RNN 可以处理长度不等的数据，因此如何将 RNN 与 CNN 相结合，并应用于 ECG 记录分类也是一个值得深入研究的课题。

Nowadays, CNN in the field of computer aided ECG analysis, its input dimension needs to be consistent. in order to keep the input dimension consistent, it is necessary to intercept the original data to a fixed length. how to intercept the data so that CNN can exert its advantages is a problem worthy of further study. since RNN can process data of different lengths, how to combine RNN with CNN and apply it to the classification of ECG records is also a subject worthy of further study.

6）在隐性训练中，如何将整个训练过程中的最佳分类模型保存下来也是一个值得探讨的问题。在文献[48]的隐性训练中，当所有的训练样本在一个训练周期内都参与 BP 反向传播过程后，才输出整个训练中的测试结果，如果此时其准确率是目前为止最高的，则保存当前分类模型。事实上，我们还可以对它做进一步的改进，例如当部分样本进行 BP训练后，就可采用校验样本测试当前的模型，然后判断该模型是否为迄今为止性能最佳的分类模型。

In implicit training, how to preserve the best classification model in the whole training process is also a problem worth discussing. When all the training samples participate in the backpropagation process in a training cycle, the test results in the whole training are output in the recessive training [48] in the literature. If the accuracy is the highest so far, the current classification model is saved. actually, we can further improve it, for example, when some samples are BP trained, we can use the check sample to test the current model, and then judge whether the model is the best performance classification model so far.

总的来说，CNN 虽然还有许多有待解决的问题，但是这不影响今后它在模式识别与人工智能等领域中进一步的发展与应用，它在未来很长的一段时间内仍然会是人们研究的一个热点。新的理论和技术的纳入以及新成果的不断出现也会使它能够应用于更多新的领域中.

Generally speaking, although CNN still has many problems to be solved, it does not affect its further development and application in the fields of pattern recognition and artificial intelligence in the future. It will still be a hot spot in people’s research for a long time in the future. The inclusion of new theories and technologies and the emergence of new results will also enable it to be applied in more new fields.

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/花生_TL007/article/detail/116410