  • cudatoolkit = 10.1.243
  • cudnn = 7.6.5
  • tensorflow-gpu = 2.1.0
  • keras-gpu = 2.3.1


4. How to Develop LSTMs in Keras

4.1 Define the Model

第一步是定义网络。在Keras中,神经网络被定义为一系列的层(Layer)。这些层的容器是 Sequential 类。第一步是创建 Sequential 类的实例。然后,按照顺序添加所需要的层。LSTM递归层称为 LSTM()。通常紧跟在 LSTM 层之后并用于输出预测的全连接层称为 Dense()


model = Sequential()
  • 3

也可以通过创建一个层数组并将其传递给 Sequential 构造器:

layers = [LSTM(2), Dense(1)]
model = Sequential(layers)
网络中的第一个隐藏层必须定义预期的输入数量,例如输入层的形状。输入必须是三维的,其shape为 [样本数量, 时间步, 特征]

  1. Samples. These are the rows in your data. One sample may be one sequence.
  2. Time steps. These are the past observations for a feature, such as lag variables.
  3. Features. These are columns in your data.

假设数据是作为NumPy数组加载的,那么可以使用NumPy中的 reshape 函数将1D或2D数据集转换为3D数据集。假设在一个NumPy数组中有两列输入数据(X)。我们可以将这两个列视为两个时间步,并重塑形状:

data = data.reshape((data.shape[0], data.shape[1], 1))
  • 1


data = data.reshape((data.shape[0], 1, data.shape[1]))
  • 1


model = Sequential()
model.add(LSTM(5, input_shape=(2,1)))
LSTM层的 input_shape 参数无需指定样本数,因为默认推断为batchsize的大小。

把一个序列模型想象成一条管道,一端输入原始数据,另一端输出预测。这是Keras中一个有用的容器,因为传统上与层相关联的关注点也可以作为单独的层拆分和添加,清楚地显示了它们在数据从输入到预测的转换中的作用。例如,可以提取转换来自层中每个神经元的总和信号的激活函数,并将其作为一个类似层的对象添加到序列中,称为 Activation。

model = Sequential()
model.add(LSTM(5, input_shape=(2,1)))
model.add(Activation( sigmoid ))
**1. Regression: Linear activation function, or linear, and the number of neurons matching the number of outputs. This is the default activation function used for neurons in the Dense layer.

  1. Binary Classification (2 class): Logistic activation function, or sigmoid, and one neuron the output layer.
  2. Multiclass Classification (> 2 class): Softmax activation function, or softmax, and one output neuron per class value, assuming a one hot encoded output pattern.**

4.2 Compile the Model



model.compile(optimizer= 'sgd' , loss= 'mse' )
  • 1


algorithm = SGD(lr=0.1, momentum=0.3)
model.compile(optimizer=algorithm, loss= mse )
**1. Regression: Mean Squared Error or mean squared error, mse for short.

  1. Binary Classification (2 class): Logarithmic Loss, also called cross entropy or binary crossentropy.
  2. Multiclass Classification (> 2 class): Multiclass Logarithmic Loss or
    categorical crossentropy.**


  1. Stochastic Gradient Descent, or sgd.
  2. Adam, or adam.
  3. RMSprop, or rmsprop.


model.compile(optimizer= sgd , loss= mean_squared_error , metrics=[ accuracy ])
4.3 Fit the Model



  1. Epoch: One pass through all samples in the training dataset and updating the network weights. LSTMs may be trained for tens, hundreds, or thousands of epochs.

  2. Batch: A pass through a subset of samples in the training dataset after which the network weights are updated. One epoch is comprised of one or more batches.
    Below are some common configurations for the batch size:

    1. batch size=1: Weights are updated after each sample and the procedure is called stochastic gradient descent.
    2. batch size=32: Weights areupdatedafter a specifiednumber of samples andtheprocedure is called mini-batch gradient descent. Common values are 32, 64, and 128, tailored to the desired efficiency and rate of model updates. If the batch size is not a factor of the number of samples in one epoch, then an additional batch size of the left over samples is run at the end of the epoch.
    3. batch size=n: Where n is the number of samples in the training dataset. Weights are updated at the end of each epoch and the procedure is called batch gradient descent.

Mini-batch gradient descent with a batch size of 32 is a common configuration for LSTMs. An example of fitting a network is as follows:

model.fit(X, y, batch_size=32, epochs=100)
  • 1



history = model.fit(X, y, batch_size=10, epochs=100, verbose=0)
  • 1

4.4 Evaluate the Model



loss, accuracy = model.evaluate(X, y)
  • 1


loss, accuracy = model.evaluate(X, y, verbose=0)
  • 1

4.5 Make Predictions on the Model

使用模型对新数据进行预测,可以调用的 predict 函数:

predictions = model.predict(X)
  • 1

预测将以网络输出层提供的格式返回。在回归问题的情况下,这些预测可以由线性激活函数直接以问题的形式提供。对于二分类问题,预测可以是第一类的概率数组,通过判别函数可以转换为1或0。对于多类分类问题,结果可以是概率数组的形式(假设一个热编码输出变量),可能需要使用 argmax 函数将其转换为单个类输出预测。或者,对于分类问题,我们可以使用 predict_classes 函数,该函数自动将预测转换为清晰的整数类值

predictions = model.predict_classes(X)
  • 1


predictions = model.predict(X, verbose=0)
  • 1

4.6 LSTM State Management


  1. The efficiency of learning, or how many samples are processed before an update.
  2. The speed of learning, or how often weights are updated.
  3. The influence of internal state, or how often internal state is reset.

Keras通过将LSTM层定义为有状态层,提供了将内部状态的重置与网络权重的更新分离的灵活性。这可以通过将LSTM层上的 stateful 参数(如果为true,则批量索引I处的每个样本的最后状态将用作以下批次中索引I的样本的初始状态。)设置为True来完成。使用状态LSTM层时,还必须通过设置批输入形状参数将批大小定义为网络定义中输入形状的一部分,并且批大小必须是训练数据集中样本数的一个因子。batch_input_shape 参数需要定义为批大小、时间步长和特征的三元组keras LSTM layer API docs


model.add(LSTM(2, stateful=True, batch_input_shape=(10, 5, 1)))
  • 1

有状态的LSTM不会在每个批处理结束时重置内部状态可以通过调用 reset_states 函数对何时重置内部状态进行细粒度控制。例如,我们可能希望在每个纪元结束时重置内部状态,我们可以执行以下操作

for i in range(1000):
	model.fit(X, y, epochs=1, batch_input_shape=(10, 5, 1))
predictions = model.predict(X, batch_size=10)
  • 1


默认情况下,一个epoch内的样本被shuffle。当使用多层感知器神经网络时,这是一个很好的实践。如果尝试跨样本保留状态,则训练数据集中样本的顺序可能很重要,必须保留。这可以通过将 fit 函数中的 shuffle 参数设置为 False 来完成:

for i in range(1000):
	model.fit(X, y, epochs=1, shuffle=False, batch_input_shape=(10, 5, 1))
  1. 在每个序列的末尾进行预测,并且序列是独立的。通过将批大小设置为1,应在每个序列后重置状态
  2. 一个长序列被分成多个子序列(每个序列有多个时间步)。通过使LSTM状态化、关闭子序列的shuffle并在每个epoch之后重置状态,应在网络暴露于整个序列之后重置状态
  3. 一个很长的序列被分成多个子序列(每个序列有多个时间步)。训练效率比长期内在状态的影响更为重要,采用128个样本的批量,然后更新网络权值并重置状态

4.7 Examples of Preparing Data


4.7.1 Example of LSTM With Single Input Sample

Consider the case where you have one sequence of multiple time steps and one feature. For example, this could be a sequence of 10 values:

0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0
  • 1

We can define this sequence of numbers as a NumPy array.

from numpy import array
data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
We can then use the reshape() function on the NumPy array to reshape this one-dimensional array into a three-dimensional array with 1 sample, 10 time steps and 1 feature at each time step. The reshape() function when called on an array takes one argument which is a tuple defining the new shape of the array. We cannot pass in any tuple of numbers, the reshape must evenly reorganize the data in the array.

from numpy import array
data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
data = data.reshape((1, 10, 1))
  • 1

This data is now ready to be used as input (X) to the LSTM with an input shape of (10,1).

model = Sequential()
model.add(LSTM(32, input_shape=(10, 1)))
4.7.2 Example of LSTM With Multiple Input Features

Consider the case where you have multiple parallel series as input for your model. For example, this could be two parallel series of 10 values:

series 1: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0
series 2: 1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1
We can define these data as a matrix of 2 columns with 10 rows:

from numpy import array
data = array([
[0.1, 1.0],
[0.2, 0.9],
[0.3, 0.8],
[0.4, 0.7],
[0.5, 0.6],
[0.6, 0.5],
[0.7, 0.4],
[0.8, 0.3],
[0.9, 0.2],
[1.0, 0.1]])
data = data.reshape(1, 10, 2)
Running the example prints the new 3D shape of the single sample.

(1, 10, 2)
  • 1

This data is now ready to be used as input (X) to the LSTM with an input shape of (10,2).

model = Sequential()
model.add(LSTM(32, input_shape=(10, 2)))
4.7.3 Tips for LSTM Input

  1. This section lists some final tips to help you when preparing your input data for LSTMs.
  2. The LSTM input layer must be 3D.
  3. The meaning of the 3 input dimensions are: samples, time steps and features.
  4. The LSTM input layer is defined by the input shape argument on the first hidden layer.
  5. The input shape argument takes a tuple of two values that define the number of time steps and features.
  6. The number of samples is assumed to be 1 or more.
  7. The reshape() function on NumPy arrays can be used to reshape your 1D or 2D data to be 3D.
  8. The reshape() function takes a tuple as an argument that defines the new shape.

