加入数据集中有一个时间序列,普通的神经网络并不能考虑这么一个序列,不认为t1和t2和t3之间的关系,每一个操作都是独立来进行的 。但是如果是一个时序的数据,数据之间就有相关性,那么网络能不能学习到由于时间的关系,而对最后的结果造成影响呢?这就是RNN的特点。
这个**嵌入矩阵(embeding Matrix)**包含训练集中每个词的一个向量。传统来讲,这个嵌入矩阵中的词向量数据会很大。
Max Sequence Length:
我们从更加技术的角度来谈谈 LSTM 单元,该单元根据输入数据 x(t) ,隐藏层输出 h(t) 。在这些单元中,h(t) 的表达形式比经典的 RNN 网络会复杂很多。这些复杂组件分为四个部分:输入门,输出门,遗忘门和一个记忆控制器。
每个门都将 x(t) 和 h(t-1) 作为输入(没有在图中显示出来),并且利用这些输入来计算一些中间状态。每个中间状态都会被送入不同的管道,并且这些信息最终会汇集到 h(t) 。为简单起见,我们不会去关心每一个门的具体推导。这些门可以被认为是不同的模块,各有不同的功能。输入门决定在每个输入上施加多少强调,遗忘门决定我们将丢弃什么信息,输出门根据中间状态来决定最终的 h(t)
2)词和ID的映射,常规套路了 (tensorflow要求,方便取得向量)
wordList 是词的ID映射,wordVectors是词向量。
import numpy as np
wordsList = np.load('./training_data/wordsList.npy')
print('Loaded the word list!')
wordsList = wordsList.tolist() #Originally loaded as numpy array
wordsList = [word.decode('UTF-8') for word in wordsList] #Encode words as UTF-8
wordVectors = np.load('./training_data/wordVectors.npy')
print ('Loaded the word vectors!')
Loaded the word list!
Loaded the word vectors!
(400000, 50)
baseballIndex = wordsList.index('baseball')
array([-1.93270004, 1.04209995, -0.78514999, 0.91033 , 0.22711 ,
-0.62158 , -1.64929998, 0.07686 , -0.58679998, 0.058831 ,
0.35628 , 0.68915999, -0.50598001, 0.70472997, 1.26639998,
-0.40031001, -0.020687 , 0.80862999, -0.90565997, -0.074054 ,
-0.87674999, -0.62910002, -0.12684999, 0.11524 , -0.55685002,
-1.68260002, -0.26291001, 0.22632 , 0.713 , -1.08280003,
2.12310004, 0.49869001, 0.066711 , -0.48225999, -0.17896999,
0.47699001, 0.16384 , 0.16537 , -0.11506 , -0.15962 ,
-0.94926 , -0.42833 , -0.59456998, 1.35660005, -0.27506 ,
0.19918001, -0.36008 , 0.55667001, -0.70314997, 0.17157 ], dtype=float32)
现在我们有了向量,我们的第一步就是输入一个句子,然后构造它的向量表示。假设我们现在的输入句子是“I thought the movie was incredible and inspiring”。为了得到词向量,我们可以使用TensorFlow的嵌入函数。这个函数有两个参数,一个是嵌入矩阵(在我们的情况下是词向量矩阵),另一个是每个词对应的索引。
import tensorflow as tf
maxSeqLength = 10 #Maximum length of sentence
numDimensions = 300 #Dimensions for each word vector
firstSentence = np.zeros((maxSeqLength), dtype='int32')
firstSentence[0] = wordsList.index("i")
firstSentence[1] = wordsList.index("thought")
firstSentence[2] = wordsList.index("the")
firstSentence[3] = wordsList.index("movie")
firstSentence[4] = wordsList.index("was")
firstSentence[5] = wordsList.index("incredible")
firstSentence[6] = wordsList.index("and")
firstSentence[7] = wordsList.index("inspiring")
#firstSentence[8] and firstSentence[9] are going to be 0
print(firstSentence) #Shows the row index for each word
[ 41 804 201534 1005 15 7446 5 13767 0 0]
有0 是因为设定最大长度后,不够长度的就填充0,超过长度就要填充
输出数据是一个10*50 的词矩阵,其中包括10个词,每个词的向量维度是50。就是去找到这些词对应的向量
with tf.Session() as sess:
(10, 50)
训练集我们使用的是IMDB数据集。这个数据集包含2500 条电影数据,其中1250O 条正向数据,12500条负向数据。这些数据都是存储在一个文本文件中.首先我们需要做的就是去解析这个文件。正向数据包含在一个文件中,负向数据包含在另一个文件中。
from os import listdir from os.path import isfile, join positiveFiles = ['./training_data/positiveReviews/' + f for f in listdir('./training_data/positiveReviews/') if isfile(join('./training_data/positiveReviews/', f))] negativeFiles = ['./training_data/negativeReviews/' + f for f in listdir('./training_data/negativeReviews/') if isfile(join('./training_data/negativeReviews/', f))] numWords = [] for pf in positiveFiles: with open(pf, "r", encoding='utf-8') as f: line=f.readline() counter = len(line.split()) numWords.append(counter) print('Positive files finished') for nf in negativeFiles: with open(nf, "r", encoding='utf-8') as f: line=f.readline() counter = len(line.split()) numWords.append(counter) print('Negative files finished') numFiles = len(numWords) print('The total number of files is', numFiles) print('The total number of words in the files is', sum(numWords)) print('The average number of words in the files is', sum(numWords)/len(numWords))
Positive files finished
Negative files finished
The total number of files is 25000
The total number of words in the files is 5844680
The average number of words in the files is 233.7872
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(numWords, 50)
plt.xlabel('Sequence Length')
plt.axis([0, 1200, 0, 8000])
maxSeqLength = 250
fname = positiveFiles[3] #Can use any valid index (not just 3)
with open(fname) as f:
for lines in f:
This is easily the most underrated film inn the Brooks cannon. Sure, its flawed. It does not give a realistic view of homelessness (unlike, say, how Citizen Kane gave a realistic view of lounge singers, or Titanic gave a realistic view of Italians YOU IDIOTS). Many of the jokes fall flat. But still, this film is very lovable in a way many comedies are not, and to pull that off in a story about some of the most traditionally reviled members of society is truly impressive. Its not The Fisher King, but its not crap, either. My only complaint is that Brooks should have cast someone else in the lead (I love Mel as a Director and Writer, not so much as a lead).
# 删除标点符号、括号、问号等,只留下字母数字字符
import re
strip_special_chars = re.compile("[^A-Za-z0-9 ]+")
def cleanSentences(string):
string = string.lower().replace("<br />", " ")
return re.sub(strip_special_chars, "", string.lower())
firstFile = np.zeros((maxSeqLength), dtype='int32')
with open(fname) as f:
indexCounter = 0
cleanedLine = cleanSentences(line)
split = cleanedLine.split()
for word in split:
firstFile[indexCounter] = wordsList.index(word)
except ValueError:
firstFile[indexCounter] = 399999 #Vector for unknown words
indexCounter = indexCounter + 1
array([ 37, 14, 2407, 201534, 96, 37314, 319, 7158,
201534, 6469, 8828, 1085, 47, 9703, 20, 260,
36, 455, 7, 7284, 1139, 3, 26494, 2633,
203, 197, 3941, 12739, 646, 7, 7284, 1139,
3, 11990, 7792, 46, 12608, 646, 7, 7284,
1139, 3, 8593, 81, 36381, 109, 3, 201534,
8735, 807, 2983, 34, 149, 37, 319, 14,
191, 31906, 6, 7, 179, 109, 15402, 32,
36, 5, 4, 2933, 12, 138, 6, 7,
523, 59, 77, 3, 201534, 96, 4246, 30006,
235, 3, 908, 14, 4702, 4571, 47, 36,
201534, 6429, 691, 34, 47, 36, 35404, 900,
192, 91, 4499, 14, 12, 6469, 189, 33,
1784, 1318, 1726, 6, 201534, 410, 41, 835,
10464, 19, 7, 369, 5, 1541, 36, 100,
181, 19, 7, 410, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0])
现在,我们用相同的方法来处理全部的25000 条评论。我们将导入电影训练集,并且得到一个2500*250的矩阵。这是一个计算成本非常高的过程,可以直接使用理好的索引矩阵文件。
# ids = np.zeros((numFiles, maxSeqLength), dtype='int32') # fileCounter = 0 # for pf in positiveFiles: # with open(pf, "r") as f: # indexCounter = 0 # line=f.readline() # cleanedLine = cleanSentences(line) # split = cleanedLine.split() # for word in split: # try: # ids[fileCounter][indexCounter] = wordsList.index(word) # except ValueError: # ids[fileCounter][indexCounter] = 399999 #Vector for unkown words # indexCounter = indexCounter + 1 # if indexCounter >= maxSeqLength: # break # fileCounter = fileCounter + 1 # for nf in negativeFiles: # with open(nf, "r") as f: # indexCounter = 0 # line=f.readline() # cleanedLine = cleanSentences(line) # split = cleanedLine.split() # for word in split: # try: # ids[fileCounter][indexCounter] = wordsList.index(word) # except ValueError: # ids[fileCounter][indexCounter] = 399999 #Vector for unkown words # indexCounter = indexCounter + 1 # if indexCounter >= maxSeqLength: # break # fileCounter = fileCounter + 1 # #Pass into embedding function and see if it evaluates. # np.save('idsMatrix', ids)
ids = np.load('./training_data/idsMatrix.npy')
from random import randint def getTrainBatch(): labels = [] arr = np.zeros([batchSize, maxSeqLength]) for i in range(batchSize): if (i % 2 == 0): num = randint(1,11499) labels.append([1,0]) else: num = randint(13499,24999) labels.append([0,1]) arr[i] = ids[num-1:num] return arr, labels def getTestBatch(): labels = [] arr = np.zeros([batchSize, maxSeqLength]) for i in range(batchSize): num = randint(11499,13499) if (num <= 12499): labels.append([1,0]) else: labels.append([0,1]) arr[i] = ids[num-1:num] return arr, labels
RNN model
现在,我们可以开始构建我们的 TensorFlow 图模型。首先,我们需要去定义一些超参数,比如批处理大小,LSTM的单元个数,分类类别和训练次数。
batchSize = 24
lstmUnits = 64
numClasses = 2
iterations = 50000
与大多数 TensorFlow 图一样,现在我们需要指定两个占位符,一个用于数据输入,另一个用于标签数据。对于占位符,最重要的一点就是确定好维度。
标签占位符代表一组值,每一个值都为 [1,0] 或者 [0,1],这个取决于数据是正向的还是负向的。输入占位符,是一个整数化的索引数组。
import tensorflow as tf
labels = tf.placeholder(tf.float32, [batchSize, numClasses])
input_data = tf.placeholder(tf.int32, [batchSize, maxSeqLength])
一旦,我们设置了我们的输入数据占位符,我们可以调用 tf.nn.embedding_lookup() 函数来得到我们的词向量。该函数最后将返回一个三维向量,第一个维度是批处理大小,第二个维度是句子长度,第三个维度是词向量长度。更清晰的表达,如下图所示:
data = tf.Variable(tf.zeros([batchSize, maxSeqLength, numDimensions]),dtype=tf.float32)
data = tf.nn.embedding_lookup(wordVectors,input_data)
现在我们已经得到了我们想要的数据形式,那么揭晓了我们看看如何才能将这种数据形式输入到我们的 LSTM 网络中。首先,我们使用 tf.nn.rnn_cell.BasicLSTMCell 函数,这个函数输入的参数是一个整数,表示需要几个 LSTM 单元。这是我们设置的一个超参数,我们需要对这个数值进行调试从而来找到最优的解。然后,我们会设置一个 dropout 参数,以此来避免一些过拟合。
最后,我们将 LSTM cell 和三维的数据输入到 tf.nn.dynamic_rnn ,这个函数的功能是展开整个网络,并且构建一整个 RNN 模型。
lstmCell = tf.contrib.rnn.BasicLSTMCell(lstmUnits)
lstmCell = tf.contrib.rnn.DropoutWrapper(cell=lstmCell, output_keep_prob=0.75)
value, _ = tf.nn.dynamic_rnn(lstmCell, data, dtype=tf.float32)
堆栈 LSTM 网络是一个比较好的网络架构。也就是前一个LSTM 隐藏层的输出是下一个LSTM的输入。堆栈LSTM可以帮助模型记住更多的上下文信息,但是带来的弊端是训练参数会增加很多,模型的训练时间会很长,过拟合的几率也会增加。
dynamic RNN 函数的第一个输出可以被认为是最后的隐藏状态向量。这个向量将被重新确定维度,然后乘以最后的权重矩阵和一个偏置项来获得最终的输出值。
weight = tf.Variable(tf.truncated_normal([lstmUnits, numClasses]))
bias = tf.Variable(tf.constant(0.1, shape=[numClasses]))
value = tf.transpose(value, [1, 0, 2])
last = tf.gather(value, int(value.get_shape()[0]) - 1)
prediction = (tf.matmul(last, weight) + bias)
correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))
之后,我们使用一个标准的交叉熵损失函数来作为损失值。对于优化器,我们选择 Adam,并且采用默认的学习率。
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=labels))
optimizer = tf.train.AdamOptimizer().minimize(loss)
训练过程的基本思路是,我们首先先定义一个 TensorFlow 会话。然后,我们加载一批评论和对应的标签。接下来,我们调用会话的 run 函数。这个函数有两个参数,第一个参数被称为 fetches 参数,这个参数定义了我们感兴趣的值。我们希望通过我们的优化器来最小化损失函数。第二个参数被称为 feed_dict 参数。这个数据结构就是我们提供给我们的占位符。我们需要将一个批处理的评论和标签输入模型,然后不断对这一组训练数据进行循环训练。
sess = tf.InteractiveSession() saver = tf.train.Saver() sess.run(tf.global_variables_initializer()) for i in range(iterations): #Next Batch of reviews nextBatch, nextBatchLabels = getTrainBatch(); sess.run(optimizer, {input_data: nextBatch, labels: nextBatchLabels}) if (i % 1000 == 0 and i != 0): loss_ = sess.run(loss, {input_data: nextBatch, labels: nextBatchLabels}) accuracy_ = sess.run(accuracy, {input_data: nextBatch, labels: nextBatchLabels}) print("iteration {}/{}...".format(i+1, iterations), "loss {}...".format(loss_), "accuracy {}...".format(accuracy_)) #Save the network every 10,000 training iterations if (i % 10000 == 0 and i != 0): save_path = saver.save(sess, "models/pretrained_lstm.ckpt", global_step=i) print("saved to %s" % save_path)
iteration 1001/50000… loss 0.6308178901672363… accuracy 0.5…
iteration 2001/50000… loss 0.7168402671813965… accuracy 0.625…
iteration 3001/50000… loss 0.7420873641967773… accuracy 0.5…
iteration 4001/50000… loss 0.650059700012207… accuracy 0.5416666865348816…
iteration 5001/50000… loss 0.6791467070579529… accuracy 0.5…
iteration 6001/50000… loss 0.6914048790931702… accuracy 0.5416666865348816…
iteration 7001/50000… loss 0.36072710156440735… accuracy 0.8333333134651184…
iteration 8001/50000… loss 0.5486791729927063… accuracy 0.75…
iteration 9001/50000… loss 0.41976991295814514… accuracy 0.7916666865348816…
iteration 10001/50000… loss 0.10224487632513046… accuracy 1.0…
saved to models/pretrained_lstm.ckpt-10000
iteration 11001/50000… loss 0.37682783603668213… accuracy 0.8333333134651184…
iteration 12001/50000… loss 0.266050785779953… accuracy 0.9166666865348816…
iteration 13001/50000… loss 0.40790924429893494… accuracy 0.7916666865348816…
iteration 14001/50000… loss 0.22000855207443237… accuracy 0.875…
iteration 15001/50000… loss 0.49727579951286316… accuracy 0.7916666865348816…
iteration 16001/50000… loss 0.21477992832660675… accuracy 0.9166666865348816…
iteration 17001/50000… loss 0.31636106967926025… accuracy 0.875…
iteration 18001/50000… loss 0.17190784215927124… accuracy 0.9166666865348816…
iteration 19001/50000… loss 0.11049345880746841… accuracy 1.0…
iteration 20001/50000… loss 0.06362085044384003… accuracy 1.0…
saved to models/pretrained_lstm.ckpt-20000
iteration 21001/50000… loss 0.19093847274780273… accuracy 0.9583333134651184…
iteration 22001/50000… loss 0.06586482375860214… accuracy 0.9583333134651184…
iteration 23001/50000… loss 0.02577809803187847… accuracy 1.0…
iteration 24001/50000… loss 0.0732395276427269… accuracy 0.9583333134651184…
iteration 25001/50000… loss 0.30879321694374084… accuracy 0.9583333134651184…
iteration 26001/50000… loss 0.2742778956890106… accuracy 0.9583333134651184…
iteration 27001/50000… loss 0.23742587864398956… accuracy 0.875…
iteration 28001/50000… loss 0.04694415628910065… accuracy 1.0…
iteration 29001/50000… loss 0.031666990369558334… accuracy 1.0…
iteration 30001/50000… loss 0.09171193093061447… accuracy 1.0…
saved to models/pretrained_lstm.ckpt-30000
iteration 31001/50000… loss 0.03852967545390129… accuracy 1.0…
iteration 32001/50000… loss 0.06964454054832458… accuracy 1.0…
iteration 33001/50000… loss 0.12447216361761093… accuracy 0.9583333134651184…
iteration 34001/50000… loss 0.008963108994066715… accuracy 1.0…
iteration 35001/50000… loss 0.04129207879304886… accuracy 0.9583333134651184…
iteration 36001/50000… loss 0.0081111378967762… accuracy 1.0…
iteration 37001/50000… loss 0.022405564785003662… accuracy 1.0…
iteration 38001/50000… loss 0.03473325073719025… accuracy 1.0…
iteration 39001/50000… loss 0.09315425157546997… accuracy 0.9583333134651184…
iteration 40001/50000… loss 0.3166258931159973… accuracy 0.9583333134651184…
saved to models/pretrained_lstm.ckpt-40000
iteration 41001/50000… loss 0.03648881986737251… accuracy 1.0…
iteration 42001/50000… loss 0.2616865932941437… accuracy 0.9583333134651184…
iteration 43001/50000… loss 0.013914794661104679… accuracy 1.0…
iteration 44001/50000… loss 0.020460862666368484… accuracy 1.0…
iteration 45001/50000… loss 0.15876878798007965… accuracy 0.9583333134651184…
iteration 46001/50000… loss 0.007766606751829386… accuracy 1.0…
iteration 47001/50000… loss 0.02079685777425766… accuracy 1.0…
iteration 48001/50000… loss 0.017801295965909958… accuracy 1.0…
iteration 49001/50000… loss 0.017789073288440704… accuracy 1.0…
查看上面的训练曲线,我们发现这个模型的训练结果还是不错的。损失值在稳定的下降,正确率也不断的在接近 100% 。然而,当分析训练曲线的时候,我们应该注意到我们的模型可能在训练集上面已经过拟合了。过拟合是机器学习中一个非常常见的问题,表示模型在训练集上面拟合的太好了,但是在测试集上面的泛化能力就会差很多。也就是说,如果你在训练集上面取得了损失值是 0 的模型,但是这个结果也不一定是最好的结果。当我们训练 LSTM 的时候,提前终止是一种常见的防止过拟合的方法。基本思路是,我们在训练集上面进行模型训练,同事不断的在测试集上面测量它的性能。一旦测试误差停止下降了,或者误差开始增大了,那么我们就需要停止训练了。因为这个迹象表明,我们网络的性能开始退化了。
导入一个预训练的模型需要使用 TensorFlow 的另一个会话函数,称为 Server ,然后利用这个会话函数来调用 restore 函数。这个函数包括两个参数,一个表示当前的会话,另一个表示保存的模型。
sess = tf.InteractiveSession()
saver = tf.train.Saver()
saver.restore(sess, tf.train.latest_checkpoint('models'))
INFO:tensorflow:Restoring parameters from models\pretrained_lstm.ckpt-40000
iterations = 10
for i in range(iterations):
nextBatch, nextBatchLabels = getTestBatch();
print("Accuracy for this batch:", (sess.run(accuracy, {input_data: nextBatch, labels: nextBatchLabels})) * 100)
Accuracy for this batch: 91.6666686535
Accuracy for this batch: 79.1666686535
Accuracy for this batch: 87.5
Accuracy for this batch: 87.5
Accuracy for this batch: 91.6666686535
Accuracy for this batch: 75.0
Accuracy for this batch: 91.6666686535
Accuracy for this batch: 70.8333313465
Accuracy for this batch: 83.3333313465
Accuracy for this batch: 95.8333313465
