赞
踩
在本次博客中,将讨论英语文本分类问题,可同样适用于文本情感分类,属性分类等文本二分类问题。
使用 IMDB 数据集,它包含来自互联网电影数据库(IMDB)的 50 000 条严重两极分化的评论。数据集被分为用于训练的 25 000 条评论与用于测试的 25 000 条评论,训练集和测试集都包含 50% 的正面评论和 50% 的负面评论。
这个数据集已经在keras的database里面有下载和读取的方法,这里我们直接调用即可。
from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
我们来打印出数据的一些信息
# 打印数据集大小及部分数据信息
print(train_data.shape)
print(train_data[0])
print(train_labels)
print(test_labels.shape)
可以看出,训练集和测试集都有25000条评论,每条评论是一个长度不等的向量。在这个数据集中,每条评论已经编码为数字,具体编码流程如下:
首先,对25000条评论分词,词形还原,再进行词频统计,按照出现频率从高到低编号1,2,3…接下来,把每一条评论里面各个单词转化为对应标号,即组成这里的每条评论行向量。
举个例子,假设词频统计结果为{“I”: 29, “love”: 49, “you”: 21, “don’t”: 34, “hurt”: 102},那么一条评论为"I hurt you"的行向量即为[29, 102, 21]
在这个数据集里面,keras也提供了每个单词编号对应编号这样一个字典,这就意味着我们可以反解析出每条评论。
word_index = imdb.get_word_index() # word_index是一个字典,键为单词,值为编码后的数字
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()]) # 将键值交换
这样,reverse_word_index就记录了每个编号对应的单词,我们就可以反解析每条评论啦~
'''
还原第一条评论信息.注意,索引减去了 3,因为 0、1、2是为“padding”(填充)、
“start of sequence”(序列开始)、“unknown”(未知词)分别保留的索引
'''
decoded_review = ' '.join([reverse_word_index.get(i-3, '?') for i in train_data[0]])
我们打印出第一条评论的文字信息:
? this film was just brilliant casting location scenery story direction
everyone's really suited the part they played and you could just imagine being there robert ?
is an amazing actor and now the same being director ? father came from the same scottish island as myself
so i loved the fact there was a real connection with this film the witty remarks throughout
the film were great it was just brilliant so much that i bought the film as soon as it was released for ?
and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film
it must have been good and this definitely was also ?
to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list
i think because the stars that play them all grown up are such a big profile for the whole film
but these children are amazing and should be praised for what they have done
don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all
因为一共10000个单词(我们只选取了10000个出现频率最高的单词),我们给每条评论的向量初始化为10000个0的行向量,依次看每条评论出现单词的编号是多少,在对应编号处将其置为1。假设某条评论出现单词编号以此为12,250, 3600,则该评论处理后的向量为10000个零, 将第12, 250, 3600个零置为1。
# 使用one-hot编码,使列表转化为大小一致的张量
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension)) # 创建一个sequences行,10000列的0向量
for i, sequence in enumerate(sequences):
results[i, sequence] = 1 # 将单词出现指定处置为1
return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
与此同时,我们也将标签向量化:
import numpy as np
# 将标签向量化
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')
在keras里面,模型可以通过models.Sequential()或者API建立。这里我们用最常用的models.Sequential()
# 建立模型
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000, )))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
接下来对神经网络层数,每层神经元个数,激活函数等加以说明。这里我们设置了三层神经网络,两个隐藏层, 一个输出层。考虑到数据量较小,这里只有两个隐藏层,且隐藏层神经元个数也较少(16和16,一般设置为2的幂)。因为这是二分类问题,所有最后一层,即输出层只有一个神经元,且用“sigmod”输出为二分类的概率。注意sigmod一般用于二分类问题,多分类问题输出概率时用"softmax"。激活函数在隐藏层都是"relu",也可以是"tanh"等,但不能是"sigmod",想一想这是为什么???“sigmod"这个函数求导后最大值在0处取得,为0.25,远小于1,当数据流入每一层神经网络时都会进行链式求导,自然就会让输入数据越来越小,导致"梯度弥散”,导致“早停”,达不到训练效果。
model.compile(
optimizer=RMSprop(lr=0.001),
loss='binary_crossentropy',
metrics=['accuracy'])
优化器选择随机梯度下降(SGD)的变体,RMSprop,学习率为0.001,损失函数为二分类的交叉熵,评估指标为准确率。
因为这个数据集训练集和测试集完全一样,导致我们只能用一个。所有我们将训练集里面划分一下,后15000条评论作为训练模型用,前10000条评论作为测试用。
train_partion_data = x_train[10000:]
train_partion_label = y_train[10000:]
test_partion_data = x_train[:10000]
test_partion_label = y_train[:10000]
接下来开始训练:
history = model.fit(
train_partion_data,
train_partion_label,
epochs=20, batch_size=512,
validation_data=(test_partion_data, test_partion_label))
用fit函数进行训练,前两个参数是训练集数据和标签,epochs是训练多少轮,batch_size是一次训练传入多少样本,validation_data里面是测试数据和标签,每一轮训练完后都会用这里面数据测试一下准确率。
最后别忘了保存训练好的模型,以便进行后期继续训练后者直接用于实际预测、分析和可视化。
print(history.history)
model.save('IMDB.json')
模型训练后有个history方法,里面记录了每一轮训练的损失值和准确率,方便可视化和观察在第几轮开始出现过拟合。
首先看看第一轮训练结果和最后一轮训练结果:
Epoch 1/20 512/15000 [>.............................] - ETA: 2:10 - loss: 0.6931 - acc: 0.5020 1024/15000 [=>............................] - ETA: 1:03 - loss: 0.6841 - acc: 0.5850 1536/15000 [==>...........................] - ETA: 41s - loss: 0.6771 - acc: 0.5632 2048/15000 [===>..........................] - ETA: 30s - loss: 0.6726 - acc: 0.5708 2560/15000 [====>.........................] - ETA: 23s - loss: 0.6591 - acc: 0.6121 3072/15000 [=====>........................] - ETA: 18s - loss: 0.6472 - acc: 0.6328 3584/15000 [======>.......................] - ETA: 15s - loss: 0.6357 - acc: 0.6551 4096/15000 [=======>......................] - ETA: 13s - loss: 0.6258 - acc: 0.6641 4608/15000 [========>.....................] - ETA: 11s - loss: 0.6206 - acc: 0.6691 5120/15000 [=========>....................] - ETA: 9s - loss: 0.6090 - acc: 0.6779 5632/15000 [==========>...................] - ETA: 8s - loss: 0.5998 - acc: 0.6919 6144/15000 [===========>..................] - ETA: 7s - loss: 0.5899 - acc: 0.7020 6656/15000 [============>.................] - ETA: 6s - loss: 0.5837 - acc: 0.7090 7168/15000 [=============>................] - ETA: 5s - loss: 0.5754 - acc: 0.7153 7680/15000 [==============>...............] - ETA: 5s - loss: 0.5674 - acc: 0.7230 8192/15000 [===============>..............] - ETA: 4s - loss: 0.5622 - acc: 0.7266 8704/15000 [================>.............] - ETA: 3s - loss: 0.5551 - acc: 0.7330 9216/15000 [=================>............] - ETA: 3s - loss: 0.5485 - acc: 0.7396 9728/15000 [==================>...........] - ETA: 3s - loss: 0.5429 - acc: 0.7457 10240/15000 [===================>..........] - ETA: 2s - loss: 0.5354 - acc: 0.7509 10752/15000 [====================>.........] - ETA: 2s - loss: 0.5294 - acc: 0.7557 11264/15000 [=====================>........] - ETA: 1s - loss: 0.5230 - acc: 0.7607 11776/15000 [======================>.......] - ETA: 1s - loss: 0.5162 - acc: 0.7655 12288/15000 [=======================>......] - ETA: 1s - loss: 0.5106 - acc: 0.7687 12800/15000 [========================>.....] - ETA: 1s - loss: 0.5047 - acc: 0.7730 13312/15000 [=========================>....] - ETA: 0s - loss: 0.5017 - acc: 0.7741 13824/15000 [==========================>...] - ETA: 0s - loss: 0.4987 - acc: 0.7765 14336/15000 [===========================>..] - ETA: 0s - loss: 0.4944 - acc: 0.7795 14848/15000 [============================>.] - ETA: 0s - loss: 0.4896 - acc: 0.7828 15000/15000 [==============================] - 8s 513us/step - loss: 0.4879 - acc: 0.7841 - val_loss: 0.3425 - val_acc: 0.8788 Epoch 20/20 512/15000 [>.............................] - ETA: 1s - loss: 0.0015 - acc: 1.0000 1024/15000 [=>............................] - ETA: 1s - loss: 0.0013 - acc: 1.0000 1536/15000 [==>...........................] - ETA: 1s - loss: 0.0015 - acc: 1.0000 2048/15000 [===>..........................] - ETA: 1s - loss: 0.0016 - acc: 1.0000 2560/15000 [====>.........................] - ETA: 1s - loss: 0.0016 - acc: 1.0000 3072/15000 [=====>........................] - ETA: 1s - loss: 0.0016 - acc: 1.0000 3584/15000 [======>.......................] - ETA: 1s - loss: 0.0016 - acc: 1.0000 4096/15000 [=======>......................] - ETA: 1s - loss: 0.0015 - acc: 1.0000 4608/15000 [========>.....................] - ETA: 1s - loss: 0.0015 - acc: 1.0000 5120/15000 [=========>....................] - ETA: 1s - loss: 0.0015 - acc: 1.0000 5632/15000 [==========>...................] - ETA: 1s - loss: 0.0015 - acc: 1.0000 6144/15000 [===========>..................] - ETA: 0s - loss: 0.0014 - acc: 1.0000 6656/15000 [============>.................] - ETA: 0s - loss: 0.0014 - acc: 1.0000 7168/15000 [=============>................] - ETA: 0s - loss: 0.0014 - acc: 1.0000 7680/15000 [==============>...............] - ETA: 0s - loss: 0.0014 - acc: 1.0000 8192/15000 [===============>..............] - ETA: 0s - loss: 0.0014 - acc: 1.0000 8704/15000 [================>.............] - ETA: 0s - loss: 0.0014 - acc: 1.0000 9216/15000 [=================>............] - ETA: 0s - loss: 0.0017 - acc: 0.9999 9728/15000 [==================>...........] - ETA: 0s - loss: 0.0017 - acc: 0.9999 10240/15000 [===================>..........] - ETA: 0s - loss: 0.0017 - acc: 0.9999 10752/15000 [====================>.........] - ETA: 0s - loss: 0.0016 - acc: 0.9999 11264/15000 [=====================>........] - ETA: 0s - loss: 0.0016 - acc: 0.9999 11776/15000 [======================>.......] - ETA: 0s - loss: 0.0016 - acc: 0.9999 12288/15000 [=======================>......] - ETA: 0s - loss: 0.0016 - acc: 0.9999 12800/15000 [========================>.....] - ETA: 0s - loss: 0.0016 - acc: 0.9999 13312/15000 [=========================>....] - ETA: 0s - loss: 0.0016 - acc: 0.9999 13824/15000 [==========================>...] - ETA: 0s - loss: 0.0016 - acc: 0.9999 14336/15000 [===========================>..] - ETA: 0s - loss: 0.0016 - acc: 0.9999 14848/15000 [============================>.] - ETA: 0s - loss: 0.0016 - acc: 0.9999 15000/15000 [==============================] - 3s 170us/step - loss: 0.0016 - acc: 0.9999 - val_loss: 0.7605 - val_acc: 0.8631
在20次训练后,准确率为86.31%。
我们来看看历史纪录打印出的值看看是否发生了过拟合:
history = {
'val_loss': [0.3797392254829407, 0.300249081659317, 0.3083435884475708, 0.283885223865509, 0.2847259536743164, 0.3144310226917267, 0.31279232678413393, 0.38592003211975096, 0.36343686447143553, 0.3843619570732117, 0.4167306966781616, 0.45070800895690916, 0.46998676981925963, 0.502394838142395, 0.5363822244167328, 0.572349524307251, 0.6167236045837402, 0.6382174592018127, 0.7905949376106263, 0.7077673551559448],
'val_acc': [0.8683999997138977, 0.8898000004768372, 0.8715000001907348, 0.8831000001907349, 0.8867000002861023, 0.8772999998092651, 0.8844999999046326, 0.8650999998092651, 0.8783000001907348, 0.8790000001907349, 0.8768999999046325, 0.8692999998092651, 0.8731999997138977, 0.8721999996185302, 0.8688999997138978, 0.8694999996185303, 0.8648999995231629, 0.8663999996185303, 0.8495000007629394, 0.8646999994277954],
'loss': [0.5085020663420359, 0.30055405753453573, 0.2179573760588964, 0.1750892378250758, 0.1426703708012899, 0.11497294256687164, 0.09784598382711411, 0.08069058032830556, 0.06599647818009059, 0.055482132176558174, 0.045179014571507775, 0.038426268839836124, 0.029788546661535898, 0.02438261934618155, 0.0176644352838397, 0.016922697043418884, 0.009341424687206746, 0.011814989059666792, 0.005609988443056742, 0.005509983973701795],
'acc': [0.7815333335240682, 0.9044666666984558, 0.9285333332697551, 0.9436666668891907, 0.954333333047231, 0.9652000000635783, 0.9706666668256124, 0.9762666667938232, 0.9821333333015442, 0.9851333333015442, 0.9887333332379659, 0.9912000002543131, 0.9928666666666667, 0.9948666666666667, 0.9978666666666667, 0.9970000001589457, 0.9994, 0.9974666666666666, 0.9998000002543131, 0.9996666666666667]
}
前两个参数,‘val_loss’和’val_acc’就是测试集的效果,在第二轮训练的时候,准确率达到0.8898,但最后一次训练却只有0.86.这明显发生了过拟合,我们来画出训练集和测试集loss变化曲线,以供进一步分析。
import matplotlib.pyplot as plt history_dict = { 'val_loss': [0.3797392254829407, 0.300249081659317, 0.3083435884475708, 0.283885223865509, 0.2847259536743164, 0.3144310226917267, 0.31279232678413393, 0.38592003211975096, 0.36343686447143553, 0.3843619570732117, 0.4167306966781616, 0.45070800895690916, 0.46998676981925963, 0.502394838142395, 0.5363822244167328, 0.572349524307251, 0.6167236045837402, 0.6382174592018127, 0.7905949376106263, 0.7077673551559448], 'val_acc': [0.8683999997138977, 0.8898000004768372, 0.8715000001907348, 0.8831000001907349, 0.8867000002861023, 0.8772999998092651, 0.8844999999046326, 0.8650999998092651, 0.8783000001907348, 0.8790000001907349, 0.8768999999046325, 0.8692999998092651, 0.8731999997138977, 0.8721999996185302, 0.8688999997138978, 0.8694999996185303, 0.8648999995231629, 0.8663999996185303, 0.8495000007629394, 0.8646999994277954], 'loss': [0.5085020663420359, 0.30055405753453573, 0.2179573760588964, 0.1750892378250758, 0.1426703708012899, 0.11497294256687164, 0.09784598382711411, 0.08069058032830556, 0.06599647818009059, 0.055482132176558174, 0.045179014571507775, 0.038426268839836124, 0.029788546661535898, 0.02438261934618155, 0.0176644352838397, 0.016922697043418884, 0.009341424687206746, 0.011814989059666792, 0.005609988443056742, 0.005509983973701795], 'acc': [0.7815333335240682, 0.9044666666984558, 0.9285333332697551, 0.9436666668891907, 0.954333333047231, 0.9652000000635783, 0.9706666668256124, 0.9762666667938232, 0.9821333333015442, 0.9851333333015442, 0.9887333332379659, 0.9912000002543131, 0.9928666666666667, 0.9948666666666667, 0.9978666666666667, 0.9970000001589457, 0.9994, 0.9974666666666666, 0.9998000002543131, 0.9996666666666667] } loss_values = history_dict['loss'] # 训练集loss val_loss_values = history_dict['val_loss'] # 测试集loss epochs = range(1, len(loss_values) + 1) plt.plot(epochs, loss_values, 'bo', label='Training loss') plt.plot(epochs, val_loss_values, 'b', label='Validation loss') plt.title('Training and validation loss') plt.xlabel('Epochs') plt.ylabel('Loss') plt.legend() plt.show()
可以看出,训练集loss不断减小,而测试集loss却在增长,发生了过拟合。
为确定我们的判断,我们再画出训练集和测试集的准确率看看:
import matplotlib.pyplot as plt history_dict = { 'val_loss': [0.3797392254829407, 0.300249081659317, 0.3083435884475708, 0.283885223865509, 0.2847259536743164, 0.3144310226917267, 0.31279232678413393, 0.38592003211975096, 0.36343686447143553, 0.3843619570732117, 0.4167306966781616, 0.45070800895690916, 0.46998676981925963, 0.502394838142395, 0.5363822244167328, 0.572349524307251, 0.6167236045837402, 0.6382174592018127, 0.7905949376106263, 0.7077673551559448], 'val_acc': [0.8683999997138977, 0.8898000004768372, 0.8715000001907348, 0.8831000001907349, 0.8867000002861023, 0.8772999998092651, 0.8844999999046326, 0.8650999998092651, 0.8783000001907348, 0.8790000001907349, 0.8768999999046325, 0.8692999998092651, 0.8731999997138977, 0.8721999996185302, 0.8688999997138978, 0.8694999996185303, 0.8648999995231629, 0.8663999996185303, 0.8495000007629394, 0.8646999994277954], 'loss': [0.5085020663420359, 0.30055405753453573, 0.2179573760588964, 0.1750892378250758, 0.1426703708012899, 0.11497294256687164, 0.09784598382711411, 0.08069058032830556, 0.06599647818009059, 0.055482132176558174, 0.045179014571507775, 0.038426268839836124, 0.029788546661535898, 0.02438261934618155, 0.0176644352838397, 0.016922697043418884, 0.009341424687206746, 0.011814989059666792, 0.005609988443056742, 0.005509983973701795], 'acc': [0.7815333335240682, 0.9044666666984558, 0.9285333332697551, 0.9436666668891907, 0.954333333047231, 0.9652000000635783, 0.9706666668256124, 0.9762666667938232, 0.9821333333015442, 0.9851333333015442, 0.9887333332379659, 0.9912000002543131, 0.9928666666666667, 0.9948666666666667, 0.9978666666666667, 0.9970000001589457, 0.9994, 0.9974666666666666, 0.9998000002543131, 0.9996666666666667] } loss_values = history_dict['acc'] # 训练集loss val_loss_values = history_dict['val_acc'] # 测试集loss epochs = range(1, len(loss_values) + 1) plt.plot(epochs, loss_values, 'bo', label='Training acc') plt.plot(epochs, val_loss_values, 'b', label='Validation acc') plt.title('Training and validation accuracy') plt.xlabel('Epochs') plt.ylabel('Accuracy') plt.legend() plt.show()
训练集准确率不断上升,而测试集准确率却波澜起伏,在最后几次不断下降,明显发生过拟合。遇到这种情况我们需要采取一些措施,最简单的就是少训练几轮,当然还有更多方法,比如正则化,dropout等等,以后会讨论。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。