当前位置:   article > 正文

[深度学习TF2][RNN-LSTM]文本情感分析包含(数据预处理-训练-预测)_lstm预测模型tf2

lstm预测模型tf2

0. 前言

[深度学习-原理篇]什么是循环神经网络RNN与LSTM

在这里插入图片描述

1. 数据下载

数据集地址:http://ai.stanford.edu/~amaas/data/sentiment/
在这里插入图片描述
下载后解压,会看到有两个文件夹,test和train:
在这里插入图片描述
我们点进train中,会发现正样本和负样本已经分好类了:
neg和pos分别是负样本和正样本,unsup是未标注的样本,可用后续需要采用。其他的都自己去看看吧。

打开pos文件,看看里面啥样:

在这里插入图片描述
都是一个个文本。

注意到,这些文本一般都不短…
在这里插入图片描述
数据集中,共有5w条文本,test集和train集各半,每个集合中,pos和neg也是各半。
本为用到只是tranin集25000条文本。

2. 训练数据介绍

情感分析是上手NLP的最简单的任务之一,它就是一个简单的文本分类问题,判断一段文本的情感极性。最简单的就是二分类,判断是积极的还是消极的;更难一点的就是三分类,除了积极消极还有无情感倾向的;更加复杂的就比如情感打分,例如电影打1~5分,这就是五分类。但本质上都一样,无非类别太多更难以学习罢了。
IMDB是一个专业的电影评论网站,类似国内的豆瓣,IMDB的电影评论数据是大家经常使用来练手的情感分析数据集,也是各种比赛,如Kaggle,和各种学者做研究常用的数据集。

其实,Tensorflow.keras 自带了IMDB的已经进行很好的预处理的数据集,可以一行代码下载,不需要进行任何的处理就可以训练,而且效果比较好。但是,这样就太没意思了。在真实场景中,我们拿到的都是脏脏的数据,我们必须自己学会读取、清洗、筛选、分成训练集测试集。而且,从我自己的实践经验来看,数据预处理的本事才是真本事,模型都好搭,现在的各种框架已经让搭建模型越来越容易,但是数据预处理只能自己动手。所有往往实际任务中,数据预处理花费的时间、精力是最多的,而且直接影响后面的效果。

另外,我们要知道,对文本进行分析,首先要将文本数值化。因为计算机不认字的,只认数字。所以最后处理好的文本应该是数值化的形式。而Tensorflow.keras自带的数据集全都数值化了,而它并不提供对应的查询字典让我们知道每个数字对应什么文字,这让我们只能训练模型,看效果,无法拓展到其他语料上,也无法深入分析。综上,我上面推荐的数据集,是原始数据集,都是真实文本,当然,为了方便处理,也已经被斯坦福的大佬分好类了。但是怎么数值化,需要我们自己动手。

3. 用到Word2Vector介绍

Google 已经帮助我们在大规模数据集上训练出来了 Word2Vec 模型,它包括 1000 亿个不同的词,在这个模型中,谷歌能创建300万个词向量,每个向量维度为 300。在理想情况下,我们将使用这些向量来构建模型,但是因为这个单词向量矩阵太大(3.6G),因此在此次研究中我们将使用一个更加易于管理的矩阵,该矩阵由 GloVe 进行训练得到。矩阵将包含 400000 个词向量,每个向量的维数为 50

我们将导入两个不同的数据结构,一个是包含 400000 个单词的 Python 列表(wordsList.npy),一个是包含所有单词向量值的 400000*50 维的嵌入矩阵 (wordVectors.npy)

GloVe 词向量网盘下载
链接: https://pan.baidu.com/s/1PJx_ahSaPfVgMjmLpMz8cw 提取码: di2e
下载减压后
在这里插入图片描述
如果你想知道这个文件如何产生的看我这篇博客

wordsList.npy介绍

一个是包含 400000 个单词的 Python 列表,它里面每个单词对应的位置就是 wordVectors里相应词向量的位置
例子:如果我们要查找baseball这个词的相应词向量

import numpy as np
import tensorflow as tf
import os as os
import matplotlib.pyplot as plt
from os import listdir
wordsList = np.load('./training_data/wordsList.npy')
print('Loaded the word list!')

wordsList = wordsList.tolist()  # Originally loaded as numpy array
wordsList = [word.decode('UTF-8') for word in wordsList]  # Encode words as UTF-8
wordVectors = np.load('./training_data/wordVectors.npy')
print('Loaded the word vectors!')

print(len(wordsList))
# print(wordsList)
print(wordVectors.shape)

baseballIndex = wordsList.index('baseball')
print(baseballIndex)
print(wordVectors[baseballIndex])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20

执行结果

Loaded the word list!
Loaded the word vectors!
400000
(400000, 50)
1444
[-1.9327    1.0421   -0.78515   0.91033   0.22711  -0.62158  -1.6493
  0.07686  -0.5868    0.058831  0.35628   0.68916  -0.50598   0.70473
  1.2664   -0.40031  -0.020687  0.80863  -0.90566  -0.074054 -0.87675
 -0.6291   -0.12685   0.11524  -0.55685  -1.6826   -0.26291   0.22632
  0.713    -1.0828    2.1231    0.49869   0.066711 -0.48226  -0.17897
  0.47699   0.16384   0.16537  -0.11506  -0.15962  -0.94926  -0.42833
 -0.59457   1.3566   -0.27506   0.19918  -0.36008   0.55667  -0.70315
  0.17157 ]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

wordVectors.npy介绍

一个是包含所有单词向量值的 400000*50 维的嵌入矩阵
假如我们有这么一句话“I thought the movie was incredible and inspiring” ,一个10个词,那么它们在wordVectors中的词向量是什么?
代码实现去查找它们相应的词向量

import numpy as np
import tensorflow as tf
import os as os
import matplotlib.pyplot as plt
from os import listdir
wordsList = np.load('./training_data/wordsList.npy')
print('Loaded the word list!')

wordsList = wordsList.tolist()  # Originally loaded as numpy array
wordsList = [word.decode('UTF-8') for word in wordsList]  # Encode words as UTF-8
wordVectors = np.load('./training_data/wordVectors.npy')
maxSeqLength = 10  # Maximum length of sentence
numDimensions = 300  # Dimensions for each word vector
firstSentence = np.zeros((maxSeqLength), dtype='int32')
firstSentence[0] = wordsList.index("i")
firstSentence[1] = wordsList.index("thought")
firstSentence[2] = wordsList.index("the")
firstSentence[3] = wordsList.index("movie")
firstSentence[4] = wordsList.index("was")
firstSentence[5] = wordsList.index("incredible")
firstSentence[6] = wordsList.index("and")
firstSentence[7] = wordsList.index("inspiring")
# firstSentence[8] and firstSentence[9] are going to be 0
print(firstSentence.shape)
print(firstSentence)  # Shows the row index for each word
with tf.Session() as sess:
    print(tf.nn.embedding_lookup(wordVectors, firstSentence).eval().shape)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27

执行结果

(10,)
[    41    804 201534   1005     15   7446      5  13767      0      0]
(10, 50)
  • 1
  • 2
  • 3

4 数据预处理

4.1 . generate_train_data函数

4.1.1. 包含加载数据
4.1.2. 省略掉低频词
4.1.3. 把词转成唯一索引, 因为计算机只能认识数字
4.1.4.产生一个 trainData.npz 数据集,训练的时候就可以直接用这个数据集训练了,因为加载文件太慢 了
4.1.5. 产生一个small_word_index 字典,dict (word, index), 这个字典要在预测的时候用,因为你把每一个单词都转成对应的唯一的索引,在训练完你要预测一条评论的时候要把评论里没一个单词再转成你训练这个模型的唯一的索引,这个时候你就用到这个字典了

4.2. generate_embedding_matrix 函数

4.2.1 利用wordVectors和wordList和small_word_index字典 产生一个Embeding matrix
4.2.2 embedding_matrix.npy 是一个矩阵,可以把词索引转成词向量。

4.3. test_load 函数,

验证产生结果 (trainData.npz, train.npz, test.npz, small_word_index.npy, embedding_matrix.npy)

import numpy as np
import os as os
import tensorflow.keras as keras
import time
import re
from sklearn.model_selection import train_test_split

vocab_size = 30000
save_dir = './train_data_new1'

# remove html tag like '<br /><br />'
def rm_tags(text):
    re_tag = re.compile(r'<[^>]+>')
    return re_tag.sub(' ', text)

def clean_str(string):
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)  # it's -> it 's
    string = re.sub(r"\'ve", " \'ve", string) # I've -> I 've
    string = re.sub(r"n\'t", " n\'t", string) # doesn't -> does n't
    string = re.sub(r"\'re", " \'re", string) # you're -> you are
    string = re.sub(r"\'d", " \'d", string)  # you'd -> you 'd
    string = re.sub(r"\'ll", " \'ll", string) # you'll -> you 'll
    string = re.sub(r"\'m", " \'m", string) # I'm -> I 'm
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip().lower()

def process(text):
    text = clean_str(text)
    text = rm_tags(text)
    #text = text.lower()
    return  text

def get_data(datapath =r'D:\train_data\aclImdb\aclImdb\train' ):
    pos_files = os.listdir(datapath + '/pos')
    neg_files = os.listdir(datapath + '/neg')
    print(len(pos_files))
    print(len(neg_files))

    pos_all = []
    neg_all = []
    for pf, nf in zip(pos_files, neg_files):
        with open(datapath + '/pos' + '/' + pf, encoding='utf-8') as f:
            s = f.read()
            s = process(s)
            pos_all.append(s)
        with open(datapath + '/neg' + '/' + nf, encoding='utf-8') as f:

            s = f.read()
            s = process(s)
            neg_all.append(s)
    print(len(pos_all))
    print(pos_all[0])

    print(len(neg_all))

    X_orig= np.array(pos_all + neg_all)
    print(X_orig)
    Y_orig = np.array([1 for _ in range(len(pos_all))] + [0 for _ in range(len(neg_all))])
    print("X_orig:", X_orig.shape)
    print("Y_orig:", Y_orig.shape)

    return X_orig, Y_orig

def generate_train_data():
    X_orig, Y_orig =  get_data(r'D:\train_data\aclImdb\aclImdb\train')
    X_orig_test, Y_orig_test = get_data(r'D:\train_data\aclImdb\aclImdb\test')
    X_orig = np.concatenate([X_orig, X_orig_test])
    Y_orig = np.concatenate([Y_orig ,Y_orig_test])

    maxlen = 200
    print("Start fitting the corpus......")
    t = keras.preprocessing.text.Tokenizer(vocab_size)  # 要使得文本向量化时省略掉低频词,就要设置这个参数
    tik = time.time()
    t.fit_on_texts(X_orig)  # 在所有的评论数据集上训练,得到统计信息
    tok = time.time()
    word_index = t.word_index  # 不受vocab_size的影响
    print(X_orig)
    print('all_vocab_size', len(word_index), type(word_index))
    print(word_index)
    print("Fitting time: ", (tok - tik), 's')
    print("Start vectorizing the sentences.......")
    v_X = t.texts_to_sequences(X_orig)  # 受vocab_size的影响
    print("Start padding......")
    print(v_X)
    pad_X = keras.preprocessing.sequence.pad_sequences(v_X, maxlen=maxlen, padding='post')
    print(pad_X.shape)
    print("Finished!")

    np.savez(save_dir+'/trainData', x=pad_X, y=Y_orig)
    import copy
    x = list(t.word_counts.items())
    s = sorted(x, key=lambda p: p[1], reverse=True)
    small_word_index = copy.deepcopy(word_index)  # 防止原来的字典也被改变了
    print("Removing less freq words from word-index dict...")
    for item in s[vocab_size:]:
        small_word_index.pop(item[0])
    print("Finished!")
    print(len(small_word_index))
    print(len(word_index))
    np.save(save_dir+'/small_word_index', small_word_index)

def generate_embedding_matrix():
    small_word_index = np.load(save_dir+'/small_word_index.npy', allow_pickle=True)

    wordVectors = np.load('./GloVe/wordVectors.npy')
    wordsList = np.load('./GloVe/wordsList.npy')
    wordsList = [word.decode('UTF-8') for word in wordsList]

    embedding_matrix = np.random.uniform(size=(vocab_size + 1, 50))  # +1是要留一个给index=0
    print("Transfering to the embedding matrix......")
    for word, index in small_word_index.item().items():
        try:
            word_index = wordsList.index(word)
            word_vector = wordVectors[word_index]
            embedding_matrix[index] = word_vector
        except Exception:
            print("Word: [", word, "] not in wvmodel! Use random embedding instead.")
    print("Finished!")
    print("Embedding matrix shape:\n", embedding_matrix.shape)
    np.save(save_dir+'/embedding_matrix', embedding_matrix)

def generate_test_train():
    trainDataNew = np.load('./train_data_new1/trainData.npz')
    X = trainDataNew['x']
    Y = trainDataNew['y']

    np.random.seed = 1
    random_indexs = np.random.permutation(len(X))
    X = X[random_indexs]
    Y = Y[random_indexs]
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3)
    print("X_train:", X_train.shape)
    print("y_train:", y_train.shape)
    print("X_test:", X_test.shape)
    print("y_test:", y_test.shape)
    np.savez(save_dir + '/train', x=X_train, y=y_train)
    np.savez(save_dir + '/test', x=X_test, y=y_test)

if __name__ == '__main__':
    #get_data(r'D:\train_data\aclImdb\aclImdb_test\train')
    generate_train_data()
    generate_embedding_matrix()
    generate_test_train()

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150

5 训练模型与测试模型

加载数据集train.npz, test.npz 和Embeding matrix.npy

LSTM 网络的相关参数含义请看官方链接

import os
import numpy as np
import tensorflow.keras as keras
import tensorflow.keras.layers as layers
import tensorflow as tf
import time

root_folder = '.\lstm8'
def get_dataset():
    train_set = np.load('./train_data_new1/train.npz')
    X_train = train_set['x']
    y_train = train_set['y']
    test_set = np.load('./train_data_new1/test.npz')
    X_test = test_set['x']
    y_test = test_set['y']

    print("X_train:", X_train.shape)
    print("y_train:", y_train.shape)
    print("X_test:", X_test.shape)
    print("y_test:", y_test.shape)
    return X_train, y_train, X_test, y_test

def lstm_model(use_pretrained_wv =True):
    if use_pretrained_wv:
        embedding_matrix = np.load('./train_data_new1/embedding_matrix.npy')
        model = keras.Sequential([
            layers.Embedding(input_dim=30001, output_dim=50, input_length=200 , weights=[embedding_matrix]),
            #layers.LSTM(128),
            #layers.Bidirectional(layers.GRU(128, dropout=0.5)),
            layers.LSTM(128,dropout=0.5),
            #layers.Dropout(0.5),
            layers.Dense(2, activation='softmax')
        ])
    else:
        model = keras.Sequential([
            layers.Embedding(input_dim=30001, output_dim=50, input_length=200),
            layers.LSTM(100),
            layers.Dropout(0.5),
            layers.Dense(2, activation='softmax')
        ])

    model.compile(optimizer=keras.optimizers.Adam(),
                 loss=keras.losses.SparseCategoricalCrossentropy(),
                metrics=[keras.metrics.SparseCategoricalAccuracy()])
    model.summary()
    return model

current_max_loss =9999
def train_my_model(model, X_train, y_train):
    weight_dir = root_folder+'\model.h5'

    if os.path.isfile(weight_dir):
        print('load weight')
        model.load_weights(weight_dir)

    def save_weight(epoch, logs):
        global current_max_loss
        if(logs['val_loss'] is not None and  logs['val_loss']< current_max_loss):
            current_max_loss = logs['val_loss']
            print('save_weight', epoch, current_max_loss)
            model.save_weights(weight_dir)


    batch_print_callback = keras.callbacks.LambdaCallback(
        on_epoch_end=save_weight
    )
    callbacks = [
        tf.keras.callbacks.EarlyStopping(patience=4, monitor='loss'),
        batch_print_callback,
        tf.keras.callbacks.TensorBoard(log_dir=root_folder+'\logs')
    ]
    begin = time.time()
    history = model.fit(X_train, y_train, batch_size=128, epochs=25,validation_split=0.1, callbacks= callbacks)
    finish = time.time()
    print("train time: ", (finish - begin), 's')
    import matplotlib.pyplot as plt
    plt.plot(history.history['sparse_categorical_accuracy'])
    plt.plot(history.history['val_sparse_categorical_accuracy'])
    plt.legend(['sparse_categorical_accuracy', 'val_sparse_categorical_accuracy'], loc='upper left')
    plt.show()

def test_my_module(model, X_test, y_test):
    if os.path.isfile(root_folder+'\model.h5'):
        print('load weight')
        model.load_weights(root_folder+'\model.h5')
    test_result = model.evaluate(X_test, y_test)
    print('test Result', test_result)
    print('Test ',test_result)


def predict_my_module(model):
    small_word_index = np.load('./train_data_new1/small_word_index.npy', allow_pickle=True)

    review_index = np.zeros((1, 200), dtype=int)
    review = "I don't like it"
    #review = "this is bad movie "
    #review = "This is good movie"
    #review = "This isn‘t good movie"
    #review = "i think this is bad movie"
    counter = 0
    for word in review.split():
        try:
            print(word, small_word_index.item()[word])
            review_index[0][counter] = small_word_index.item()[word]
            counter = counter + 1
        except Exception:
            print('Word error', word)
    print(review_index.shape)
    s = model.predict(x=review_index)
    print(s)

if __name__ == '__main__':
    X_train, y_train, x_test, y_test = get_dataset()
    model = lstm_model()
    train_my_model(model, X_train, y_train)
    test_my_module(model,x_test, y_test)
    #predict_my_module(model)

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118

训练集与验证集上结果

在这里插入图片描述

测试集上准确率0.8983333

14912/15000 [============================>.] - ETA: 0s - loss: 0.2592 - sparse_categorical_accuracy: 0.8985
15000/15000 [==============================] - 2s 134us/sample - loss: 0.2595 - sparse_categorical_accuracy: 0.8983
test Result [0.25950324280261994, 0.8983333]
Test  [0.25950324280261994, 0.8983333]
  • 1
  • 2
  • 3
  • 4

如果不加dropout,测试集上准确率是87左右。

6. 预测

你可以自己写一些电影评论测测。

def predict_my_module(model):
    weight_dir = root_folder+'\model.h5'

    if os.path.isfile(weight_dir):
        print('load weight')
        model.load_weights(weight_dir)
    else:
        print("model doesn't exit")
        return

    small_word_index = np.load('./train_data_new1/small_word_index.npy', allow_pickle=True)

    review_index = np.zeros((1, 200), dtype=int)
    review = "I don't like it"
    #review = "this is bad movie "
    #review = "This is good movie"
    #review = "This isn‘t good movie"
    #review = "i think this is bad movie"
    counter = 0
    for word in review.split():
        try:
            print(word, small_word_index.item()[word])
            review_index[0][counter] = small_word_index.item()[word]
            counter = counter + 1
        except Exception:
            print('Word error', word)
    print(review_index.shape)
    s = model.predict(x=review_index)
    print(s)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29

预测结果

[[0.42191893 0.5780811 ]]
  • 1

7. 训练模型用Tensorflow自带的imdb数据集

如果你只想训练你的模型的话, 你可以直接用tensorflow 自带的数据集训练。

import tensorflow.keras as keras
import tensorflow.keras.layers as layers

num_words = 30000
maxlen = 200
(x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(num_words=num_words)
#print(len(x_train[0]))
#print(x_train[0])
print(x_train.shape, ' ', y_train.shape)
print(x_test.shape, ' ', y_test.shape)
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen, padding='post')
x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen, padding='post')
#print(x_train[0])
print(x_train.shape, ' ', y_train.shape)
print(x_test.shape, ' ', y_test.shape)

def lstm_model():
    model = keras.Sequential([
        layers.Embedding(input_dim=30000, output_dim=32, input_length=maxlen),
        layers.LSTM(32, return_sequences=True),
        layers.LSTM(1, activation='sigmoid', return_sequences=False)
    ])
    model.compile(optimizer=keras.optimizers.Adam(),
                 loss=keras.losses.BinaryCrossentropy(),
                 metrics=['accuracy'])
    return model
model = lstm_model()
model.summary()

history = model.fit(x_train, y_train, batch_size=64, epochs=10,validation_split=0.1)

import matplotlib.pyplot as plt
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['training', 'valivation'], loc='upper left')
plt.show()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36

8. 用TextCNN卷积训练

结果还是不错的,测试集上准确率也在87%以上哦。
详细结果请看我下面这一篇博客
[深度学习-实战篇]情感分析之卷积神经网络-TextCNN,包含代码

9. 训练结果比较与总结

训练类别训练集结果验证集结果测试集结果时间预测统计
使用word2vec作为embedding的参数并固定参数98.30%88.78%87.24%时间较长
不使用word2vec作为embedding的参数80.58%78.54%77.54%时间最长
不使用word2vec作为embedding的参数(加BatchNormalization)99.58%84.80%89.14%1347s
用TF keras自带的imdb数据集98.46%87.68%85.14%464s
使用word2vec作为embedding的参数并继续fine-tune??时间最快

10. 循环神经网络系列

1. 循环神经网络系列之RNN与LSTM介绍
2. 循环神经网络系列之word2vector总结与理解
3. 循环神经网络系列之Imdb影评的数据集介绍与下载
4. 循环神经网络系列之文本情感分析包含(数据预处理-训练-预测)
5. 循环神经网络系列之基于tensorflow的CNN和RNN-LSTM文本情感分析对比
6. 情感分析之卷积神经网络-TextCNN,包含代码

11. 参考资料

https://www.oreilly.com/content/perform-sentiment-analysis-with-lstms-using-tensorflow/
https://zhuanlan.zhihu.com/p/63852350

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/笔触狂放9/article/detail/359953
推荐阅读
相关标签
  

闽ICP备14008679号