赞
踩
试了一下,直接把typora导入csdn很方便,所以就把之前上传的资源转成博客,也方便大家阅读~注意: gensim-4.1.2,不然会报错
本次数据集来源于kaggle项目“Bag of Words Meets Bags of Popcorn”提供的IMDB 情感分析数据集,共有25000条电影评论,其中正面评论为12500条,负面评论为12500条
The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.
其中正面评论主要包含的关键词如下图所示
具体的热门词及其对应的统计量如下
其中负面评论主要包含的关键词如下图所示
具体的热门词及其对应的统计量如下
由上面4张图不难看出,正面和负面评论中包含的热门词语基本一致,仅有少数词如“well”、“love”是正面评论的热门词,“bad”是负面评论的热门词。而这样的情况导致了在后续的分类过程中容易混淆文本的情感含义,这也间接说明了此次分类的任务具有较大的挑战性。
同时,该数据集的文本总体来看相对较长,具体如图所示
其中句长的最小值、句长的最大值、句长的中位数和平均数如下图所示
从上面两张图我们可以看到数据集的文本句长主要集中在50-200之间,这也为我们后续建模提供了数据支撑(max_len)。
另外,本次实验的测试集为kaggle项目提供的25000条文本,而上述数据集全部用于模型训练。
上述文本是25000条数据中的其中一条。
本次实验采用两种方式对数据进行清洗。第一种是首先利用Python的第三方模块bs4提供的BeautifulSoup方法除去文本内包含的<br/><br/>标签,再删去除英文字母外的一切字符,在利用空格将词分开,最后去除停用词;第二种首先同样是利用Python的第三方模块bs4提供的BeautifulSoup方法除去文本内包含的<br/><br/>标签,最后去除停用词,保留包括特殊符号在内的标点符号。
def tokenizer(reviews):
Words=[]
for review in reviews:
review_text=BeautifulSoup(review,'html.parser').get_text()
# 除去标点符号等非英文字母
review_text=re.sub('[^a-zA-Z]',' ',review_text)
# 小写化,并按空格分词
words=review_text.lower().split()
stops=set(stopwords.words('english'))
# 除去停用词
words=[w for w in words if not w in stops]
Words.append(words)
return Words
本次实验主要利用到了两种特征提取的方法,分别为传统的特征提取方法——TF-IDF,以及双层神经网络模型——Word2vec
出于机器性能的限制,本次实验在利用TF-IDF进行特征提取时,仅提取词频数在500以上的词语,最终词向量的维度数为1648。有关TF-IDF的算法详细介绍可参考该资源的内容。
本次实验主要利用到的文本特征方法就是word2vec模型提取文本特征,并且,我们将word2vec的词向量维度分别设置为100和200,以找到更优的实验结果。有关word2vec的算法详细介绍可参考该资源的内容。
本次的模型评价指标为AUC,被定义为ROC曲线下与坐标轴围成的,其中,ROC曲线全称为受试者工作特征曲线(receiver operating characteristic curve)。
ROC曲线是基于样本的真实类别和预测概率来画的,具体来说,ROC曲线的x轴是伪正率(false positive rate),y轴是真正率(true positive rate)。那么问题来了,什么是真、伪阳性率呢?对于二分类问题,一个样本的类别只有两种,我们用0,1分别表示两种类别,0和1也可以分别叫做负面和正面。当我们用一个分类器进行概率的预测的时候,对于真实为0的样本,我们可能预测其为0或1,同样对于真实为1的样本,我们也可能预测其为0或1,这样就有四种可能性,具体如下表所示。
如上表,TP表示预测为正面,而实际也是正面的样例数;FN表示预测为负面,而实际是正面的样例数;FP表示预测为正面,而实际是负面的样例数;TN表示预测为负面,而实际也是负面的样例数;所以,上面这四个数就形成了一个矩阵,称为混淆矩阵。那么接下来,我们如何利用混淆矩阵来计算ROC呢?
首先我们需要定义下面两个变量:
上述FPR表示,在所有的负面数据中,被预测成正面的比例。称为伪正率。伪正率告诉我们,随机拿一个负面的样本,有多大概率会将其预测成正面数据。显然我们会希望FPR越小越好。
上述TPR表示,在所有正面数据中,被预测为正面的比例。称为真正率。真正率告诉我们,随机拿一个正面的数据时,有多大的概率会将其预测为正面数据。显然我们会希望TPR越大越好。如果以FPR为横坐标,TPR为纵坐标,就可以得到下面的坐标系:
FPR=0时说明FP=0,即没有假正例。TPR=1时说明FN=0,即没有假反例。那么如上图所示,如果一个点越接近左上角,那么说明模型的预测效果越好。如果能达到左上角,那就是最完美的结果了。
我们知道,在二分类(0,1)的模型中,一般我们最后的输出是一个概率值,表示结果是1的概率。那么我们最后怎么决定输入的x是属于0或1呢?我们需要一个阈值,超过这个阈值则归类为1,低于这个阈值就归类为0。所以,不同的阈值会导致分类的结果不同,也就是混淆矩阵不一样了,FPR和TPR也就不一样了。所以当阈值从0开始慢慢移动到1的过程,就会形成很多对(FPR, TPR)的值,将它们画在坐标系上,就是所谓的ROC曲线了。
AUC的优势就在于AUC的计算方法同时考虑了分类器对于正例和负例的分类能力,在样本不平衡的情况下,依然能够对分类器作出合理的评价。例如在反欺诈场景,设欺诈类样本为正例,正例占比很少(假设0.1%),如果使用准确率评估,把所有的样本预测为负例,便可以获得99.9%的准确率。但是如果使用AUC,把所有样本预测为负例,TPR和FPR同时为0,与(0,0) (1,1)连接,得出AUC仅为0.5,成功规避了样本不均匀带来的问题。
本次实验主要实现了四种分类算法,分别为Bi-LSTM、TextCNN、CNN+Bi-LSTM以及支持向量机。
惩罚系数C、核函数类型kernel与核函数系数gamma这三个参数主要利用到了Python提供的第三方模块scikit-learn提供的GridSearchCV方法进行调整。此外,本次实验进行了传统的特征提取方法——TF-IDF与word2vec的对比实验,以及去除标点符号和不去除标点符号的对比实验。
word2vec+svm:
# -*- coding: utf-8 -*- # @File :train_svm.py import codecs import csv import numpy as np import pandas as pd from nltk.corpus import stopwords import re from bs4 import BeautifulSoup import multiprocessing from gensim.models.word2vec import Word2Vec from sklearn import svm from sklearn.model_selection import GridSearchCV import joblib cpu_count = multiprocessing.cpu_count() vocab_dim = 100 n_iterations = 1 n_exposures = 10 # 所有频数超过10的词语 window_size = 7 def loadfile(): train_data = pd.read_csv('../word2vec-nlp-tutorial/labeledTrainData.tsv', header=0, delimiter='\t', quoting=3) test_data = pd.read_csv('../word2vec-nlp-tutorial/testData.tsv', header=0, delimiter='\t', quoting=3) unlabeled = pd.read_csv('../word2vec-nlp-tutorial/unlabeledTrainData.tsv', header=0, delimiter='\t', quoting=3) combined = np.concatenate((train_data['review'], test_data['review'], unlabeled['review'])) return combined """ 特别的符号 标点符号 """ # # 对句子进行分词 # def tokenizer(reviews): # Words = [] # for review in reviews: # review_text = BeautifulSoup(review, 'html.parser').get_text() # 除去标签 # # review_text = re.sub("[^a-zA-Z]", " ", review_text) # 除去标点符号等非英文字母 # # words = review_text.lower().split() # 小写化且按空格分词 # # stops = set(stopwords.words("english")) # words = [w for w in words if not w in stops] # 除去停用词 # # Words.append(words) # # return Words # 对句子进行分词 def tokenizer(reviews): Words = [] for review in reviews: review_text = BeautifulSoup(review, 'html.parser').get_text() # 除去标签 review_text = re.sub("[^a-zA-Z0-9.?!,'\\\()\-\"/&#@=<>\[\]%:;ò{}★çÉ’‘:;ıäßýÖÊ…ìáמ∧ºğ«û+¾½§“”()$¨ÞåðÁčöÈíŻÄÜÐ]", " ", review_text) # 除去标点符号等非英文字母 words = review_text.lower().split() # 小写化且按空格分词 stops = set(stopwords.words("english")) words = [w for w in words if not w in stops] # 除去停用词 Words.append(words) return Words def word2vec_train(combined): model = Word2Vec(vector_size=vocab_dim, min_count=n_exposures, window=window_size, workers=cpu_count, epochs=n_iterations) model.build_vocab(combined) # input: list model.train(combined, total_examples=model.corpus_count, epochs=model.epochs) model.save('model/Word2vec_model_100_punc.pkl') # 直接词向量相加求平均 def fea_sentence(list_w): n0 = np.array([0. for i in range(vocab_dim)], dtype=np.float32) for i in list_w: n0 += i fe = n0 / len(list_w) fe = fe.tolist() return fe def parse_dataset(x_data, word2vec): xVec = [] for x in x_data: sentence = [] for word in x: print('word2vec',word2vec.wv.key_to_index) if word in word2vec.wv.key_to_index: sentence.append(word2vec.wv.key_to_index[word]) else: sentence.append([0. for i in range(vocab_dim)]) xVec.append(fea_sentence(sentence)) xVec = np.array(xVec) return xVec def get_data(word2vec): neg_train = pd.read_csv('data/neg_train.csv', header=None, index_col=None) pos_train = pd.read_csv('data/pos_train.csv', header=None, index_col=None) x_train = np.concatenate((neg_train[0], pos_train[0])) x_train = tokenizer(x_train) x_train = parse_dataset(x_train, word2vec) y_train = np.concatenate((np.zeros(len(neg_train), dtype=int), np.ones(len(pos_train), dtype=int))) x_test = pd.read_csv('data/test_data.csv', header=None, index_col=None) x_test = tokenizer(x_test[0]) x_test = parse_dataset(x_test, word2vec) return x_train, y_train, x_test def train_svm(x_train, y_train): svr = svm.SVC(verbose=True) parameters = {'kernel':('linear', 'rbf'), 'C':[1, 2, 4], 'gamma':[0.125, 0.25, 0.5 ,1, 2, 4]} clf = GridSearchCV(svr, parameters, scoring='f1') clf.fit(x_train, y_train, ) print('最佳参数: ') # print(clf.best_params_) # rbf 4 0.125 print(clf.best_params_) # punc {'C': 4, 'gamma': 0.25, 'kernel': 'rbf'} # clf = svm.SVC(kernel='rbf', C=4, gamma=0.125, verbose=True) # clf.fit(x_train,y_train) # 封装模型 print('保存模型...') joblib.dump(clf, 'model/svm_100_punc.pkl') if __name__ == '__main__': # 训练模型,并保存 print('加载数据集...') combined = loadfile() print(len(combined)) print('数据预处理...') combined = tokenizer(combined) print('训练word2vec模型...') word2vec_train(combined) ################# 若所需的word2vec已经训练好了,则上述几行代码可注释掉 print('加载word2vec模型') # word2vec = Word2Vec.load('model/Word2vec_model_200.pkl') word2vec = Word2Vec.load('model/Word2vec_model_100_punc.pkl') print('将数据转换为模型输入所需格式...') x_train, y_train, x_test = get_data(word2vec) print("特征与标签大小:") print(x_train.shape, y_train.shape) print('训练svmm模型...') train_svm(x_train, y_train) print('加载svm模型...') model = joblib.load('model/svm_100_punc.pkl') y_pred = model.predict(x_test) id = pd.read_csv('../word2vec-nlp-tutorial/sampleSubmission.csv', header=0)['id'] print(len(id)) print(len(y_pred)) f = codecs.open('data/Submission_svm_100_punc.csv', 'w', encoding='utf-8') writer = csv.writer(f) writer.writerow(['id', 'sentiment']) for i in range(len(id)): writer.writerow([id[i], y_pred[i]]) f.close()
TFIDF+SVM:
# -*- coding: utf-8 -*- # @File :train_svm_tfidf.py # 正常运行 import csv import re from bs4 import BeautifulSoup from nltk.corpus import stopwords import numpy as np import pandas as pd from sklearn import svm from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import GridSearchCV import joblib # # 对句子进行分词 # def tokenizer(reviews): # Words = [] # for review in reviews: # review_text = BeautifulSoup(review, 'html.parser').get_text() # 除去标签 # # review_text = re.sub("[^a-zA-Z]", " ", review_text) # 除去标点符号等非英文字母 # # words = review_text.lower().split() # 小写化且按空格分词 # # stops = set(stopwords.words("english")) # words = [w for w in words if not w in stops] # 除去停用词 # # Words.append(words) # # return Words # 对句子进行分词 def tokenizer(reviews): Words = [] for review in reviews: review_text = BeautifulSoup(review, 'html.parser').get_text() # 除去标签 review_text = re.sub("[^a-zA-Z0-9.?!,'\\\()\-\"/&#@=<>\[\]%:;ò{}★çÉ’‘:;ıäßýÖÊ…ìáמ∧ºğ«û+¾½§“”()$¨ÞåðÁčöÈíŻÄÜÐ]", " ", review_text) # 除去标点符号等非英文字母 words = review_text.lower().split() # 小写化且按空格分词 stops = set(stopwords.words("english")) words = [w for w in words if not w in stops] # 除去停用词 words = ' '.join(words) Words.append(words) return Words def parse_dataset(x_data): x_data = tokenizer(x_data) # 分词 # tfidfVectorizer = TfidfVectorizer(min_df=100) # (25000, 6110) (25000,) 太大了 # tfidfVectorizer = TfidfVectorizer(min_df=200) # (25000, 3614) (25000,) tfidfVectorizer = TfidfVectorizer(min_df=500) # (25000, 1648) (25000,) vectors = tfidfVectorizer.fit_transform(x_data) # 进行训练集文本的拟合和转换 print(vectors.shape) # (1352, 2597) return vectors def get_data(): neg_train = pd.read_csv('data/neg_train.csv', header=None, index_col=None) pos_train = pd.read_csv('data/pos_train.csv', header=None, index_col=None) x_test = pd.read_csv('data/test_data.csv', header=None, index_col=None) y_train = np.concatenate((np.zeros(len(neg_train), dtype=int), np.ones(len(pos_train), dtype=int))) x = np.concatenate((neg_train[0], pos_train[0], x_test[0])) x = parse_dataset(x) x_train = x[: -len(x_test[0])] x_test = x[-len(x_test[0]):] return x_train, y_train, x_test def train_svm(x_train, y_train): svr = svm.SVC(verbose=True) parameters = {'C':[1, 2, 4], 'gamma':[0.5 ,1, 2]} # 4 2 clf = GridSearchCV(svr, parameters, scoring='f1') clf.fit(x_train, y_train, ) print('最佳参数: ') print(clf.best_params_) # {'C': 4, 'gamma': 2} # clf = svm.SVC(kernel='rbf', C=1, gamma=1, verbose=True) # clf.fit(x_train,y_train) # 封装模型 print('保存模型...') joblib.dump(clf, 'model/svm_tfidf_punc.pkl') if __name__ == '__main__': print('特征转换...') x_train, y_train, x_test = get_data() print("特征与标签大小:") print(x_train.shape, y_train.shape) print('训练svmm模型...') train_svm(x_train, y_train) print('加载svm模型...') model = joblib.load('model/svm_tfidf_punc.pkl') y_pred = model.predict(x_test) id = pd.read_csv('../word2vec-nlp-tutorial/sampleSubmission.csv', header=0)['id'] print(len(id)) print(len(y_pred)) f = open('data/Submission_svm_tfidf_punc.csv', 'w', encoding='utf-8') writer = csv.writer(f) writer.writerow(['id', 'sentiment']) for i in range(len(id)): writer.writerow([id[i], y_pred[i]]) f.close()
# -*- coding: utf-8 -*- # @File :train_bilstm.py import codecs import csv import numpy as np import pandas as pd from bs4 import BeautifulSoup import re import multiprocessing from nltk.corpus import stopwords from gensim.models.word2vec import Word2Vec from gensim.corpora.dictionary import Dictionary import keras from keras.preprocessing import sequence from keras.models import Sequential from keras.models import load_model from keras.layers import Bidirectional from keras.layers.embeddings import Embedding from keras.layers.recurrent import LSTM from keras.layers.core import Dense, Dropout from keras.callbacks import EarlyStopping from keras.utils import np_utils import random cpu_count = multiprocessing.cpu_count() vocab_dim = 200 n_iterations = 1 n_exposures = 10 # 所有频数超过10的词语 window_size = 7 n_epoch = 30 maxlen = 100 batch_size = 64 def loadfile(): train_data = pd.read_csv('../word2vec-nlp-tutorial/labeledTrainData.tsv', header=0, delimiter='\t', quoting=3) test_data = pd.read_csv('../word2vec-nlp-tutorial/testData.tsv', header=0, delimiter='\t', quoting=3) unlabeled = pd.read_csv('../word2vec-nlp-tutorial/unlabeledTrainData.tsv', header=0, delimiter='\t', quoting=3) combined = np.concatenate((train_data['review'], test_data['review'], unlabeled['review'])) return combined # 对句子进行分词 def tokenizer(reviews): Words = [] for review in reviews: review_text = BeautifulSoup(review, 'html.parser').get_text() # 除去标签 review_text = re.sub("[^a-zA-Z]", " ", review_text) # 除去标点符号等非英文字母 words = review_text.lower().split() # 小写化且按空格分词 stops = set(stopwords.words("english")) words = [w for w in words if not w in stops] # 除去停用词 Words.append(words) return Words def word2vec_train(combined): model = Word2Vec(vector_size=vocab_dim, min_count=n_exposures, window=window_size, workers=cpu_count, epochs=n_iterations) model.build_vocab(combined) # input: list model.train(combined, total_examples=model.corpus_count, epochs=model.epochs) model.save('model/Word2vec_model_200.pkl') def create_dictionaries(model=None): gensim_dict = Dictionary() gensim_dict.doc2bow(model.wv.key_to_index, allow_update=True) # 10->0 所以k+1 w2indx = {v: k + 1 for k, v in gensim_dict.items()} # 所有频数超过10的词语的索引 f = open("word2index.txt", 'w', encoding='utf8') for key in w2indx: f.write(str(key)) f.write(' ') f.write(str(w2indx[key])) f.write('\n') f.close() w2vec = {word: model.wv.key_to_index[word] for word in w2indx.keys()} # 所有频数超过10的词语的词向量 return w2indx, w2vec def parse_dataset(combined, w2indx): data = [] for sentence in combined: new_txt = [] for word in sentence: try: new_txt.append(w2indx[word]) except: new_txt.append(0) # 10->0 data.append(new_txt) data = sequence.pad_sequences(data, maxlen=maxlen) # 每个句子所含词语对应的索引,所以句子中含有频数小于10的词语,索引为0 return data def get_data(index_dict, word_vectors): n_symbols = len(index_dict) + 1 # 所有单词的索引数,频数小于10的词语索引为0,所以加1 embedding_weights = np.zeros((n_symbols, vocab_dim)) # 初始化 索引为0的词语,词向量全为0 for word, index in index_dict.items(): # 从索引为1的词语开始,对每个词语对应其词向量 embedding_weights[index, :] = word_vectors[word] neg_train = pd.read_csv('data/neg_train.csv', header=None, index_col=None) pos_train = pd.read_csv('data/pos_train.csv', header=None, index_col=None) x_train = np.concatenate((neg_train[0], pos_train[0])) x_train = tokenizer(x_train) x_train = parse_dataset(x_train, index_dict) y_train = np.concatenate((np.zeros(len(neg_train), dtype=int), np.ones(len(pos_train), dtype=int))) y_train = np_utils.to_categorical(y_train, num_classes=2) # 转换为对应one-hot 表示 [len(y),2] x_test = pd.read_csv('data/test_data.csv', header=None, index_col=None) x_test = tokenizer(x_test[0]) x_test = parse_dataset(x_test, index_dict) return n_symbols, embedding_weights, x_train, y_train, x_test ##定义网络结构 def train_bilstm(n_symbols, embedding_weights, x_train, y_train): model = Sequential() model.add(Embedding(output_dim=vocab_dim, input_dim=n_symbols, weights=[embedding_weights], input_length=maxlen)) model.add(Bidirectional(LSTM(units=50, dropout=0.5, activation='tanh'))) model.add(Dense(2, activation='softmax')) # Dense=>全连接层,输出维度=2 model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epoch, verbose=2) model.save('model/bilstm_100_200.h5') if __name__ == '__main__': # 训练模型,并保存 print('加载数据集...') combined = loadfile() print(len(combined)) print('数据预处理...') combined = tokenizer(combined) print('训练word2vec模型...') word2vec_train(combined) ################# 若所需的word2vec已经训练好了,则上述几行代码可注释掉 print('加载word2vec模型') word2vec = Word2Vec.load('model/Word2vec_model_200.pkl') print('创建词典...') index_dict, word_vectors = create_dictionaries(model=word2vec) print('将数据转换为模型输入所需格式...') n_symbols, embedding_weights, x_train, y_train, x_test = get_data(index_dict, word_vectors) print("特征与标签大小:") print(x_train.shape, y_train.shape) print('训练bilstm模型...') train_bilstm(n_symbols, embedding_weights, x_train, y_train) print('加载bilstm模型...') model = load_model('model/bilstm_100_200.h5') y_pred = model.predict(x_test) for i in range(len(y_pred)): max_value = max(y_pred[i]) for j in range(len(y_pred[i])): if max_value == y_pred[i][j]: y_pred[i][j] = 1 else: y_pred[i][j] = 0 test_result = [] for i in y_pred: if i[0] == 1: test_result.append(0) else: test_result.append(1) id = pd.read_csv('../word2vec-nlp-tutorial/sampleSubmission.csv', header=0)['id'] print(len(id)) print(len(test_result)) f = codecs.open('data/Submission_bilstm_100_200.csv', 'w', encoding='utf-8') writer = csv.writer(f) writer.writerow(['id', 'sentiment']) for i in range(len(id)): writer.writerow([id[i],test_result[i]]) f.close()
# -*- coding: utf-8 -*- # @File :train_textcnn.py import codecs import csv import random import numpy as np import pandas as pd from nltk.corpus import stopwords import re from bs4 import BeautifulSoup import multiprocessing from gensim.models.word2vec import Word2Vec from gensim.corpora.dictionary import Dictionary import keras from keras import Input, Model from keras.preprocessing import sequence from keras.models import load_model from keras.layers import Conv1D, MaxPooling1D, concatenate from keras.layers.embeddings import Embedding from keras.layers.core import Dense, Dropout, Flatten from keras.callbacks import EarlyStopping from keras.utils import np_utils cpu_count = multiprocessing.cpu_count() vocab_dim = 200 n_iterations = 1 n_exposures = 10 # 所有频数超过10的词语 window_size = 7 n_epoch = 30 maxlen = 200 batch_size = 64 def loadfile(): train_data = pd.read_csv('../word2vec-nlp-tutorial/labeledTrainData.tsv', header=0, delimiter='\t', quoting=3) test_data = pd.read_csv('../word2vec-nlp-tutorial/testData.tsv', header=0, delimiter='\t', quoting=3) unlabeled = pd.read_csv('../word2vec-nlp-tutorial/unlabeledTrainData.tsv', header=0, delimiter='\t', quoting=3) combined = np.concatenate((train_data['review'], test_data['review'], unlabeled['review'])) return combined # 对句子进行分词 def tokenizer(reviews): Words = [] for review in reviews: review_text = BeautifulSoup(review, 'html.parser').get_text() # 除去标签 review_text = re.sub("[^a-zA-Z]", " ", review_text) # 除去标点符号等非英文字母 words = review_text.lower().split() # 小写化且按空格分词 stops = set(stopwords.words("english")) words = [w for w in words if not w in stops] # 除去停用词 Words.append(words) return Words def word2vec_train(combined): model = Word2Vec(vector_size=vocab_dim, min_count=n_exposures, window=window_size, workers=cpu_count, epochs=n_iterations) model.build_vocab(combined) # input: list model.train(combined, total_examples=model.corpus_count, epochs=model.epochs) model.save('model/Word2vec_model_200.pkl') def create_dictionaries(model=None): gensim_dict = Dictionary() gensim_dict.doc2bow(model.wv.key_to_index, allow_update=True) # 10->0 所以k+1 w2indx = {v: k + 1 for k, v in gensim_dict.items()} # 所有频数超过10的词语的索引 f = open("word2index.txt", 'w', encoding='utf8') for key in w2indx: f.write(str(key)) f.write(' ') f.write(str(w2indx[key])) f.write('\n') f.close() w2vec = {word: model.wv.key_to_index[word] for word in w2indx.keys()} # 所有频数超过10的词语的词向量 return w2indx, w2vec def parse_dataset(combined, w2indx): data = [] for sentence in combined: new_txt = [] for word in sentence: try: new_txt.append(w2indx[word]) except: new_txt.append(0) # 10->0 data.append(new_txt) data = sequence.pad_sequences(data, maxlen=maxlen) # 每个句子所含词语对应的索引,所以句子中含有频数小于10的词语,索引为0 return data def get_data(index_dict, word_vectors): n_symbols = len(index_dict) + 1 # 所有单词的索引数,频数小于10的词语索引为0,所以加1 embedding_weights = np.zeros((n_symbols, vocab_dim)) # 初始化 索引为0的词语,词向量全为0 for word, index in index_dict.items(): # 从索引为1的词语开始,对每个词语对应其词向量 embedding_weights[index, :] = word_vectors[word] neg_train = pd.read_csv('data/neg_train.csv', header=None, index_col=None) pos_train = pd.read_csv('data/pos_train.csv', header=None, index_col=None) x_train = np.concatenate((neg_train[0], pos_train[0])) x_train = tokenizer(x_train) x_train = parse_dataset(x_train, index_dict) y_train = np.concatenate((np.zeros(len(neg_train), dtype=int), np.ones(len(pos_train), dtype=int))) y_train = np_utils.to_categorical(y_train, num_classes=2) # 转换为对应one-hot 表示 [len(y),2] x_test = pd.read_csv('data/test_data.csv', header=None, index_col=None) x_test = tokenizer(x_test[0]) x_test = parse_dataset(x_test, index_dict) return n_symbols, embedding_weights, x_train, y_train, x_test def train_cnn(n_symbols, embedding_weights, x_train, y_train): # 模型结构:词嵌入-卷积池化*3-拼接-全连接-dropout-全连接 main_input = Input(shape=(maxlen,), dtype='float64') # 词嵌入(使用预训练的词向量) embedder = Embedding(output_dim=vocab_dim, input_dim=n_symbols, input_length=maxlen, weights=[embedding_weights]) embed = embedder(main_input) # 卷积核大小分别为3,4,5 cnn1 = Conv1D(256, 3, padding='same', strides=1, activation='relu')(embed) cnn1 = MaxPooling1D(pool_size=38)(cnn1) cnn2 = Conv1D(256, 4, padding='same', strides=1, activation='relu')(embed) cnn2 = MaxPooling1D(pool_size=37)(cnn2) cnn3 = Conv1D(256, 5, padding='same', strides=1, activation='relu')(embed) cnn3 = MaxPooling1D(pool_size=36)(cnn3) # 合并三个模型的输出向量 cnn = concatenate([cnn1, cnn2, cnn3], axis=-1) flat = Flatten()(cnn) drop = Dropout(0.5)(flat) main_output = Dense(2, activation='softmax')(drop) model = Model(inputs=main_input, outputs=main_output) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epoch) model.save('model/textcnn_200_200.h5') if __name__ == '__main__': # 训练模型,并保存 print('加载数据集...') combined = loadfile() print(len(combined)) print('数据预处理...') combined = tokenizer(combined) print('训练word2vec模型...') word2vec_train(combined) # ################# 若所需的word2vec已经训练好了,则上述几行代码可注释掉 print('加载word2vec模型') word2vec = Word2Vec.load('model/Word2vec_model_200.pkl') print('创建词典...') index_dict, word_vectors = create_dictionaries(model=word2vec) print('将数据转换为模型输入所需格式...') n_symbols, embedding_weights, x_train, y_train, x_test = get_data(index_dict, word_vectors) print("特征与标签大小:") print(x_train.shape, y_train.shape) print('训练cnn模型...') train_cnn(n_symbols, embedding_weights, x_train, y_train) print('加载cnn模型...') model = load_model('model/textcnn_200_200.h5') y_pred = model.predict(x_test) for i in range(len(y_pred)): max_value = max(y_pred[i]) for j in range(len(y_pred[i])): if max_value == y_pred[i][j]: y_pred[i][j] = 1 else: y_pred[i][j] = 0 test_result = [] for i in y_pred: if i[0] == 1: test_result.append(0) else: test_result.append(1) id = pd.read_csv('../word2vec-nlp-tutorial/sampleSubmission.csv', header=0)['id'] print(len(id)) print(len(test_result)) f = codecs.open('data/Submission_textcnn_200_200.csv', 'w', encoding='utf-8') writer = csv.writer(f) writer.writerow(['id', 'sentiment']) for i in range(len(id)): writer.writerow([id[i], test_result[i]]) f.close()
# -*- coding: utf-8 -*- # @File :train_textcnn.py import codecs import csv import random import numpy as np import pandas as pd from nltk.corpus import stopwords import re from bs4 import BeautifulSoup import multiprocessing from gensim.models.word2vec import Word2Vec from gensim.corpora.dictionary import Dictionary import keras from keras import Input, Model from keras.preprocessing import sequence from keras.models import load_model from keras.layers import Conv1D, MaxPooling1D, concatenate from keras.layers import Bidirectional from keras.layers.embeddings import Embedding from keras.layers.recurrent import LSTM from keras.layers.core import Dense, Dropout, Flatten from keras.callbacks import EarlyStopping from keras.utils import np_utils cpu_count = multiprocessing.cpu_count() vocab_dim = 100 n_iterations = 1 n_exposures = 10 # 所有频数超过10的词语 window_size = 7 n_epoch = 30 maxlen = 100 batch_size = 64 def loadfile(): train_data = pd.read_csv('../word2vec-nlp-tutorial/labeledTrainData.tsv', header=0, delimiter='\t', quoting=3) test_data = pd.read_csv('../word2vec-nlp-tutorial/testData.tsv', header=0, delimiter='\t', quoting=3) unlabeled = pd.read_csv('../word2vec-nlp-tutorial/unlabeledTrainData.tsv', header=0, delimiter='\t', quoting=3) combined = np.concatenate((train_data['review'], test_data['review'], unlabeled['review'])) return combined # 对句子进行分词 def tokenizer(reviews): Words = [] for review in reviews: review_text = BeautifulSoup(review, 'html.parser').get_text() # 除去标签 review_text = re.sub("[^a-zA-Z]", " ", review_text) # 除去标点符号等非英文字母 words = review_text.lower().split() # 小写化且按空格分词 stops = set(stopwords.words("english")) words = [w for w in words if not w in stops] # 除去停用词 Words.append(words) return Words def word2vec_train(combined): model = Word2Vec(vector_size=vocab_dim, min_count=n_exposures, window=window_size, workers=cpu_count, epochs=n_iterations) model.build_vocab(combined) # input: list model.train(combined, total_examples=model.corpus_count, epochs=model.epochs) model.save('model/Word2vec_model_100.pkl') def create_dictionaries(model=None): gensim_dict = Dictionary() gensim_dict.doc2bow(model.wv.key_to_index, allow_update=True) # 10->0 所以k+1 w2indx = {v: k + 1 for k, v in gensim_dict.items()} # 所有频数超过10的词语的索引 f = open("word2index.txt", 'w', encoding='utf8') for key in w2indx: f.write(str(key)) f.write(' ') f.write(str(w2indx[key])) f.write('\n') f.close() w2vec = {word: model.wv.key_to_index[word] for word in w2indx.keys()} # 所有频数超过10的词语的词向量 return w2indx, w2vec def parse_dataset(combined, w2indx): data = [] for sentence in combined: new_txt = [] for word in sentence: try: new_txt.append(w2indx[word]) except: new_txt.append(0) # 10->0 data.append(new_txt) data = sequence.pad_sequences(data, maxlen=maxlen) # 每个句子所含词语对应的索引,所以句子中含有频数小于10的词语,索引为0 return data def get_data(index_dict, word_vectors): n_symbols = len(index_dict) + 1 # 所有单词的索引数,频数小于10的词语索引为0,所以加1 embedding_weights = np.zeros((n_symbols, vocab_dim)) # 初始化 索引为0的词语,词向量全为0 for word, index in index_dict.items(): # 从索引为1的词语开始,对每个词语对应其词向量 embedding_weights[index, :] = word_vectors[word] neg_train = pd.read_csv('data/neg_train.csv', header=None, index_col=None) pos_train = pd.read_csv('data/pos_train.csv', header=None, index_col=None) x_train = np.concatenate((neg_train[0], pos_train[0])) x_train = tokenizer(x_train) x_train = parse_dataset(x_train, index_dict) y_train = np.concatenate((np.zeros(len(neg_train), dtype=int), np.ones(len(pos_train), dtype=int))) y_train = np_utils.to_categorical(y_train, num_classes=2) # 转换为对应one-hot 表示 [len(y),2] x_test = pd.read_csv('data/test_data.csv', header=None, index_col=None) x_test = tokenizer(x_test[0]) x_test = parse_dataset(x_test, index_dict) return n_symbols, embedding_weights, x_train, y_train, x_test def train_cnn_bilstm(n_symbols, embedding_weights, x_train, y_train): # 模型结构:词嵌入-卷积池化*3-拼接-BiLSTM-全连接-dropout-全连接 main_input = Input(shape=(maxlen,), dtype='float64') # 词嵌入(使用预训练的词向量) embedder = Embedding(output_dim=vocab_dim, input_dim=n_symbols, input_length=maxlen, weights=[embedding_weights]) embed = embedder(main_input) cnn = Conv1D(64, 3, padding='same', strides=1, activation='relu')(embed) bilstm = Bidirectional(LSTM(units=50, dropout=0.5, activation='tanh', return_sequences=True))(cnn) flat = Flatten()(bilstm) main_output = Dense(2, activation='softmax')(flat) model = Model(inputs=main_input, outputs=main_output) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epoch) model.save('model/cnnbilstm_100_100.h5') if __name__ == '__main__': # 训练模型,并保存 print('加载数据集...') combined = loadfile() print(len(combined)) print('数据预处理...') combined = tokenizer(combined) print('训练word2vec模型...') word2vec_train(combined) ################# 若所需的word2vec已经训练好了,则上述几行代码可注释掉 print('加载word2vec模型') word2vec = Word2Vec.load('model/Word2vec_model_100.pkl') print('创建词典...') index_dict, word_vectors = create_dictionaries(model=word2vec) print('将数据转换为模型输入所需格式...') n_symbols, embedding_weights, x_train, y_train, x_test = get_data(index_dict, word_vectors) print("特征与标签大小:") print(x_train.shape, y_train.shape) print('训练cnn_bilstm模型...') train_cnn_bilstm(n_symbols, embedding_weights, x_train, y_train) print('加载cnn_bilstm模型...') model = load_model('model/cnnbilstm_100_100.h5') y_pred = model.predict(x_test) for i in range(len(y_pred)): max_value = max(y_pred[i]) for j in range(len(y_pred[i])): if max_value == y_pred[i][j]: y_pred[i][j] = 1 else: y_pred[i][j] = 0 test_result = [] for i in y_pred: if i[0] == 1: test_result.append(0) else: test_result.append(1) id = pd.read_csv('../word2vec-nlp-tutorial/sampleSubmission.csv', header=0)['id'] print(len(id)) print(len(test_result)) f = codecs.open('data/Submission_cnnbilstm_100_100.csv', 'w', encoding='utf-8') writer = csv.writer(f) writer.writerow(['id', 'sentiment']) for i in range(len(id)): writer.writerow([id[i], test_result[i]]) f.close()
# -*- coding: utf-8 -*- # @File :train_bilstm_attention.py import codecs import csv import random import numpy as np import pandas as pd from bs4 import BeautifulSoup import re import multiprocessing from nltk.corpus import stopwords from gensim.models.word2vec import Word2Vec from gensim.corpora.dictionary import Dictionary import keras from keras.preprocessing import sequence from keras import backend as K from keras.models import Sequential from keras.models import load_model from keras.layers import Layer from keras.layers import Bidirectional from keras.layers.embeddings import Embedding from keras.layers.recurrent import LSTM from keras.layers.core import Dense from keras.callbacks import EarlyStopping from keras.utils import np_utils # In tensorflow 2.0, eager execution is enabled by default. # tensorflow2.0默认启用Eager Execution # 通过使用下列函数禁用eager_excution: import tensorflow as tf tf.compat.v1.disable_eager_execution() cpu_count = multiprocessing.cpu_count() # 4 vocab_dim = 100 ATT_SIZE = 50 n_iterations = 1 n_exposures = 10 # 所有频数超过10的词语 window_size = 7 n_epoch = 30 maxlen = 100 batch_size = 64 def loadfile(): train_data = pd.read_csv('../word2vec-nlp-tutorial/labeledTrainData.tsv', header=0, delimiter='\t', quoting=3) test_data = pd.read_csv('../word2vec-nlp-tutorial/testData.tsv', header=0, delimiter='\t', quoting=3) unlabeled = pd.read_csv('../word2vec-nlp-tutorial/unlabeledTrainData.tsv', header=0, delimiter='\t', quoting=3) combined = np.concatenate((train_data['review'], test_data['review'], unlabeled['review'])) return combined # 对句子进行分词 def tokenizer(reviews): Words = [] for review in reviews: review_text = BeautifulSoup(review, 'html.parser').get_text() # 除去标签 review_text = re.sub("[^a-zA-Z]", " ", review_text) # 除去标点符号等非英文字母 words = review_text.lower().split() # 小写化且按空格分词 stops = set(stopwords.words("english")) words = [w for w in words if not w in stops] # 除去停用词 Words.append(words) return Words def word2vec_train(combined): model = Word2Vec(vector_size=vocab_dim, min_count=n_exposures, window=window_size, workers=cpu_count, epochs=n_iterations) model.build_vocab(combined) # input: list model.train(combined, total_examples=model.corpus_count, epochs=model.epochs) model.save('model/Word2vec_model.pkl') def create_dictionaries(model=None): gensim_dict = Dictionary() gensim_dict.doc2bow(model.wv.key_to_index, allow_update=True) # 所有频数低于10->0 所以k+1 w2indx = {v: k + 1 for k, v in gensim_dict.items()} # 所有频数超过10的词语的索引 f = open("word2index.txt", 'w', encoding='utf8') for key in w2indx: f.write(str(key)) f.write(' ') f.write(str(w2indx[key])) f.write('\n') f.close() w2vec = {word: model.wv.key_to_index[word] for word in w2indx.keys()} # 所有频数超过10的词语的词向量 return w2indx, w2vec def parse_dataset(combined, w2indx): data = [] for sentence in combined: new_txt = [] for word in sentence: try: new_txt.append(w2indx[word]) except: new_txt.append(0) # 10->0 data.append(new_txt) data = sequence.pad_sequences(data, maxlen=maxlen) # 每个句子所含词语对应的索引,所以句子中含有频数小于10的词语,索引为0 return data def get_data(index_dict, word_vectors): n_symbols = len(index_dict) + 1 # 所有单词的索引数,频数小于10的词语索引为0,所以加1 embedding_weights = np.zeros((n_symbols, vocab_dim)) # 初始化 索引为0的词语,词向量全为0 for word, index in index_dict.items(): # 从索引为1的词语开始,对每个词语对应其词向量 embedding_weights[index, :] = word_vectors[word] neg_train = pd.read_csv('data/neg_train.csv', header=None, index_col=None) pos_train = pd.read_csv('data/pos_train.csv', header=None, index_col=None) x_train = np.concatenate((neg_train[0], pos_train[0])) x_train = tokenizer(x_train) x_train = parse_dataset(x_train, index_dict) y_train = np.concatenate((np.zeros(len(neg_train), dtype=int), np.ones(len(pos_train), dtype=int))) y_train = np_utils.to_categorical(y_train, num_classes=2) # 转换为对应one-hot 表示 [len(y),2] x_test = pd.read_csv('data/test_data.csv', header=None, index_col=None) x_test = tokenizer(x_test[0]) x_test = parse_dataset(x_test, index_dict) return n_symbols, embedding_weights, x_train, y_train, x_test # 自定义Attention层 class AttentionLayer(Layer): def __init__(self, attention_size=None, **kwargs): self.attention_size = attention_size super(AttentionLayer, self).__init__(**kwargs) def get_config(self): config = super().get_config() config['attention_size'] = self.attention_size return config def build(self, input_shape): assert len(input_shape) == 3 self.time_steps = input_shape[1] hidden_size = input_shape[2] if self.attention_size is None: self.attention_size = hidden_size self.W = self.add_weight(name='att_weight', shape=(hidden_size, self.attention_size), initializer='uniform', trainable=True) self.b = self.add_weight(name='att_bias', shape=(self.attention_size,), initializer='uniform', trainable=True) self.V = self.add_weight(name='att_var', shape=(self.attention_size,), initializer='uniform', trainable=True) super(AttentionLayer, self).build(input_shape) def call(self, inputs): self.V = K.reshape(self.V, (-1, 1)) H = K.tanh(K.dot(inputs, self.W) + self.b) score = K.softmax(K.dot(H, self.V), axis=1) outputs = K.sum(score * inputs, axis=1) return outputs def compute_output_shape(self, input_shape): return input_shape[0], input_shape[2] ##定义网络结构 def train_bilstm_att(n_symbols, embedding_weights, x_train, y_train, ATT_SIZE): print('Defining a Simple Keras Model...') model = Sequential() model.add(Embedding(output_dim=vocab_dim, input_dim=n_symbols, weights=[embedding_weights], input_length=maxlen)) model.add(Bidirectional(LSTM(units=50, dropout=0.5, return_sequences=True))) model.add(AttentionLayer(attention_size=ATT_SIZE)) model.add(Dense(2, activation='softmax')) # Dense=>全连接层,输出维度=2 print('Compiling the Model...') model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) print("Train...") model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epoch) model.save('model/bilstmAtt_100_05_att50.h5') if __name__ == '__main__': # 训练模型,并保存 print('加载数据集...') combined = loadfile() print(len(combined)) print('数据预处理...') combined = tokenizer(combined) print('训练word2vec模型...') word2vec_train(combined) ################# 若所需的word2vec已经训练好了,则上述几行代码可注释掉 print('加载word2vec模型') word2vec = Word2Vec.load('model/Word2vec_model_200.pkl') print('创建词典...') index_dict, word_vectors = create_dictionaries(model=word2vec) print('将数据转换为模型输入所需格式...') n_symbols, embedding_weights, x_train, y_train, x_test = get_data(index_dict, word_vectors) print("特征与标签大小:") print(x_train.shape, y_train.shape) train_bilstm_att(n_symbols, embedding_weights, x_train, y_train, ATT_SIZE) print('load bilstm_model...') model = load_model('model/bilstmAtt_100_05_att50.h5', custom_objects={'AttentionLayer':AttentionLayer}) y_pred = model.predict(x_test) for i in range(len(y_pred)): max_value = max(y_pred[i]) for j in range(len(y_pred[i])): if max_value == y_pred[i][j]: y_pred[i][j] = 1 else: y_pred[i][j] = 0 test_result = [] for i in y_pred: if i[0] == 1: test_result.append(0) else: test_result.append(1) id = pd.read_csv('../word2vec-nlp-tutorial/sampleSubmission.csv', header=0)['id'] print(len(id)) print(len(test_result)) f = codecs.open('data/Submission_bilstmAtt_100_05_att50.csv', 'w', encoding='utf-8') writer = csv.writer(f) writer.writerow(['id', 'sentiment']) for i in range(len(id)): writer.writerow([id[i],test_result[i]]) f.close()
通过本次实验,我们可以发现本任务更适用于使用支持向量机进行分类,其次是CNN+Bi-LSTM模型,最后是TextCNN模型和Bi-LSTM模型。同时由实验可以发现,训练支持向量机模型的时间要远低于其他深度学习模型的训练时间。
通过调参,我们发现词向量的维度对于模型最终的分类效果没有太大的影响,而相较之下,句长的最大值影响更大一些。原因可能是使用word2vec提取词向量时,词向量维度取100已经足够代表词语本身,增加维度本身没有太大变化。而句长最大值的不同选择将会导致最终较短的句子的句向量中有多少是由零向量填充,较长的句子有多少词语将被删去,这本身就会对一个句子的完整表示带来较大的影响。
同时,我们也可以发现使用传统的文本特征提取的方法,最终模型的分类效果也要高于使用神经网络word2vec模型提取文本特征的分类效果。原因可能是使用TF-IDF表示句向量的话,句向量的维度较大,尽管会导致训练的时间有所增加,但也能够更加表达出更加完整的句子含义,而相较之下word2vec的效果就会差一些。
在数据预处理时如果去除标点符号,相较于不去除标点符号而言,对于模型的最终分类效果也会有较大提升。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。