当前位置:   article > 正文

PyTorch搭建LSTM神经网络实现文本情感分析实战(附源码和数据集)_lstm pytorch实现 情感分析

lstm pytorch实现 情感分析

需要源码和数据集请点赞关注收藏后评论区留言~~~

一、文本情感分析简介

文本情感分析是指利用自然语言处理和文本挖掘技术,对带有情感色彩的主观性文本进行分析,处理和抽取的过程。

接下来主要实现情感分类,情感分类又称为情感倾向性分析,是指对给定的文本,识别其中主观性文本的倾向是肯定的还是否定的,或者说是正面的还是负面的,这是情感分析领域研究最多的内容。通常,网络中存在大量的主观性文本和客观性文本,客观性文本是对事务的客观性描述,不带有感情色彩和情感倾向。情感分类的对象是带有情感倾向的主观性文本,因此情感分类首先要进行文本的主客观雷芬,以情感词识别为主,利用不同的文本特征表示方法和分类器进行识别研究,对网络文本事先主客观分类,能够提高情感分类的速度和准确度

二、数据集简介

本篇博客使用的数据集是IMDB的数据,IMDB数据集包含来自互联网的50000条严重两级分化的评论,该数据被分为用于训练的25000条评论和用于测试的25000条评论

里面给出了数据对应网址 

三、数据预处理

因为这个数据集非常小,所以如果用这个数据集做word embedding有可能过拟合,而且模型没有通用性,所以传入一个已经学好的word embedding,用的是glove的6B 100维的预训练数据

如下图

 

 四、算法模型

1:循环神经网络(RNN)

它是一类专门处理时序数据样本的神经网络,它的每一层不仅输出给下一层,而同时还输出一个隐状态,供当前层在处理下一个样本时使用

网络结构图如下

 2:长短期记忆神经网络(LSTM)

然而 RNN在处理长期依赖时会遇到巨大的困难,因此计算距离较远的节点之间的联系会设计雅可比矩阵的多次相乘,这会带来梯度消失或者梯度爆炸。LSTM可以有效的解决这个问题 它的主要思想是

门控单元以及线性连接的引入

门控单元:有选择性的保存和输出历史信息

线性连接:LSTM可以更好的捕捉时序数据中间间隔较大的依赖关系

 工作图如下

 五、模型训练

下面使用基于PyTorch的LSTM模型 效果如下

建议使用GPU或者cuda 单纯用cpu训练时间比较长~~~

训练结果图如下

测试集上的损失与精确度的变化

由下图可见 当训练到4-5次左右时模型已经逐渐收敛 不必训练过多次

 

 

 训练集上的损失与精确度变化

 六、代码

部分源码如下

  1. # coding: utf-8
  2. # In[1]:
  3. import torch.autograd as autograd
  4. import torchtext.vocab as torchvocab
  5. from torch.autograd import Variable
  6. import tqdm
  7. import os
  8. import time
  9. import re
  10. import pandas as pd
  11. import string
  12. import gensim
  13. import time
  14. import random
  15. import snowballstemmer
  16. import collections
  17. from collections import Counter
  18. from nltk.corpus import stopwords
  19. from itertools import chain
  20. from sklearn.metrics import accuracy_score
  21. from gensim.test.utils import datapath, get_tmpfile
  22. from gensim.models import KeyedVectors
  23. # In[2]:
  24. def clean_text(text):
  25. ## Remove puncuation
  26. text = text.translate(string.punctuation)
  27. ## Convert words to lower case and split them
  28. text = text.lower().split()
  29. # Remove stop words
  30. stops = set(stopwords.words("english"))
  31. text = [w for w in text if not w in stops and len(w) >= 3]
  32. text = " ".join(text)
  33. ## Clean the text
  34. text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
  35. text = re.sub(r"what's", "what is ", text)
  36. text = re.sub(r"\'s", " ", text)
  37. text = re.sub(r"\'ve", " have ", text)
  38. text = re.sub(r"n't", " not ", text)
  39. text = re.sub(r"i'm", "i am ", text)
  40. text = re.sub(r"\'re", " are ", text)
  41. text = re.sub(r"\'d", " would ", text)
  42. text = re.sub(r"\'ll", " will ", text)
  43. text = re.sub(r",", " ", text)
  44. text = re.sub(r"\.", " ", text)
  45. text = re.sub(r"!", " ! ", text)
  46. text = re.sub(r"\/", " ", text)
  47. text = re.sub(r"\^", " ^ ", text)
  48. text = re.sub(r"\+", " + ", text)
  49. text = re.sub(r"\-", " - ", text)
  50. text = re.sub(r"\=", " = ", text)
  51. text = re.sub(r"'", " ", text)
  52. text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
  53. text = re.sub(r":", " : ", text)
  54. tan ", text)
  55. text = re.sub(r"\0s", "0", text)
  56. text = re.sub(r" 9 11 ", "911", text)
  57. text = re.sub(r"e - mail", "email", text)
  58. text = re.sub(r"j k", "jk", text)
  59. text = re.sub(r"\s{2,}", " ", text)
  60. ## Stemming
  61. text = text.split()
  62. stemmer = snowballstemmer.stemmer('english')
  63. stemmed_words = [stemmer.stemWord(word) for word in text]
  64. text = " ".join(stemmed_words)
  65. print(text)
  66. return text
  67. # In[3]:
  68. def readIMDB(path, seg='train'):
  69. pos_or_neg = ['pos', 'neg']
  70. data = []
  71. for label in pos_or_neg:
  72. files = os.listdir(os.path.join(path, seg, label))
  73. for file in files:
  74. with open(os.path.join(path, seg, label, file), 'r', encoding='utf8') as rf:
  75. review = rf.read().replace('\n', '')
  76. if label == 'pos':
  77. data.append([review, 1])
  78. elif label == 'neg':
  79. data.append([review, 0])
  80. return data
  81. # In[3]:
  82. root = r'C:\Users\Admin\Desktop\aclImdb\aclImdb'
  83. train_data = readIMDB(root)
  84. test_data = readIMDB(root, 'test')
  85. # In[4]:
  86. def tokenizer(text):
  87. return [tok.lower() for tok in text.split(' ')]
  88. train_tokenized = []
  89. test_tokenized = []
  90. for review, score in train_data:
  91. train_tokenized.append(tokenizer(review))
  92. for review, score in test_data:
  93. test_tokenized.append(tokenizer(review))
  94. # In[5]:
  95. vocab = set(chain(*train_tokenized))
  96. vocab_size = len(vocab)
  97. # In[6]:
  98. # 输入文件
  99. glove_file = datapath(r'C:\Users\Admin\Desktop\glove.6B.100d.txt')
  100. # 输出文件
  101. tmp_file = get_tmpfile(r'C:\Users\Admin\Desktop\wv.6B.100d.txt')
  102. # call glove2word2vec script
  103. # default way (through CLI): python -m gensim.scripts.glove2word2vec --input <glove_file> --output <w2v_file>
  104. # 开始转换
  105. from gensim.scripts.glove2word2vec import glove2word2vec
  106. glove2word2vec(glove_file, tmp_file)
  107. # 加载转化后的文件
  108. wvmodel = KeyedVectors.load_word2vec_format(tmp_file)
  109. # In[7]:
  110. word_to_idx = {word: i + 1 for i, word in enumerate(vocab)}
  111. word_to_idx['<unk>'] = 0
  112. idx_to_word = {i + 1: word for i, word in enumerate(vocab)}
  113. idx_to_word[0] = '<unk>'
  114. # In[8]:
  115. def encode_samples(tokenized_samples, vocab):
  116. features = []
  117. for sample in tokenized_samples:
  118. feature = []
  119. for token in sample:
  120. if token in word_to_idx:
  121. feature.append(word_to_idx[token])
  122. else:
  123. feature.append(0)
  124. features.append(feature)
  125. return features
  126. # In[9]:
  127. def pad_samples(features, maxlen=500, PAD=0):
  128. padded_features = []
  129. for feature in features:
  130. if len(feature) >= maxlen:
  131. padded_feature = feature[:maxlen]
  132. else:
  133. padded_feature = feature
  134. while (len(padded_feature) < maxlen):
  135. padded_feature.append(PAD)
  136. padded_features.append(padded_feature)
  137. return padded_features
  138. # In[10]:
  139. train_features = torch.tensor(pad_samples(encode_samples(train_tokenized, vocab)))
  140. train_labels = torch.tensor([score for _, score in train_data])
  141. test_features = torch.tensor(pad_samples(encode_samples(test_tokenized, vocab)))
  142. test_labels = torch.tensor([score for _, score in test_data])
  143. # In[13]:
  144. class SentimentNet(nn.Module):
  145. def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
  146. bidirectional, weight, labels, use_gpu, **kwargs):
  147. super(SentimentNet, self).__init__(**kwargs)
  148. self.num_hiddens = num_hiddens
  149. self.num_layers = num_layers
  150. self.use_gpu = use_gpu
  151. self.bidirectional = bidirectional
  152. self.embedding = nn.Embedding.from_pretrained(weight)
  153. self.embedding.weight.requires_grad = False
  154. self.encoder = nn.LSTM(input_size=embed_size, hidden_size=self.num_hiddens,
  155. num_layers=num_layers, bidirectional=self.bidirectional,
  156. dropout=0)
  157. if self.bidirectional:
  158. self.decoder = nn.Linear(num_hiddens * 4, labels)
  159. else:
  160. self.decoder = nn.Linear(num_hiddens * 2, labels)
  161. def forward(self, inputs):
  162. embeddings = self.embedding(inputs)
  163. states, hidden = self.encoder(embeddings.permute([1, 0, 2]))
  164. encoding = torch.cat([states[0], states[-1]], dim=1)
  165. outputs = self.decoder(encoding)
  166. return outputs
  167. # In[16]:
  168. num_epochs = 5
  169. embed_size = 100
  170. num_hiddens = 100
  171. num_layers = 2
  172. bidirectional = True
  173. batch_size = 64
  174. labels = 2
  175. lr = 0.8
  176. device = torch.device('cpu')
  177. use_gpu = True
  178. weight = torch.zeros(vocab_size + 1, embed_size)
  179. for i in range(len(wvmodel.index_to_key)):
  180. try:
  181. index = word_to_idx[wvmodel.index_to_key[i]]
  182. except:
  183. continue
  184. weight[index, :] = torch.from_numpy(wvmodel.get_vector(
  185. idx_to_word[word_to_idx[wvmodel.index_to_key[i]]]))
  186. # In[17]:
  187. net = SentimentNet(vocab_size=(vocab_size + 1), embed_size=embed_size,
  188. num_hiddens=num_hiddens, num_layers=num_layers,
  189. bidirectional=bidirectional, weight=weight,
  190. labels=labels, use_gpu=use_gpu)
  191. net.to(device)
  192. loss_function = nn.CrossEntropyLoss()
  193. optimizer = optim.SGD(net.parameters(), lr=lr)
  194. # In[18]:
  195. train_set = torch.utils.data.TensorDataset(train_features, train_labels)
  196. test_set = torch.utils.data.TensorDataset(test_features, test_labels)
  197. train_iter = torch.utils.data.DataLoader(train_set, batch_size=batch_size,
  198. shuffle=True)
  199. test_iter = torch.utils.data.DataLoader(test_set, batch_size=batch_size,
  200. shuffle=False)
  201. # In[20]:
  202. num_epochs = 20
  203. # In[ ]:
  204. for epoch in range(num_epochs):
  205. start = time.time()
  206. train_loss, test_losses = 0, 0
  207. train_acc, test_acc = 0, 0
  208. n, m = 0, 0
  209. for feature, label in train_iter:
  210. n += 1
  211. net.zero_grad()
  212. feature = Variable(feature.cpu())
  213. label = Variable(label.cpu())
  214. score = net(feature)
  215. loss = loss_function(score, label)
  216. loss.backward()
  217. optimizer.step()
  218. train_acc += accuracy_score(torch.argmax(score.cpu().data,
  219. dim=1), label.cpu())
  220. train_loss += loss
  221. with torch.no_grad():
  222. for test_feature, test_label in test_iter:
  223. m += 1
  224. test_feature = test_feature.cpu()
  225. net(test_feature)
  226. test_loss = loss_function(test_score, test_label)
  227. test_acc += accuracy_score(torch.argmax(test_score.cpu().data,
  228. dim=1), test_label.cpu())
  229. test_losses += test_loss
  230. end = time.time()
  231. runtime = end - start
  232. epoch: %d, train loss: %.4f, train acc: %.2f, test loss: %.4f, test acc: %.2f, time: %.2f' %
  233. (epoch, train_loss.data / n, train_acc / n, test_losses.data / m, test_acc / m, runtime))
  234. # In[ ]:

创作不易 觉得有帮助请点赞关注收藏~~~

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/繁依Fanyi0/article/detail/72844
推荐阅读
相关标签
  

闽ICP备14008679号