当前位置:   article > 正文

nlp-beginner task2 基于深度学习的文本分类 part1(pytorch+BOW+textCNN)_任务二:基于深度学习的文本分类 熟悉pytorch,用pytorch重写《任务一》,实现cn

任务二:基于深度学习的文本分类 熟悉pytorch,用pytorch重写《任务一》,实现cn

https://github.com/FudanNLP/nlp-beginner

注:这一整篇第4节前的代码都没有处理好batch的问题,每个epoch都把参数更新了124848次,所以效果极差,不过除了训练部分其他都没大问题

因为时间问题只重写了textCNN+glove+Adam+双embedding(3.3),其实就是训练代码的差别,训练代码以及minibatch的设定和part2一样,其他的结果看个乐就行


1. Pytorch安装

因为以前试bert的时候装的cuda9.0,懒得折腾就装了pytorch1.1.0,直接anaconda新建python3.6的环境运行如下命令即可

conda install pytorch==1.1.0 torchvision==0.3.0 cudatoolkit=9.0 -c pytorch

 

2. 用Pytorch重写BOW

入门和代码分别参考了官网的两个教程60分钟入门(只看了前三章)DEEP LEARNING WITH PYTORCH

 

2.1. 代码

  1. import torch
  2. import pandas as pd
  3. import torch.nn as nn
  4. import torch.nn.functional as F
  5. import torch.optim as optim
  6. from torch.utils.data import random_split
  7. read_data = pd.read_table('../train.tsv')
  8. data = []
  9. data_len = read_data.shape[0]
  10. for i in range(data_len):
  11. data.append([read_data['Phrase'][i].split(' '), read_data['Sentiment'][i]])
  12. word_to_ix = {} # 给每个词分配index
  13. for sent, _ in data:
  14. for word in sent:
  15. if word not in word_to_ix:
  16. word_to_ix[word] = len(word_to_ix)
  17. torch.manual_seed(6) # 设置torch的seed,影响后面初始化参数和random_split
  18. train_len = int(0.8 * data_len)
  19. test_len = data_len - train_len
  20. train_data, test_data = random_split(data, [train_len, test_len]) # 分割数据集
  21. # print(type(train_data)) # torch.utils.data.dataset.Subset
  22. train_data = list(train_data)
  23. test_data = list(test_data)
  24. VOCAB_SIZE = len(word_to_ix)
  25. NUM_LABELS = 5
  26. class BoWClassifier(nn.Module):
  27. def __init__(self, num_labels, vocab_size):
  28. super(BoWClassifier, self).__init__()
  29. self.linear = nn.Linear(vocab_size, num_labels) # 线性映射
  30. # print(list(self.parameters())) # 输出参数要先转化为list,不然是一个地址值
  31. def forward(self, bow_vec):
  32. return F.log_softmax(self.linear(bow_vec), dim=1)
  33. def make_bow_vector(phrase): # 生成BOW向量
  34. vec = torch.zeros(VOCAB_SIZE)
  35. for word in phrase:
  36. vec[word_to_ix[word]] += 1
  37. return vec.view(1, -1) # -1为根据另一个参数自动调整,此处reshape为 1 x VOCAB_SIZE
  38. model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)
  39. loss_function = nn.NLLLoss()
  40. optimizer = optim.SGD(model.parameters(), lr=1)
  41. for epoch in range(10):
  42. print('now in epoch %d...' % epoch)
  43. for instance, label in train_data:
  44. model.zero_grad()
  45. bow_vec = make_bow_vector(instance)
  46. target = torch.LongTensor([label]) # [label]表示loss计算的位置(共5个,比如[0]就把位置0的值拿出来算prob)
  47. log_probs = model(bow_vec) # forward返回值
  48. # print(log_probs) # tensor([[-1.6050, -1.6095, -1.6062, -1.6140, -1.6126]], grad_fn=<LogSoftmaxBackward>)
  49. loss = loss_function(log_probs, target)
  50. loss.backward() # backprop
  51. optimizer.step() # 更新参数
  52. acc = 0
  53. with torch.no_grad(): # 不自动计算梯度,避免多余的计算造成无谓的消耗
  54. for instance, label in test_data:
  55. bow_vec = make_bow_vector(instance)
  56. log_probs = model(bow_vec)
  57. b = torch.argmax(log_probs, dim=1) # dim 0为行,1为列,dim=1表示原来nxm变为nx1(此处原来为1 x NUM_LABELS)
  58. if b[0] == label:
  59. acc += 1
  60. print('acc = %.4lf%%' % (acc / test_len * 100))

 

2.2. 结果

train_data为125k个数据,一次SGD耗时大概2min左右

epoch = 1, acc = 39.65%

epoch = 5, acc = 47.48%

epoch = 10, acc = 59.24%

epoch = 20, acc = 59.30%

估计这附近过拟合了,差不多60%的亚子吧


嗯...出大问题了,上面的acc突然不能复现,取而代之的是以下数据(可复现),迷惑,画了个图代码稍微改了改

  1. ...
  2. accs = []
  3. def match():
  4. acc = 0
  5. with torch.no_grad():
  6. for instance, label in test_data:
  7. bow_vec = make_bow_vector(instance)
  8. log_probs = model(bow_vec)
  9. b = torch.argmax(log_probs, dim=1) # dim 0为行,1为列,dim=1表示原来nxm变为nx1(此处原来为1 x NUM_LABELS)
  10. if b[0] == label:
  11. acc += 1
  12. print('acc = %.4lf%%' % (acc / test_len * 100))
  13. accs.append(acc / test_len * 100)
  14. for epoch in range(40):
  15. print('now in epoch %d...' % epoch)
  16. for instance, label in train_data:
  17. model.zero_grad()
  18. bow_vec = make_bow_vector(instance)
  19. target = torch.LongTensor([label]) # [label]表示loss计算的位置(共5个,比如[0]就把位置0的值拿出来算prob)
  20. log_probs = model(bow_vec) # forward返回值
  21. # print(log_probs) # tensor([[-1.6050, -1.6095, -1.6062, -1.6140, -1.6126]], grad_fn=<LogSoftmaxBackward>)
  22. loss = loss_function(log_probs, target)
  23. loss.backward() # backprop
  24. optimizer.step() # 更新参数
  25. match(epo
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/你好赵伟/article/detail/629678
推荐阅读
相关标签
  

闽ICP备14008679号