赞
踩
https://github.com/FudanNLP/nlp-beginner
注:这一整篇第4节前的代码都没有处理好batch的问题,每个epoch都把参数更新了124848次,所以效果极差,不过除了训练部分其他都没大问题
因为时间问题只重写了textCNN+glove+Adam+双embedding(3.3),其实就是训练代码的差别,训练代码以及minibatch的设定和part2一样,其他的结果看个乐就行
因为以前试bert的时候装的cuda9.0,懒得折腾就装了pytorch1.1.0,直接anaconda新建python3.6的环境运行如下命令即可
conda install pytorch==1.1.0 torchvision==0.3.0 cudatoolkit=9.0 -c pytorch
入门和代码分别参考了官网的两个教程60分钟入门(只看了前三章)和DEEP LEARNING WITH PYTORCH
- import torch
- import pandas as pd
- import torch.nn as nn
- import torch.nn.functional as F
- import torch.optim as optim
- from torch.utils.data import random_split
-
- read_data = pd.read_table('../train.tsv')
- data = []
- data_len = read_data.shape[0]
- for i in range(data_len):
- data.append([read_data['Phrase'][i].split(' '), read_data['Sentiment'][i]])
-
- word_to_ix = {} # 给每个词分配index
- for sent, _ in data:
- for word in sent:
- if word not in word_to_ix:
- word_to_ix[word] = len(word_to_ix)
-
- torch.manual_seed(6) # 设置torch的seed,影响后面初始化参数和random_split
- train_len = int(0.8 * data_len)
- test_len = data_len - train_len
- train_data, test_data = random_split(data, [train_len, test_len]) # 分割数据集
- # print(type(train_data)) # torch.utils.data.dataset.Subset
- train_data = list(train_data)
- test_data = list(test_data)
-
- VOCAB_SIZE = len(word_to_ix)
- NUM_LABELS = 5
-
-
- class BoWClassifier(nn.Module):
-
- def __init__(self, num_labels, vocab_size):
- super(BoWClassifier, self).__init__()
- self.linear = nn.Linear(vocab_size, num_labels) # 线性映射
- # print(list(self.parameters())) # 输出参数要先转化为list,不然是一个地址值
-
- def forward(self, bow_vec):
- return F.log_softmax(self.linear(bow_vec), dim=1)
-
-
- def make_bow_vector(phrase): # 生成BOW向量
- vec = torch.zeros(VOCAB_SIZE)
- for word in phrase:
- vec[word_to_ix[word]] += 1
-
- return vec.view(1, -1) # -1为根据另一个参数自动调整,此处reshape为 1 x VOCAB_SIZE
-
-
- model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)
- loss_function = nn.NLLLoss()
- optimizer = optim.SGD(model.parameters(), lr=1)
-
- for epoch in range(10):
- print('now in epoch %d...' % epoch)
- for instance, label in train_data:
- model.zero_grad()
-
- bow_vec = make_bow_vector(instance)
- target = torch.LongTensor([label]) # [label]表示loss计算的位置(共5个,比如[0]就把位置0的值拿出来算prob)
-
- log_probs = model(bow_vec) # forward返回值
- # print(log_probs) # tensor([[-1.6050, -1.6095, -1.6062, -1.6140, -1.6126]], grad_fn=<LogSoftmaxBackward>)
-
- loss = loss_function(log_probs, target)
- loss.backward() # backprop
- optimizer.step() # 更新参数
-
- acc = 0
- with torch.no_grad(): # 不自动计算梯度,避免多余的计算造成无谓的消耗
- for instance, label in test_data:
- bow_vec = make_bow_vector(instance)
- log_probs = model(bow_vec)
- b = torch.argmax(log_probs, dim=1) # dim 0为行,1为列,dim=1表示原来nxm变为nx1(此处原来为1 x NUM_LABELS)
- if b[0] == label:
- acc += 1
-
- print('acc = %.4lf%%' % (acc / test_len * 100))
train_data为125k个数据,一次SGD耗时大概2min左右
epoch = 1, acc = 39.65%
epoch = 5, acc = 47.48%
epoch = 10, acc = 59.24%
epoch = 20, acc = 59.30%
估计这附近过拟合了,差不多60%的亚子吧
嗯...出大问题了,上面的acc突然不能复现,取而代之的是以下数据(可复现),迷惑,画了个图代码稍微改了改
- ...
-
- accs = []
- def match():
- acc = 0
- with torch.no_grad():
- for instance, label in test_data:
- bow_vec = make_bow_vector(instance)
- log_probs = model(bow_vec)
- b = torch.argmax(log_probs, dim=1) # dim 0为行,1为列,dim=1表示原来nxm变为nx1(此处原来为1 x NUM_LABELS)
- if b[0] == label:
- acc += 1
-
- print('acc = %.4lf%%' % (acc / test_len * 100))
- accs.append(acc / test_len * 100)
-
-
- for epoch in range(40):
- print('now in epoch %d...' % epoch)
- for instance, label in train_data:
- model.zero_grad()
-
- bow_vec = make_bow_vector(instance)
- target = torch.LongTensor([label]) # [label]表示loss计算的位置(共5个,比如[0]就把位置0的值拿出来算prob)
-
- log_probs = model(bow_vec) # forward返回值
- # print(log_probs) # tensor([[-1.6050, -1.6095, -1.6062, -1.6140, -1.6126]], grad_fn=<LogSoftmaxBackward>)
-
- loss = loss_function(log_probs, target)
- loss.backward() # backprop
- optimizer.step() # 更新参数
-
- match(epo
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。