赞
踩
任务说明:NLP-Beginner:自然语言处理入门练习 任务二
数据下载:Sentiment Analysis on Movie Reviews
参考资料:
该任务主要利用torchtext这个包来进行。
import torch
from torch import nn, optim
import torch.nn.functional as F
from torchtext import data
from torchtext.data import Field, TabularDataset, BucketIterator
import time
这里主要用到torchtext的四个模块:
整体思路如下图所示(摘自参考资料4)
fix_length = 40
# 创建Field对象
TEXT = Field(sequential = True, lower=True, fix_length = fix_length)
LABEL = Field(sequential = False, use_vocab = False)
Field参数说明(摘自参考资料4):
Field几个重要的方法(摘自参考资料4):
torchtext的TabularDataset可以用于从CSV、TSV或者JSON文件中读取数据。
TabularDataset参数说明
我们可以先看一下数据的样子
实际上我们只需要知道文本内容(即Phrase列)和标签(即Sentiment列)就可以了,设置fields读取器如下:
fields = [('PhraseId', None), ('SentenceId', None), ('Phrase', TEXT), ('Sentiment', LABEL)]
设置好读取器后,可以利用TabularDataset读取数据:
# 从文件中读取数据
fields = [('PhraseId', None), ('SentenceId', None), ('Phrase', TEXT), ('Sentiment', LABEL)]
dataset, test = TabularDataset.splits(path = './', format = 'tsv',
train = 'train.tsv', test = 'test.tsv',
skip_header = True, fields = fields)
train, vali = dataset.split(0.7)
这里需要注意的是,我们读进来的train.tsv文件其实是包含训练集和测试集的,利用dataset.split(0.7)将其分成训练集和测试集,数据样本数的比例的是7:3。
torchtext.data.Field对象提供了build_vocab方法来帮助构建词表。build_vocab()方法的参数参考Vocab
从训练集中构建词表的代码如下。文本数据的词表最多包含10000个词,并删除了出现频次不超过10的词。
# 构建词表
TEXT.build_vocab(train, max_size=10000, min_freq=10)
LABEL.build_vocab(train)
trochtext提供了BucketIterator,可以帮助我们批处理所有文本并将词替换成词的索引。如果序列的长度差异很大,则填充将消耗大量浪费的内存和时间。BucketIterator可以将每个批次的相似长度的序列组合在一起,以最小化填充。
# 生成向量的批数据
bs = 64
train_iter, vali_iter = BucketIterator.splits((train, vali), batch_size = bs,
device= torch.device('cpu'),
sort_key=lambda x: len(x.Phrase),
sort_within_batch=False,
shuffle = True,
repeat = False)
class EmbNet(nn.Module): def __init__(self,emb_size,hidden_size1,hidden_size2=400): super().__init__() self.embedding = nn.Embedding(emb_size,hidden_size1) self.fc = nn.Linear(hidden_size2,3) self.log_softmax = nn.LogSoftmax(dim = -1) def forward(self,x): embeds = self.embedding(x).view(x.size(0),-1) out = self.fc(embeds) out = self.log_softmax(out) return out model = EmbNet(len(TEXT.vocab.stoi), 20) model = model.cuda() optimizer = optim.Adam(model.parameters(),lr=0.001)
模型包括三层,首先是一个embedding层,它接收两个参数,即词表的大小和希望为每个单词创建的word embedding的维度。对于一个句子来说,所有的词的word embedding向量收尾相接(扁平化),通过一个线性层和一个log_softmax层得到最后的分类。
这里的优化器选择Adam,在实践中的效果比SGD要好很多。
# 训练模型 def fit(epoch, model, data_loader, phase = 'training'): if phase == 'training': model.train() if phase == 'validation': model.eval() running_loss = 0.0 running_correct = 0.0 for batch_idx, batch in enumerate(data_loader): text, target = batch.Phrase, batch.Sentiment # if torch.cuda.is_available(): # text, target = text.cuda(), target.cuda() if phase == 'training': optimizer.zero_grad() output = model(text) loss = F.cross_entropy(output, target) running_loss += F.cross_entropy(output, target, reduction='sum').item() preds = output.data.max(dim=1, keepdim = True)[1] running_correct += preds.eq(target.data.view_as(preds)).cpu().sum() if phase == 'training': loss.backward() optimizer.step() loss = running_loss/len(data_loader.dataset) accuracy = 100. * running_correct/len(data_loader.dataset) print(f'{phase} loss is {loss:{5}.{4}} and {phase} accuracy is {running_correct}/{len(data_loader.dataset)}{accuracy:{10}.{4}}') return loss, accuracy train_losses , train_accuracy = [],[] val_losses , val_accuracy = [],[] t0 = time.time() for epoch in range(1, 100): print('epoch no. {} :'.format(epoch) + '-'* 15) epoch_loss, epoch_accuracy = fit(epoch, model, train_iter,phase='training') val_epoch_loss, val_epoch_accuracy = fit(epoch, model, vali_iter,phase='validation') train_losses.append(epoch_loss) train_accuracy.append(epoch_accuracy) val_losses.append(val_epoch_loss) val_accuracy.append(val_epoch_accuracy) tend = time.time() print('总共用时:{} s'.format(tend-t0))
模型一共训练了100轮,在val的准确率大致在60%左右。
很多时候,在处理特定领域的NLP任务时,使用预训练好的词向量会非常有用。通常使用预训练的词向量包括下面三个步骤。
下载词向量
TEXT.build_vocab(train, vectors=GloVe(name='6B', dim=300), max_size=10000, min_freq=20)
在模型中加载词向量
model = EmbNet(len(TEXT.vocab.stoi),300, 300*fix_length)
# 利用预训练好的词向量
model.embedding.weight.data = TEXT.vocab.vectors
冻结embedding层
# 冻结embedding层的权重
model.embedding.weight.requires_grad = False
optimizer = optim.Adam([ param for param in model.parameters() if param.requires_grad == True],lr=0.001)
## 创建模型 class LSTM(nn.Module): def __init__(self, vocab, hidden_size, n_cat, bs = 1, nl = 2): super().__init__() self.hidden_size = hidden_size self.bs = bs self.nl = nl self.n_vocab = len(vocab) self.n_cat = n_cat self.e = nn.Embedding(self.n_vocab, self.hidden_size) self.rnn = nn.LSTM(self.hidden_size, self.hidden_size, self.nl) self.fc2 = nn.Linear(self.hidden_size, self.n_cat) self.sofmax = nn.LogSoftmax(dim = -1) def forward(self, x): bs = x.size()[1] if bs != self.bs: self.bs = bs e_out = self.e(x) h0, c0 = self.init_paras() rnn_o, _ = self.rnn(e_out, (h0, c0)) rnn_o = rnn_o[-1] fc = self.fc2(rnn_o) out = self.sofmax(fc) return out def init_paras(self): h0 = Variable(torch.zeros(self.nl, self.bs, self.hidden_size)) c0 = Variable(torch.zeros(self.nl, self.bs, self.hidden_size)) return h0, c0 model = LSTM(TEXT.vocab, hidden_size = 300, n_cat = 5, bs = bs) # 利用预训练好的词向量 model.e.weight.data = TEXT.vocab.vectors # 冻结embedding层的权重 model.e.weight.requires_grad = False optimizer = optim.Adam([ param for param in model.parameters() if param.requires_grad == True],lr=0.001)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。