赞
踩
文本的有序性以及词与词之间的共现信息为自然语言处理提供了天然的自监督学习信号,使得系统无须额外人工标注也能够从文本中习得知识。
从语言模型的角度来看,N元语言模型存在明显的缺点。首先,模型容易受到数据稀疏的影响,一般需要对模型进行平滑处理;其次,无法对长度超过N的上下文依赖关系进行建模。神经网络语言模型(Neural Network Language Model)在一定程度上克服了这些问题。
神经网络语言模型几乎已经替代N元语言模型,成为现代自然语言处理中最重要的基础技术之一。
给定一段文本 w1w2··· wn,语言模型的基本任务是根据历史上下文对下一时刻的词进行预测,也就是计算条件概率P (wt|w1w2··· wt−1)。
为了构建语言模型,将其转化为以词表为类别标签集合的分类问题,其输入为历史词序列w1w2··· wt−1(也记作w1:t−1),输出为目标词wt。然后从无标注的文本语料中构建训练数据集,并通过优化该数据集上的分类损失对模型进行训练。由于监督信号来自数据自身,因此这种学习方式也被称为自监督学习(Self-supervised Learning)
前馈神经网络语言模型
利用传统N元语言模型中的马尔可夫假设(Markov Assumption)——对下一个词的预测只与历史中最近的 n−1个词相关。
模型的输入:长度为n−1的定长词序列wt−n+1:t−1。
模型的任务:为对条件概率P (wt|wt−n+1:t−1)进行估计。
前馈神经网络由输入层、词向量层、隐含层和输出层构成。
输入层
由当前时刻t的历史词序列wt−n+1:t−1构成,主要为离散的符号表示。
词向量层
将输入层中的每个词分别映射至一个低维、稠密的实值特征向量。
隐含层
对词向量层x进行线性变换与激活。常用的激活函数有Sigmoid、tanh和ReLU等。
输出层
对h做线性变换,并利用Softmax函数进行归一化,从而获得词表V空间内的概率分布。
循环神经网络语言模型
在前馈神经网络语言模型中,对下一个词的预测需要回看多长的历史是由超参数n决定的。但是,不同的句子对历史长度n的期望往往是变化的。
循环神经网络语言模型正是为了处理这种不定长依赖而设计的一种语言模型。
循环神经网络是用来处理序列数据的一种神经网络,而自然语言正好满足这种序列结构性质。循环神经网络语言模型中的每一时刻都维护一个隐含状态,该状态蕴含了当前词的所有历史信息,且与当前词一起被作为下一时刻的输入。这个随时刻变化而不断更新的隐含状态也被称作记忆(Memory)。
输入层
可由完整的历史词序列构成,即w1:t−1。
词向量层
输入的词序列首先由词向量层映射至相应的词向量表示。
隐含层
由线性变换与激活函数构成。
输出层
在输出层计算t时刻词表上的概率分布。
当序列较长时,训练阶段会存在梯度弥散(Vanishing gradient)或者梯度爆炸(Exploding gradient)的风险。
LSTM这种含有门控机制的RNN很好地解决了这个问题。
数据准备
NLTK中提供的Reuters语料库,该语料库被广泛用于文本分类任务,其中包含10788篇新闻类文档,每篇文档具有1个或多个类别。
from nltk.corpus import reuters
前馈神经网络语言模型
数据:NGramDataset类
模型:FeedForwardNNLM
训练
import xml.etree.ElementTree from collections import defaultdict, Counter import torch from torch import nn, optim from torch.utils.data import Dataset, DataLoader from tqdm import tqdm WEIGHT_INIT_RANGE = 0.1 class Vocab: def __init__(self, tokens=None): self.idx_to_token = list() self.token_to_idx = dict() if tokens is not None: if "<unk>" not in tokens: tokens = tokens + ["<unk>"] for token in tokens: self.idx_to_token.append(token) self.token_to_idx[token] = len(self.idx_to_token) - 1 self.unk = self.token_to_idx['<unk>'] @classmethod def build(cls, text, min_freq=1, reserved_tokens=None): token_freqs = defaultdict(int) for sentence in text: for token in sentence: token_freqs[token] += 1 uniq_tokens = ["<unk>"] + (reserved_tokens if reserved_tokens else []) uniq_tokens += [token for token, freq in token_freqs.items() \ if freq >= min_freq and token != "<unk>"] return cls(uniq_tokens) def __len__(self): return len(self.idx_to_token) def __getitem__(self, token): return self.token_to_idx.get(token, self.unk) def convert_tokens_to_ids(self, tokens): return [self[token] for token in tokens] def convert_ids_to_tokens(self, indices): return [self.idx_to_token[index] for index in indices] def save_vocab(vocab, path): with open(path, 'w') as writer: writer.write("\n".join(vocab.idx_to_token)) def read_vocab(path): with open(path, 'r') as f: tokens = f.read().split('\n') return Vocab(tokens) ## 预留标记常量 # Constants BOS_TOKEN = "<bos>" EOS_TOKEN = "<eos>" PAD_TOKEN = "<pad>" BOW_TOKEN = "<bow>" EOW_TOKEN = "<eow>" def init_weights(model): for name, param in model.named_parameters(): if "embedding" not in name: torch.nn.init.uniform_( param, a=-WEIGHT_INIT_RANGE, b=WEIGHT_INIT_RANGE ) def get_loader(dataset, batch_size, shuffle=True): data_loader = DataLoader( dataset, batch_size=batch_size, collate_fn=dataset.collate_fn, shuffle=shuffle ) return data_loader def save_pretrained(vocab, embeds, save_path): """ Save pretrained token vectors in a unified format, where the first line specifies the `number_of_tokens` and `embedding_dim` followed with all token vectors, one token per line. """ with open(save_path, "w") as writer: writer.write(f"{embeds.shape[0]} {embeds.shape[1]}\n") for idx, token in enumerate(vocab.idx_to_token): vec = " ".join(["{:.4f}".format(x) for x in embeds[idx]]) writer.write(f"{token} {vec}\n") print(f"Pretrained embeddings saved to: {save_path}") ## 加载语料库 def load_reuters(): from nltk.corpus import reuters text = reuters.sents() # lowercase (optional) text = [[word.lower() for word in sentence] for sentence in text] vocab = Vocab.build(text, reserved_tokens=[PAD_TOKEN, BOS_TOKEN, EOS_TOKEN]) corpus = [vocab.convert_tokens_to_ids(sentence) for sentence in text] return corpus, vocab ## 创建前馈神经网络语言模型的数据处理类NGramDataset class NGramDataset(Dataset): def __init__(self,corpus,vocab,context_size = 2): self.data = [] self.bos = vocab[BOS_TOKEN] # 句首标记id self.eos = vocab[EOS_TOKEN] # 句尾标记id for sentence in tqdm(corpus,desc="Dataset Construciton"): # 插入句首句尾符号 sentence = [self.bos] + sentence + [self.eos] # 如果句子长度小于预定义的上下文大小,则跳过 if len(sentence) < context_size: continue for i in range(context_size,len(sentence)): # 模型输入:长度为context_size 的上文 context = sentence[i - context_size:i] # 模型输出:当前词 target = sentence[i] # 每个训练样本由(context,target)组成 self.data.append((context,target)) def __len__(self): return len(self.data) def __getitem__(self, i): return self.data[i] def collate_fn(self,examples): #从独立样本集合中构建批次的输入输出,并转换为pytorch的张量类型 inputs = torch.tensor([ex[0] for ex in examples],dtype = torch.long) targets = torch.tensor([ex[1] for ex in examples] , dtype = torch.long) return (inputs,targets) # 前馈神经网络模型 class FeedForwardNNLM(nn.Module): def __init__(self,vocab_size,embedding_dim,context_size,hidden_dim): super(FeedForwardNNLM, self).__init__() # 词嵌入层 self.embeddings = nn.Embedding(vocab_size,embedding_dim) # 线性变换:词嵌入层 → 隐含层 self.linear1 = nn.Linear(context_size * embedding_dim,hidden_dim) # 线性变换:隐含层 → 输出层 self.linear2 = nn.Linear(hidden_dim,vocab_size) # 使用relu激活函数 self.activate = torch.relu init_weights(self) def forward(self,inputs): embeds = self.embeddings(inputs).view((inputs.shape[0],-1)) hidden = self.activate(self.linear1(embeds)) output = self.linear2(hidden) log_probs = torch.log_softmax(output,dim = 1) return log_probs # 设置超参数 embedding_dim = 64 context_size = 2 hidden_dim = 128 batch_size = 1024 num_epoch = 10 # 读取文本数据,构建训练数据集 corpus , vocab = load_reuters() dataset = NGramDataset(corpus,vocab,context_size) data_loader = get_loader(dataset,batch_size) # 负对数似然损失函数 nll_loss = nn.NLLLoss() # 构建神经网络 device = torch.device("cpu") model = FeedForwardNNLM(len(vocab),embedding_dim,context_size,hidden_dim) model.to(device) optimizer = optim.Adam(model.parameters(),lr=0.001) model.train() total_losses = [] for epoch in range(num_epoch): total_loss = 0 for batch in tqdm(data_loader,desc=f"Training Epoch{epoch}"): inputs , targets = [x.to(device) for x in batch] optimizer.zero_grad() log_probs = model(inputs) loss = nll_loss(log_probs,targets) loss.backward() optimizer.step() total_loss += loss.item() print(f"Loss:{total_loss:.2f}") total_losses.append(total_loss) # 保存词向量 save_pretrained(vocab,model.embeddings.weight.data,"ffnnlm.vec")
循环神经网络语言模型
数据
创建循环神经网络语言模型的数据类RnnlmDataset,实现训练数据的构建与存取。
模型
创建循环神经网络语言模型类RNNLM。循环神经网络语言模型主要包含词向量层、循环神经网络(这里使用LSTM)和输出层。
训练
from collections import defaultdict import torch from torch import nn, optim from torch.nn.utils.rnn import pad_sequence from torch.utils.data import Dataset, DataLoader from tqdm import tqdm WEIGHT_INIT_RANGE = 0.1 class Vocab: def __init__(self, tokens=None): self.idx_to_token = list() self.token_to_idx = dict() if tokens is not None: if "<unk>" not in tokens: tokens = tokens + ["<unk>"] for token in tokens: self.idx_to_token.append(token) self.token_to_idx[token] = len(self.idx_to_token) - 1 self.unk = self.token_to_idx['<unk>'] @classmethod def build(cls, text, min_freq=1, reserved_tokens=None): token_freqs = defaultdict(int) for sentence in text: for token in sentence: token_freqs[token] += 1 uniq_tokens = ["<unk>"] + (reserved_tokens if reserved_tokens else []) uniq_tokens += [token for token, freq in token_freqs.items() \ if freq >= min_freq and token != "<unk>"] return cls(uniq_tokens) def __len__(self): return len(self.idx_to_token) def __getitem__(self, token): return self.token_to_idx.get(token, self.unk) def convert_tokens_to_ids(self, tokens): return [self[token] for token in tokens] def convert_ids_to_tokens(self, indices): return [self.idx_to_token[index] for index in indices] def save_vocab(vocab, path): with open(path, 'w') as writer: writer.write("\n".join(vocab.idx_to_token)) def read_vocab(path): with open(path, 'r') as f: tokens = f.read().split('\n') return Vocab(tokens) def init_weights(model): for name, param in model.named_parameters(): if "embedding" not in name: torch.nn.init.uniform_( param, a=-WEIGHT_INIT_RANGE, b=WEIGHT_INIT_RANGE ) def get_loader(dataset, batch_size, shuffle=True): data_loader = DataLoader( dataset, batch_size=batch_size, collate_fn=dataset.collate_fn, shuffle=shuffle ) return data_loader def save_pretrained(vocab, embeds, save_path): """ Save pretrained token vectors in a unified format, where the first line specifies the `number_of_tokens` and `embedding_dim` followed with all token vectors, one token per line. """ with open(save_path, "w") as writer: writer.write(f"{embeds.shape[0]} {embeds.shape[1]}\n") for idx, token in enumerate(vocab.idx_to_token): vec = " ".join(["{:.4f}".format(x) for x in embeds[idx]]) writer.write(f"{token} {vec}\n") print(f"Pretrained embeddings saved to: {save_path}") ## 预留标记常量 # Constants BOS_TOKEN = "<bos>" EOS_TOKEN = "<eos>" PAD_TOKEN = "<pad>" BOW_TOKEN = "<bow>" EOW_TOKEN = "<eow>" class RnnlmDataset(Dataset): def __init__(self, corpus, vocab): self.data = [] self.bos = vocab[BOS_TOKEN] self.eos = vocab[EOS_TOKEN] self.pad = vocab[PAD_TOKEN] for sentence in tqdm(corpus, desc="Dataset Construction"): # 模型输入:BOS_TOKEN, w_1, w_2, ..., w_n input = [self.bos] + sentence # 模型输出:w_1, w_2, ..., w_n, EOS_TOKEN target = sentence + [self.eos] self.data.append((input, target)) def __len__(self): return len(self.data) def __getitem__(self, i): return self.data[i] def collate_fn(self, examples): # 从独立样本集合中构建batch输入输出 inputs = [torch.tensor(ex[0]) for ex in examples] targets = [torch.tensor(ex[1]) for ex in examples] # 对batch内的样本进行padding,使其具有相同长度 inputs = pad_sequence(inputs, batch_first=True, padding_value=self.pad) targets = pad_sequence(targets, batch_first=True, padding_value=self.pad) return (inputs, targets) class RNNLM(nn.Module): def __init__(self, vocab_size, embedding_dim, hidden_dim): super(RNNLM, self).__init__() # 词嵌入层 self.embeddings = nn.Embedding(vocab_size, embedding_dim) # 循环神经网络:这里使用LSTM self.rnn = nn.LSTM(embedding_dim, hidden_dim, batch_first=True) # 输出层 self.output = nn.Linear(hidden_dim, vocab_size) def forward(self, inputs): embeds = self.embeddings(inputs) # 计算每一时刻的隐含层表示 hidden, _ = self.rnn(embeds) output = self.output(hidden) log_probs = torch.log_softmax(output, dim=2) return log_probs embedding_dim = 64 context_size = 2 hidden_dim = 128 batch_size = 1024 num_epoch = 10 # 读取文本数据,构建FFNNLM训练数据集(n-grams) ## 加载语料库 def load_reuters(): from nltk.corpus import reuters text = reuters.sents() # lowercase (optional) text = [[word.lower() for word in sentence] for sentence in text] vocab = Vocab.build(text, reserved_tokens=[PAD_TOKEN, BOS_TOKEN, EOS_TOKEN]) corpus = [vocab.convert_tokens_to_ids(sentence) for sentence in text] return corpus, vocab corpus, vocab = load_reuters() dataset = RnnlmDataset(corpus, vocab) data_loader = get_loader(dataset, batch_size) # 负对数似然损失函数,忽略pad_token处的损失 nll_loss = nn.NLLLoss(ignore_index=dataset.pad) # 构建RNNLM,并加载至device device = torch.device('cpu') model = RNNLM(len(vocab), embedding_dim, hidden_dim) model.to(device) # 使用Adam优化器 optimizer = optim.Adam(model.parameters(), lr=0.001) model.train() for epoch in range(num_epoch): total_loss = 0 for batch in tqdm(data_loader, desc=f"Training Epoch {epoch}"): inputs, targets = [x.to(device) for x in batch] optimizer.zero_grad() log_probs = model(inputs) loss = nll_loss(log_probs.view(-1, log_probs.shape[-1]), targets.view(-1)) loss.backward() optimizer.step() total_loss += loss.item() print(f"Loss: {total_loss:.2f}") save_pretrained(vocab, model.embeddings.weight.data, "rnnlm.vec")
这个别轻易用CPU跑,直接拉满…把设备换成GPU
从词向量学习的角度来看,基于神经网络语言模型的预训练方法存在一个明显的缺点,即当对t时刻词进行预测时,模型只利用了历史词序列作为输入,而损失了与“未来”上下文之间的共现信息。
CBOW模型(“Continuous Bag-of Words)
给定一段文本,CBOW模型的基本思想是根据上下文对目标词进行预测。
CBOW模型的任务是根据一定窗口大小内的上下文Ct(若取窗口大小为5,则Ct={wt−2, wt−1, wt+1, wt+2})对t时刻的词wt进行预测。
CBOW模型不考虑上下文中单词的位置或者顺序,因此模型的输入实际上是一个“词袋”而非序列。
与一般的前馈神经网络相比,CBOW模型的隐含层只是执行对词向量层取平均的操作,而没有线性变换以及非线性激活的过程。所以,也可以认为CBOW模型是没有隐含层的,这也是CBOW模型具有高训练效率的主要原因。
输入层
以大小为5的上下文窗口为例,在目标词wt左右各取2个词作为模型的输入。输入层由4个维度为词表长度|V|的独热表示向量构成
词向量层
输入层中每个词的独热表示向量经由矩阵映射至词向量空间。
输出层
输出层根据上下文表示对目标词进行预测(分类),与前馈神经网络语言模型基本一致,唯一的不同在于丢弃了线性变换的偏置项。
Skip-gram模型
CBOW模型使用上下文窗口中词的集合作为条件输入预测目标词,即P (wt|Ct),其中Ct={wt−k,
···, wt−1, wt+1,···, wt+k}。
Skip-gram模型在此基础之上作了进一步的简化,使用Ct中的每个词作为独立的上下文对目标词进行预测。
参数估计
通过优化分类损失对 CBOW 模型和 Skip-gram模型进行训练,需要估计的参数为θ={E, E′}。
给定一段长为T 的词序列w1w2···wT:
CBOW模型的负对数似然损失函数为:
Skip-gram模型的负对数似然损失函数为:
根据上下文预测当前词(CBOW模型)
根据当前词预测上下文(Skip-gram模型)。
当词表规模较大且计算资源有限时,这类模型的训练过程会受到输出层概率归一化(Normalization)计
算效率的影响。
负采样方法则提供了一种新的任务视角:给定当前词与其上下文,最大化两者共现的概率。
CBOW模型
from collections import defaultdict import torch from torch import nn, optim from torch.nn.utils.rnn import pad_sequence from torch.utils.data import Dataset, DataLoader from tqdm import tqdm WEIGHT_INIT_RANGE = 0.1 class Vocab: def __init__(self, tokens=None): self.idx_to_token = list() self.token_to_idx = dict() if tokens is not None: if "<unk>" not in tokens: tokens = tokens + ["<unk>"] for token in tokens: self.idx_to_token.append(token) self.token_to_idx[token] = len(self.idx_to_token) - 1 self.unk = self.token_to_idx['<unk>'] @classmethod def build(cls, text, min_freq=1, reserved_tokens=None): token_freqs = defaultdict(int) for sentence in text: for token in sentence: token_freqs[token] += 1 uniq_tokens = ["<unk>"] + (reserved_tokens if reserved_tokens else []) uniq_tokens += [token for token, freq in token_freqs.items() \ if freq >= min_freq and token != "<unk>"] return cls(uniq_tokens) def __len__(self): return len(self.idx_to_token) def __getitem__(self, token): return self.token_to_idx.get(token, self.unk) def convert_tokens_to_ids(self, tokens): return [self[token] for token in tokens] def convert_ids_to_tokens(self, indices): return [self.idx_to_token[index] for index in indices] def save_vocab(vocab, path): with open(path, 'w') as writer: writer.write("\n".join(vocab.idx_to_token)) def read_vocab(path): with open(path, 'r') as f: tokens = f.read().split('\n') return Vocab(tokens) def init_weights(model): for name, param in model.named_parameters(): if "embedding" not in name: torch.nn.init.uniform_( param, a=-WEIGHT_INIT_RANGE, b=WEIGHT_INIT_RANGE ) def load_reuters(): from nltk.corpus import reuters text = reuters.sents() # lowercase (optional) text = [[word.lower() for word in sentence] for sentence in text] vocab = Vocab.build(text, reserved_tokens=[PAD_TOKEN, BOS_TOKEN, EOS_TOKEN]) corpus = [vocab.convert_tokens_to_ids(sentence) for sentence in text] return corpus, vocab def save_pretrained(vocab, embeds, save_path): """ Save pretrained token vectors in a unified format, where the first line specifies the `number_of_tokens` and `embedding_dim` followed with all token vectors, one token per line. """ with open(save_path, "w") as writer: writer.write(f"{embeds.shape[0]} {embeds.shape[1]}\n") for idx, token in enumerate(vocab.idx_to_token): vec = " ".join(["{:.4f}".format(x) for x in embeds[idx]]) writer.write(f"{token} {vec}\n") print(f"Pretrained embeddings saved to: {save_path}") def get_loader(dataset, batch_size, shuffle=True): data_loader = DataLoader( dataset, batch_size=batch_size, collate_fn=dataset.collate_fn, shuffle=shuffle ) return data_loader def save_pretrained(vocab, embeds, save_path): """ Save pretrained token vectors in a unified format, where the first line specifies the `number_of_tokens` and `embedding_dim` followed with all token vectors, one token per line. """ with open(save_path, "w") as writer: writer.write(f"{embeds.shape[0]} {embeds.shape[1]}\n") for idx, token in enumerate(vocab.idx_to_token): vec = " ".join(["{:.4f}".format(x) for x in embeds[idx]]) writer.write(f"{token} {vec}\n") print(f"Pretrained embeddings saved to: {save_path}") ## 预留标记常量 # Constants BOS_TOKEN = "<bos>" EOS_TOKEN = "<eos>" PAD_TOKEN = "<pad>" BOW_TOKEN = "<bow>" EOW_TOKEN = "<eow>" class RnnlmDataset(Dataset): def __init__(self, corpus, vocab): self.data = [] self.bos = vocab[BOS_TOKEN] self.eos = vocab[EOS_TOKEN] self.pad = vocab[PAD_TOKEN] for sentence in tqdm(corpus, desc="Dataset Construction"): # 模型输入:BOS_TOKEN, w_1, w_2, ..., w_n input = [self.bos] + sentence # 模型输出:w_1, w_2, ..., w_n, EOS_TOKEN target = sentence + [self.eos] self.data.append((input, target)) def __len__(self): return len(self.data) def __getitem__(self, i): return self.data[i] def collate_fn(self, examples): # 从独立样本集合中构建batch输入输出 inputs = [torch.tensor(ex[0]) for ex in examples] targets = [torch.tensor(ex[1]) for ex in examples] # 对batch内的样本进行padding,使其具有相同长度 inputs = pad_sequence(inputs, batch_first=True, padding_value=self.pad) targets = pad_sequence(targets, batch_first=True, padding_value=self.pad) return (inputs, targets) class RNNLM(nn.Module): def __init__(self, vocab_size, embedding_dim, hidden_dim): super(RNNLM, self).__init__() # 词嵌入层 self.embeddings = nn.Embedding(vocab_size, embedding_dim) # 循环神经网络:这里使用LSTM self.rnn = nn.LSTM(embedding_dim, hidden_dim, batch_first=True) # 输出层 self.output = nn.Linear(hidden_dim, vocab_size) def forward(self, inputs): embeds = self.embeddings(inputs) # 计算每一时刻的隐含层表示 hidden, _ = self.rnn(embeds) output = self.output(hidden) log_probs = torch.log_softmax(output, dim=2) return log_probs embedding_dim = 64 context_size = 2 hidden_dim = 128 batch_size = 1024 num_epoch = 10 # 读取文本数据,构建FFNNLM训练数据集(n-grams) ## 加载语料库 def load_reuters(): from nltk.corpus import reuters text = reuters.sents() # lowercase (optional) text = [[word.lower() for word in sentence] for sentence in text] vocab = Vocab.build(text, reserved_tokens=[PAD_TOKEN, BOS_TOKEN, EOS_TOKEN]) corpus = [vocab.convert_tokens_to_ids(sentence) for sentence in text] return corpus, vocab corpus, vocab = load_reuters() dataset = RnnlmDataset(corpus, vocab) data_loader = get_loader(dataset, batch_size) # 负对数似然损失函数,忽略pad_token处的损失 nll_loss = nn.NLLLoss(ignore_index=dataset.pad) # 构建RNNLM,并加载至device device = torch.device('cpu') model = RNNLM(len(vocab), embedding_dim, hidden_dim) model.to(device) # 使用Adam优化器 optimizer = optim.Adam(model.parameters(), lr=0.001) model.train() for epoch in range(num_epoch): total_loss = 0 for batch in tqdm(data_loader, desc=f"Training Epoch {epoch}"): inputs, targets = [x.to(device) for x in batch] optimizer.zero_grad() log_probs = model(inputs) loss = nll_loss(log_probs.view(-1, log_probs.shape[-1]), targets.view(-1)) loss.backward() optimizer.step() total_loss += loss.item() print(f"Loss: {total_loss:.2f}") save_pretrained(vocab, model.embeddings.weight.data, "rnnlm.vec")
Skip-gram模型
import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torch.utils.data import Dataset from torch.nn.utils.rnn import pad_sequence from tqdm.auto import tqdm from utils import BOS_TOKEN, EOS_TOKEN, PAD_TOKEN from utils import load_reuters, save_pretrained, get_loader, init_weights class SkipGramDataset(Dataset): def __init__(self, corpus, vocab, context_size=2): self.data = [] self.bos = vocab[BOS_TOKEN] self.eos = vocab[EOS_TOKEN] for sentence in tqdm(corpus, desc="Dataset Construction"): sentence = [self.bos] + sentence + [self.eos] for i in range(1, len(sentence)-1): # 模型输入:当前词 w = sentence[i] # 模型输出:一定窗口大小内的上下文 left_context_index = max(0, i - context_size) right_context_index = min(len(sentence), i + context_size) context = sentence[left_context_index:i] + sentence[i+1:right_context_index+1] self.data.extend([(w, c) for c in context]) def __len__(self): return len(self.data) def __getitem__(self, i): return self.data[i] def collate_fn(self, examples): inputs = torch.tensor([ex[0] for ex in examples]) targets = torch.tensor([ex[1] for ex in examples]) return (inputs, targets) class SkipGramModel(nn.Module): def __init__(self, vocab_size, embedding_dim): super(SkipGramModel, self).__init__() self.embeddings = nn.Embedding(vocab_size, embedding_dim) self.output = nn.Linear(embedding_dim, vocab_size) init_weights(self) def forward(self, inputs): embeds = self.embeddings(inputs) output = self.output(embeds) log_probs = F.log_softmax(output, dim=1) return log_probs embedding_dim = 64 context_size = 2 hidden_dim = 128 batch_size = 1024 num_epoch = 10 # 读取文本数据,构建Skip-gram模型训练数据集 corpus, vocab = load_reuters() dataset = SkipGramDataset(corpus, vocab, context_size=context_size) data_loader = get_loader(dataset, batch_size) nll_loss = nn.NLLLoss() # 构建Skip-gram模型,并加载至device device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = SkipGramModel(len(vocab), embedding_dim) model.to(device) optimizer = optim.Adam(model.parameters(), lr=0.001) model.train() for epoch in range(num_epoch): total_loss = 0 for batch in tqdm(data_loader, desc=f"Training Epoch {epoch}"): inputs, targets = [x.to(device) for x in batch] optimizer.zero_grad() log_probs = model(inputs) loss = nll_loss(log_probs, targets) loss.backward() optimizer.step() total_loss += loss.item() print(f"Loss: {total_loss:.2f}") # 保存词向量(model.embeddings) save_pretrained(vocab, model.embeddings.weight.data, "skipgram.vec")
基于负采样的Skip-gram模型
import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torch.utils.data import Dataset from torch.nn.utils.rnn import pad_sequence from tqdm.auto import tqdm from utils import BOS_TOKEN, EOS_TOKEN, PAD_TOKEN from utils import load_reuters, save_pretrained, get_loader, init_weights class SGNSDataset(Dataset): def __init__(self, corpus, vocab, context_size=2, n_negatives=5, ns_dist=None): self.data = [] self.bos = vocab[BOS_TOKEN] self.eos = vocab[EOS_TOKEN] self.pad = vocab[PAD_TOKEN] for sentence in tqdm(corpus, desc="Dataset Construction"): sentence = [self.bos] + sentence + [self.eos] for i in range(1, len(sentence)-1): # 模型输入:(w, context) ;输出为0/1,表示context是否为负样本 w = sentence[i] left_context_index = max(0, i - context_size) right_context_index = min(len(sentence), i + context_size) context = sentence[left_context_index:i] + sentence[i+1:right_context_index+1] context += [self.pad] * (2 * context_size - len(context)) self.data.append((w, context)) # 负样本数量 self.n_negatives = n_negatives # 负采样分布:若参数ns_dist为None,则使用uniform分布 self.ns_dist = ns_dist if ns_dist is not None else torch.ones(len(vocab)) def __len__(self): return len(self.data) def __getitem__(self, i): return self.data[i] def collate_fn(self, examples): words = torch.tensor([ex[0] for ex in examples], dtype=torch.long) contexts = torch.tensor([ex[1] for ex in examples], dtype=torch.long) batch_size, context_size = contexts.shape neg_contexts = [] # 对batch内的样本分别进行负采样 for i in range(batch_size): # 保证负样本不包含当前样本中的context ns_dist = self.ns_dist.index_fill(0, contexts[i], .0) neg_contexts.append(torch.multinomial(ns_dist, self.n_negatives * context_size, replacement=True)) neg_contexts = torch.stack(neg_contexts, dim=0) return words, contexts, neg_contexts class SGNSModel(nn.Module): def __init__(self, vocab_size, embedding_dim): super(SGNSModel, self).__init__() # 词嵌入 self.w_embeddings = nn.Embedding(vocab_size, embedding_dim) # 上下文嵌入 self.c_embeddings = nn.Embedding(vocab_size, embedding_dim) def forward_w(self, words): w_embeds = self.w_embeddings(words) return w_embeds def forward_c(self, contexts): c_embeds = self.c_embeddings(contexts) return c_embeds def get_unigram_distribution(corpus, vocab_size): # 从给定语料中统计unigram概率分布 token_counts = torch.tensor([0] * vocab_size) total_count = 0 for sentence in corpus: total_count += len(sentence) for token in sentence: token_counts[token] += 1 unigram_dist = torch.div(token_counts.float(), total_count) return unigram_dist embedding_dim = 64 context_size = 2 hidden_dim = 128 batch_size = 1024 num_epoch = 10 n_negatives = 10 # 读取文本数据 corpus, vocab = load_reuters() # 计算unigram概率分布 unigram_dist = get_unigram_distribution(corpus, len(vocab)) # 根据unigram分布计算负采样分布: p(w)**0.75 negative_sampling_dist = unigram_dist ** 0.75 negative_sampling_dist /= negative_sampling_dist.sum() # 构建SGNS训练数据集 dataset = SGNSDataset( corpus, vocab, context_size=context_size, n_negatives=n_negatives, ns_dist=negative_sampling_dist ) data_loader = get_loader(dataset, batch_size) device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = SGNSModel(len(vocab), embedding_dim) model.to(device) optimizer = optim.Adam(model.parameters(), lr=0.001) model.train() for epoch in range(num_epoch): total_loss = 0 for batch in tqdm(data_loader, desc=f"Training Epoch {epoch}"): words, contexts, neg_contexts = [x.to(device) for x in batch] optimizer.zero_grad() batch_size = words.shape[0] # 提取batch内词、上下文以及负样本的向量表示 word_embeds = model.forward_w(words).unsqueeze(dim=2) context_embeds = model.forward_c(contexts) neg_context_embeds = model.forward_c(neg_contexts) # 正样本的分类(对数)似然 context_loss = F.logsigmoid(torch.bmm(context_embeds, word_embeds).squeeze(dim=2)) context_loss = context_loss.mean(dim=1) # 负样本的分类(对数)似然 neg_context_loss = F.logsigmoid(torch.bmm(neg_context_embeds, word_embeds).squeeze(dim=2).neg()) neg_context_loss = neg_context_loss.view(batch_size, -1, n_negatives).sum(dim=2) neg_context_loss = neg_context_loss.mean(dim=1) # 损失:负对数似然 loss = -(context_loss + neg_context_loss).mean() loss.backward() optimizer.step() total_loss += loss.item() print(f"Loss: {total_loss:.2f}") # 合并词嵌入矩阵与上下文嵌入矩阵,作为最终的预训练词向量 combined_embeds = model.w_embeddings.weight + model.c_embeddings.weight save_pretrained(vocab, combined_embeds.data, "sgns.vec")
无论是基于神经网络语言模型还是Word2vec的词向量预训练方法,本质上都是利用文本中词与词在局部上下文中的共现信息作为自监督学习信号。
结合词向量以及矩阵分解的思想,GloVe(Global Vectors for WordRepresentation)模型产生。
GloVe模型的基本思想是利用词向量对“词–上下文”共现矩阵进行预测(或者回归),从而实现隐式的矩阵分解。
GloVe模型通过优化以下加权回归损失函数进行学习:
import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torch.utils.data import Dataset from torch.nn.utils.rnn import pad_sequence from tqdm.auto import tqdm from utils import BOS_TOKEN, EOS_TOKEN, PAD_TOKEN from utils import load_reuters, save_pretrained, get_loader, init_weights from collections import defaultdict class GloveDataset(Dataset): def __init__(self, corpus, vocab, context_size=2): # 记录词与上下文在给定语料中的共现次数 self.cooccur_counts = defaultdict(float) self.bos = vocab[BOS_TOKEN] self.eos = vocab[EOS_TOKEN] for sentence in tqdm(corpus, desc="Dataset Construction"): sentence = [self.bos] + sentence + [self.eos] for i in range(1, len(sentence)-1): w = sentence[i] left_contexts = sentence[max(0, i - context_size):i] right_contexts = sentence[i+1:min(len(sentence), i + context_size)+1] # 共现次数随距离衰减: 1/d(w, c) for k, c in enumerate(left_contexts[::-1]): self.cooccur_counts[(w, c)] += 1 / (k + 1) for k, c in enumerate(right_contexts): self.cooccur_counts[(w, c)] += 1 / (k + 1) self.data = [(w, c, count) for (w, c), count in self.cooccur_counts.items()] def __len__(self): return len(self.data) def __getitem__(self, i): return self.data[i] def collate_fn(self, examples): words = torch.tensor([ex[0] for ex in examples]) contexts = torch.tensor([ex[1] for ex in examples]) counts = torch.tensor([ex[2] for ex in examples]) return (words, contexts, counts) class GloveModel(nn.Module): def __init__(self, vocab_size, embedding_dim): super(GloveModel, self).__init__() # 词嵌入及偏置向量 self.w_embeddings = nn.Embedding(vocab_size, embedding_dim) self.w_biases = nn.Embedding(vocab_size, 1) # 上下文嵌入及偏置向量 self.c_embeddings = nn.Embedding(vocab_size, embedding_dim) self.c_biases = nn.Embedding(vocab_size, 1) def forward_w(self, words): w_embeds = self.w_embeddings(words) w_biases = self.w_biases(words) return w_embeds, w_biases def forward_c(self, contexts): c_embeds = self.c_embeddings(contexts) c_biases = self.c_biases(contexts) return c_embeds, c_biases embedding_dim = 64 context_size = 2 batch_size = 1024 num_epoch = 10 # 用以控制样本权重的超参数 m_max = 100 alpha = 0.75 # 从文本数据中构建GloVe训练数据集 corpus, vocab = load_reuters() dataset = GloveDataset( corpus, vocab, context_size=context_size ) data_loader = get_loader(dataset, batch_size) device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = GloveModel(len(vocab), embedding_dim) model.to(device) optimizer = optim.Adam(model.parameters(), lr=0.001) model.train() for epoch in range(num_epoch): total_loss = 0 for batch in tqdm(data_loader, desc=f"Training Epoch {epoch}"): words, contexts, counts = [x.to(device) for x in batch] # 提取batch内词、上下文的向量表示及偏置 word_embeds, word_biases = model.forward_w(words) context_embeds, context_biases = model.forward_c(contexts) # 回归目标值:必要时可以使用log(counts+1)进行平滑 log_counts = torch.log(counts) # 样本权重 weight_factor = torch.clamp(torch.pow(counts / m_max, alpha), max=1.0) optimizer.zero_grad() # 计算batch内每个样本的L2损失 loss = (torch.sum(word_embeds * context_embeds, dim=1) + word_biases + context_biases - log_counts) ** 2 # 样本加权损失 wavg_loss = (weight_factor * loss).mean() wavg_loss.backward() optimizer.step() total_loss += wavg_loss.item() print(f"Loss: {total_loss:.2f}") # 合并词嵌入矩阵与上下文嵌入矩阵,作为最终的预训练词向量 combined_embeds = model.w_embeddings.weight + model.c_embeddings.weight save_pretrained(vocab, combined_embeds.data, "glove.vec")
内部任务评价方法:根据词向量对词义相关性或者类比推理性的表达能力进行评价。
外部任务评价方法:根据下游任务的性能指标判断。
根据词向量对词义相关性的表达能力衡量词向量的好坏。
对于词向量的另一种常用的内部任务评价方法。对词向量在向量空间内的分布进行分析可以发现,对于语法或者语义关系相同的两个词对(wa, wb)与(wc, wd),它们的词向量在一定程度上满足:vwb - vwa ≈ vwd - vwc的几何性质。
预训练词向量可以作为词的特征表示直接用于下游任务,也可以作为模型参数在下游任务的训练过程中进行精调(Fine-tuning)。在通常情况下,两种方式都能够有效地提升模型的泛化能力。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。