当前位置:   article > 正文

pytorch入门NLP教程(二)——CBOW_pytorch cbow

pytorch cbow

上一个教程中我们说到了NNLM,但是NNLM虽然考虑的一个词前面的词对它的影响,但是没有办法顾忌到后面的词,而且计算量较大,所以可以使用Word2vec中的一个模型CBOW。
在这里插入图片描述
在这里插入图片描述
目标:通过周围的词预测中心词 w ( t ) w(t) w(t)

目标函数 J = ∑ ω ∈ c o r p u s P ( w ∣ c o n t e n t ( w ) ) J = \sum_{\omega\in corpus}P(w|content(w)) J=ωcorpusP(wcontent(w))

输入:上下文单词的onehot,假设单词向量空间dim为V,上下文单词个数为C,所以输入矩阵维度为 C × V C\times V C×V

PROJECTION:输入的向量每个词的onehot乘上输入权重矩阵W( V × N V\times N V×N)相加求平均作为隐层向量维度为 1 × N 1\times N 1×N

输出:用上面得到的向量乘上输出权重矩阵W’( N × V N\times V N×V),得到输出向量维度为 1 × V 1\times V 1×V,即概率向量
举个例子
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

corpus = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells."""

# 模型参数
window_size = 2
embeding_dim = 100
hidden_dim = 128

# 数据预处理
sentences = corpus.split()  # 分词
words = list(set(sentences))
word_dict = {word: i for i, word in enumerate(words)}  # 每个词对应的索引
data = []  # 准备数据
for i in range(window_size, len(sentences)-window_size):
    content = [sentences[i-1], sentences[i-2],
               sentences[i+1], sentences[i+2]]
    target = sentences[i]
    data.append((content, target))
print(data[:5])

# 处理输入数据
def make_content_vector(content, word_to_ix):
    idx = [word_to_ix[w] for w in content]
    return torch.LongTensor(idx)

# CBOW模型
class CBOW(nn.Module):
    def __init__(self, vocab_size, n_dim, window_size, hidden_dim):
        super(CBOW, self).__init__()
        self.embedding = nn.Embedding(vocab_size, n_dim)
        self.linear1 = nn.Linear(2*n_dim*window_size, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, vocab_size)

    def forward(self, X):
        embeds = self.embedding(X).view(1, -1)
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

# 训练模型
model = CBOW(len(word_dict), embeding_dim, window_size, hidden_dim)
if torch.cuda.is_available():
    model = model.cuda()
criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)
for epoch in range(500):
    total_loss = 0
    for content, target in data:
        content_vector = make_content_vector(content, word_dict)
        target = torch.tensor([word_dict[target]], dtype=torch.long)
        if torch.cuda.is_available():
            content_vector = content_vector.cuda()
            target = target.cuda()
        
        optimizer.zero_grad()
        
        log_probs = model(content_vector)
        loss = criterion(log_probs, target)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    if (epoch + 1) % 100 == 0:
        print('Epoch:', '%03d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/秋刀鱼在做梦/article/detail/912632
推荐阅读
相关标签
  

闽ICP备14008679号