pytorch入门NLP教程(二)——CBOW_pytorch cbow

作者：秋刀鱼在做梦 | 2024-08-01 06:02:14

踩

pytorch cbow

在上一个教程中我们说到了NNLM，但是NNLM虽然考虑的一个词前面的词对它的影响，但是没有办法顾忌到后面的词，而且计算量较大，所以可以使用Word2vec中的一个模型CBOW。
在这里插入图片描述

目标：通过周围的词预测中心词 $w (t)$

目标函数： $\sum_{\omega\in corpus}P(w|content(w))$

输入：上下文单词的onehot，假设单词向量空间dim为V，上下文单词个数为C，所以输入矩阵维度为 $C\times V$

PROJECTION:输入的向量每个词的onehot乘上输入权重矩阵W( $V\times N$ )相加求平均作为隐层向量维度为 $1\times N$

输出：用上面得到的向量乘上输出权重矩阵W’( $N\times V$ )，得到输出向量维度为 $1\times V$ ，即概率向量
举个例子：
在这里插入图片描述

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

corpus = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells."""

# 模型参数
window_size = 2
embeding_dim = 100
hidden_dim = 128

# 数据预处理
sentences = corpus.split()  # 分词
words = list(set(sentences))
word_dict = {word: i for i, word in enumerate(words)}  # 每个词对应的索引
data = []  # 准备数据
for i in range(window_size, len(sentences)-window_size):
    content = [sentences[i-1], sentences[i-2],
               sentences[i+1], sentences[i+2]]
    target = sentences[i]
    data.append((content, target))
print(data[:5])

# 处理输入数据
def make_content_vector(content, word_to_ix):
    idx = [word_to_ix[w] for w in content]
    return torch.LongTensor(idx)

# CBOW模型
class CBOW(nn.Module):
    def __init__(self, vocab_size, n_dim, window_size, hidden_dim):
        super(CBOW, self).__init__()
        self.embedding = nn.Embedding(vocab_size, n_dim)
        self.linear1 = nn.Linear(2*n_dim*window_size, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, vocab_size)

    def forward(self, X):
        embeds = self.embedding(X).view(1, -1)
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

# 训练模型
model = CBOW(len(word_dict), embeding_dim, window_size, hidden_dim)
if torch.cuda.is_available():
    model = model.cuda()
criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)
for epoch in range(500):
    total_loss = 0
    for content, target in data:
        content_vector = make_content_vector(content, word_dict)
        target = torch.tensor([word_dict[target]], dtype=torch.long)
        if torch.cuda.is_available():
            content_vector = content_vector.cuda()
            target = target.cuda()
        
        optimizer.zero_grad()
        
        log_probs = model(content_vector)
        loss = criterion(log_probs, target)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    if (epoch + 1) % 100 == 0:
        print('Epoch:', '%03d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/秋刀鱼在做梦/article/detail/912632