当前位置:   article > 正文

CBOW和Skip-Gram模型介绍及Python编程实现_cbow模型

cbow模型

文章目录

前言

一、CBOW模型

1. CBOW模型介绍

2. CBOW模型实现

二、Skip-Gram模型

1. Skip-Gram模型介绍

2. Skip-Gram模型实现

总结


前言

本文实现了CBOW和Skip-Gram模型的文本词汇预测。下图为两种模型结构图:

一、CBOW模型

1. CBOW模型介绍

CBOW模型功能:通过给出目标词语前后位置上的x个词语可以实现对中间词语的预测(x是前后词语个数,x可变。代码中我实现的是利用前后各2个词语,来预测中间位置词语是什么)。

CBOW模型考虑了上下文(t - 1,t + 1),CBOW模型的全称为Continuous Bag-of-Word Model。该模型的作用是根据给定的词,预测目标词出现的概率。如下图所示,Input layer表示给定的词,${h_1,...,h_N}$是这个给定词的词向量(又称输入词向量),Output layer是这个神经网络的输出层,为了得出在这个输入词下另一个词出现的可能概率,需要对Output layer求softmax。

2. CBOW模型实现

第一步:随便找一段英文文本,进行分词并汇总为集合word,并形成顺序字典word_to_ix、ix_to_word。

  1. import torch
  2. import torch.nn as nn
  3. text = """People who truly loved once are far more likely to love again.
  4. Difficult circumstances serve as a textbook of life for people.
  5. The best preparation for tomorrow is doing your best today.
  6. The reason why a great man is great is that he resolves to be a great man.
  7. The shortest way to do many things is to only one thing at a time.
  8. Only they who fulfill their duties in everyday matters will fulfill them on great occasions.
  9. I go all out to deal with the ordinary life.
  10. I can stand up once again on my own.
  11. Never underestimate your power to change yourself.""".split()
  12. word = set(text)
  13. word_size = len(word)
  14. word_to_ix = {word:ix for ix, word in enumerate(word)}
  15. ix_to_word = {ix:word for ix, word in enumerate(word)}

   

注:enumerate()是python的内置函数。
enumerate在字典上是枚举、列举的意思。
enumerate参数为可遍历/可迭代的对象(如列表、字符串)。
enumerate多用于在for循环中得到计数,利用它可以同时获得索引和值,即需要index和value值的时候可以使用enumerate。

第二步:定义方法,自定义make_context_vector方法制作数据,自定义CBOW用于建立模型;

  1. def make_context_vector(context, word_to_ix):
  2. idxs = [word_to_ix[w] for w in context]
  3. return torch.tensor(idxs, dtype=torch.long)
  4. EMDEDDING_DIM = 100 #词向量维度
  5. data = []
  6. for i in range(2, len(text) - 2):
  7. context = [text[i - 2], text[i - 1],
  8. text[i + 1], text[i + 2]]
  9. target = text[i]
  10. data.append((context, target))
  11. class CBOW(torch.nn.Module):
  12. def __init__(self, word_size, embedding_dim):
  13. super(CBOW, self).__init__()
  14. self.embeddings = nn.Embedding(word_size, embedding_dim)
  15. self.linear1 = nn.Linear(embedding_dim, 128)
  16. self.activation_function1 = nn.ReLU()
  17. self.linear2 = nn.Linear(128, word_size)
  18. self.activation_function2 = nn.LogSoftmax(dim = -1)
  19. def forward(self, inputs):
  20. embeds = sum(self.embeddings(inputs)).view(1,-1)
  21. out = self.linear1(embeds)
  22. out = self.activation_function1(out)
  23. out = self.linear2(out)
  24. out = self.activation_function2(out)
  25. return out
  26. def get_word_emdedding(self, word):
  27. word = torch.tensor([word_to_ix[word]])
  28. return self.embeddings(word).view(1,-1)

第三步:建立模型,开始训练;

  1. model = CBOW(word_size, EMDEDDING_DIM)
  2. loss_function = nn.NLLLoss()
  3. optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
  4. #开始训练
  5. for epoch in range(100):
  6. total_loss = 0
  7. for context, target in data:
  8. context_vector = make_context_vector(context, word_to_ix)
  9. log_probs = model(context_vector)
  10. total_loss += loss_function(log_probs, torch.tensor([word_to_ix[target]]))
  11. optimizer.zero_grad()
  12. total_loss.backward()
  13. optimizer.step()

第四步:开始训练;

  1. for epoch in range(100):
  2. total_loss = 0
  3. for context, target in data:
  4. context_vector = make_context_vector(context, word_to_ix)
  5. log_probs = model(context_vector)
  6. total_loss += loss_function(log_probs, torch.tensor([word_to_ix[target]]))
  7. optimizer.zero_grad()
  8. total_loss.backward()
  9. optimizer.step()

第五步:进行词汇预测,给出上文两个词汇、下文两个词汇,可以预测出中间的词汇;

  1. #预测
  2. context1 = ['preparation','for','is', 'doing']
  3. context_vector1 = make_context_vector(context1, word_to_ix)
  4. a = model(context_vector1)
  5. context2 = ['People','who', 'loved', 'once']
  6. context_vector2 = make_context_vector(context2, word_to_ix)
  7. b = model(context_vector2)
  8. print(f'文本数据: {" ".join(text)}\n')
  9. print(f'预测1: {context1}\n')
  10. print(f'预测结果: {ix_to_word[torch.argmax(a[0]).item()]}')
  11. print('\n')
  12. print(f'预测2: {context2}\n')
  13. print(f'预测结果: {ix_to_word[torch.argmax(b[0]).item()]}')


 

 

二、Skip-Gram模型

1. Skip-Gram模型介绍

Skip-gram模型功能:输入一个词汇,返回该词汇的上下文中最可能出现的x个词语(x是返回词语的数量,可以更改。代码中我实现了预测词汇所在上下文中最可能出现的4个词语)。 

Skip-Gram模型与连续词袋模型(CBOW)类似,同样包含三层:输入层、映射层和输出层。

Skip-Gram模型中的 w ( t ) w(t) w(t)为输入词,在已知词w(t)的前提下预测词w(t)的上下文w(t−n)、 ⋯⋯、w(t−2)、w(t−1)、w(t+1)、w(t+2)、⋯⋯、w(t+n)。
 

2. Skip-Gram模型实现

第一步:导入包,编辑一系列自定义方法(网上搜到的);

这些方法包括:

(1) 自定义的Softmax函数,或称归一化指数函数,是逻辑函数的一种推广。它能将一个含任意实数的K维向量压缩到另一个K维实向量中,使得每一个元素的范围都在之间,并且所有元素的和为1(Softmax函数的解释来自https://blog.csdn.net/weixin_31866177/article/details/82464617littlemichelle的文章)。

(2) 自定义的word2vec类,实现Skip-Gram模型功能。

(3) 自定义的preprocessing函数,对文本数据进行split分词等预处理。

(4) 自定义的prepare_data_for_training函数,顾名思义,该函数通过改变数据格式等操作为训练做好准备。

只是较随便地解释了一下,想了解更多的小伙伴直接拿着代码百度就行~

  1. import numpy as np
  2. import string
  3. from nltk.corpus import stopwords
  4. def softmax(x):
  5. e_x = np.exp(x - np.max(x))
  6. return e_x / e_x.sum()
  7. class word2vec(object):
  8. def __init__(self):
  9. self.N = 10
  10. self.X_train = []
  11. self.y_train = []
  12. self.window_size = 2
  13. self.alpha = 0.001
  14. self.words = []
  15. self.word_index = {}
  16. def initialize(self,V,data):
  17. self.V = V
  18. self.W = np.random.uniform(-0.8, 0.8, (self.V, self.N))
  19. self.W1 = np.random.uniform(-0.8, 0.8, (self.N, self.V))
  20. self.words = data
  21. for i in range(len(data)):
  22. self.word_index[data[i]] = i
  23. def feed_forward(self,X):
  24. self.h = np.dot(self.W.T,X).reshape(self.N,1)
  25. self.u = np.dot(self.W1.T,self.h)
  26. #print(self.u)
  27. self.y = softmax(self.u)
  28. return self.y
  29. def backpropagate(self,x,t):
  30. e = self.y - np.asarray(t).reshape(self.V,1)
  31. # e.shape is V x 1
  32. dLdW1 = np.dot(self.h,e.T)
  33. X = np.array(x).reshape(self.V,1)
  34. dLdW = np.dot(X, np.dot(self.W1,e).T)
  35. self.W1 = self.W1 - self.alpha*dLdW1
  36. self.W = self.W - self.alpha*dLdW
  37. def train(self,epochs):
  38. for x in range(1,epochs):
  39. self.loss = 0
  40. for j in range(len(self.X_train)):
  41. self.feed_forward(self.X_train[j])
  42. self.backpropagate(self.X_train[j],self.y_train[j])
  43. C = 0
  44. for m in range(self.V):
  45. if(self.y_train[j][m]):
  46. self.loss += -1*self.u[m][0]
  47. C += 1
  48. self.loss += C*np.log(np.sum(np.exp(self.u)))
  49. print("epoch ",x, " loss = ",self.loss)
  50. self.alpha *= 1/( (1+self.alpha*x) )
  51. def predict(self,word,number_of_predictions):
  52. if word in self.words:
  53. index = self.word_index[word]
  54. X = [0 for i in range(self.V)]
  55. X[index] = 1
  56. prediction = self.feed_forward(X)
  57. output = {}
  58. for i in range(self.V):
  59. output[prediction[i][0]] = i
  60. top_context_words = []
  61. for k in sorted(output,reverse=True):
  62. top_context_words.append(self.words[output[k]])
  63. if(len(top_context_words)>=number_of_predictions):
  64. break
  65. return top_context_words
  66. else:
  67. print("Word not found in dictionary")
  68. def preprocessing(corpus):
  69. stop_words = set(stopwords.words('english'))
  70. training_data = []
  71. sentences = corpus.split(".")
  72. for i in range(len(sentences)):
  73. sentences[i] = sentences[i].strip()
  74. sentence = sentences[i].split()
  75. x = [word.strip(string.punctuation) for word in sentence
  76. if word not in stop_words]
  77. x = [word.lower() for word in x]
  78. training_data.append(x)
  79. return training_data
  80. def prepare_data_for_training(sentences,w2v):
  81. data = {}
  82. for sentence in sentences:
  83. for word in sentence:
  84. if word not in data:
  85. data[word] = 1
  86. else:
  87. data[word] += 1
  88. V = len(data)
  89. data = sorted(list(data.keys()))
  90. vocab = {}
  91. for i in range(len(data)):
  92. vocab[data[i]] = i
  93. #for i in range(len(words)):
  94. for sentence in sentences:
  95. for i in range(len(sentence)):
  96. center_word = [0 for x in range(V)]
  97. center_word[vocab[sentence[i]]] = 1
  98. context = [0 for x in range(V)]
  99. for j in range(i-w2v.window_size,i+w2v.window_size):
  100. if i!=j and j>=0 and j<len(sentence):
  101. context[vocab[sentence[j]]] += 1
  102. w2v.X_train.append(center_word)
  103. w2v.y_train.append(context)
  104. w2v.initialize(V,data)
  105. return w2v.X_train,w2v.y_train

第二步:添加一些文本组成一个微型语料库,随后进行训练,训练2000轮,并且打印每轮训练的损失值,可以看见损失值随着训练轮数增加不断减小;

  1. corpus = ""
  2. corpus += "Jack bought me a dictionary as a birthday present. Her father bought a book for her as a birthday present. "
  3. corpus += "His teacher bought a car"
  4. epochs = 2000
  5. training_data = preprocessing(corpus)
  6. w2v = word2vec()
  7. prepare_data_for_training(training_data,w2v)
  8. w2v.train(epochs)

 

第三步:预测词汇上下文,可以看到预测出bought的上下文就是car、book、father、dictionary,与我编辑的语料基本一致,预测效果还是不错的;

print(w2v.predict("bought",4))

 


总结

两种模型的代码整体放在下面:

Python编程实现CBOW模型并选取文本完成文本预测:

  1. import torch
  2. import torch.nn as nn
  3. text = """People who truly loved once are far more likely to love again.
  4. Difficult circumstances serve as a textbook of life for people.
  5. The best preparation for tomorrow is doing your best today.
  6. The reason why a great man is great is that he resolves to be a great man.
  7. The shortest way to do many things is to only one thing at a time.
  8. Only they who fulfill their duties in everyday matters will fulfill them on great occasions.
  9. I go all out to deal with the ordinary life.
  10. I can stand up once again on my own.
  11. Never underestimate your power to change yourself.""".split()
  12. word = set(text)
  13. word_size = len(word)
  14. word_to_ix = {word:ix for ix, word in enumerate(word)}
  15. ix_to_word = {ix:word for ix, word in enumerate(word)}
  16. def make_context_vector(context, word_to_ix):
  17. idxs = [word_to_ix[w] for w in context]
  18. return torch.tensor(idxs, dtype=torch.long)
  19. EMDEDDING_DIM = 100
  20. data = []
  21. for i in range(2, len(text) - 2):
  22. context = [text[i - 2], text[i - 1],
  23. text[i + 1], text[i + 2]]
  24. target = text[i]
  25. data.append((context, target))
  26. class CBOW(torch.nn.Module):
  27. def __init__(self, word_size, embedding_dim):
  28. super(CBOW, self).__init__()
  29. self.embeddings = nn.Embedding(word_size, embedding_dim)
  30. self.linear1 = nn.Linear(embedding_dim, 128)
  31. self.activation_function1 = nn.ReLU()
  32. self.linear2 = nn.Linear(128, word_size)
  33. self.activation_function2 = nn.LogSoftmax(dim = -1)
  34. def forward(self, inputs):
  35. embeds = sum(self.embeddings(inputs)).view(1,-1)
  36. out = self.linear1(embeds)
  37. out = self.activation_function1(out)
  38. out = self.linear2(out)
  39. out = self.activation_function2(out)
  40. return out
  41. def get_word_emdedding(self, word):
  42. word = torch.tensor([word_to_ix[word]])
  43. return self.embeddings(word).view(1,-1)
  44. model = CBOW(word_size, EMDEDDING_DIM)
  45. loss_function = nn.NLLLoss()
  46. optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
  47. #开始训练
  48. for epoch in range(100):
  49. total_loss = 0
  50. for context, target in data:
  51. context_vector = make_context_vector(context, word_to_ix)
  52. log_probs = model(context_vector)
  53. total_loss += loss_function(log_probs, torch.tensor([word_to_ix[target]]))
  54. optimizer.zero_grad()
  55. total_loss.backward()
  56. optimizer.step()
  57. #预测
  58. context1 = ['preparation','for','is', 'doing']
  59. context_vector1 = make_context_vector(context1, word_to_ix)
  60. a = model(context_vector1)
  61. context2 = ['People','who', 'loved', 'once']
  62. context_vector2 = make_context_vector(context2, word_to_ix)
  63. b = model(context_vector2)
  64. print(f'文本数据: {" ".join(text)}\n')
  65. print(f'预测1: {context1}\n')
  66. print(f'预测结果: {ix_to_word[torch.argmax(a[0]).item()]}')
  67. print('\n')
  68. print(f'预测2: {context2}\n')
  69. print(f'预测结果: {ix_to_word[torch.argmax(b[0]).item()]}')

Python编程实现Skip-gram并选取文本完成文本预测:

  1. import numpy as np
  2. import string
  3. from nltk.corpus import stopwords
  4. def softmax(x):
  5. e_x = np.exp(x - np.max(x))
  6. return e_x / e_x.sum()
  7. class word2vec(object):
  8. def __init__(self):
  9. self.N = 10
  10. self.X_train = []
  11. self.y_train = []
  12. self.window_size = 2
  13. self.alpha = 0.001
  14. self.words = []
  15. self.word_index = {}
  16. def initialize(self,V,data):
  17. self.V = V
  18. self.W = np.random.uniform(-0.8, 0.8, (self.V, self.N))
  19. self.W1 = np.random.uniform(-0.8, 0.8, (self.N, self.V))
  20. self.words = data
  21. for i in range(len(data)):
  22. self.word_index[data[i]] = i
  23. def feed_forward(self,X):
  24. self.h = np.dot(self.W.T,X).reshape(self.N,1)
  25. self.u = np.dot(self.W1.T,self.h)
  26. #print(self.u)
  27. self.y = softmax(self.u)
  28. return self.y
  29. def backpropagate(self,x,t):
  30. e = self.y - np.asarray(t).reshape(self.V,1)
  31. # e.shape is V x 1
  32. dLdW1 = np.dot(self.h,e.T)
  33. X = np.array(x).reshape(self.V,1)
  34. dLdW = np.dot(X, np.dot(self.W1,e).T)
  35. self.W1 = self.W1 - self.alpha*dLdW1
  36. self.W = self.W - self.alpha*dLdW
  37. def train(self,epochs):
  38. for x in range(1,epochs):
  39. self.loss = 0
  40. for j in range(len(self.X_train)):
  41. self.feed_forward(self.X_train[j])
  42. self.backpropagate(self.X_train[j],self.y_train[j])
  43. C = 0
  44. for m in range(self.V):
  45. if(self.y_train[j][m]):
  46. self.loss += -1*self.u[m][0]
  47. C += 1
  48. self.loss += C*np.log(np.sum(np.exp(self.u)))
  49. print("epoch ",x, " loss = ",self.loss)
  50. self.alpha *= 1/( (1+self.alpha*x) )
  51. def predict(self,word,number_of_predictions):
  52. if word in self.words:
  53. index = self.word_index[word]
  54. X = [0 for i in range(self.V)]
  55. X[index] = 1
  56. prediction = self.feed_forward(X)
  57. output = {}
  58. for i in range(self.V):
  59. output[prediction[i][0]] = i
  60. top_context_words = []
  61. for k in sorted(output,reverse=True):
  62. top_context_words.append(self.words[output[k]])
  63. if(len(top_context_words)>=number_of_predictions):
  64. break
  65. return top_context_words
  66. else:
  67. print("Word not found in dictionary")
  68. def preprocessing(corpus):
  69. stop_words = set(stopwords.words('english'))
  70. training_data = []
  71. sentences = corpus.split(".")
  72. for i in range(len(sentences)):
  73. sentences[i] = sentences[i].strip()
  74. sentence = sentences[i].split()
  75. x = [word.strip(string.punctuation) for word in sentence
  76. if word not in stop_words]
  77. x = [word.lower() for word in x]
  78. training_data.append(x)
  79. return training_data
  80. def prepare_data_for_training(sentences,w2v):
  81. data = {}
  82. for sentence in sentences:
  83. for word in sentence:
  84. if word not in data:
  85. data[word] = 1
  86. else:
  87. data[word] += 1
  88. V = len(data)
  89. data = sorted(list(data.keys()))
  90. vocab = {}
  91. for i in range(len(data)):
  92. vocab[data[i]] = i
  93. for sentence in sentences:
  94. for i in range(len(sentence)):
  95. center_word = [0 for x in range(V)]
  96. center_word[vocab[sentence[i]]] = 1
  97. context = [0 for x in range(V)]
  98. for j in range(i-w2v.window_size,i+w2v.window_size):
  99. if i!=j and j>=0 and j<len(sentence):
  100. context[vocab[sentence[j]]] += 1
  101. w2v.X_train.append(center_word)
  102. w2v.y_train.append(context)
  103. w2v.initialize(V,data)
  104. return w2v.X_train,w2v.y_train
  105. corpus = ""
  106. corpus += "Jack bought me a dictionary as a birthday present. Her father bought a book for her as a birthday present. "
  107. corpus += "His teacher bought a car"
  108. epochs = 2000
  109. training_data = preprocessing(corpus)
  110. w2v = word2vec()
  111. prepare_data_for_training(training_data,w2v)
  112. w2v.train(epochs)
  113. print(w2v.predict("bought",4))
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/羊村懒王/article/detail/465856
推荐阅读
相关标签
  

闽ICP备14008679号