当前位置:   article > 正文

【NLP基础知识】4.NLP中必须掌握的经典模型_nlp常用模型

nlp常用模型

目录

一、Word2vec原理与复现

1.背景复习

2.C&W模型

3.直接可观测特征

4.skip-gram模型

5.代码实现skip-gram(初级版)

二、BERT使用实战

三、MLP模型与实战

四、普通RNN模型与实战

五、门控RNN模型与实战


一、Word2vec原理与复现

Word2vec:NLP领域的奠基之作

1.背景复习

2.C&W模型

3.直接可观测特征

论文链接:https://arxiv.org/pdf/1309.4168v1.pdf

4.skip-gram模型

5.代码实现skip-gram(初级版)

代码:

  1. import numpy as np
  2. import torch
  3. import torch.nn as nn
  4. import torch.optim as optim
  5. from tqdm import tqdm
  6. from torch.autograd import Variable
  7. import matplotlib.pyplot as plt
  8. dtype = torch.FloatTensor
  9. # 语料库
  10. sentences = ["i like dog", "i like cat", "i like animal",
  11. "dog cat animal", "apple cat dog like", "cat like fish",
  12. "dog like meat", "i like apple", "i hate apple",
  13. "i like movie book music apple", "dog like bark", "dog friend cat"]
  14. word_sequence = ' '.join(sentences).split() # 所有句子以空格拼接,拼接好的内容再以空格分开
  15. word_list = list(set(word_sequence)) # 以集合的形式去重
  16. word_dict = {w: i for i, w in enumerate(word_list)}
  17. # print(word_sequence)
  18. skip_grams = [] # 训练数据
  19. for i in range(1, len(word_sequence) - 1):
  20. target = word_dict[word_sequence[i]] # 当前词对应的id
  21. context = [word_dict[word_sequence[i - 1]], word_dict[word_sequence[i + 1]]] # 两个上下文词对应的id
  22. for w in context:
  23. skip_grams.append([target, w])
  24. embedding_size = 2
  25. voc_size = len(word_list)
  26. batch_size = 2
  27. class Word2Vec(nn.Module):
  28. def __init__(self):
  29. super(Word2Vec, self).__init__()
  30. self.W = nn.Parameter(torch.rand(voc_size, embedding_size)).type(dtype)
  31. self.WT = nn.Parameter(torch.rand(embedding_size, voc_size)).type(dtype)
  32. def forward(self, x):
  33. hidden_layer = torch.matmul(x, self.W)
  34. output_layer = torch.matmul(hidden_layer, self.WT)
  35. return output_layer
  36. model = Word2Vec()
  37. criterion = nn.CrossEntropyLoss()
  38. optimizer = optim.Adam(model.parameters(), lr=0.0003)
  39. def random_batch(data, size):
  40. random_inputs = []
  41. random_labels = []
  42. random_index = np.random.choice(range(len(data)), size, replace=False)
  43. for i in random_index:
  44. random_inputs.append(np.eye(voc_size)[data[i][0]]) # 生成one-hot词向量
  45. random_labels.append(data[i][1])
  46. return random_inputs, random_labels
  47. # 训练函数
  48. for epoch in range(10000000):
  49. input_batch, target_batch = random_batch(skip_grams, batch_size)
  50. input_batch = torch.Tensor(input_batch)
  51. target_batch = torch.LongTensor(target_batch)
  52. optimizer.zero_grad()
  53. output = model(input_batch)
  54. loss = criterion(output, target_batch)
  55. if (epoch + 1) % 1000 == 0:
  56. print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
  57. loss.backward
  58. optimizer.step()
  59. for i, label in enumerate(word_list):
  60. W, WT = model.parameters()
  61. x, y = float(W[i][0]), float(W[i][1])
  62. plt.scatter(x, y)
  63. plt.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
  64. plt.show()

6.word2vec项目实战展示

 

二、BERT使用实战

代码:

  1. from transformers import BertModel, BertTokenizer
  2. import torch
  3. import torch.nn as nn
  4. sentence = 'i like eating apples very much'
  5. class Model(nn.Module):
  6. def __init__(self):
  7. super().__init__()
  8. self.embedder = BertModel.from_pretrained('bert-base-cased', output_hidden_states=True)
  9. self.tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
  10. def forward(self, inputs):
  11. tokens = self.tokenizer.tokenize(inputs)
  12. print(tokens)
  13. tokens_id = self.tokenizer.convert_tokens_to_ids(tokens)
  14. print(tokens_id)
  15. tokens_id_tensor = torch.tensor(tokens_id).unsqueeze(0)
  16. outputs = self.embedder(tokens_id_tensor)
  17. print(outputs[0])
  18. model = Model()
  19. results = model(sentence)

三、MLP模型与实战

四、普通RNN模型与实战

五、门控RNN模型与实战

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小丑西瓜9/article/detail/249362?site
推荐阅读
相关标签
  

闽ICP备14008679号