当前位置:   article > 正文

1 NLP分类之:FastText

1 NLP分类之:FastText

0 数据

https://download.csdn.net/download/qq_28611929/88580520?spm=1001.2014.3001.5503

数据集合:0 NLP: 数据获取与EDA-CSDN博客

词嵌入向量文件: embedding_SougouNews.npz

词典文件:vocab.pkl

1 模型

基于fastText做词向量嵌入然后引入2-gram, 3-gram扩充,最后接入一个MLP即可;

fastText 是一个由 Facebook AI Research 实现的开源库,用于进行文本分类和词向量学习。它结合了传统的词袋模型和神经网络的优点,能够快速训练大规模的文本数据。

fastText 的主要特点包括:

1. 快速训练:fastText 使用了层次化 Softmax 和负采样等技术,大大加快了训练速度。

2. 子词嵌入:fastText 将单词表示为字符级别的 n-gram,并将其视为单词的子词。这样可以更好地处理未登录词和稀有词。

3. 文本分类:fastText 提供了一个简单而高效的文本分类接口,可以用于训练和预测多类别文本分类任务。

4. 多语言支持:fastText 支持多种语言,并且可以通过学习共享词向量来提高跨语言任务的性能。

需要注意的是,fastText 主要适用于文本分类任务,对于其他类型的自然语言处理任务(如命名实体识别、机器翻译等),可能需要使用其他模型或方法。

 2 代码

nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
 

`nn.Embedding.from_pretrained` 是 PyTorch 中的一个函数,用于从预训练的词向量加载 Embedding 层的权重。

在使用 `nn.Embedding.from_pretrained` 时,你需要提供一个预训练的词向量矩阵作为参数,

freeze 参数: 指定是否冻结该层的权重。预训练的词向量可以是从其他模型(如 Word2Vec 或 GloVe)中得到的。

y = nn.Embedding.from_pretrained (x)

x输入:词的索引

y返回: 词向量

  1. import pandas as pd
  2. import torch
  3. import torch.nn as nn
  4. import torch.nn.functional as F
  5. import numpy as np
  6. import pickle as pkl
  7. from tqdm import tqdm
  8. import time
  9. from torch.utils.data import Dataset
  10. from datetime import timedelta
  11. from sklearn.model_selection import train_test_split
  12. from torch.utils.data import Dataset, DataLoader
  13. from collections import defaultdict
  14. from torch.optim import AdamW
  15. UNK, PAD = '<UNK>', '<PAD>' # 未知字,padding符号
  16. RANDOM_SEED = 2023
  17. file_path = "./data/online_shopping_10_cats.csv"
  18. vocab_file = "./data/vocab.pkl"
  19. emdedding_file = "./data/embedding_SougouNews.npz"
  20. vocab = pkl.load(open(vocab_file, 'rb'))
  21. class MyDataSet(Dataset):
  22. def __init__(self, df, vocab,pad_size=None):
  23. self.data_info = df
  24. self.data_info['review'] = self.data_info['review'].apply(lambda x:str(x).strip())
  25. self.data_info = self.data_info[['review','label']].values
  26. self.vocab = vocab
  27. self.pad_size = pad_size
  28. self.buckets = 250499
  29. def biGramHash(self,sequence, t):
  30. t1 = sequence[t - 1] if t - 1 >= 0 else 0
  31. return (t1 * 14918087) % self.buckets
  32. def triGramHash(self,sequence, t):
  33. t1 = sequence[t - 1] if t - 1 >= 0 else 0
  34. t2 = sequence[t - 2] if t - 2 >= 0 else 0
  35. return (t2 * 14918087 * 18408749 + t1 * 14918087) % self.buckets
  36. def __getitem__(self, item):
  37. result = {}
  38. view, label = self.data_info[item]
  39. result['view'] = view.strip()
  40. result['label'] = torch.tensor(label,dtype=torch.long)
  41. token = [i for i in view.strip()]
  42. seq_len = len(token)
  43. # 填充
  44. if self.pad_size:
  45. if len(token) < self.pad_size:
  46. token.extend([PAD] * (self.pad_size - len(token)))
  47. else:
  48. token = token[:self.pad_size]
  49. seq_len = self.pad_size
  50. result['seq_len'] = seq_len
  51. # 词表的转换
  52. words_line = []
  53. for word in token:
  54. words_line.append(self.vocab.get(word, self.vocab.get(UNK)))
  55. result['input_ids'] = torch.tensor(words_line, dtype=torch.long)
  56. #
  57. bigram = []
  58. trigram = []
  59. for i in range(self.pad_size):
  60. bigram.append(self.biGramHash(words_line, i))
  61. trigram.append(self.triGramHash(words_line, i))
  62. result['bigram'] = torch.tensor(bigram, dtype=torch.long)
  63. result['trigram'] = torch.tensor(trigram, dtype=torch.long)
  64. return result
  65. def __len__(self):
  66. return len(self.data_info)
  67. df = pd.read_csv("./data/online_shopping_10_cats.csv")
  68. #myDataset[0]
  69. df_train, df_test = train_test_split(df, test_size=0.1, random_state=RANDOM_SEED)
  70. df_val, df_test = train_test_split(df_test, test_size=0.5, random_state=RANDOM_SEED)
  71. df_train.shape, df_val.shape, df_test.shape
  72. def create_data_loader(df,vocab,pad_size,batch_size=4):
  73. ds = MyDataSet(df,
  74. vocab,
  75. pad_size=pad_size
  76. )
  77. return DataLoader(ds,batch_size=batch_size)
  78. MAX_LEN = 256
  79. BATCH_SIZE = 4
  80. train_data_loader = create_data_loader(df_train,vocab,pad_size=MAX_LEN, batch_size=BATCH_SIZE)
  81. val_data_loader = create_data_loader(df_val,vocab,pad_size=MAX_LEN, batch_size=BATCH_SIZE)
  82. test_data_loader = create_data_loader(df_test,vocab,pad_size=MAX_LEN, batch_size=BATCH_SIZE)
  83. class Config(object):
  84. """配置参数"""
  85. def __init__(self):
  86. self.model_name = 'FastText'
  87. self.embedding_pretrained = torch.tensor(
  88. np.load("./data/embedding_SougouNews.npz")["embeddings"].astype('float32')) # 预训练词向量
  89. self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # 设备
  90. self.dropout = 0.5 # 随机失活
  91. self.require_improvement = 1000 # 若超过1000batch效果还没提升,则提前结束训练
  92. self.num_classes = 2 # 类别数
  93. self.n_vocab = 0 # 词表大小,在运行时赋值
  94. self.num_epochs = 20 # epoch数
  95. self.batch_size = 128 # mini-batch大小
  96. self.learning_rate = 1e-4 # 学习率
  97. self.embed = self.embedding_pretrained.size(1)\
  98. if self.embedding_pretrained is not None else 300 # 字向量维度
  99. self.hidden_size = 256 # 隐藏层大小
  100. self.n_gram_vocab = 250499 # ngram 词表大小
  101. class Model(nn.Module):
  102. def __init__(self, config):
  103. super(Model, self).__init__()
  104. if config.embedding_pretrained is not None:
  105. self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
  106. else:
  107. self.embedding = nn.Embedding(config.n_vocab, config.embed, padding_idx=config.n_vocab - 1)
  108. self.embedding_ngram2 = nn.Embedding(config.n_gram_vocab, config.embed)
  109. self.embedding_ngram3 = nn.Embedding(config.n_gram_vocab, config.embed)
  110. self.dropout = nn.Dropout(config.dropout)
  111. self.fc1 = nn.Linear(config.embed * 3, config.hidden_size)
  112. # self.dropout2 = nn.Dropout(config.dropout)
  113. self.fc2 = nn.Linear(config.hidden_size, config.num_classes)
  114. def forward(self, x):
  115. out_word = self.embedding(x['input_ids'])
  116. out_bigram = self.embedding_ngram2(x['bigram'])
  117. out_trigram = self.embedding_ngram3(x['trigram'])
  118. out = torch.cat((out_word, out_bigram, out_trigram), -1)
  119. out = out.mean(dim=1)
  120. out = self.dropout(out)
  121. out = self.fc1(out)
  122. out = F.relu(out)
  123. out = self.fc2(out)
  124. return out
  125. config = Config()
  126. model = Model(config)
  127. device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
  128. model = model.to(device)
  129. EPOCHS = 5 # 训练轮数
  130. optimizer = AdamW(model.parameters(),lr=2e-4)
  131. total_steps = len(train_data_loader) * EPOCHS
  132. # schedule = get_linear_schedule_with_warmup(optimizer,num_warmup_steps=0,
  133. # num_training_steps=total_steps)
  134. loss_fn = nn.CrossEntropyLoss().to(device)
  135. def train_epoch(model,data_loader,loss_fn,device,n_exmaples,schedule=None):
  136. model = model.train()
  137. losses = []
  138. correct_predictions = 0
  139. for d in tqdm(data_loader):
  140. # input_ids = d['input_ids'].to(device)
  141. # attention_mask = d['attention_mask'].to(device)
  142. targets = d['label']#.to(device)
  143. outputs = model(d)
  144. _,preds = torch.max(outputs, dim=1)
  145. loss = loss_fn(outputs,targets)
  146. losses.append(loss.item())
  147. correct_predictions += torch.sum(preds==targets)
  148. loss.backward()
  149. nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
  150. optimizer.step()
  151. #scheduler.step()
  152. optimizer.zero_grad()
  153. return correct_predictions.double() / n_examples, np.mean(losses)
  154. def eval_model(model, data_loader, loss_fn, device, n_examples):
  155. model = model.eval() # 验证预测模式
  156. losses = []
  157. correct_predictions = 0
  158. with torch.no_grad():
  159. for d in data_loader:
  160. targets = d['label']#.to(device)
  161. outputs = model(d)
  162. _, preds = torch.max(outputs, dim=1)
  163. loss = loss_fn(outputs, targets)
  164. correct_predictions += torch.sum(preds == targets)
  165. losses.append(loss.item())
  166. return correct_predictions.double() / n_examples, np.mean(losses)
  167. # train model
  168. EPOCHS = 5
  169. history = defaultdict(list) # 记录10轮loss和acc
  170. best_accuracy = 0
  171. for epoch in range(EPOCHS):
  172. print(f'Epoch {epoch + 1}/{EPOCHS}')
  173. print('-' * 10)
  174. train_acc, train_loss = train_epoch(
  175. model,
  176. train_data_loader,
  177. loss_fn,
  178. optimizer,
  179. device,
  180. len(df_train)
  181. )
  182. print(f'Train loss {train_loss} accuracy {train_acc}')
  183. val_acc, val_loss = eval_model(
  184. model,
  185. val_data_loader,
  186. loss_fn,
  187. device,
  188. len(df_val)
  189. )
  190. print(f'Val loss {val_loss} accuracy {val_acc}')
  191. print()
  192. history['train_acc'].append(train_acc)
  193. history['train_loss'].append(train_loss)
  194. history['val_acc'].append(val_acc)
  195. history['val_loss'].append(val_loss)
  196. if val_acc > best_accuracy:
  197. torch.save(model.state_dict(), 'best_model_state.bin')
  198. best_accuracy = val_acc

备注: CPU训练模型很慢啊!!!有GPU的用GPU吧。大家有想了解的可以私聊。

平均 1epoch / h;

Epoch 1/10
----------
100%|██████████████████████████████████| 14124/14124 [10:25:00<00:00,  2.66s/it]
Train loss 0.30206009501767567 accuracy 0.9164365618804872
Val   loss 0.335533762476819 accuracy 0.9111181905065308

Epoch 2/10
----------
100%|███████████████████████████████████| 14124/14124 [1:40:00<00:00,  2.35it/s]
Train loss 0.2812397742334814 accuracy 0.924667233078448
Val   loss 0.33604823821747 accuracy 0.9114367633004141

Epoch 3/10
----------
100%|███████████████████████████████████| 14124/14124 [1:26:10<00:00,  2.73it/s]
Train loss 0.26351333512826924 accuracy 0.9319420843953554
Val   loss 0.3722937448388443 accuracy 0.9082510353615801
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/AllinToyou/article/detail/674661
推荐阅读
相关标签
  

闽ICP备14008679号