当前位置:   article > 正文

词嵌入(Embeddings)_glove embeddings

glove embeddings

概述

虽然 one-hot 编码允许我们保留结构信息,但它确实存在两个主要缺点。

  • 线性依赖于我们词汇表中唯一标记的数量,如果我们正在处理大型语料库,这是一个问题。
  • 每个令牌的表示不保留与其他令牌的任何关系。

在这个笔记本中,我们将激发对嵌入的需求以及它们如何解决 one-hot 编码的所有缺点。嵌入的主要思想是为文本中的标记提供固定长度的表示,而不考虑词汇表中标记的数量。使用 one-hot 编码,每个标记都由一个大小为 的数组表示vocab_size,但使用嵌入,每个标记现在都具有形状embed_dim。表示中的值不是固定的二进制值,而是通过改变浮点数来实现细粒度的学习表示。

  • 目标
    • 表示文本中捕获内在语义关系的标记。
  • 优点
    • 捕捉关系时的低维性。
    • 可解释的令牌表示
  • 缺点
    • 预计算可能是计算密集型的。
  • 杂项
    • 有很多预训练嵌入可供选择,但您也可以从头开始训练自己的嵌入。

学习嵌入

我们可以通过在 PyTorch 中创建模型来学习嵌入,但首先,我们将使用一个专门研究嵌入和主题建模的库,称为Gensim

  1. import nltk
  2. nltk.download("punkt");
  3. import numpy as np
  4. import re
  5. import urllib
  1. [nltk_data] 正在下载包 punkt 到 /root/nltk_data...
  2. [nltk_data] 解压缩标记器/punkt.zip。
SEED = 1234
  1. # Set seed for reproducibility
  2. np.random.seed(SEED)
  1. # Split text into sentences
  2. tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
  3. book = urllib.request.urlopen(url="https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/harrypotter.txt")
  4. sentences = tokenizer.tokenize(str(book.read()))
  5. print (f"{len(sentences)} sentences")
  1. def preprocess(text):
  2. """Conditional preprocessing on our text."""
  3. # Lower
  4. text = text.lower()
  5. # Spacing and filters
  6. text = re.sub(r"([-;;.,!?<=>])", r" \1 ", text)
  7. text = re.sub("[^A-Za-z0-9]+", " ", text) # remove non alphanumeric chars
  8. text = re.sub(" +", " ", text) # remove multiple spaces
  9. text = text.strip()
  10. # Separate into word tokens
  11. text = text.split(" ")
  12. return text
'
运行
  1. # Preprocess sentences
  2. print (sentences[11])
  3. sentences = [preprocess(sentence) for sentence in sentences]
  4. print (sentences[11])
  1. Snape nodded, but did not elaborate.
  2. ['snape', 'nodded', 'but', 'did', 'not', 'elaborate']

但是我们首先如何学习嵌入呢?嵌入背后的直觉是令牌的定义不取决于令牌本身,而是取决于其上下文。有几种不同的方法可以做到这一点:

  1. 给定上下文中的单词,预测目标单词(CBOW - 连续词袋)。
  2. 给定目标词,预测上下文词(skip-gram)。
  3. 给定一个单词序列,预测下一个单词(LM - 语言建模)。

所有这些方法都涉及创建数据来训练我们的模型。句子中的每个词都成为目标词,上下文词由窗口确定。在下图中(skip-gram),窗口大小为 2(目标词左右各 2 个词)。我们对语料库中的每个句子都重复此操作,这会产生我们用于无监督任务的训练数据。这是一种无监督学习技术,因为我们没有官方的上下文标签。这个想法是相似的目标词将出现在相似的上下文中,我们可以通过使用(上下文,目标)对重复训练我们的模式来学习这种关系。

我们可以使用上述任何一种方法来学习嵌入,其中一些方法比其他方法效果更好。您可以检查学习的嵌入,但选择方法的最佳方法是凭经验验证监督任务的性能。

Word2Vec

当我们有大量词汇来学习嵌入时,事情会很快变得复杂。回想一下,使用 softmax 的反向传播会更新正确和不正确的类权重。对于我们所做的每一次反向传递,这都会成为一个巨大的计算,因此一种解决方法是使用负采样,它只更新正确的类和一些任意不正确的类(NEGATIVE_SAMPLING=20)。我们之所以能够做到这一点,是因为我们会在大量的训练数据中多次看到与目标类相同的词。

 

  1. import gensim
  2. from gensim.models import KeyedVectors
  3. from gensim.models import Word2Vec
  1. EMBEDDING_DIM = 100
  2. WINDOW = 5
  3. MIN_COUNT = 3 # Ignores all words with total frequency lower than this
  4. SKIP_GRAM = 1 # 0 = CBOW
  5. NEGATIVE_SAMPLING = 20
  1. # Super fast because of optimized C code under the hood
  2. w2v = Word2Vec(
  3. sentences=sentences, size=EMBEDDING_DIM,
  4. window=WINDOW, min_count=MIN_COUNT,
  5. sg=SKIP_GRAM, negative=NEGATIVE_SAMPLING)
  6. print (w2v)
Word2Vec(vocab=4937, size=100, alpha=0.025)
  1. # Vector for each word
  2. w2v.wv.get_vector("potter")
  1. array([-0.11787166, -0.2702948 , 0.24332453, 0.07497228, -0.5299148 ,
  2. 0.17751476, -0.30183575, 0.17060578, -0.0342238 , -0.331856 ,
  3. -0.06467848, 0.02454215, 0.4524056 , -0.18918884, -0.22446074,
  4. 0.04246538, 0.5784022 , 0.12316586, 0.03419832, 0.12895502,
  5. -0.36260423, 0.06671549, -0.28563526, -0.06784113, -0.0838319 ,
  6. 0.16225453, 0.24313857, 0.04139925, 0.06982274, 0.59947336,
  7. 0.14201492, -0.00841052, -0.14700615, -0.51149386, -0.20590985,
  8. 0.00435914, 0.04931103, 0.3382509 , -0.06798466, 0.23954925,
  9. -0.07505646, -0.50945646, -0.44729665, 0.16253233, 0.11114362,
  10. 0.05604156, 0.26727834, 0.43738437, -0.2606872 , 0.16259147,
  11. -0.28841105, -0.02349186, 0.00743417, 0.08558545, -0.0844396 ,
  12. -0.44747537, -0.30635086, -0.04186366, 0.11142804, 0.03187608,
  13. 0.38674814, -0.2663519 , 0.35415238, 0.094676 , -0.13586426,
  14. -0.35296437, -0.31428036, -0.02917303, 0.02518964, -0.59744245,
  15. -0.11500382, 0.15761602, 0.30535367, -0.06207089, 0.21460988,
  16. 0.17566076, 0.46426776, 0.15573359, 0.3675553 , -0.09043553,
  17. 0.2774392 , 0.16967005, 0.32909656, 0.01422888, 0.4131812 ,
  18. 0.20034142, 0.13722987, 0.10324971, 0.14308734, 0.23772323,
  19. 0.2513108 , 0.23396717, -0.10305202, -0.03343603, 0.14360961,
  20. -0.01891198, 0.11430877, 0.30017182, -0.09570111, -0.10692801],
  21. dtype=float32)
  1. # Get nearest neighbors (excluding itself)
  2. w2v.wv.most_similar(positive="scar", topn=5)
  1. [('pain', 0.9274871349334717),
  2. ('forehead', 0.9020695686340332),
  3. ('heart', 0.8953317999839783),
  4. ('mouth', 0.8939940929412842),
  5. ('throat', 0.8922691345214844)]
  1. # Saving and loading
  2. w2v.wv.save_word2vec_format("model.bin", binary=True)
  3. w2v = KeyedVectors.load_word2vec_format("model.bin", binary=True)

FastText

当我们的词汇表中不存在一个单词时会发生什么?我们可以分配一个用于所有 OOV(词汇表外)单词的 UNK 标记,或者我们可以使用FastText,它使用字符级 n-gram 来嵌入单词。这有助于嵌入稀有词、拼写错误的词,以及我们的语料库中不存在但与我们的语料库中的词相似的词。

from gensim.models import FastText
  1. # Super fast because of optimized C code under the hood
  2. ft = FastText(sentences=sentences, size=EMBEDDING_DIM,
  3. window=WINDOW, min_count=MIN_COUNT,
  4. sg=SKIP_GRAM, negative=NEGATIVE_SAMPLING)
  5. print (ft)
FastText(vocab=4937, size=100, alpha=0.025)
  1. # This word doesn't exist so the word2vec model will error out
  2. w2v.wv.most_similar(positive="scarring", topn=5)
  1. # FastText will use n-grams to embed an OOV word
  2. ft.wv.most_similar(positive="scarring", topn=5)
  1. [('sparkling', 0.9785991907119751),
  2. ('coiling', 0.9770463705062866),
  3. ('watering', 0.9759057760238647),
  4. ('glittering', 0.9756022095680237),
  5. ('dazzling', 0.9755154848098755)]
  1. # Save and loading
  2. ft.wv.save("model.bin")
  3. ft = KeyedVectors.load("model.bin")

预训练嵌入

我们可以使用上述方法之一从头开始学习嵌入,但我们也可以利用已在数百万个文档上训练的预训练嵌入。流行的包括Word2Vec (skip-gram) 或GloVe (global word-word co-occurrence)。我们可以通过确认这些嵌入来验证它们是否捕获了有意义的语义关系。

  1. from gensim.scripts.glove2word2vec import glove2word2vec
  2. from io import BytesIO
  3. import matplotlib.pyplot as plt
  4. from sklearn.decomposition import PCA
  5. from urllib.request import urlopen
  6. from zipfile import ZipFile
  1. # Arguments
  2. EMBEDDING_DIM = 100
  1. def plot_embeddings(words, embeddings, pca_results):
  2. for word in words:
  3. index = embeddings.index2word.index(word)
  4. plt.scatter(pca_results[index, 0], pca_results[index, 1])
  5. plt.annotate(word, xy=(pca_results[index, 0], pca_results[index, 1]))
  6. plt.show()
  1. # Unzip the file (may take ~3-5 minutes)
  2. resp = urlopen("http://nlp.stanford.edu/data/glove.6B.zip")
  3. zipfile = ZipFile(BytesIO(resp.read()))
  4. zipfile.namelist()
  1. ['glove.6B.50d.txt',
  2. 'glove.6B.100d.txt',
  3. 'glove.6B.200d.txt',
  4. 'glove.6B.300d.txt']
  1. # Write embeddings to file
  2. embeddings_file = "glove.6B.{0}d.txt".format(EMBEDDING_DIM)
  3. zipfile.extract(embeddings_file)
/content/glove.6B.100d.txt
  1. # Preview of the GloVe embeddings file
  2. with open(embeddings_file, "r") as fp:
  3. line = next(fp)
  4. values = line.split()
  5. word = values[0]
  6. embedding = np.asarray(values[1:], dtype='float32')
  7. print (f"word: {word}")
  8. print (f"embedding:\n{embedding}")
  9. print (f"embedding dim: {len(embedding)}")
  1. word: the
  2. embedding:
  3. [-0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953 -0.39141 0.3344 -0.57545 0.087459 0.28787 -0.06731 0.30906 -0.26384
  4. -0.13231 -0.20757 0.33395 -0.33848 -0.31743 -0.48336
  5. 0.1464 -0.37304 0.34577 0.052041
  6. 0.44946 -0.46971 0.02628 -0.54155
  7. - 0.15518 -0.14107 -0.039722 0.28277 0.14393 0.23464 -0.31021 0.086173 0.20397 0.52624 0.17164 -0.082378 -0.71787 -0.41531 0.20335 -0.12763
  8. 0.41367 0.55187 0.57908 -0.33477 -0.36559 -0.54857 -0.062892 0.26584
  9. 0.30205 0.99775 -0.80481 -3.0243
  10. 0.01254 -0.36942 2.2167
  11. 0.72201 -0.24978 0.92136 0.034514
  12. 0.46745 1.1079 -0.19358 -0.074575 0.23353 -0.052062 -0.22044 0.057162 -0.15806 -0.30798 -0.41625 0.37972 0.15006 -0.53212 -0.2055 -1.2526
  13. 0.071624 0.70565 0.49744 -0.42063 0.26148 -1.53​​8
  14. -0.30223 -0.073438 -0.28312 0.37104
  15. -0.25217 0.016215 -0.017099
  16. -0.38984 0.87424 - 0.72569 -0.51058 -0.52028 -0.1459
  17. 0.8278 0.27062 ]
  18. 嵌入暗度:100
  1. # Save GloVe embeddings to local directory in word2vec format
  2. word2vec_output_file = "{0}.word2vec".format(embeddings_file)
  3. glove2word2vec(embeddings_file, word2vec_output_file)
(400000, 100)
  1. # Load embeddings (may take a minute)
  2. glove = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)
  1. # (king - man) + woman = ?
  2. # king - man = ? - woman
  3. glove.most_similar(positive=["woman", "king"], negative=["man"], topn=5)
  1. [('王后', 0.7698541283607483),
  2. ('君主', 0.6843380928039551),
  3. ('王位', 0.6755735874176025),
  4. ('女儿', 0.6594556570053101),
  5. ('公主', 0.6532054]43
  1. # Get nearest neighbors (excluding itself)
  2. glove.wv.most_similar(positive="goku", topn=5)
  1. [('
  2. gohan', 0.7246542572975159), ('bulma', 0.6497020125389099),
  3. ('raistlin', 0.6443604230880737),
  4. ('skaar', 0.6316742897033691),
  5. ('guybrush', 0.6239)
  1. # Reduce dimensionality for plotting
  2. X = glove[glove.wv.vocab]
  3. pca = PCA(n_components=2)
  4. pca_results = pca.fit_transform(X)
  1. # Visualize
  2. plot_embeddings(
  3. words=["king", "queen", "man", "woman"], embeddings=glove,
  4. pca_results=pca_results)

  1. # Bias in embeddings
  2. glove.most_similar(positive=["woman", "doctor"], negative=["man"],topn=5)

 

  1. [('护士', 0.7735227346420288),
  2. ('医生', 0.7189429998397827),
  3. ('医生', 0.6824328303337097),
  4. ('病人', 0.6750682592391968),
  5. ('牙医', 0.65520)]

设置

让我们为我们的主要任务设置种子和设备。

1
2
3
4
5
  1. import numpy as np
  2. import pandas as pd
  3. import random
  4. import torch
  5. import torch.nn as nn
1
SEED = 1234
1
2
3
4
5
6
7
  1. def set_seeds(seed=1234):
  2. """Set seeds for reproducibility."""
  3. np.random.seed(seed)
  4. random.seed(seed)
  5. torch.manual_seed(seed)
  6. torch.cuda.manual_seed(seed)
  7. torch.cuda.manual_seed_all(seed) # multi-GPU
'
运行
1
2
  1. # Set seeds for reproducibility
  2. set_seeds(seed=SEED)
1
2
3
4
5
6
7
8
  1. # Set device
  2. cuda = True
  3. device = torch.device("cuda" if (
  4. torch.cuda.is_available() and cuda) else "cpu")
  5. torch.set_default_tensor_type("torch.FloatTensor")
  6. if device.type == "cuda":
  7. torch.set_default_tensor_type("torch.cuda.FloatTensor")
  8. print (device)

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">不同的
  2. </span></span>

加载数据

我们将下载AG News 数据集Business,该数据集包含来自 4 个独特类别( 、Sci/TechSportsWorld) 的 120K 文本样本

1
2
3
4
5
  1. # Load data
  2. url = "https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/news.csv"
  3. df = pd.read_csv(url, header=0) # load
  4. df = df.sample(frac=1).reset_index(drop=True) # shuffle
  5. df.head()

标题类别
0沙龙接受减少加沙军队行动的计划......世界
1野生动物犯罪斗争中的互联网关键战场科技
27 月耐用品订单增长 1.7%商业
3华尔街放缓的迹象越来越多商业
4真人秀的新面孔世界

预处理

我们将首先通过执行诸如下部文本、删除停止(填充)词、使用正则表达式的过滤器等操作来清理我们的输入数据。

1
2
3
4
  1. import nltk
  2. from nltk.corpus import stopwords
  3. from nltk.stem import PorterStemmer
  4. import re
1
2
3
4
  1. nltk.download("stopwords")
  2. STOPWORDS = stopwords.words("english")
  3. print (STOPWORDS[:5])
  4. porter = PorterStemmer()

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">[nltk_data] 正在下载包停用词到 /root/nltk_data...
  2. [nltk_data] 包停用词已经是最新的!
  3. ['我','我','我的','我自己','我们']
  4. </span></span>

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
  1. def preprocess(text, stopwords=STOPWORDS):
  2. """Conditional preprocessing on our text unique to our task."""
  3. # Lower
  4. text = text.lower()
  5. # Remove stopwords
  6. pattern = re.compile(r"\b(" + r"|".join(stopwords) + r")\b\s*")
  7. text = pattern.sub("", text)
  8. # Remove words in parenthesis
  9. text = re.sub(r"\([^)]*\)", "", text)
  10. # Spacing and filters
  11. text = re.sub(r"([-;;.,!?<=>])", r" \1 ", text)
  12. text = re.sub("[^A-Za-z0-9]+", " ", text) # remove non alphanumeric chars
  13. text = re.sub(" +", " ", text) # remove multiple spaces
  14. text = text.strip()
  15. return text
1
2
3
  1. # Sample
  2. text = "Great week for the NYSE!"
  3. preprocess(text=text)

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">纽约证券交易所伟大的一周
  2. </span></span>
1
2
3
4
  1. # Apply to dataframe
  2. preprocessed_df = df.copy()
  3. preprocessed_df.title = preprocessed_df.title.apply(preprocess)
  4. print (f"{df.title.values[0]}\n\n{preprocessed_df.title.values[0]}")
'
运行
  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">沙龙接受减少加沙军队行动的计划,国土报说
  2. 沙龙接受减少加沙军队行动的计划 国土报说
  3. </span></span>

警告

如果您有计算标准化等预处理步骤,则需要先分离训练集和测试集,然后再应用这些操作。这是因为我们不能在预处理/训练期间意外应用从测试集获得的任何知识(数据泄漏)。然而,对于像上面的函数这样的全局预处理步骤,我们没有从数据本身学习任何东西,我们可以在拆分数据之前执行。

拆分数据

1
2
  1. import collections
  2. from sklearn.model_selection import train_test_split
1
2
3
  1. TRAIN_SIZE = 0.7
  2. VAL_SIZE = 0.15
  3. TEST_SIZE = 0.15
1
2
3
4
5
  1. def train_val_test_split(X, y, train_size):
  2. """Split dataset into data splits."""
  3. X_train, X_, y_train, y_ = train_test_split(X, y, train_size=TRAIN_SIZE, stratify=y)
  4. X_val, X_test, y_val, y_test = train_test_split(X_, y_, train_size=0.5, stratify=y_)
  5. return X_train, X_val, X_test, y_train, y_val, y_test
1
2
3
  1. # Data
  2. X = preprocessed_df["title"].values
  3. y = preprocessed_df["category"].values
1
2
3
4
5
6
7
  1. # Create data splits
  2. X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(
  3. X=X, y=y, train_size=TRAIN_SIZE)
  4. print (f"X_train: {X_train.shape}, y_train: {y_train.shape}")
  5. print (f"X_val: {X_val.shape}, y_val: {y_val.shape}")
  6. print (f"X_test: {X_test.shape}, y_test: {y_test.shape}")
  7. print (f"Sample point: {X_train[0]}{y_train[0]}")

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">X_train: (84000,), y_train: (84000,)
  2. X_val: (18000,), y_val: (18000,)
  3. X_test: (18000,), y_test: (18000,)
  4. 样本点:中国与朝鲜核谈判→世界
  5. </span></span>

标签编码

接下来,我们将定义 aLabelEncoder将我们的文本标签编码为唯一索引

1
import itertools
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
  1. class LabelEncoder(object):
  2. """Label encoder for tag labels."""
  3. def __init__(self, class_to_index={}):
  4. self.class_to_index = class_to_index or {} # mutable defaults ;)
  5. self.index_to_class = {v: k for k, v in self.class_to_index.items()}
  6. self.classes = list(self.class_to_index.keys())
  7. def __len__(self):
  8. return len(self.class_to_index)
  9. def __str__(self):
  10. return f"<LabelEncoder(num_classes={len(self)})>"
  11. def fit(self, y):
  12. classes = np.unique(y)
  13. for i, class_ in enumerate(classes):
  14. self.class_to_index[class_] = i
  15. self.index_to_class = {v: k for k, v in self.class_to_index.items()}
  16. self.classes = list(self.class_to_index.keys())
  17. return self
  18. def encode(self, y):
  19. encoded = np.zeros((len(y)), dtype=int)
  20. for i, item in enumerate(y):
  21. encoded[i] = self.class_to_index[item]
  22. return encoded
  23. def decode(self, y):
  24. classes = []
  25. for i, item in enumerate(y):
  26. classes.append(self.index_to_class[item])
  27. return classes
  28. def save(self, fp):
  29. with open(fp, "w") as fp:
  30. contents = {'class_to_index': self.class_to_index}
  31. json.dump(contents, fp, indent=4, sort_keys=False)
  32. @classmethod
  33. def load(cls, fp):
  34. with open(fp, "r") as fp:
  35. kwargs = json.load(fp=fp)
  36. return cls(**kwargs)
1
2
3
4
5
  1. # Encode
  2. label_encoder = LabelEncoder()
  3. label_encoder.fit(y_train)
  4. NUM_CLASSES = len(label_encoder)
  5. label_encoder.class_to_index

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">{“商业”:0,“科技”:1,“体育”:2,“世界”:3}
  2. </span></span>
1
2
3
4
5
6
  1. # Convert labels to tokens
  2. print (f"y_train[0]: {y_train[0]}")
  3. y_train = label_encoder.encode(y_train)
  4. y_val = label_encoder.encode(y_val)
  5. y_test = label_encoder.encode(y_test)
  6. print (f"y_train[0]: {y_train[0]}")
  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">y_train[0]:世界
  2. y_train[0]:3
  3. </span></span>
1
2
3
4
  1. # Class weights
  2. counts = np.bincount(y_train)
  3. class_weights = {i: 1.0/count for i, count in enumerate(counts)}
  4. print (f"counts: {counts}\nweights: {class_weights}")
  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">计数:[21000 21000 21000 21000]
  2. 权重:{0: 4.761904761904762e-05, 1: 4.761904761904762e-05, 2: 4.761904761904762e-05, 3: 4.76190476190
  3. </span></span>

分词器

我们将定义一个Tokenizer将我们的文本输入数据转换为标记索引。

1
2
3
  1. import json
  2. from collections import Counter
  3. from more_itertools import take
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
  1. class Tokenizer(object):
  2. def __init__(self, char_level, num_tokens=None,
  3. pad_token="<PAD>", oov_token="<UNK>",
  4. token_to_index=None):
  5. self.char_level = char_level
  6. self.separator = "" if self.char_level else " "
  7. if num_tokens: num_tokens -= 2 # pad + unk tokens
  8. self.num_tokens = num_tokens
  9. self.pad_token = pad_token
  10. self.oov_token = oov_token
  11. if not token_to_index:
  12. token_to_index = {pad_token: 0, oov_token: 1}
  13. self.token_to_index = token_to_index
  14. self.index_to_token = {v: k for k, v in self.token_to_index.items()}
  15. def __len__(self):
  16. return len(self.token_to_index)
  17. def __str__(self):
  18. return f"<Tokenizer(num_tokens={len(self)})>"
  19. def fit_on_texts(self, texts):
  20. if not self.char_level:
  21. texts = [text.split(" ") for text in texts]
  22. all_tokens = [token for text in texts for token in text]
  23. counts = Counter(all_tokens).most_common(self.num_tokens)
  24. self.min_token_freq = counts[-1][1]
  25. for token, count in counts:
  26. index = len(self)
  27. self.token_to_index[token] = index
  28. self.index_to_token[index] = token
  29. return self
  30. def texts_to_sequences(self, texts):
  31. sequences = []
  32. for text in texts:
  33. if not self.char_level:
  34. text = text.split(" ")
  35. sequence = []
  36. for token in text:
  37. sequence.append(self.token_to_index.get(
  38. token, self.token_to_index[self.oov_token]))
  39. sequences.append(np.asarray(sequence))
  40. return sequences
  41. def sequences_to_texts(self, sequences):
  42. texts = []
  43. for sequence in sequences:
  44. text = []
  45. for index in sequence:
  46. text.append(self.index_to_token.get(index, self.oov_token))
  47. texts.append(self.separator.join([token for token in text]))
  48. return texts
  49. def save(self, fp):
  50. with open(fp, "w") as fp:
  51. contents = {
  52. "char_level": self.char_level,
  53. "oov_token": self.oov_token,
  54. "token_to_index": self.token_to_index
  55. }
  56. json.dump(contents, fp, indent=4, sort_keys=False)
  57. @classmethod
  58. def load(cls, fp):
  59. with open(fp, "r") as fp:
  60. kwargs = json.load(fp=fp)
  61. return cls(**kwargs)

警告

重要的是,我们只适合使用我们的训练数据拆分,因为在推理过程中,我们的模型并不总是知道每个标记,因此使用我们的验证和测试拆分复制该场景也很重要。

1
2
3
4
5
  1. # Tokenize
  2. tokenizer = Tokenizer(char_level=False, num_tokens=5000)
  3. tokenizer.fit_on_texts(texts=X_train)
  4. VOCAB_SIZE = len(tokenizer)
  5. print (tokenizer)
  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)"><Tokenizer(num_tokens=5000)>
  2. </span></span>
1
2
3
  1. # Sample of tokens
  2. print (take(5, tokenizer.token_to_index.items()))
  3. print (f"least freq token's freq: {tokenizer.min_token_freq}") # use this to adjust num_tokens
'
运行
  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">[('<pad>',0),('',1),('39',2),('b',3),('gt',4)]
  2. 至少弗雷克令牌的freq: 14
  3. </span></span>
1
2
3
4
5
6
7
8
  1. # Convert texts to sequences of indices
  2. X_train = tokenizer.texts_to_sequences(X_train)
  3. X_val = tokenizer.texts_to_sequences(X_val)
  4. X_test = tokenizer.texts_to_sequences(X_test)
  5. preprocessed_text = tokenizer.sequences_to_texts([X_train[0]])[0]
  6. print ("Text to indices:\n"
  7. f" (preprocessed) → {preprocessed_text}\n"
  8. f" (tokenized) → {X_train[0]}")
  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">文本到索引:(
  2. 预处理)→NBA包裹neal 40热量向导
  3. (令牌化)→[299 359 3869 1 1648 734 1 2021]
  4. </span></span>

嵌入层

我们可以使用 PyTorch 的嵌入层嵌入我们的输入。

1
2
3
4
5
  1. # Input
  2. vocab_size = 10
  3. x = torch.randint(high=vocab_size, size=(1,5))
  4. print (x)
  5. print (x.shape)
  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">张量([[2, 6, 5, 2, 6]])
  2. torch.Size([1, 5])
  3. </span></span>
1
2
3
  1. # Embedding layer
  2. embeddings = nn.Embedding(embedding_dim=100, num_embeddings=vocab_size)
  3. print (embeddings.weight.shape)
  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">火炬大小([10, 100])
  2. </span></span>
1
2
  1. # Embed the input
  2. embeddings(x).shape
  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">火炬大小([1, 5, 100])
  2. </span></span>

输入中的每个标记都通过嵌入表示(所有词汇表外(OOV)标记都被赋予了标记的嵌入UNK。)在下面的模型中,我们将看到如何将这些嵌入设置为预训练的 GloVe 嵌入以及如何选择是否在训练期间冻结(固定嵌入权重)这些嵌入。

填充

我们的输入都是不同的长度,但我们需要每个批次的形状一致。因此,我们将使用填充使批次中的所有输入具有相同的长度。我们的填充索引将为 0(请注意,这与<PAD>我们定义的令牌一致Tokenizer)。

虽然嵌入我们的输入标记将创建一批形状 ( Nmax_seq_lenembed_dim),但我们只需要提供一个 2D 矩阵 ( Nmax_seq_len) 即可在 PyTorch 中使用嵌入。

1
2
3
4
5
6
7
  1. def pad_sequences(sequences, max_seq_len=0):
  2. """Pad sequences to max length in sequence."""
  3. max_seq_len = max(max_seq_len, max(len(sequence) for sequence in sequences))
  4. padded_sequences = np.zeros((len(sequences), max_seq_len))
  5. for i, sequence in enumerate(sequences):
  6. padded_sequences[i][:len(sequence)] = sequence
  7. return padded_sequences
1
2
3
4
  1. # 2D sequences
  2. padded = pad_sequences(X_train[0:3])
  3. print (padded.shape)
  4. print (padded)

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">(3, 8)
  2. [[2.990e+02 3.590e+02 3.869e+03 1.000e+00 1.648e+03 7.340e+02 1.000e+00
  3. 2.021e+03]
  4. [4.977e+03 1.000e+00 8.070 e+02 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  5. 0.000e+00]
  6. [5.900e+01 1.213e+03 1.160e+02 4.042e+03 2.040e+02 4.190e+02 1.000 e+00
  7. 0.000e+00]]
  8. </span></span>

数据集

我们将创建数据集和数据加载器,以便能够使用我们的数据拆分有效地创建批次。

1
FILTER_SIZES = list(range(1, 4)) # uni, bi and tri grams
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
  1. class Dataset(torch.utils.data.Dataset):
  2. def __init__(self, X, y, max_filter_size):
  3. self.X = X
  4. self.y = y
  5. self.max_filter_size = max_filter_size
  6. def __len__(self):
  7. return len(self.y)
  8. def __str__(self):
  9. return f"<Dataset(N={len(self)})>"
  10. def __getitem__(self, index):
  11. X = self.X[index]
  12. y = self.y[index]
  13. return [X, y]
  14. def collate_fn(self, batch):
  15. """Processing on a batch."""
  16. # Get inputs
  17. batch = np.array(batch)
  18. X = batch[:, 0]
  19. y = batch[:, 1]
  20. # Pad sequences
  21. X = pad_sequences(X)
  22. # Cast
  23. X = torch.LongTensor(X.astype(np.int32))
  24. y = torch.LongTensor(y.astype(np.int32))
  25. return X, y
  26. def create_dataloader(self, batch_size, shuffle=False, drop_last=False):
  27. return torch.utils.data.DataLoader(
  28. dataset=self, batch_size=batch_size, collate_fn=self.collate_fn,
  29. shuffle=shuffle, drop_last=drop_last, pin_memory=True)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
  1. # Create datasets
  2. max_filter_size = max(FILTER_SIZES)
  3. train_dataset = Dataset(X=X_train, y=y_train, max_filter_size=max_filter_size)
  4. val_dataset = Dataset(X=X_val, y=y_val, max_filter_size=max_filter_size)
  5. test_dataset = Dataset(X=X_test, y=y_test, max_filter_size=max_filter_size)
  6. print ("Datasets:\n"
  7. f" Train dataset:{train_dataset.__str__()}\n"
  8. f" Val dataset: {val_dataset.__str__()}\n"
  9. f" Test dataset: {test_dataset.__str__()}\n"
  10. "Sample point:\n"
  11. f" X: {train_dataset[0][0]}\n"
  12. f" y: {train_dataset[0][1]}")

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">数据集:
  2. 训练数据集:<Dataset(N=84000)>
  3. Val 数据集:<Dataset(N=18000)>
  4. 测试数据集:<Dataset(N=18000)>
  5. 样本点:
  6. X:[299 359 3869 1 1648 734 1 2021]
  7. 是:2
  8. </span></span>
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
  1. # Create dataloaders
  2. batch_size = 64
  3. train_dataloader = train_dataset.create_dataloader(batch_size=batch_size)
  4. val_dataloader = val_dataset.create_dataloader(batch_size=batch_size)
  5. test_dataloader = test_dataset.create_dataloader(batch_size=batch_size)
  6. batch_X, batch_y = next(iter(train_dataloader))
  7. print ("Sample batch:\n"
  8. f" X: {list(batch_X.size())}\n"
  9. f" y: {list(batch_y.size())}\n"
  10. "Sample point:\n"
  11. f" X: {batch_X[0]}\n"
  12. f" y: {batch_y[0]}")
  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">样本批次:
  2. X: [64, 9]
  3. y: [64]
  4. 样本点:
  5. X: tensor([ 299, 359, 3869, 1, 1648, 734, 1, 2021, 0], device="cpu")
  6. y: 2
  7. </span></span>

模型

我们将在嵌入式令牌之上使用卷积神经网络来提取有意义的空间信号。这一次,我们将使用许多过滤器宽度来充当 n-gram 特征提取器。

让我们可视化模型的前向传播。

  1. 我们将首先标记我们的输入(batch_sizemax_seq_len)。
  2. 然后我们将嵌入我们的标记化输入(batch_sizemax_seq_lenembedding_dim)。
  3. 我们将通过过滤器 ( filter_sizeembedding_dimnum_filters) 应用卷积,然后进行批量归一化。我们的过滤器充当字符级 n-gram 检测器。我们有三种不同的过滤器大小(2、3 和 4),它们将分别充当二元、三元和 4-gram 特征提取器。
  4. 我们将应用一维全局最大池化,它将从特征图中提取最相关的信息以做出决策。
  5. 我们将池输出提供给全连接 (FC) 层(带有 dropout)。
  6. 我们使用一个带有 softmax 的 FC 层来推导类概率。

1
2
  1. import math
  2. import torch.nn.functional as F
1
2
3
  1. EMBEDDING_DIM = 100
  2. HIDDEN_DIM = 100
  3. DROPOUT_P = 0.1
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
  1. class CNN(nn.Module):
  2. def __init__(self, embedding_dim, vocab_size, num_filters,
  3. filter_sizes, hidden_dim, dropout_p, num_classes,
  4. pretrained_embeddings=None, freeze_embeddings=False,
  5. padding_idx=0):
  6. super(CNN, self).__init__()
  7. # Filter sizes
  8. self.filter_sizes = filter_sizes
  9. # Initialize embeddings
  10. if pretrained_embeddings is None:
  11. self.embeddings = nn.Embedding(
  12. embedding_dim=embedding_dim, num_embeddings=vocab_size,
  13. padding_idx=padding_idx)
  14. else:
  15. pretrained_embeddings = torch.from_numpy(pretrained_embeddings).float()
  16. self.embeddings = nn.Embedding(
  17. embedding_dim=embedding_dim, num_embeddings=vocab_size,
  18. padding_idx=padding_idx, _weight=pretrained_embeddings)
  19. # Freeze embeddings or not
  20. if freeze_embeddings:
  21. self.embeddings.weight.requires_grad = False
  22. # Conv weights
  23. self.conv = nn.ModuleList(
  24. [nn.Conv1d(in_channels=embedding_dim,
  25. out_channels=num_filters,
  26. kernel_size=f) for f in filter_sizes])
  27. # FC weights
  28. self.dropout = nn.Dropout(dropout_p)
  29. self.fc1 = nn.Linear(num_filters*len(filter_sizes), hidden_dim)
  30. self.fc2 = nn.Linear(hidden_dim, num_classes)
  31. def forward(self, inputs, channel_first=False):
  32. # Embed
  33. x_in, = inputs
  34. x_in = self.embeddings(x_in)
  35. # Rearrange input so num_channels is in dim 1 (N, C, L)
  36. if not channel_first:
  37. x_in = x_in.transpose(1, 2)
  38. # Conv outputs
  39. z = []
  40. max_seq_len = x_in.shape[2]
  41. for i, f in enumerate(self.filter_sizes):
  42. # `SAME` padding
  43. padding_left = int((self.conv[i].stride[0]*(max_seq_len-1) - max_seq_len + self.filter_sizes[i])/2)
  44. padding_right = int(math.ceil((self.conv[i].stride[0]*(max_seq_len-1) - max_seq_len + self.filter_sizes[i])/2))
  45. # Conv + pool
  46. _z = self.conv[i](F.pad(x_in, (padding_left, padding_right)))
  47. _z = F.max_pool1d(_z, _z.size(2)).squeeze(2)
  48. z.append(_z)
  49. # Concat conv outputs
  50. z = torch.cat(z, 1)
  51. # FC layers
  52. z = self.fc1(z)
  53. z = self.dropout(z)
  54. z = self.fc2(z)
  55. return z

使用手套

我们将创建一些实用函数,以便能够将预训练的 GloVe 嵌入加载到我们的嵌入层中。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
  1. def load_glove_embeddings(embeddings_file):
  2. """Load embeddings from a file."""
  3. embeddings = {}
  4. with open(embeddings_file, "r") as fp:
  5. for index, line in enumerate(fp):
  6. values = line.split()
  7. word = values[0]
  8. embedding = np.asarray(values[1:], dtype='float32')
  9. embeddings[word] = embedding
  10. return embeddings
1
2
3
4
5
6
7
8
  1. def make_embeddings_matrix(embeddings, word_index, embedding_dim):
  2. """Create embeddings matrix to use in Embedding layer."""
  3. embedding_matrix = np.zeros((len(word_index), embedding_dim))
  4. for word, i in word_index.items():
  5. embedding_vector = embeddings.get(word)
  6. if embedding_vector is not None:
  7. embedding_matrix[i] = embedding_vector
  8. return embedding_matrix
1
2
3
4
5
6
7
  1. # Create embeddings
  2. embeddings_file = 'glove.6B.{0}d.txt'.format(EMBEDDING_DIM)
  3. glove_embeddings = load_glove_embeddings(embeddings_file=embeddings_file)
  4. embedding_matrix = make_embeddings_matrix(
  5. embeddings=glove_embeddings, word_index=tokenizer.token_to_index,
  6. embedding_dim=EMBEDDING_DIM)
  7. print (f"<Embeddings(words={embedding_matrix.shape[0]}, dim={embedding_matrix.shape[1]})>")

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)"><嵌入(字数=5000,暗淡=100)>
  2. </span></span>

实验

我们首先必须决定是否使用预训练嵌入随机初始化的嵌入。然后,我们可以选择冻结我们的嵌入或继续使用监督数据训练它们(这可能导致过度拟合)。以下是我们将要进行的三个实验:

  • 随机初始化的嵌入(微调)
  • GloVe 嵌入(冻结)
  • GloVe 嵌入(微调)

1
2
3
  1. import json
  2. from sklearn.metrics import precision_recall_fscore_support
  3. from torch.optim import Adam
1
2
3
4
  1. NUM_FILTERS = 50
  2. LEARNING_RATE = 1e-3
  3. PATIENCE = 5
  4. NUM_EPOCHS = 10
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
  1. class Trainer(object):
  2. def __init__(self, model, device, loss_fn=None, optimizer=None, scheduler=None):
  3. # Set params
  4. self.model = model
  5. self.device = device
  6. self.loss_fn = loss_fn
  7. self.optimizer = optimizer
  8. self.scheduler = scheduler
  9. def train_step(self, dataloader):
  10. """Train step."""
  11. # Set model to train mode
  12. self.model.train()
  13. loss = 0.0
  14. # Iterate over train batches
  15. for i, batch in enumerate(dataloader):
  16. # Step
  17. batch = [item.to(self.device) for item in batch] # Set device
  18. inputs, targets = batch[:-1], batch[-1]
  19. self.optimizer.zero_grad() # Reset gradients
  20. z = self.model(inputs) # Forward pass
  21. J = self.loss_fn(z, targets) # Define loss
  22. J.backward() # Backward pass
  23. self.optimizer.step() # Update weights
  24. # Cumulative Metrics
  25. loss += (J.detach().item() - loss) / (i + 1)
  26. return loss
  27. def eval_step(self, dataloader):
  28. """Validation or test step."""
  29. # Set model to eval mode
  30. self.model.eval()
  31. loss = 0.0
  32. y_trues, y_probs = [], []
  33. # Iterate over val batches
  34. with torch.inference_mode():
  35. for i, batch in enumerate(dataloader):
  36. # Step
  37. batch = [item.to(self.device) for item in batch] # Set device
  38. inputs, y_true = batch[:-1], batch[-1]
  39. z = self.model(inputs) # Forward pass
  40. J = self.loss_fn(z, y_true).item()
  41. # Cumulative Metrics
  42. loss += (J - loss) / (i + 1)
  43. # Store outputs
  44. y_prob = F.softmax(z).cpu().numpy()
  45. y_probs.extend(y_prob)
  46. y_trues.extend(y_true.cpu().numpy())
  47. return loss, np.vstack(y_trues), np.vstack(y_probs)
  48. def predict_step(self, dataloader):
  49. """Prediction step."""
  50. # Set model to eval mode
  51. self.model.eval()
  52. y_probs = []
  53. # Iterate over val batches
  54. with torch.inference_mode():
  55. for i, batch in enumerate(dataloader):
  56. # Forward pass w/ inputs
  57. inputs, targets = batch[:-1], batch[-1]
  58. z = self.model(inputs)
  59. # Store outputs
  60. y_prob = F.softmax(z).cpu().numpy()
  61. y_probs.extend(y_prob)
  62. return np.vstack(y_probs)
  63. def train(self, num_epochs, patience, train_dataloader, val_dataloader):
  64. best_val_loss = np.inf
  65. for epoch in range(num_epochs):
  66. # Steps
  67. train_loss = self.train_step(dataloader=train_dataloader)
  68. val_loss, _, _ = self.eval_step(dataloader=val_dataloader)
  69. self.scheduler.step(val_loss)
  70. # Early stopping
  71. if val_loss < best_val_loss:
  72. best_val_loss = val_loss
  73. best_model = self.model
  74. _patience = patience # reset _patience
  75. else:
  76. _patience -= 1
  77. if not _patience: # 0
  78. print("Stopping early!")
  79. break
  80. # Logging
  81. print(
  82. f"Epoch: {epoch+1} | "
  83. f"train_loss: {train_loss:.5f}, "
  84. f"val_loss: {val_loss:.5f}, "
  85. f"lr: {self.optimizer.param_groups[0]['lr']:.2E}, "
  86. f"_patience: {_patience}"
  87. )
  88. return best_model
'
运行
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
  1. def get_metrics(y_true, y_pred, classes):
  2. """Per-class performance metrics."""
  3. # Performance
  4. performance = {"overall": {}, "class": {}}
  5. # Overall performance
  6. metrics = precision_recall_fscore_support(y_true, y_pred, average="weighted")
  7. performance["overall"]["precision"] = metrics[0]
  8. performance["overall"]["recall"] = metrics[1]
  9. performance["overall"]["f1"] = metrics[2]
  10. performance["overall"]["num_samples"] = np.float64(len(y_true))
  11. # Per-class performance
  12. metrics = precision_recall_fscore_support(y_true, y_pred, average=None)
  13. for i in range(len(classes)):
  14. performance["class"][classes[i]] = {
  15. "precision": metrics[0][i],
  16. "recall": metrics[1][i],
  17. "f1": metrics[2][i],
  18. "num_samples": np.float64(metrics[3][i]),
  19. }
  20. return performance

随机初始化

1
2
  1. PRETRAINED_EMBEDDINGS = None
  2. FREEZE_EMBEDDINGS = False
1
2
3
4
5
6
7
8
  1. # Initialize model
  2. model = CNN(
  3. embedding_dim=EMBEDDING_DIM, vocab_size=VOCAB_SIZE,
  4. num_filters=NUM_FILTERS, filter_sizes=FILTER_SIZES,
  5. hidden_dim=HIDDEN_DIM, dropout_p=DROPOUT_P, num_classes=NUM_CLASSES,
  6. pretrained_embeddings=PRETRAINED_EMBEDDINGS, freeze_embeddings=FREEZE_EMBEDDINGS)
  7. model = model.to(device) # set device
  8. print (model.named_parameters)

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)"><bound method Module.named_pa​​rameters of CNN(
  2. (embeddings): Embedding(5000, 100, padding_idx=0)
  3. (conv): ModuleList(
  4. (0): Conv1d(100, 50, kernel_size=(1,), stride=(1) ,))
  5. (1): Conv1d(100, 50, kernel_size=(2,), stride=(1,))
  6. (2): Conv1d(100, 50, kernel_size=(3,), stride=(1,) )
  7. )
  8. (dropout): Dropout(p=0.1, inplace=False)
  9. (fc1): Linear(in_features=150, out_features=100, bias=True)
  10. (fc2): Linear(in_features=100, out_features=4, bias=真的)
  11. )>
  12. </span></span>

1
2
3
  1. # Define Loss
  2. class_weights_tensor = torch.Tensor(list(class_weights.values())).to(device)
  3. loss_fn = nn.CrossEntropyLoss(weight=class_weights_tensor)
1
2
3
4
  1. # Define optimizer & scheduler
  2. optimizer = Adam(model.parameters(), lr=LEARNING_RATE)
  3. scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
  4. optimizer, mode="min", factor=0.1, patience=3)
1
2
3
4
  1. # Trainer module
  2. trainer = Trainer(
  3. model=model, device=device, loss_fn=loss_fn,
  4. optimizer=optimizer, scheduler=scheduler)
1
2
3
  1. # Train
  2. best_model = trainer.train(
  3. NUM_EPOCHS, PATIENCE, train_dataloader, val_dataloader)

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">时代:1 | train_loss:0.77038,val_loss:0.59683,lr:1.00E-03,_patience:3
  2. Epoch:2 | train_loss:0.49571,val_loss:0.54363,lr:1.00E-03,_patience:3
  3. Epoch:3 | train_loss:0.40796,val_loss:0.54551,lr:1.00E-03,_patience:2
  4. Epoch:4 | train_loss:0.34797,val_loss:0.57950,lr:1.00E-03,_patience:1
  5. 提前停止!
  6. </span></span>

1
2
3
  1. # Get predictions
  2. test_loss, y_true, y_prob = trainer.eval_step(dataloader=test_dataloader)
  3. y_pred = np.argmax(y_prob, axis=1)
1
2
3
4
  1. # Determine performance
  2. performance = get_metrics(
  3. y_true=y_test, y_pred=y_pred, classes=label_encoder.classes)
  4. print (json.dumps(performance["overall"], indent=2))

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">{
  2. “精度”:0.8070310520771562,
  3. “召回”:0.7999444444444445,
  4. “f1”:0.8012357147662316,
  5. “num_samples”:18000.0
  6. }
  7. </span></span>

手套(冷冻)

1
2
  1. PRETRAINED_EMBEDDINGS = embedding_matrix
  2. FREEZE_EMBEDDINGS = True
1
2
3
4
5
6
7
8
  1. # Initialize model
  2. model = CNN(
  3. embedding_dim=EMBEDDING_DIM, vocab_size=VOCAB_SIZE,
  4. num_filters=NUM_FILTERS, filter_sizes=FILTER_SIZES,
  5. hidden_dim=HIDDEN_DIM, dropout_p=DROPOUT_P, num_classes=NUM_CLASSES,
  6. pretrained_embeddings=PRETRAINED_EMBEDDINGS, freeze_embeddings=FREEZE_EMBEDDINGS)
  7. model = model.to(device) # set device
  8. print (model.named_parameters)

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)"><bound method Module.named_pa​​rameters of CNN(
  2. (embeddings): Embedding(5000, 100, padding_idx=0)
  3. (conv): ModuleList(
  4. (0): Conv1d(100, 50, kernel_size=(1,), stride=(1) ,))
  5. (1): Conv1d(100, 50, kernel_size=(2,), stride=(1,))
  6. (2): Conv1d(100, 50, kernel_size=(3,), stride=(1,) )
  7. )
  8. (dropout): Dropout(p=0.1, inplace=False)
  9. (fc1): Linear(in_features=150, out_features=100, bias=True)
  10. (fc2): Linear(in_features=100, out_features=4, bias=真的)
  11. )>
  12. </span></span>

1
2
3
  1. # Define Loss
  2. class_weights_tensor = torch.Tensor(list(class_weights.values())).to(device)
  3. loss_fn = nn.CrossEntropyLoss(weight=class_weights_tensor)
1
2
3
4
  1. # Define optimizer & scheduler
  2. optimizer = Adam(model.parameters(), lr=LEARNING_RATE)
  3. scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
  4. optimizer, mode="min", factor=0.1, patience=3)
1
2
3
4
  1. # Trainer module
  2. trainer = Trainer(
  3. model=model, device=device, loss_fn=loss_fn,
  4. optimizer=optimizer, scheduler=scheduler)
1
2
3
  1. # Train
  2. best_model = trainer.train(
  3. NUM_EPOCHS, PATIENCE, train_dataloader, val_dataloader)

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">时代:1 | train_loss:0.51510,val_loss:0.47643,lr:1.00E-03,_patience:3
  2. Epoch:2 | train_loss:0.44220,val_loss:0.46124,lr:1.00E-03,_patience:3
  3. Epoch:3 | train_loss:0.41204,val_loss:0.46231,lr:1.00E-03,_patience:2
  4. Epoch:4 | train_loss:0.38733,val_loss:0.46606,lr:1.00E-03,_patience:1
  5. 提前停止!
  6. </span></span>

1
2
3
  1. # Get predictions
  2. test_loss, y_true, y_prob = trainer.eval_step(dataloader=test_dataloader)
  3. y_pred = np.argmax(y_prob, axis=1)
1
2
3
4
  1. # Determine performance
  2. performance = get_metrics(
  3. y_true=y_test, y_pred=y_pred, classes=label_encoder.classes)
  4. print (json.dumps(performance["overall"], indent=2))

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">{
  2. “精度”:0.8304874226557859,
  3. “召回”:0.8281111111111111,
  4. “f1”:0.828556487688813,
  5. “num_samples”:18000.0
  6. }
  7. </span></span>

手套(微调)

1
2
  1. PRETRAINED_EMBEDDINGS = embedding_matrix
  2. FREEZE_EMBEDDINGS = False
1
2
3
4
5
6
7
8
  1. # Initialize model
  2. model = CNN(
  3. embedding_dim=EMBEDDING_DIM, vocab_size=VOCAB_SIZE,
  4. num_filters=NUM_FILTERS, filter_sizes=FILTER_SIZES,
  5. hidden_dim=HIDDEN_DIM, dropout_p=DROPOUT_P, num_classes=NUM_CLASSES,
  6. pretrained_embeddings=PRETRAINED_EMBEDDINGS, freeze_embeddings=FREEZE_EMBEDDINGS)
  7. model = model.to(device) # set device
  8. print (model.named_parameters)

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)"><bound method Module.named_pa​​rameters of CNN(
  2. (embeddings): Embedding(5000, 100, padding_idx=0)
  3. (conv): ModuleList(
  4. (0): Conv1d(100, 50, kernel_size=(1,), stride=(1) ,))
  5. (1): Conv1d(100, 50, kernel_size=(2,), stride=(1,))
  6. (2): Conv1d(100, 50, kernel_size=(3,), stride=(1,) )
  7. )
  8. (dropout): Dropout(p=0.1, inplace=False)
  9. (fc1): Linear(in_features=150, out_features=100, bias=True)
  10. (fc2): Linear(in_features=100, out_features=4, bias=真的)
  11. )>
  12. </span></span>

1
2
3
  1. # Define Loss
  2. class_weights_tensor = torch.Tensor(list(class_weights.values())).to(device)
  3. loss_fn = nn.CrossEntropyLoss(weight=class_weights_tensor)
1
2
3
4
  1. # Define optimizer & scheduler
  2. optimizer = Adam(model.parameters(), lr=LEARNING_RATE)
  3. scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
  4. optimizer, mode="min", factor=0.1, patience=3)
1
2
3
4
  1. # Trainer module
  2. trainer = Trainer(
  3. model=model, device=device, loss_fn=loss_fn,
  4. optimizer=optimizer, scheduler=scheduler)
1
2
3
  1. # Train
  2. best_model = trainer.train(
  3. NUM_EPOCHS, PATIENCE, train_dataloader, val_dataloader)

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">时代:1 | train_loss:0.48908,val_loss:0.44320,lr:1.00E-03,_patience:3
  2. Epoch:2 | train_loss:0.38986,val_loss:0.43616,lr:1.00E-03,_patience:3
  3. Epoch:3 | train_loss:0.34403,val_loss:0.45240,lr:1.00E-03,_patience:2
  4. Epoch:4 | train_loss:0.30224,val_loss:0.49063,lr:1.00E-03,_patience:1
  5. 提前停止!
  6. </span></span>

1
2
3
  1. # Get predictions
  2. test_loss, y_true, y_prob = trainer.eval_step(dataloader=test_dataloader)
  3. y_pred = np.argmax(y_prob, axis=1)
1
2
3
4
  1. # Determine performance
  2. performance = get_metrics(
  3. y_true=y_test, y_pred=y_pred, classes=label_encoder.classes)
  4. print (json.dumps(performance["overall"], indent=2))

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">{
  2. “精度”:0.8297157849772082,
  3. “召回”:0.8263333333333334,
  4. “f1”:0.8266579939871359,
  5. “num_samples”:18000.0
  6. }
  7. </span></span>
1
2
3
4
5
6
7
8
9
  1. # Save artifacts
  2. from pathlib import Path
  3. dir = Path("cnn")
  4. dir.mkdir(parents=True, exist_ok=True)
  5. label_encoder.save(fp=Path(dir, "label_encoder.json"))
  6. tokenizer.save(fp=Path(dir, "tokenizer.json"))
  7. torch.save(best_model.state_dict(), Path(dir, "model.pt"))
  8. with open(Path(dir, "performance.json"), "w") as fp:
  9. json.dump(performance, indent=2, sort_keys=False, fp=fp)

推理

1
2
3
4
5
6
7
8
  1. def get_probability_distribution(y_prob, classes):
  2. """Create a dict of class probabilities from an array."""
  3. results = {}
  4. for i, class_ in enumerate(classes):
  5. results[class_] = np.float64(y_prob[i])
  6. sorted_results = {k: v for k, v in sorted(
  7. results.items(), key=lambda item: item[1], reverse=True)}
  8. return sorted_results
'
运行
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
  1. # Load artifacts
  2. device = torch.device("cpu")
  3. label_encoder = LabelEncoder.load(fp=Path(dir, "label_encoder.json"))
  4. tokenizer = Tokenizer.load(fp=Path(dir, "tokenizer.json"))
  5. model = CNN(
  6. embedding_dim=EMBEDDING_DIM, vocab_size=VOCAB_SIZE,
  7. num_filters=NUM_FILTERS, filter_sizes=FILTER_SIZES,
  8. hidden_dim=HIDDEN_DIM, dropout_p=DROPOUT_P, num_classes=NUM_CLASSES,
  9. pretrained_embeddings=PRETRAINED_EMBEDDINGS, freeze_embeddings=FREEZE_EMBEDDINGS)
  10. model.load_state_dict(torch.load(Path(dir, "model.pt"), map_location=device))
  11. model.to(device)

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">CNN(
  2. (embeddings): Embedding(5000, 100, padding_idx=0)
  3. (conv): ModuleList(
  4. (0): Conv1d(100, 50, kernel_size=(1,), stride=(1,))
  5. (1): Conv1d(100, 50, kernel_size=(2,), stride=(1,))
  6. (2): Conv1d(100, 50, kernel_size=(3,), stride=(1,))
  7. )
  8. (dropout): Dropout (p=0.1,inplace=False)
  9. (fc1):线性(in_features=150,out_features=100,bias=True)
  10. (fc2):线性(in_features=100,out_features=4,bias=True)
  11. </span></span>

1
2
  1. # Initialize trainer
  2. trainer = Trainer(model=model, device=device)
1
2
3
4
5
6
7
  1. # Dataloader
  2. text = "The final tennis tournament starts next week."
  3. X = tokenizer.texts_to_sequences([preprocess(text)])
  4. print (tokenizer.sequences_to_texts(X))
  5. y_filler = label_encoder.encode([label_encoder.classes[0]]*len(X))
  6. dataset = Dataset(X=X, y=y_filler, max_filter_size=max_filter_size)
  7. dataloader = dataset.create_dataloader(batch_size=batch_size)

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">['决赛网球锦标赛下周开始']
  2. </span></span>
1
2
3
4
  1. # Inference
  2. y_prob = trainer.predict_step(dataloader)
  3. y_pred = np.argmax(y_prob, axis=1)
  4. label_encoder.decode(y_pred)
  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">['运动的']
  2. </span></span>
1
2
3
  1. # Class distributions
  2. prob_dist = get_probability_distribution(y_prob=y_prob[0], classes=label_encoder.classes)
  3. print (json.dumps(prob_dist, indent=2))
  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">{
  2. “体育”:0.9999998807907104,
  3. “世界”:6.336378532978415e-08,
  4. “科技”:2.107449992294619e-09,
  5. “商业”:3.706519813295728e-10
  6. }
  7. </span></span>

可解释性

我们经历了在卷积之前填充输入的所有麻烦,结果是与输入形状相同的输出,因此我们可以尝试获得一些可解释性。由于每个标记都映射到我们应用最大池化的卷积输出,因此我们可以看到哪个标记的输出对预测影响最大。我们首先需要从我们的模型中获取 conv 输出:

1
2
  1. import collections
  2. import seaborn as sns
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
  1. class InterpretableCNN(nn.Module):
  2. def __init__(self, embedding_dim, vocab_size, num_filters,
  3. filter_sizes, hidden_dim, dropout_p, num_classes,
  4. pretrained_embeddings=None, freeze_embeddings=False,
  5. padding_idx=0):
  6. super(InterpretableCNN, self).__init__()
  7. # Filter sizes
  8. self.filter_sizes = filter_sizes
  9. # Initialize embeddings
  10. if pretrained_embeddings is None:
  11. self.embeddings = nn.Embedding(
  12. embedding_dim=embedding_dim, num_embeddings=vocab_size,
  13. padding_idx=padding_idx)
  14. else:
  15. pretrained_embeddings = torch.from_numpy(pretrained_embeddings).float()
  16. self.embeddings = nn.Embedding(
  17. embedding_dim=embedding_dim, num_embeddings=vocab_size,
  18. padding_idx=padding_idx, _weight=pretrained_embeddings)
  19. # Freeze embeddings or not
  20. if freeze_embeddings:
  21. self.embeddings.weight.requires_grad = False
  22. # Conv weights
  23. self.conv = nn.ModuleList(
  24. [nn.Conv1d(in_channels=embedding_dim,
  25. out_channels=num_filters,
  26. kernel_size=f) for f in filter_sizes])
  27. # FC weights
  28. self.dropout = nn.Dropout(dropout_p)
  29. self.fc1 = nn.Linear(num_filters*len(filter_sizes), hidden_dim)
  30. self.fc2 = nn.Linear(hidden_dim, num_classes)
  31. def forward(self, inputs, channel_first=False):
  32. # Embed
  33. x_in, = inputs
  34. x_in = self.embeddings(x_in)
  35. # Rearrange input so num_channels is in dim 1 (N, C, L)
  36. if not channel_first:
  37. x_in = x_in.transpose(1, 2)
  38. # Conv outputs
  39. z = []
  40. max_seq_len = x_in.shape[2]
  41. for i, f in enumerate(self.filter_sizes):
  42. # `SAME` padding
  43. padding_left = int((self.conv[i].stride[0]*(max_seq_len-1) - max_seq_len + self.filter_sizes[i])/2)
  44. padding_right = int(math.ceil((self.conv[i].stride[0]*(max_seq_len-1) - max_seq_len + self.filter_sizes[i])/2))
  45. # Conv + pool
  46. _z = self.conv[i](F.pad(x_in, (padding_left, padding_right)))
  47. z.append(_z.cpu().numpy())
  48. return z
1
2
  1. PRETRAINED_EMBEDDINGS = embedding_matrix
  2. FREEZE_EMBEDDINGS = False
1
2
3
4
5
6
7
8
  1. # Initialize model
  2. interpretable_model = InterpretableCNN(
  3. embedding_dim=EMBEDDING_DIM, vocab_size=VOCAB_SIZE,
  4. num_filters=NUM_FILTERS, filter_sizes=FILTER_SIZES,
  5. hidden_dim=HIDDEN_DIM, dropout_p=DROPOUT_P, num_classes=NUM_CLASSES,
  6. pretrained_embeddings=PRETRAINED_EMBEDDINGS, freeze_embeddings=FREEZE_EMBEDDINGS)
  7. interpretable_model.load_state_dict(torch.load(Path(dir, "model.pt"), map_location=device))
  8. interpretable_model.to(device)

  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">InterpretableCNN(
  2. (embeddings): Embedding(5000, 100, padding_idx=0)
  3. (conv): ModuleList(
  4. (0): Conv1d(100, 50, kernel_size=(1,), stride=(1,))
  5. (1): Conv1d(100, 50, kernel_size=(2,), stride=(1,))
  6. (2): Conv1d(100, 50, kernel_size=(3,), stride=(1,))
  7. )
  8. (dropout): Dropout (p=0.1,inplace=False)
  9. (fc1):线性(in_features=150,out_features=100,bias=True)
  10. (fc2):线性(in_features=100,out_features=4,bias=True)
  11. </span></span>
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
  1. # Get conv outputs
  2. interpretable_model.eval()
  3. conv_outputs = []
  4. with torch.inference_mode():
  5. for i, batch in enumerate(dataloader):
  6. # Forward pass w/ inputs
  7. inputs, targets = batch[:-1], batch[-1]
  8. z = interpretable_model(inputs)
  9. # Store conv outputs
  10. conv_outputs.extend(z)
  11. conv_outputs = np.vstack(conv_outputs)
  12. print (conv_outputs.shape) # (len(filter_sizes), num_filters, max_seq_len)
  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">(3, 50, 6)
  2. </span></span>
1
2
3
4
  1. # Visualize a bi-gram filter's outputs
  2. tokens = tokenizer.sequences_to_texts(X)[0].split(" ")
  3. filter_size = 2
  4. sns.heatmap(conv_outputs[filter_size-1][:, len(tokens)], xticklabels=tokens)

一维全局最大池化将从我们num_filters的每个中提取最高值filter_size。我们也可以采用相同的方法来确定哪个 n-gram 最相关,但请注意在上面的热图中,许多过滤器没有太大的差异。为了缓解这种情况,本文使用阈值来确定用于可解释性的过滤器。但为了简单起见,让我们通过最大池最频繁地提取哪些令牌的过滤器输出。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
  1. sample_index = 0
  2. print (f"Original text:\n{text}")
  3. print (f"\nPreprocessed text:\n{tokenizer.sequences_to_texts(X)[0]}")
  4. print ("\nMost important n-grams:")
  5. # Process conv outputs for each unique filter size
  6. for i, filter_size in enumerate(FILTER_SIZES):
  7. # Identify most important n-gram (excluding last token)
  8. popular_indices = collections.Counter([np.argmax(conv_output) \
  9. for conv_output in conv_outputs[i]])
  10. # Get corresponding text
  11. start = popular_indices.most_common(1)[-1][0]
  12. n_gram = " ".join([token for token in tokens[start:start+filter_size]])
  13. print (f"[{filter_size}-gram]: {n_gram}")
  1. <span style="background-color:#ffffff"><span style="color:var(--md-code-fg-color)">原文:
  2. 网球总决赛下周开始。
  3. 预处理文本:
  4. 网球决赛将于下周开始
  5. 最重要的 n-gram:
  6. [1-gram]:网球
  7. [2-gram]:网球锦标赛
  8. [3-gram]:网球决赛</span></span>
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/神奇cpp/article/detail/873957
推荐阅读
相关标签
  

闽ICP备14008679号