当前位置:   article > 正文

自然语言处理实验

自然语言处理实验

实验一:

        使用前馈神经网络进行姓氏分类

1.实验介绍

实验内容:

        我们通过观察感知器来介绍神经网络的基础,感知器是现存最简单的神经网络。感知器的一个历史性的缺点是它不能学习数据中存在的一些非常重要的模式。例如,查看下图中绘制的数据点。这相当于非此即彼(XOR)的情况,在这种情况下,决策边界不能是一条直线(也称为线性可分)。在这个例子中,感知器失败了。

        在这一实验中,我们将探索传统上称为前馈网络的神经网络模型,以及两种前馈神经网络:多层感知器和卷积神经网络。多层感知器是简单感知器在结构上的拓展,将多个感知器分组在一个单层,并将多个层叠加在一起。卷积神经网络,在处理数字信号时深受窗口滤波器的启发。通过这种窗口特性,卷积神经网络能够在输入中学习局部化模式,这不仅使其成为计算机视觉的主轴,而且是检测单词和句子等序列数据中的子结构的理想候选。在本实验中,多层感知器和卷积神经网络被分组在一起,因为它们都是前馈神经网络,并且与另一类神经网络——递归神经网络(RNNs)形成对比,递归神经网络(RNNs)允许反馈(或循环),这样每次计算都可以从之前的计算中获得信息。

实验环境:

        Python3.6.7

2.The Multilayer Perceptron(多层感知器)

          多层感知器(MLP)被认为是最基本的神经网络构建模块之一。感知器将数据向量作为输入,计算出一个输出值。在MLP中,许多感知器被分组,以便单个层的输出是一个新的向量,而不是单个输出值。在PyTorch中,正如您稍后将看到的,这只需设置线性层中的输出特性的数量即可完成。MLP的另一个方面是,它将多个层与每个层之间的非线性结合在一起。最简单的MLP,如图所示,由三个表示阶段和两个线性层组成。第一阶段是输入向量。这是给定给模型的向量。给定输入向量,第一个线性层计算一个隐藏向量——表示的第二阶段。隐藏向量之所以这样被调用,是因为它是位于输入和输出之间的层的输出。我们所说的“层的输出”是什么意思?理解这个的一种方法是隐藏向量中的值是组成该层的不同感知器的输出。使用这个隐藏的向量,第二个线性层计算一个输出向量。在像Yelp评论分类这样的二进制任务中,输出向量仍然可以是1。mlp的力量来自于添加第二个线性层和允许模型学习一个线性分割的的中间表示——该属性的能表示一个直线(或更一般的,一个超平面)可以用来区分数据点落在线(或超平面)的哪一边的。学习具有特定属性的中间表示,如分类任务是线性可分的,这是使用神经网络的最深刻后果之一,也是其建模能力的精髓。

2.1.1 A Simple Example:XOR

        在这个例子中,我们在一个二元分类任务中训练感知器和MLP:星和圆。每个数据点是一个二维坐标。在不深入研究实现细节的情况下,最终的模型预测如图4-3所示。在这个图中,错误分类的数据点用黑色填充,而正确分类的数据点没有填充。在左边的面板中,从填充的形状可以看出,感知器在学习一个可以将星星和圆分开的决策边界方面有困难。然而,MLP(右面板)学习了一个更精确地对恒星和圆进行分类的决策边界。

        图中,每个数据点的真正类是该点的形状:星形或圆形。错误的分类用块填充,正确的分类没有填充。这些线是每个模型的决策边界。在边的面板中,感知器学习—个不能正确地将圆与星分开的决策边界。事实上,没有一条线可以。在右动的面板中,MLP学会了从圆中分离星。

        虽然在图中显示MLP有两个决策边界,这是它的优点,但它实际上只是一个决策边界!决策边界就是这样出现的,因为中间表示法改变了空间,使一个超平面同时出现在这两个位置上。在下图中,我们可以看到MLP计算的中间值。这些点的形状表示类(星形或圆形)。我们所看到的是,神经网络(本例中为MLP)已经学会了“扭曲”数据所处的空间,以便在数据通过最后一层时,用一线来分割它们。

相反,如下图所示,感知器没有额外的一层来处理数据的形状,直到数据变成线性可分的。

2.2 Implementing MLPs in PyTorch

        如前所述,MLP除了简单的感知器之外,还有一个额外的计算层。在下面的例子2-1中,线性对象被命名为fc1和fc2,它们遵循一个通用约定,即将线性模块称为“完全连接层”,简称为“fc层”。除了这两个线性层外,还有一个修正的线性单元(ReLU)非线性,它在被输入到第二个线性层之前应用于第一个线性层的输出。由于层的顺序性,必须确保层中的输出数量等于下一层的输入数量。使用两个线性层之间的非线性是必要的,因为没有它,两个线性层在数学上等价于一个线性层4,因此不能建模复杂的模式。MLP的实现只实现反向传播的前向传递。这是因为PyTorch根据模型的定义和向前传递的实现,自动计算出如何进行向后传递和梯度更新。

例2-1:Multilayer Perceptron
  1. import torch.nn as nn
  2. import torch.nn.functional as F
  3. class MultilayerPerceptron(nn.Module):
  4. def __init__(self, input_dim, hidden_dim, output_dim):
  5. """
  6. Args:
  7. input_dim (int): the size of the input vectors
  8. hidden_dim (int): the output size of the first Linear layer
  9. output_dim (int): the output size of the second Linear layer
  10. """
  11. super(MultilayerPerceptron, self).__init__()
  12. self.fc1 = nn.Linear(input_dim, hidden_dim)
  13. self.fc2 = nn.Linear(hidden_dim, output_dim)
  14. def forward(self, x_in, apply_softmax=False):
  15. """The forward pass of the MLP
  16. Args:
  17. x_in (torch.Tensor): an input data tensor.
  18. x_in.shape should be (batch, input_dim)
  19. apply_softmax (bool): a flag for the softmax activation
  20. should be false if used with the Cross Entropy losses
  21. Returns:
  22. the resulting tensor. tensor.shape should be (batch, output_dim)
  23. """
  24. # 通过第一个全连接层并应用ReLU激活函数
  25. intermediate = F.relu(self.fc1(x_in))
  26. # 通过第二个全连接层得到模型输出
  27. output = self.fc2(intermediate)
  28. # 如果apply_softmax为真,则对输出应用softmax函数
  29. if apply_softmax:
  30. output = F.softmax(output, dim=1) # 在第1维(通常是类别维度)上应用softmax
  31. return output

        在例2-2中,我们实例化了MLP。由于MLP实现的通用性,可以为任何大小的输入建模。为了演示,我们使用大小为3的输入维度、大小为4的输出维度和大小为100的隐藏维度。请注意,在print语句的输出中,每个层中的单元数很好地排列在一起,以便为维度3的输入生成维度4的输出。

例2-2: An example instantiation of an MLP
  1. batch_size = 2 # number of samples input at once
  2. input_dim = 3
  3. hidden_dim = 100
  4. output_dim = 4
  5. # Initialize model
  6. mlp = MultilayerPerceptron(input_dim, hidden_dim, output_dim)
  7. print(mlp)

        我们可以通过传递一些随机输入来快速测试模型的“连接”,如例2-3所示。因为模型还没有经过训练,所以输出是随机的。在花费时间训练模型之前,这样做是一个有用的完整性检查。请注意PyTorch的交互性是如何让我们在开发过程中实时完成所有这些工作的,这与使用NumPy或panda没有太大区别:

例2-3:Testing the MLP with random inputs
  1. import torch
  2. def describe(x):
  3. print("Type: {}".format(x.type()))
  4. print("Shape/size: {}".format(x.shape))
  5. print("Values: \n{}".format(x))
  6. x_input = torch.rand(batch_size, input_dim)
  7. describe(x_input)

输出结果:

        学习如何读取PyTorch模型的输入和输出非常重要。在前面的例子中,MLP模型的输出是一个有两行三列的张量。这个张量中的行与批处理维数对应,批处理维数是小批处理中的数据点的数量。列是每个数据点的最终特征向量。在某些情况下,例如在分类设置中,特征向量是一个预测向量。名称为“预测向量”表示它对应于一个概率分布。预测向量会发生什么取决于我们当前是在进行训练还是在执行推理。在训练期间,输出按原样使用,带有一个损失函数和目标类标签的表示。我们将在“示例:带有多层感知器的姓氏分类”中对此进行深入介绍。

        但是,如果想将预测向量转换为概率,则需要额外的步骤。具体来说,需要softmax函数,它用于将一个值向量转换为概率。softmax有许多根。在物理学中,它被称为玻尔兹曼或吉布斯分布;在统计学中,它是多项式逻辑回归;在自然语言处理(NLP)社区,它是最大熵(MaxEnt)分类器。不管叫什么名字,这个函数背后的直觉是,大的正值会导致更高的概率,小的负值会导致更小的概率。在示例4-3中,apply_softmax参数应用了这个额外的步骤。在例2-4中,可以看到相同的输出,但是这次将apply_softmax标志设置为True:

例2-4:MLP with apply_softmax=True
  1. y_output = mlp(x_input, apply_softmax=True)
  2. describe(y_output)

输出结果:

        综上所述,mlp是将张量映射到其他张量的线性层。在每一对线性层之间使用非线性来打破线性关系,并允许模型扭曲向量空间。在分类设置中,这种扭曲应该导致类之间的线性可分性。另外,可以使用softmax函数将MLP输出解释为概率,但是不应该将softmax与特定的损失函数一起使用,因为底层实现可以利用高级数学/计算捷径。

3.Convolutional Neural Network(卷积神经网络)

        卷积神经网络(Convolutional Neural Network,简称CNN)特别适用于处理具有网格结构的数据,如图像。CNN的设计灵感来源于对生物视觉系统的理解,尤其是猫和猴子的视觉皮层中的细胞如何响应特定类型的刺激。

CNN的主要组成部分包括:

3.1:卷积层

        这是CNN的核心,它使用一组可学习的滤波器(或称作核)来检测输入数据中的局部特征。每个滤波器都会在输入数据上滑动,并与之进行点积操作,生成一个二维激活图,即特征映射。通过学习这些滤波器的权重,CNN能够识别出边缘、纹理等图像的基本特征。

3.2:池化层

        通常位于卷积层之后,用于降低特征映射的空间维度,减少计算量,同时保持最重要的信息。最常见的池化方法是最大池化,它选择每个区域内的最大值作为输出。

3.3:全连接层

        在网络的末端,将前面提取到的特征进行整合,以进行分类或回归等任务。这些层的神经元与前一层的所有神经元相连,类似于传统的多层感知器。

3.4:激活函数

        如ReLU(Rectified Linear Unit),用于引入非线性,帮助网络学习更复杂的模式。

3.5:Dropout层

        随机丢弃一部分神经元的输出,以防止过拟合。

CNN通过反向传播算法自动调整其参数(权重和偏置),以最小化训练集上的损失函数。由于其在图像识别、自然语言处理等领域展现出的强大性能,CNN已成为深度学习中最重要和最广泛应用的模型之一。

4.实验步骤

4.1:MLP
4.1.1:Surname Dataset 数据集

        姓氏数据集,它收集了来自18个不同国家的10,000个姓氏,这些姓氏是作者从互联网上不同的姓名来源收集的。该数据集将在本课程实验的几个示例中重用,并具有一些使其有趣的属性。第一个性质是它是相当不平衡的。排名前三的课程占数据的60%以上:27%是英语,21%是俄语,14%是阿拉伯语。剩下的15个民族的频率也在下降——这也是语言特有的特性。第二个特点是,在国籍和姓氏正字法(拼写)之间有一种有效和直观的关系。有些拼写变体与原籍国联系非常紧密(比如“O ‘Neill”、“Antonopoulos”、“Nagasawa”或“Zhu”)。

        为了创建最终的数据集,我们从一个比课程补充材料中包含的版本处理更少的版本开始,并执行了几个数据集修改操作。第一个目的是减少这种不平衡——原始数据集中70%以上是俄文,这可能是由于抽样偏差或俄文姓氏的增多。为此,我们通过选择标记为俄语的姓氏的随机子集对这个过度代表的类进行子样本。接下来,我们根据国籍对数据集进行分组,并将数据集分为三个部分:70%到训练数据集,15%到验证数据集,最后15%到测试数据集,以便跨这些部分的类标签分布具有可比性。

例2-5: Implementing SurnameDataset.__getitem__()
  1. class SurnameDataset(Dataset):
  2. # Implementation is nearly identical to Section 3.5
  3. def __getitem__(self, index):
  4. # 获取目标数据框中对应索引的行
  5. row = self._target_df.iloc[index]
  6. # 使用_vectorizer的vectorize方法处理姓氏,得到向量表示
  7. surname_vector = self._vectorizer.vectorize(row.surname)
  8. # 查找国籍词汇表中对应国籍的索引(标签)
  9. nationality_index = self._vectorizer.nationality_vocab.lookup_token(row.nationality)
  10. # 返回包含特征和标签的字典
  11. return {'x_surname': surname_vector,
  12. 'y_nationality': nationality_index}
4.1.2:词汇表和向量化器

        为了使用字符对姓氏进行分类,我们使用词汇表、向量化器和DataLoader将姓氏字符串转换为向量化的minibatches。这些数据结构与“Example: Classifying Sentiment of Restaurant Reviews”中使用的数据结构相同,它们举例说明了一种多态性,这种多态性将姓氏的字符标记与Yelp评论的单词标记相同对待。数据不是通过将字令牌映射到整数来向量化的,而是通过将字符映射到整数来向量化的。

例2-6.:Implementing SurnameVectorizer and Vocabulary
  1. class SurnameVectorizer(object):
  2. """ The Vectorizer which coordinates the Vocabularies and puts them to use"""
  3. def __init__(self, surname_vocab, nationality_vocab):
  4. """
  5. Args:
  6. surname_vocab (Vocabulary): maps characters to integers
  7. nationality_vocab (Vocabulary): maps nationalities to integers
  8. """
  9. self.surname_vocab = surname_vocab
  10. self.nationality_vocab = nationality_vocab
  11. def vectorize(self, surname):
  12. """
  13. Args:
  14. surname (str): the surname
  15. Returns:
  16. one_hot (np.ndarray): a collapsed one-hot encoding
  17. """
  18. vocab = self.surname_vocab
  19. one_hot = np.zeros(len(vocab), dtype=np.float32)
  20. for token in surname:
  21. one_hot[vocab.lookup_token(token)] = 1
  22. return one_hot
  23. @classmethod
  24. def from_dataframe(cls, surname_df):
  25. """Instantiate the vectorizer from the dataset dataframe
  26. Args:
  27. surname_df (pandas.DataFrame): the surnames dataset
  28. Returns:
  29. an instance of the SurnameVectorizer
  30. """
  31. surname_vocab = Vocabulary(unk_token="@")
  32. nationality_vocab = Vocabulary(add_unk=False)
  33. for index, row in surname_df.iterrows():
  34. for letter in row.surname:
  35. surname_vocab.add_token(letter)
  36. nationality_vocab.add_token(row.nationality)
  37. return cls(surname_vocab, nationality_vocab)
  38. @classmethod
  39. def from_serializable(cls, contents):
  40. surname_vocab = Vocabulary.from_serializable(contents['surname_vocab'])
  41. nationality_vocab = Vocabulary.from_serializable(contents['nationality_vocab'])
  42. return cls(surname_vocab=surname_vocab, nationality_vocab=nationality_vocab)
  43. def to_serializable(self):
  44. return {'surname_vocab': self.surname_vocab.to_serializable(),
  45. 'nationality_vocab': self.nationality_vocab.to_serializable()}
  46. class Vocabulary(object):
  47. """Class to process text and extract vocabulary for mapping"""
  48. def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"):
  49. """
  50. Args:
  51. token_to_idx (dict): a pre-existing map of tokens to indices
  52. add_unk (bool): a flag that indicates whether to add the UNK token
  53. unk_token (str): the UNK token to add into the Vocabulary
  54. """
  55. if token_to_idx is None:
  56. token_to_idx = {}
  57. self._token_to_idx = token_to_idx
  58. self._idx_to_token = {idx: token
  59. for token, idx in self._token_to_idx.items()}
  60. self._add_unk = add_unk
  61. self._unk_token = unk_token
  62. self.unk_index = -1
  63. if add_unk:
  64. self.unk_index = self.add_token(unk_token)
  65. def to_serializable(self):
  66. """ returns a dictionary that can be serialized """
  67. return {'token_to_idx': self._token_to_idx,
  68. 'add_unk': self._add_unk,
  69. 'unk_token': self._unk_token}
  70. @classmethod
  71. def from_serializable(cls, contents):
  72. """ instantiates the Vocabulary from a serialized dictionary """
  73. return cls(**contents)
  74. def add_token(self, token):
  75. """Update mapping dicts based on the token.
  76. Args:
  77. token (str): the item to add into the Vocabulary
  78. Returns:
  79. index (int): the integer corresponding to the token
  80. """
  81. try:
  82. index = self._token_to_idx[token]
  83. except KeyError:
  84. index = len(self._token_to_idx)
  85. self._token_to_idx[token] = index
  86. self._idx_to_token[index] = token
  87. return index
  88. def add_many(self, tokens):
  89. """Add a list of tokens into the Vocabulary
  90. Args:
  91. tokens (list): a list of string tokens
  92. Returns:
  93. indices (list): a list of indices corresponding to the tokens
  94. """
  95. return [self.add_token(token) for token in tokens]
  96. def lookup_token(self, token):
  97. """Retrieve the index associated with the token
  98. or the UNK index if token isn't present.
  99. Args:
  100. token (str): the token to look up
  101. Returns:
  102. index (int): the index corresponding to the token
  103. Notes:
  104. `unk_index` needs to be >=0 (having been added into the Vocabulary)
  105. for the UNK functionality
  106. """
  107. if self.unk_index >= 0:
  108. return self._token_to_idx.get(token, self.unk_index)
  109. else:
  110. return self._token_to_idx[token]
  111. def lookup_index(self, index):
  112. """Return the token associated with the index
  113. Args:
  114. index (int): the index to look up
  115. Returns:
  116. token (str): the token corresponding to the index
  117. Raises:
  118. KeyError: if the index is not in the Vocabulary
  119. """
  120. if index not in self._idx_to_token:
  121. raise KeyError("the index (%d) is not in the Vocabulary" % index)
  122. return self._idx_to_token[index]
  123. def __str__(self):
  124. return "<Vocabulary(size=%d)>" % len(self)
  125. def __len__(self):
  126. return len(self._token_to_idx)
4.1.3:The Dataset
  1. import torch.utils.data.dataset as Dataset
  2. class SurnameDataset(Dataset.Dataset):
  3. def __init__(self, surname_df, vectorizer):
  4. """
  5. Args:
  6. surname_df (pandas.DataFrame): the dataset
  7. vectorizer (SurnameVectorizer): vectorizer instatiated from dataset
  8. """
  9. self.surname_df = surname_df
  10. self._vectorizer = vectorizer
  11. self.train_df = self.surname_df[self.surname_df.split == 'train']
  12. self.train_size = len(self.train_df)
  13. self.val_df = self.surname_df[self.surname_df.split == 'val']
  14. self.validation_size = len(self.val_df)
  15. self.test_df = self.surname_df[self.surname_df.split == 'test']
  16. self.test_size = len(self.test_df)
  17. self._lookup_dict = {'train': (self.train_df, self.train_size),
  18. 'val': (self.val_df, self.validation_size),
  19. 'test': (self.test_df, self.test_size)}
  20. self.set_split('train')
  21. # Class weights
  22. class_counts = surname_df.nationality.value_counts().to_dict()
  23. def sort_key(item):
  24. return self._vectorizer.nationality_vocab.lookup_token(item[0])
  25. sorted_counts = sorted(class_counts.items(), key=sort_key)
  26. frequencies = [count for _, count in sorted_counts]
  27. self.class_weights = 1.0 / torch.tensor(frequencies, dtype=torch.float32)
  28. @classmethod
  29. def load_dataset_and_make_vectorizer(cls, surname_csv):
  30. """Load dataset and make a new vectorizer from scratch
  31. Args:
  32. surname_csv (str): location of the dataset
  33. Returns:
  34. an instance of SurnameDataset
  35. """
  36. surname_df = pd.read_csv(surname_csv)
  37. train_surname_df = surname_df[surname_df.split == 'train']
  38. return cls(surname_df, SurnameVectorizer.from_dataframe(train_surname_df))
  39. @classmethod
  40. def load_dataset_and_load_vectorizer(cls, surname_csv, vectorizer_filepath):
  41. """Load dataset and the corresponding vectorizer.
  42. Used in the case in the vectorizer has been cached for re-use
  43. Args:
  44. surname_csv (str): location of the dataset
  45. vectorizer_filepath (str): location of the saved vectorizer
  46. Returns:
  47. an instance of SurnameDataset
  48. """
  49. surname_df = pd.read_csv(surname_csv)
  50. vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
  51. return cls(surname_df, vectorizer)
  52. @staticmethod
  53. def load_vectorizer_only(vectorizer_filepath):
  54. """a static method for loading the vectorizer from file
  55. Args:
  56. vectorizer_filepath (str): the location of the serialized vectorizer
  57. Returns:
  58. an instance of SurnameVectorizer
  59. """
  60. with open(vectorizer_filepath) as fp:
  61. return SurnameVectorizer.from_serializable(json.load(fp))
  62. def save_vectorizer(self, vectorizer_filepath):
  63. """saves the vectorizer to disk using json
  64. Args:
  65. vectorizer_filepath (str): the location to save the vectorizer
  66. """
  67. with open(vectorizer_filepath, "w") as fp:
  68. json.dump(self._vectorizer.to_serializable(), fp)
  69. def get_vectorizer(self):
  70. """ returns the vectorizer """
  71. return self._vectorizer
  72. def set_split(self, split="train"):
  73. """ selects the splits in the dataset using a column in the dataframe """
  74. self._target_split = split
  75. self._target_df, self._target_size = self._lookup_dict[split]
  76. def __len__(self):
  77. return self._target_size
  78. def __getitem__(self, index):
  79. """the primary entry point method for PyTorch datasets
  80. Args:
  81. index (int): the index to the data point
  82. Returns:
  83. a dictionary holding the data point's:
  84. features (x_surname)
  85. label (y_nationality)
  86. """
  87. row = self._target_df.iloc[index]
  88. surname_vector = \
  89. self._vectorizer.vectorize(row.surname)
  90. nationality_index = \
  91. self._vectorizer.nationality_vocab.lookup_token(row.nationality)
  92. return {'x_surname': surname_vector,
  93. 'y_nationality': nationality_index}
  94. def get_num_batches(self, batch_size):
  95. """Given a batch size, return the number of batches in the dataset
  96. Args:
  97. batch_size (int)
  98. Returns:
  99. number of batches in the dataset
  100. """
  101. return len(self) // batch_size
  102. def generate_batches(dataset, batch_size, shuffle=True,
  103. drop_last=True, device="cpu"):
  104. """
  105. A generator function which wraps the PyTorch DataLoader. It will
  106. ensure each tensor is on the write device location.
  107. """
  108. dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
  109. shuffle=shuffle, drop_last=drop_last)
  110. for data_dict in dataloader:
  111. out_data_dict = {}
  112. for name, tensor in data_dict.items():
  113. out_data_dict[name] = data_dict[name].to(device)
  114. yield out_data_dict
4.1.4:The Surname Classifier Model

        SurnameClassifier是本实验前面介绍的MLP的实现。第一个线性层将输入向量映射到中间向量,并对该向量应用非线性。第二线性层将中间向量映射到预测向量。在最后一步中,可选地应用softmax操作,以确保输出和为1;这就是所谓的“概率”。它是可选的原因与我们使用的损失函数的数学公式有关——交叉熵损失。我们研究了“损失函数”中的交叉熵损失。回想一下,交叉熵损失对于多类分类是最理想的,但是在训练过程中软最大值的计算不仅浪费而且在很多情况下并不稳定。

例2-7:The SurnameClassifier as an MLP
  1. import torch.nn as nn
  2. import torch.nn.functional as F
  3. class SurnameClassifier(nn.Module):
  4. """ A 2-layer Multilayer Perceptron for classifying surnames """
  5. def __init__(self, input_dim, hidden_dim, output_dim):
  6. """
  7. Args:
  8. input_dim (int): the size of the input vectors
  9. hidden_dim (int): the output size of the first Linear layer
  10. output_dim (int): the output size of the second Linear layer
  11. """
  12. super(SurnameClassifier, self).__init__()
  13. self.fc1 = nn.Linear(input_dim, hidden_dim)
  14. self.fc2 = nn.Linear(hidden_dim, output_dim)
  15. def forward(self, x_in, apply_softmax=False):
  16. """The forward pass of the classifier
  17. Args:
  18. x_in (torch.Tensor): an input data tensor.
  19. x_in.shape should be (batch, input_dim)
  20. apply_softmax (bool): a flag for the softmax activation
  21. should be false if used with the Cross Entropy losses
  22. Returns:
  23. the resulting tensor. tensor.shape should be (batch, output_dim)
  24. """
  25. # 通过第一层全连接层,并使用ReLU激活函数
  26. intermediate_vector = F.relu(self.fc1(x_in))
  27. # 通过第二层全连接层得到预测向量
  28. prediction_vector = self.fc2(intermediate_vector)
  29. # 如果apply_softmax为True,则对预测向量应用softmax函数
  30. if apply_softmax:
  31. prediction_vector = F.softmax(prediction_vector, dim=1)
  32. # 返回处理后的预测向量
  33. return prediction_vector
4.2:CNN

        一般来说,神经网络设计的目标是找到一个能够完成任务的超参数组态。我们再次考虑在“示例:带有多层感知器的姓氏分类”中引入的现在很熟悉的姓氏分类任务,但是我们将使用CNNs而不是MLP。我们仍然需要应用最后一个线性层,它将学会从一系列卷积层创建的特征向量创建预测向量。这意味着目标是确定卷积层的配置,从而得到所需的特征向量。所有CNN应用程序都是这样的:首先有一组卷积层,它们提取一个feature map,然后将其作为上游处理的输入。在分类中,上游处理几乎总是应用线性(或fc)层。

        本课程中的实现遍历设计决策,以构建一个特征向量。我们首先构造一个人工数据张量,以反映实际数据的形状。数据张量的大小是三维的——这是向量化文本数据的最小批大小。如果你对一个字符序列中的每个字符使用onehot向量,那么onehot向量序列就是一个矩阵,而onehot矩阵的小批量就是一个三维张量。使用卷积的术语,每个onehot(通常是词汇表的大小)的大小是”input channels”的数量,字符序列的长度是“width”。

4.2.1:词汇表和向量化器
  1. class Vocabulary(object):
  2. """Class to process text and extract vocabulary for mapping"""
  3. def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"):
  4. """
  5. Args:
  6. token_to_idx (dict): a pre-existing map of tokens to indices
  7. add_unk (bool): a flag that indicates whether to add the UNK token
  8. unk_token (str): the UNK token to add into the Vocabulary
  9. """
  10. if token_to_idx is None:
  11. token_to_idx = {}
  12. self._token_to_idx = token_to_idx
  13. self._idx_to_token = {idx: token
  14. for token, idx in self._token_to_idx.items()}
  15. self._add_unk = add_unk
  16. self._unk_token = unk_token
  17. self.unk_index = -1
  18. if add_unk:
  19. self.unk_index = self.add_token(unk_token)
  20. def to_serializable(self):
  21. """ returns a dictionary that can be serialized """
  22. return {'token_to_idx': self._token_to_idx,
  23. 'add_unk': self._add_unk,
  24. 'unk_token': self._unk_token}
  25. @classmethod
  26. def from_serializable(cls, contents):
  27. """ instantiates the Vocabulary from a serialized dictionary """
  28. return cls(**contents)
  29. def add_token(self, token):
  30. """Update mapping dicts based on the token.
  31. Args:
  32. token (str): the item to add into the Vocabulary
  33. Returns:
  34. index (int): the integer corresponding to the token
  35. """
  36. try:
  37. index = self._token_to_idx[token]
  38. except KeyError:
  39. index = len(self._token_to_idx)
  40. self._token_to_idx[token] = index
  41. self._idx_to_token[index] = token
  42. return index
  43. def add_many(self, tokens):
  44. """Add a list of tokens into the Vocabulary
  45. Args:
  46. tokens (list): a list of string tokens
  47. Returns:
  48. indices (list): a list of indices corresponding to the tokens
  49. """
  50. return [self.add_token(token) for token in tokens]
  51. def lookup_token(self, token):
  52. """Retrieve the index associated with the token
  53. or the UNK index if token isn't present.
  54. Args:
  55. token (str): the token to look up
  56. Returns:
  57. index (int): the index corresponding to the token
  58. Notes:
  59. `unk_index` needs to be >=0 (having been added into the Vocabulary)
  60. for the UNK functionality
  61. """
  62. if self.unk_index >= 0:
  63. return self._token_to_idx.get(token, self.unk_index)
  64. else:
  65. return self._token_to_idx[token]
  66. def lookup_index(self, index):
  67. """Return the token associated with the index
  68. Args:
  69. index (int): the index to look up
  70. Returns:
  71. token (str): the token corresponding to the index
  72. Raises:
  73. KeyError: if the index is not in the Vocabulary
  74. """
  75. if index not in self._idx_to_token:
  76. raise KeyError("the index (%d) is not in the Vocabulary" % index)
  77. return self._idx_to_token[index]
  78. def __str__(self):
  79. return "<Vocabulary(size=%d)>" % len(self)
  80. def __len__(self):
  81. return len(self._token_to_idx)
  82. class SurnameVectorizer(object):
  83. """ The Vectorizer which coordinates the Vocabularies and puts them to use"""
  84. def __init__(self, surname_vocab, nationality_vocab, max_surname_length):
  85. """
  86. Args:
  87. surname_vocab (Vocabulary): maps characters to integers
  88. nationality_vocab (Vocabulary): maps nationalities to integers
  89. max_surname_length (int): the length of the longest surname
  90. """
  91. self.surname_vocab = surname_vocab
  92. self.nationality_vocab = nationality_vocab
  93. self._max_surname_length = max_surname_length
  94. def vectorize(self, surname):
  95. """
  96. Args:
  97. surname (str): the surname
  98. Returns:
  99. one_hot_matrix (np.ndarray): a matrix of one-hot vectors
  100. """
  101. one_hot_matrix_size = (len(self.surname_vocab), self._max_surname_length)
  102. one_hot_matrix = np.zeros(one_hot_matrix_size, dtype=np.float32)
  103. for position_index, character in enumerate(surname):
  104. character_index = self.surname_vocab.lookup_token(character)
  105. one_hot_matrix[character_index][position_index] = 1
  106. return one_hot_matrix
  107. @classmethod
  108. def from_dataframe(cls, surname_df):
  109. """Instantiate the vectorizer from the dataset dataframe
  110. Args:
  111. surname_df (pandas.DataFrame): the surnames dataset
  112. Returns:
  113. an instance of the SurnameVectorizer
  114. """
  115. surname_vocab = Vocabulary(unk_token="@")
  116. nationality_vocab = Vocabulary(add_unk=False)
  117. max_surname_length = 0
  118. for index, row in surname_df.iterrows():
  119. max_surname_length = max(max_surname_length, len(row.surname))
  120. for letter in row.surname:
  121. surname_vocab.add_token(letter)
  122. nationality_vocab.add_token(row.nationality)
  123. return cls(surname_vocab, nationality_vocab, max_surname_length)
  124. @classmethod
  125. def from_serializable(cls, contents):
  126. surname_vocab = Vocabulary.from_serializable(contents['surname_vocab'])
  127. nationality_vocab = Vocabulary.from_serializable(contents['nationality_vocab'])
  128. return cls(surname_vocab=surname_vocab, nationality_vocab=nationality_vocab,
  129. max_surname_length=contents['max_surname_length'])
  130. def to_serializable(self):
  131. return {'surname_vocab': self.surname_vocab.to_serializable(),
  132. 'nationality_vocab': self.nationality_vocab.to_serializable(),
  133. 'max_surname_length': self._max_surname_length}
  134. class SurnameDataset(Dataset):
  135. def __init__(self, surname_df, vectorizer):
  136. """
  137. Args:
  138. name_df (pandas.DataFrame): the dataset
  139. vectorizer (SurnameVectorizer): vectorizer instatiated from dataset
  140. """
  141. self.surname_df = surname_df
  142. self._vectorizer = vectorizer
  143. self.train_df = self.surname_df[self.surname_df.split == 'train']
  144. self.train_size = len(self.train_df)
  145. self.val_df = self.surname_df[self.surname_df.split == 'val']
  146. self.validation_size = len(self.val_df)
  147. self.test_df = self.surname_df[self.surname_df.split == 'test']
  148. self.test_size = len(self.test_df)
  149. self._lookup_dict = {'train': (self.train_df, self.train_size),
  150. 'val': (self.val_df, self.validation_size),
  151. 'test': (self.test_df, self.test_size)}
  152. self.set_split('train')
  153. # Class weights
  154. class_counts = surname_df.nationality.value_counts().to_dict()
  155. def sort_key(item):
  156. return self._vectorizer.nationality_vocab.lookup_token(item[0])
  157. sorted_counts = sorted(class_counts.items(), key=sort_key)
  158. frequencies = [count for _, count in sorted_counts]
  159. self.class_weights = 1.0 / torch.tensor(frequencies, dtype=torch.float32)
  160. @classmethod
  161. def load_dataset_and_make_vectorizer(cls, surname_csv):
  162. """Load dataset and make a new vectorizer from scratch
  163. Args:
  164. surname_csv (str): location of the dataset
  165. Returns:
  166. an instance of SurnameDataset
  167. """
  168. surname_df = pd.read_csv(surname_csv)
  169. train_surname_df = surname_df[surname_df.split == 'train']
  170. return cls(surname_df, SurnameVectorizer.from_dataframe(train_surname_df))
  171. @classmethod
  172. def load_dataset_and_load_vectorizer(cls, surname_csv, vectorizer_filepath):
  173. """Load dataset and the corresponding vectorizer.
  174. Used in the case in the vectorizer has been cached for re-use
  175. Args:
  176. surname_csv (str): location of the dataset
  177. vectorizer_filepath (str): location of the saved vectorizer
  178. Returns:
  179. an instance of SurnameDataset
  180. """
  181. surname_df = pd.read_csv(surname_csv)
  182. vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
  183. return cls(surname_df, vectorizer)
  184. @staticmethod
  185. def load_vectorizer_only(vectorizer_filepath):
  186. """a static method for loading the vectorizer from file
  187. Args:
  188. vectorizer_filepath (str): the location of the serialized vectorizer
  189. Returns:
  190. an instance of SurnameDataset
  191. """
  192. with open(vectorizer_filepath) as fp:
  193. return SurnameVectorizer.from_serializable(json.load(fp))
  194. def save_vectorizer(self, vectorizer_filepath):
  195. """saves the vectorizer to disk using json
  196. Args:
  197. vectorizer_filepath (str): the location to save the vectorizer
  198. """
  199. with open(vectorizer_filepath, "w") as fp:
  200. json.dump(self._vectorizer.to_serializable(), fp)
  201. def get_vectorizer(self):
  202. """ returns the vectorizer """
  203. return self._vectorizer
  204. def set_split(self, split="train"):
  205. """ selects the splits in the dataset using a column in the dataframe """
  206. self._target_split = split
  207. self._target_df, self._target_size = self._lookup_dict[split]
  208. def __len__(self):
  209. return self._target_size
  210. def __getitem__(self, index):
  211. """the primary entry point method for PyTorch datasets
  212. Args:
  213. index (int): the index to the data point
  214. Returns:
  215. a dictionary holding the data point's features (x_data) and label (y_target)
  216. """
  217. row = self._target_df.iloc[index]
  218. surname_matrix = \
  219. self._vectorizer.vectorize(row.surname)
  220. nationality_index = \
  221. self._vectorizer.nationality_vocab.lookup_token(row.nationality)
  222. return {'x_surname': surname_matrix,
  223. 'y_nationality': nationality_index}
  224. def get_num_batches(self, batch_size):
  225. """Given a batch size, return the number of batches in the dataset
  226. Args:
  227. batch_size (int)
  228. Returns:
  229. number of batches in the dataset
  230. """
  231. return len(self) // batch_size
  232. def generate_batches(dataset, batch_size, shuffle=True,
  233. drop_last=True, device="cpu"):
  234. """
  235. A generator function which wraps the PyTorch DataLoader. It will
  236. ensure each tensor is on the write device location.
  237. """
  238. dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
  239. shuffle=shuffle, drop_last=drop_last)
  240. for data_dict in dataloader:
  241. out_data_dict = {}
  242. for name, tensor in data_dict.items():
  243. out_data_dict[name] = data_dict[name].to(device)
  244. yield out_data_dict
4.2.2:The Dataset
  1. class SurnameDataset(Dataset):
  2. def __init__(self, surname_df, vectorizer):
  3. """
  4. Args:
  5. name_df (pandas.DataFrame): the dataset
  6. vectorizer (SurnameVectorizer): vectorizer instatiated from dataset
  7. """
  8. self.surname_df = surname_df
  9. self._vectorizer = vectorizer
  10. self.train_df = self.surname_df[self.surname_df.split=='train']
  11. self.train_size = len(self.train_df)
  12. self.val_df = self.surname_df[self.surname_df.split=='val']
  13. self.validation_size = len(self.val_df)
  14. self.test_df = self.surname_df[self.surname_df.split=='test']
  15. self.test_size = len(self.test_df)
  16. self._lookup_dict = {'train': (self.train_df, self.train_size),
  17. 'val': (self.val_df, self.validation_size),
  18. 'test': (self.test_df, self.test_size)}
  19. self.set_split('train')
  20. # Class weights
  21. class_counts = surname_df.nationality.value_counts().to_dict()
  22. def sort_key(item):
  23. return self._vectorizer.nationality_vocab.lookup_token(item[0])
  24. sorted_counts = sorted(class_counts.items(), key=sort_key)
  25. frequencies = [count for _, count in sorted_counts]
  26. self.class_weights = 1.0 / torch.tensor(frequencies, dtype=torch.float32)
  27. @classmethod
  28. def load_dataset_and_make_vectorizer(cls, surname_csv):
  29. """Load dataset and make a new vectorizer from scratch
  30. Args:
  31. surname_csv (str): location of the dataset
  32. Returns:
  33. an instance of SurnameDataset
  34. """
  35. surname_df = pd.read_csv(surname_csv)
  36. train_surname_df = surname_df[surname_df.split=='train']
  37. return cls(surname_df, SurnameVectorizer.from_dataframe(train_surname_df))
  38. @classmethod
  39. def load_dataset_and_load_vectorizer(cls, surname_csv, vectorizer_filepath):
  40. """Load dataset and the corresponding vectorizer.
  41. Used in the case in the vectorizer has been cached for re-use
  42. Args:
  43. surname_csv (str): location of the dataset
  44. vectorizer_filepath (str): location of the saved vectorizer
  45. Returns:
  46. an instance of SurnameDataset
  47. """
  48. surname_df = pd.read_csv(surname_csv)
  49. vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
  50. return cls(surname_df, vectorizer)
  51. @staticmethod
  52. def load_vectorizer_only(vectorizer_filepath):
  53. """a static method for loading the vectorizer from file
  54. Args:
  55. vectorizer_filepath (str): the location of the serialized vectorizer
  56. Returns:
  57. an instance of SurnameDataset
  58. """
  59. with open(vectorizer_filepath) as fp:
  60. return SurnameVectorizer.from_serializable(json.load(fp))
  61. def save_vectorizer(self, vectorizer_filepath):
  62. """saves the vectorizer to disk using json
  63. Args:
  64. vectorizer_filepath (str): the location to save the vectorizer
  65. """
  66. with open(vectorizer_filepath, "w") as fp:
  67. json.dump(self._vectorizer.to_serializable(), fp)
  68. def get_vectorizer(self):
  69. """ returns the vectorizer """
  70. return self._vectorizer
  71. def set_split(self, split="train"):
  72. """ selects the splits in the dataset using a column in the dataframe """
  73. self._target_split = split
  74. self._target_df, self._target_size = self._lookup_dict[split]
  75. def __len__(self):
  76. return self._target_size
  77. def __getitem__(self, index):
  78. """the primary entry point method for PyTorch datasets
  79. Args:
  80. index (int): the index to the data point
  81. Returns:
  82. a dictionary holding the data point's features (x_data) and label (y_target)
  83. """
  84. row = self._target_df.iloc[index]
  85. surname_matrix = \
  86. self._vectorizer.vectorize(row.surname)
  87. nationality_index = \
  88. self._vectorizer.nationality_vocab.lookup_token(row.nationality)
  89. return {'x_surname': surname_matrix,
  90. 'y_nationality': nationality_index}
  91. def get_num_batches(self, batch_size):
  92. """Given a batch size, return the number of batches in the dataset
  93. Args:
  94. batch_size (int)
  95. Returns:
  96. number of batches in the dataset
  97. """
  98. return len(self) // batch_size
  99. def generate_batches(dataset, batch_size, shuffle=True,
  100. drop_last=True, device="cpu"):
  101. """
  102. A generator function which wraps the PyTorch DataLoader. It will
  103. ensure each tensor is on the write device location.
  104. """
  105. dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
  106. shuffle=shuffle, drop_last=drop_last)
  107. for data_dict in dataloader:
  108. out_data_dict = {}
  109. for name, tensor in data_dict.items():
  110. out_data_dict[name] = data_dict[name].to(device)
  111. yield out_data_dict
例2-8:The SurnameClassifier as an CNN
  1. import torch.nn as nn
  2. import torch.nn.functional as F
  3. class SurnameClassifier(nn.Module):
  4. def __init__(self, initial_num_channels, num_classes, num_channels):
  5. """
  6. Args:
  7. initial_num_channels (int): size of the incoming feature vector
  8. num_classes (int): size of the output prediction vector
  9. num_channels (int): constant channel size to use throughout network
  10. """
  11. super(SurnameClassifier, self).__init__()
  12. self.convnet = nn.Sequential(
  13. nn.Conv1d(in_channels=initial_num_channels,
  14. out_channels=num_channels, kernel_size=3),
  15. nn.ELU(),
  16. nn.Conv1d(in_channels=num_channels, out_channels=num_channels,
  17. kernel_size=3, stride=2),
  18. nn.ELU(),
  19. nn.Conv1d(in_channels=num_channels, out_channels=num_channels,
  20. kernel_size=3, stride=2),
  21. nn.ELU(),
  22. nn.Conv1d(in_channels=num_channels, out_channels=num_channels,
  23. kernel_size=3),
  24. nn.ELU()
  25. )
  26. self.fc = nn.Linear(num_channels, num_classes)
  27. def forward(self, x_surname, apply_softmax=False):
  28. """The forward pass of the classifier
  29. Args:
  30. x_surname (torch.Tensor): an input data tensor.
  31. x_surname.shape should be (batch, initial_num_channels, max_surname_length)
  32. apply_softmax (bool): a flag for the softmax activation
  33. should be false if used with the Cross Entropy losses
  34. Returns:
  35. the resulting tensor. tensor.shape should be (batch, num_classes)
  36. """
  37. features = self.convnet(x_surname).squeeze(dim=2)
  38. prediction_vector = self.fc(features)
  39. if apply_softmax:
  40. prediction_vector = F.softmax(prediction_vector, dim=1)
  41. return prediction_vector
4.3:模型训练
  1. def make_train_state(args):
  2. return {'stop_early': False,
  3. 'early_stopping_step': 0,
  4. 'early_stopping_best_val': 1e8,
  5. 'learning_rate': args.learning_rate,
  6. 'epoch_index': 0,
  7. 'train_loss': [],
  8. 'train_acc': [],
  9. 'val_loss': [],
  10. 'val_acc': [],
  11. 'test_loss': -1,
  12. 'test_acc': -1,
  13. 'model_filename': args.model_state_file}
  14. def update_train_state(args, model, train_state):
  15. """Handle the training state updates.
  16. Components:
  17. - Early Stopping: Prevent overfitting.
  18. - Model Checkpoint: Model is saved if the model is better
  19. :param args: main arguments
  20. :param model: model to train
  21. :param train_state: a dictionary representing the training state values
  22. :returns:
  23. a new train_state
  24. """
  25. # Save one model at least
  26. if train_state['epoch_index'] == 0:
  27. torch.save(model.state_dict(), train_state['model_filename'])
  28. train_state['stop_early'] = False
  29. # Save model if performance improved
  30. elif train_state['epoch_index'] >= 1:
  31. loss_tm1, loss_t = train_state['val_loss'][-2:]
  32. # If loss worsened
  33. if loss_t >= train_state['early_stopping_best_val']:
  34. # Update step
  35. train_state['early_stopping_step'] += 1
  36. # Loss decreased
  37. else:
  38. # Save the best model
  39. if loss_t < train_state['early_stopping_best_val']:
  40. torch.save(model.state_dict(), train_state['model_filename'])
  41. # Reset early stopping step
  42. train_state['early_stopping_step'] = 0
  43. # Stop early ?
  44. train_state['stop_early'] = \
  45. train_state['early_stopping_step'] >= args.early_stopping_criteria
  46. return train_state
  47. def compute_accuracy(y_pred, y_target):
  48. y_pred_indices = y_pred.max(dim=1)[1]
  49. n_correct = torch.eq(y_pred_indices, y_target).sum().item()
  50. return n_correct / len(y_pred_indices) * 100
  51. args = Namespace(
  52. # Data and Path information
  53. surname_csv="data/surnames/surnames_with_splits.csv",
  54. vectorizer_file="vectorizer.json",
  55. model_state_file="model.pth",
  56. save_dir="model_storage/ch4/cnn",
  57. # Model hyper parameters
  58. hidden_dim=100,
  59. num_channels=256,
  60. # Training hyper parameters
  61. seed=1337,
  62. learning_rate=0.001,
  63. batch_size=128,
  64. num_epochs=100,
  65. early_stopping_criteria=5,
  66. dropout_p=0.1,
  67. # Runtime options
  68. cuda=False,
  69. reload_from_files=False,
  70. expand_filepaths_to_save_dir=True,
  71. catch_keyboard_interrupt=True
  72. )
  73. if args.expand_filepaths_to_save_dir:
  74. args.vectorizer_file = os.path.join(args.save_dir,
  75. args.vectorizer_file)
  76. args.model_state_file = os.path.join(args.save_dir,
  77. args.model_state_file)
  78. print("Expanded filepaths: ")
  79. print("\t{}".format(args.vectorizer_file))
  80. print("\t{}".format(args.model_state_file))
  81. # Check CUDA
  82. if not torch.cuda.is_available():
  83. args.cuda = False
  84. args.device = torch.device("cuda" if args.cuda else "cpu")
  85. print("Using CUDA: {}".format(args.cuda))
  86. def set_seed_everywhere(seed, cuda):
  87. np.random.seed(seed)
  88. torch.manual_seed(seed)
  89. if cuda:
  90. torch.cuda.manual_seed_all(seed)
  91. def handle_dirs(dirpath):
  92. if not os.path.exists(dirpath):
  93. os.makedirs(dirpath)
  94. # Set seed for reproducibility
  95. set_seed_everywhere(args.seed, args.cuda)
  96. # handle dirs
  97. handle_dirs(args.save_dir)

输出结果:

  1. if args.reload_from_files:
  2. # 如果reload_from_files参数为True,表示从已有的文件中加载数据集和向量化器
  3. # 这意味着我们将从上次保存的检查点继续训练
  4. dataset = SurnameDataset.load_dataset_and_load_vectorizer(args.surname_csv,
  5. args.vectorizer_file)
  6. else:
  7. # 如果reload_from_files参数为False,表示创建新的数据集和向量化器
  8. # 这通常在首次训练模型时使用
  9. dataset = SurnameDataset.load_dataset_and_make_vectorizer(args.surname_csv)
  10. # 保存新创建的向量化器到指定文件,供将来使用或继续训练
  11. dataset.save_vectorizer(args.vectorizer_file)
  12. # 获取数据集的向量化器,用于将文本数据转换为数值向量
  13. vectorizer = dataset.get_vectorizer()
  14. # 初始化分类器模型,使用向量化器的大小作为输入和输出层的维度
  15. classifier = SurnameClassifier(initial_num_channels=len(vectorizer.surname_vocab),
  16. num_classes=len(vectorizer.nationality_vocab),
  17. num_channels=args.num_channels)
  18. # 将模型移动到指定的设备(CPU或GPU)
  19. classifier = classifier.to(args.device)
  20. # 将数据集的类别权重也移动到相同的设备
  21. dataset.class_weights = dataset.class_weights.to(args.device)
  22. # 定义损失函数,这里使用加权的交叉熵损失,权重来源于数据集的类别权重
  23. loss_func = nn.CrossEntropyLoss(weight=dataset.class_weights)
  24. # 定义优化器,这里使用Adam优化器,学习率由args.learning_rate参数指定
  25. optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
  26. # 定义学习率调度器,当验证集的损失不再下降时,按一定比例降低学习率
  27. scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer,
  28. mode='min', factor=0.5,
  29. patience=1)
  30. # 创建一个训练状态字典,用于跟踪训练过程中的关键信息,如epoch、loss等
  31. train_state = make_train_state(args)
  32. # 创建一个描述为'training routine'的进度条,总长度为训练周期数
  33. epoch_bar = tqdm_notebook(desc='training routine',
  34. total=args.num_epochs,
  35. position=0)
  36. # 将数据集设置为训练模式
  37. dataset.set_split('train')
  38. # 创建一个描述为'split=train'的进度条,总长度为训练集的批次数
  39. train_bar = tqdm_notebook(desc='split=train',
  40. total=dataset.get_num_batches(args.batch_size),
  41. position=1,
  42. leave=True)
  43. # 将数据集设置为验证模式
  44. dataset.set_split('val')
  45. # 创建一个描述为'split=val'的进度条,总长度为验证集的批次数
  46. val_bar = tqdm_notebook(desc='split=val',
  47. total=dataset.get_num_batches(args.batch_size),
  48. position=1,
  49. leave=True)
  50. # 尝试执行以下训练循环
  51. try:
  52. for epoch_index in range(args.num_epochs):
  53. # 更新训练状态字典中的epoch_index
  54. train_state['epoch_index'] = epoch_index
  55. # 开始迭代训练数据集
  56. # 准备工作:生成批次,初始化损失和准确率为0,设置模型为训练模式
  57. dataset.set_split('train')
  58. batch_generator = generate_batches(dataset,
  59. batch_size=args.batch_size,
  60. device=args.device)
  61. running_loss = 0.0
  62. running_acc = 0.0
  63. classifier.train()
  64. # 遍历训练数据集的每个批次
  65. for batch_index, batch_dict in enumerate(batch_generator):
  66. # 训练循环分为以下5个步骤:
  67. # --------------------------------------
  68. # 步骤1:清零梯度
  69. optimizer.zero_grad()
  70. # 步骤2:计算模型输出
  71. y_pred = classifier(batch_dict['x_surname'])
  72. # 步骤3:计算损失
  73. loss = loss_func(y_pred, batch_dict['y_nationality'])
  74. loss_t = loss.item()
  75. # 更新运行时的平均损失
  76. running_loss += (loss_t - running_loss) / (batch_index + 1)
  77. # 步骤4:使用损失反向传播计算梯度
  78. loss.backward()
  79. # 步骤5:使用优化器更新权重
  80. optimizer.step()
  81. # -----------------------------------------
  82. # 计算准确率
  83. acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
  84. # 更新运行时的平均准确率
  85. running_acc += (acc_t - running_acc) / (batch_index + 1)
  86. # 更新进度条信息
  87. train_bar.set_postfix(loss=running_loss, acc=running_acc,
  88. epoch=epoch_index)
  89. train_bar.update()
  90. # 将当前epoch的训练损失和准确率添加到训练状态字典中
  91. train_state['train_loss'].append(running_loss)
  92. train_state['train_acc'].append(running_acc)
  93. # 开始迭代验证数据集
  94. # 准备工作:生成批次,初始化损失和准确率为0,设置模型为评估模式
  95. dataset.set_split('val')
  96. batch_generator = generate_batches(dataset,
  97. batch_size=args.batch_size,
  98. device=args.device)
  99. running_loss = 0.
  100. running_acc = 0.
  101. classifier.eval()
  102. # 遍历验证数据集的每个批次
  103. for batch_index, batch_dict in enumerate(batch_generator):
  104. # 计算模型输出
  105. y_pred = classifier(batch_dict['x_surname'])
  106. # 计算损失
  107. loss = loss_func(y_pred, batch_dict['y_nationality'])
  108. loss_t = loss.item()
  109. # 更新运行时的平均损失
  110. running_loss += (loss_t - running_loss) / (batch_index + 1)
  111. # 计算准确率
  112. acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
  113. # 更新运行时的平均准确率
  114. running_acc += (acc_t - running_acc) / (batch_index + 1)
  115. # 更新进度条信息
  116. val_bar.set_postfix(loss=running_loss, acc=running_acc,
  117. epoch=epoch_index)
  118. val_bar.update()
  119. # 将当前epoch的验证损失和准确率添加到训练状态字典中
  120. train_state['val_loss'].append(running_loss)
  121. train_state['val_acc'].append(running_acc)
  122. # 更新训练状态字典,检查是否需要提前终止训练
  123. train_state = update_train_state(args=args, model=classifier,
  124. train_state=train_state)
  125. # 根据最新的验证损失更新学习率
  126. scheduler.step(train_state['val_loss'][-1])
  127. # 如果训练状态表明需要提前终止训练,则跳出循环
  128. if train_state['stop_early']:
  129. break
  130. # 重置进度条的位置
  131. train_bar.n = 0
  132. val_bar.n = 0
  133. epoch_bar.update()
  134. except KeyboardInterrupt:
  135. print("Exiting loop")

输出结果:

4.4:模型评估
  1. # 加载之前保存的最佳模型参数
  2. classifier.load_state_dict(torch.load(train_state['model_filename']))
  3. # 将模型移动到指定的设备(如GPU)上
  4. classifier = classifier.to(args.device)
  5. # 将数据集的类别权重也移动到相同的设备上,这通常用于处理类别不平衡问题
  6. dataset.class_weights = dataset.class_weights.to(args.device)
  7. # 定义交叉熵损失函数,并将它也移动到相应的设备上
  8. loss_func = nn.CrossEntropyLoss(dataset.class_weights)
  9. # 设置数据集为测试模式
  10. dataset.set_split('test')
  11. # 创建一个生成器,用于产生测试数据集的批次
  12. batch_generator = generate_batches(dataset,
  13. batch_size=args.batch_size,
  14. device=args.device)
  15. # 初始化测试损失和准确率为0
  16. running_loss = 0.
  17. running_acc = 0.
  18. # 将模型设置为评估模式
  19. classifier.eval()
  20. # 遍历测试数据集的每一个批次
  21. for batch_index, batch_dict in enumerate(batch_generator):
  22. # 使用模型预测当前批次的输出
  23. y_pred = classifier(batch_dict['x_surname'])
  24. # 计算损失
  25. loss = loss_func(y_pred, batch_dict['y_nationality'])
  26. loss_t = loss.item()
  27. # 更新运行时的平均损失
  28. running_loss += (loss_t - running_loss) / (batch_index + 1)
  29. # 计算准确率
  30. acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
  31. # 更新运行时的平均准确率
  32. running_acc += (acc_t - running_acc) / (batch_index + 1)
  33. # 将测试集上的最终损失和准确率记录在训练状态字典中
  34. train_state['test_loss'] = running_loss
  35. train_state['test_acc'] = running_acc
  36. print("Test loss: {};".format(train_state['test_loss']))
  37. print("Test Accuracy: {}".format(train_state['test_acc']))

输出结果:

4.5:模型预测(预测国籍)
  1. def predict_nationality(surname, classifier, vectorizer):
  2. """Predict the nationality from a new surname
  3. Args:
  4. surname (str): the surname to classifier
  5. classifier (SurnameClassifer): an instance of the classifier
  6. vectorizer (SurnameVectorizer): the corresponding vectorizer
  7. Returns:
  8. a dictionary with the most likely nationality and its probability
  9. """
  10. vectorized_surname = vectorizer.vectorize(surname)
  11. vectorized_surname = torch.tensor(vectorized_surname).unsqueeze(0)
  12. result = classifier(vectorized_surname, apply_softmax=True)
  13. probability_values, indices = result.max(dim=1)
  14. index = indices.item()
  15. predicted_nationality = vectorizer.nationality_vocab.lookup_index(index)
  16. probability_value = probability_values.item()
  17. return {'nationality': predicted_nationality, 'probability': probability_value}
  18. new_surname = input("Enter a surname to classify: ")
  19. classifier = classifier.cpu()
  20. prediction = predict_nationality(new_surname, classifier, vectorizer)
  21. print("{} -> {} (p={:0.2f})".format(new_surname,
  22. prediction['nationality'],
  23. prediction['probability']))

输出结果:

  1. def predict_topk_nationality(surname, classifier, vectorizer, k=5):
  2. """Predict the top K nationalities from a new surname
  3. Args:
  4. surname (str): the surname to classifier
  5. classifier (SurnameClassifer): an instance of the classifier
  6. vectorizer (SurnameVectorizer): the corresponding vectorizer
  7. k (int): the number of top nationalities to return
  8. Returns:
  9. list of dictionaries, each dictionary is a nationality and a probability
  10. """
  11. vectorized_surname = vectorizer.vectorize(surname)
  12. vectorized_surname = torch.tensor(vectorized_surname).unsqueeze(dim=0)
  13. prediction_vector = classifier(vectorized_surname, apply_softmax=True)
  14. probability_values, indices = torch.topk(prediction_vector, k=k)
  15. # returned size is 1,k
  16. probability_values = probability_values[0].detach().numpy()
  17. indices = indices[0].detach().numpy()
  18. results = []
  19. for kth_index in range(k):
  20. nationality = vectorizer.nationality_vocab.lookup_index(indices[kth_index])
  21. probability_value = probability_values[kth_index]
  22. results.append({'nationality': nationality,
  23. 'probability': probability_value})
  24. return results
  25. new_surname = input("Enter a surname to classify: ")
  26. k = int(input("How many of the top predictions to see? "))
  27. if k > len(vectorizer.nationality_vocab):
  28. print("Sorry! That's more than the # of nationalities we have.. defaulting you to max size :)")
  29. k = len(vectorizer.nationality_vocab)
  30. predictions = predict_topk_nationality(new_surname, classifier, vectorizer, k=k)
  31. print("Top {} predictions:".format(k))
  32. print("===================")
  33. for prediction in predictions:
  34. print("{} -> {} (p={:0.2f})".format(new_surname,
  35. prediction['nationality'],
  36. prediction['probability']))

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/代码探险家/article/detail/875438
推荐阅读
相关标签
  

闽ICP备14008679号