当前位置:   article > 正文

NLP—基于MLP和CNN实现姓氏分类_mlp分类

mlp分类

一、基于多层感知机(MLP)实现姓氏分类

1.1多层感知机(MLP)原理

1.1.1简介:

多层感知机MLP(Multilayer Perceptron),也是人工神经网络(ANN,Artificial Neural Network),是一种全连接(全连接:MLP由多个神经元按照层次结构组成,每个神经元都与上一层的所有神经元相连)的前馈神经网络模型

输入层(Input Layer):

接受输入数据,通常是一维向量,每个输入节点对应一个特征。
隐藏层(Hidden Layers):

一层或多层的神经元,每个神经元连接到上一层的所有神经元,并通过权重进行加权求和。
隐藏层中的每个神经元都应用一个非线性激活函数(如ReLU、sigmoid、tanh),以引入非线性特性。
输出层(Output Layer)

生成最终的预测结果,输出层的神经元数量取决于具体的任务,例如分类任务中的类别数量。
权重和偏置(Weights and Biases):

每条连接都有一个权重,神经元还有一个偏置项,这些参数在训练过程中通过反向传播算法进行优化。
激活函数(Activation Function):

激活函数用于引入非线性特性,常用的激活函数包括ReLU、sigmoid、tanh等。


损失函数(Loss Function):

损失函数用于衡量模型预测与真实标签之间的差距,常见的损失函数有均方误差(MSE)、交叉熵损失等。
反向传播(Backpropagation):

反向传播是一种通过梯度下降优化权重和偏置的算法,它根据损失函数的梯度更新模型参数,使得模型在训练数据上的预测更加准确。

1.1.2工作流程:

前向传播(Forward Propagation)

输入数据经过各层神经元的加权求和和激活函数,逐层传递到输出层,生成预测结果。
计算损失(Calculate Loss):

根据预测结果和真实标签计算损失值。
反向传播和参数更新(Backpropagation and Parameter Update):

通过反向传播算法计算损失函数相对于权重和偏置的梯度,并使用优化算法(如梯度下降)更新参数。

1.1.3优点:

激活函数(如 ReLU、sigmoid、tanh)引入了非线性,使得 MLP 能够捕捉和表示输入数据中的复杂非线性关系。通过多个隐藏层,MLP 能够逐层提取和组合数据的特征,每一层都在前一层特征的基础上构建更高级的表示。多层结构和非线性激活函数共同作用,使得 MLP 具有强大的表示能力,可以表示复杂的决策边界和数据模式。可以解决下图这类非线性且高维的边界决策。

1.2代码详情

1.2.1数据集

姓氏数据集,它收集了来自18个不同国家的10,000个姓氏,这些姓氏是作者从互联网上不同的姓名来源收集的。该数据集有一些使其有趣的属性。第一个性质是它是相当不平衡的。排名前三的课程占数据的60%以上:27%是英语,21%是俄语,14%是阿拉伯语。剩下的15个民族的频率也在下降——这也是语言特有的特性。第二个特点是,在国籍和姓氏正字法(拼写)之间有一种有效和直观的关系。有些拼写变体与原籍国联系非常紧密(比如“O ‘Neill”、“Antonopoulos”、“Nagasawa”或“Zhu”)。

为了创建最终的数据集,我们从一个比补充材料中包含的版本处理更少的版本开始,并执行了几个数据集修改操作。第一个目的是减少这种不平衡——原始数据集中70%以上是俄文,这可能是由于抽样偏差或俄文姓氏的增多。为此,我们通过选择标记为俄语的姓氏的随机子集对这个过度代表的类进行子样本。接下来,我们根据国籍对数据集进行分组,并将数据集分为三个部分:70%到训练数据集,15%到验证数据集,最后15%到测试数据集,以便跨这些部分的类标签分布具有可比性。

数据集处理
  1. import collections
  2. import numpy as np
  3. import pandas as pd
  4. import re
  5. from argparse import Namespace
  6. args = Namespace(
  7. raw_dataset_csv="data/surnames/surnames.csv",
  8. train_proportion=0.7,
  9. val_proportion=0.15,
  10. test_proportion=0.15,
  11. output_munged_csv="data/surnames/surnames_with_splits.csv",
  12. seed=1337
  13. )
  14. # Read raw data
  15. surnames = pd.read_csv(args.raw_dataset_csv, header=0)
  16. surnames.head()
  17. # Unique classes
  18. set(surnames.nationality)
  19. # Splitting train by nationality
  20. # Create dict
  21. by_nationality = collections.defaultdict(list)
  22. for _, row in surnames.iterrows():
  23. by_nationality[row.nationality].append(row.to_dict())
  24. # Create split data
  25. final_list = []
  26. np.random.seed(args.seed)
  27. # 按国籍遍历分类后的数据
  28. for _, item_list in sorted(by_nationality.items()):
  29. np.random.shuffle(item_list)# 随机打乱数据列表
  30. n = len(item_list)# 数据总数
  31. n_train = int(args.train_proportion*n)# 训练集大小
  32. n_val = int(args.val_proportion*n) # 验证集大小
  33. n_test = int(args.test_proportion*n)# 测试集大小
  34. # Give data point a split attribute
  35. # 给数据点添加分割属性
  36. for item in item_list[:n_train]:
  37. item['split'] = 'train'# 标记为训练集
  38. for item in item_list[n_train:n_train+n_val]:
  39. item['split'] = 'val'# 标记为验证集
  40. for item in item_list[n_train+n_val:]:
  41. item['split'] = 'test' # 标记为测试集
  42. # Add to final list
  43. final_list.extend(item_list)
  44. # Write split data to file
  45. # 将数据写入CSV文件
  46. final_surnames = pd.DataFrame(final_list)
  47. final_surnames.split.value_counts()
  48. final_surnames.head()
  49. # Write munged data to CSV
  50. final_surnames.to_csv(args.output_munged_csv, index=False)

划分后的前5行数据,训练集、验证集、测试集数量

 1.2.2模型

相关库的导入
  1. from argparse import Namespace
  2. from collections import Counter
  3. import json
  4. import os
  5. import string
  6. import numpy as np
  7. import pandas as pd
  8. import torch
  9. import torch.nn as nn
  10. import torch.nn.functional as F
  11. import torch.optim as optim
  12. from torch.utils.data import Dataset, DataLoader
  13. from tqdm import tqdm_notebook
数据矢量化类(Vocabulary,Vectorizer,and Dataset)
  1. class Vocabulary(object):
  2. """Class to process text and extract vocabulary for mapping"""
  3. def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"):
  4. """
  5. Args:
  6. token_to_idx (dict): a pre-existing map of tokens to indices
  7. add_unk (bool): a flag that indicates whether to add the UNK token
  8. unk_token (str): the UNK token to add into the Vocabulary
  9. """
  10. if token_to_idx is None:
  11. token_to_idx = {}
  12. self._token_to_idx = token_to_idx
  13. self._idx_to_token = {idx: token
  14. for token, idx in self._token_to_idx.items()}
  15. self._add_unk = add_unk
  16. self._unk_token = unk_token
  17. self.unk_index = -1
  18. if add_unk:
  19. self.unk_index = self.add_token(unk_token)
  20. def to_serializable(self):
  21. """ returns a dictionary that can be serialized """
  22. return {'token_to_idx': self._token_to_idx,
  23. 'add_unk': self._add_unk,
  24. 'unk_token': self._unk_token}
  25. @classmethod
  26. def from_serializable(cls, contents):
  27. """ instantiates the Vocabulary from a serialized dictionary """
  28. return cls(**contents)
  29. def add_token(self, token):
  30. """Update mapping dicts based on the token.
  31. Args:
  32. token (str): the item to add into the Vocabulary
  33. Returns:
  34. index (int): the integer corresponding to the token
  35. """
  36. try:
  37. index = self._token_to_idx[token]
  38. except KeyError:
  39. index = len(self._token_to_idx)
  40. self._token_to_idx[token] = index
  41. self._idx_to_token[index] = token
  42. return index
  43. def add_many(self, tokens):
  44. """Add a list of tokens into the Vocabulary
  45. Args:
  46. tokens (list): a list of string tokens
  47. Returns:
  48. indices (list): a list of indices corresponding to the tokens
  49. """
  50. return [self.add_token(token) for token in tokens]
  51. def lookup_token(self, token):
  52. """Retrieve the index associated with the token
  53. or the UNK index if token isn't present.
  54. Args:
  55. token (str): the token to look up
  56. Returns:
  57. index (int): the index corresponding to the token
  58. Notes:
  59. `unk_index` needs to be >=0 (having been added into the Vocabulary)
  60. for the UNK functionality
  61. """
  62. if self.unk_index >= 0:
  63. return self._token_to_idx.get(token, self.unk_index)
  64. else:
  65. return self._token_to_idx[token]
  66. def lookup_index(self, index):
  67. """Return the token associated with the index
  68. Args:
  69. index (int): the index to look up
  70. Returns:
  71. token (str): the token corresponding to the index
  72. Raises:
  73. KeyError: if the index is not in the Vocabulary
  74. """
  75. if index not in self._idx_to_token:
  76. raise KeyError("the index (%d) is not in the Vocabulary" % index)
  77. return self._idx_to_token[index]
  78. def __str__(self):
  79. return "<Vocabulary(size=%d)>" % len(self)
  80. def __len__(self):
  81. return len(self._token_to_idx)

  1. class SurnameVectorizer(object):
  2. """ The Vectorizer which coordinates the Vocabularies and puts them to use"""
  3. def __init__(self, surname_vocab, nationality_vocab):
  4. """
  5. Args:
  6. surname_vocab (Vocabulary): maps characters to integers
  7. nationality_vocab (Vocabulary): maps nationalities to integers
  8. """
  9. self.surname_vocab = surname_vocab
  10. self.nationality_vocab = nationality_vocab
  11. def vectorize(self, surname):
  12. """
  13. Args:
  14. surname (str): the surname
  15. Returns:
  16. one_hot (np.ndarray): a collapsed one-hot encoding
  17. """
  18. vocab = self.surname_vocab
  19. one_hot = np.zeros(len(vocab), dtype=np.float32)
  20. for token in surname:
  21. one_hot[vocab.lookup_token(token)] = 1
  22. return one_hot
  23. @classmethod
  24. def from_dataframe(cls, surname_df):
  25. """Instantiate the vectorizer from the dataset dataframe
  26. Args:
  27. surname_df (pandas.DataFrame): the surnames dataset
  28. Returns:
  29. an instance of the SurnameVectorizer
  30. """
  31. surname_vocab = Vocabulary(unk_token="@")
  32. nationality_vocab = Vocabulary(add_unk=False)
  33. for index, row in surname_df.iterrows():
  34. for letter in row.surname:
  35. surname_vocab.add_token(letter)
  36. nationality_vocab.add_token(row.nationality)
  37. return cls(surname_vocab, nationality_vocab)
  38. @classmethod
  39. def from_serializable(cls, contents):
  40. surname_vocab = Vocabulary.from_serializable(contents['surname_vocab'])
  41. nationality_vocab = Vocabulary.from_serializable(contents['nationality_vocab'])
  42. return cls(surname_vocab=surname_vocab, nationality_vocab=nationality_vocab)
  43. def to_serializable(self):
  44. return {'surname_vocab': self.surname_vocab.to_serializable(),
  45. 'nationality_vocab': self.nationality_vocab.to_serializable()}
  1. class SurnameDataset(Dataset):
  2. def __init__(self, surname_df, vectorizer):
  3. """
  4. Args:
  5. surname_df (pandas.DataFrame): the dataset
  6. vectorizer (SurnameVectorizer): vectorizer instatiated from dataset
  7. """
  8. self.surname_df = surname_df
  9. self._vectorizer = vectorizer
  10. self.train_df = self.surname_df[self.surname_df.split=='train']
  11. self.train_size = len(self.train_df)
  12. self.val_df = self.surname_df[self.surname_df.split=='val']
  13. self.validation_size = len(self.val_df)
  14. self.test_df = self.surname_df[self.surname_df.split=='test']
  15. self.test_size = len(self.test_df)
  16. self._lookup_dict = {'train': (self.train_df, self.train_size),
  17. 'val': (self.val_df, self.validation_size),
  18. 'test': (self.test_df, self.test_size)}
  19. self.set_split('train')
  20. # Class weights
  21. class_counts = surname_df.nationality.value_counts().to_dict()
  22. def sort_key(item):
  23. return self._vectorizer.nationality_vocab.lookup_token(item[0])
  24. sorted_counts = sorted(class_counts.items(), key=sort_key)
  25. frequencies = [count for _, count in sorted_counts]
  26. self.class_weights = 1.0 / torch.tensor(frequencies, dtype=torch.float32)
  27. @classmethod
  28. def load_dataset_and_make_vectorizer(cls, surname_csv):
  29. """Load dataset and make a new vectorizer from scratch
  30. Args:
  31. surname_csv (str): location of the dataset
  32. Returns:
  33. an instance of SurnameDataset
  34. """
  35. surname_df = pd.read_csv(surname_csv)
  36. train_surname_df = surname_df[surname_df.split=='train']
  37. return cls(surname_df, SurnameVectorizer.from_dataframe(train_surname_df))
  38. @classmethod
  39. def load_dataset_and_load_vectorizer(cls, surname_csv, vectorizer_filepath):
  40. """Load dataset and the corresponding vectorizer.
  41. Used in the case in the vectorizer has been cached for re-use
  42. Args:
  43. surname_csv (str): location of the dataset
  44. vectorizer_filepath (str): location of the saved vectorizer
  45. Returns:
  46. an instance of SurnameDataset
  47. """
  48. surname_df = pd.read_csv(surname_csv)
  49. vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
  50. return cls(surname_df, vectorizer)
  51. @staticmethod
  52. def load_vectorizer_only(vectorizer_filepath):
  53. """a static method for loading the vectorizer from file
  54. Args:
  55. vectorizer_filepath (str): the location of the serialized vectorizer
  56. Returns:
  57. an instance of SurnameVectorizer
  58. """
  59. with open(vectorizer_filepath) as fp:
  60. return SurnameVectorizer.from_serializable(json.load(fp))
  61. def save_vectorizer(self, vectorizer_filepath):
  62. """saves the vectorizer to disk using json
  63. Args:
  64. vectorizer_filepath (str): the location to save the vectorizer
  65. """
  66. with open(vectorizer_filepath, "w") as fp:
  67. json.dump(self._vectorizer.to_serializable(), fp)
  68. def get_vectorizer(self):
  69. """ returns the vectorizer """
  70. return self._vectorizer
  71. def set_split(self, split="train"):
  72. """ selects the splits in the dataset using a column in the dataframe """
  73. self._target_split = split
  74. self._target_df, self._target_size = self._lookup_dict[split]
  75. def __len__(self):
  76. return self._target_size
  77. def __getitem__(self, index):
  78. """the primary entry point method for PyTorch datasets
  79. Args:
  80. index (int): the index to the data point
  81. Returns:
  82. a dictionary holding the data point's:
  83. features (x_surname)
  84. label (y_nationality)
  85. """
  86. row = self._target_df.iloc[index]
  87. surname_vector = \
  88. self._vectorizer.vectorize(row.surname)
  89. nationality_index = \
  90. self._vectorizer.nationality_vocab.lookup_token(row.nationality)
  91. return {'x_surname': surname_vector,
  92. 'y_nationality': nationality_index}
  93. def get_num_batches(self, batch_size):
  94. """Given a batch size, return the number of batches in the dataset
  95. Args:
  96. batch_size (int)
  97. Returns:
  98. number of batches in the dataset
  99. """
  100. return len(self) // batch_size
  101. def generate_batches(dataset, batch_size, shuffle=True,
  102. drop_last=True, device="cpu"):
  103. """
  104. A generator function which wraps the PyTorch DataLoader. It will
  105. ensure each tensor is on the write device location.
  106. """
  107. dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
  108. shuffle=shuffle, drop_last=drop_last)
  109. for data_dict in dataloader:
  110. out_data_dict = {}
  111. for name, tensor in data_dict.items():
  112. out_data_dict[name] = data_dict[name].to(device)
  113. yield out_data_dict

模型定义 
模型
  1. class SurnameClassifier(nn.Module):
  2. """ A 2-layer Multilayer Perceptron for classifying surnames """
  3. def __init__(self, input_dim, hidden_dim, output_dim):
  4. """
  5. Args:
  6. input_dim (int): the size of the input vectors
  7. hidden_dim (int): the output size of the first Linear layer
  8. output_dim (int): the output size of the second Linear layer
  9. """
  10. super(SurnameClassifier, self).__init__()
  11. self.fc1 = nn.Linear(input_dim, hidden_dim)
  12. self.fc2 = nn.Linear(hidden_dim, output_dim)
  13. def forward(self, x_in, apply_softmax=False):
  14. """The forward pass of the classifier
  15. Args:
  16. x_in (torch.Tensor): an input data tensor.
  17. x_in.shape should be (batch, input_dim)
  18. apply_softmax (bool): a flag for the softmax activation
  19. should be false if used with the Cross Entropy losses
  20. Returns:
  21. the resulting tensor. tensor.shape should be (batch, output_dim)
  22. """
  23. intermediate_vector = F.relu(self.fc1(x_in))
  24. prediction_vector = self.fc2(intermediate_vector)
  25. if apply_softmax:
  26. prediction_vector = F.softmax(prediction_vector, dim=1)
  27. return prediction_vector
一些辅助函数
  1. def make_train_state(args):
  2. return {'stop_early': False,
  3. 'early_stopping_step': 0,
  4. 'early_stopping_best_val': 1e8,
  5. 'learning_rate': args.learning_rate,
  6. 'epoch_index': 0,
  7. 'train_loss': [],
  8. 'train_acc': [],
  9. 'val_loss': [],
  10. 'val_acc': [],
  11. 'test_loss': -1,
  12. 'test_acc': -1,
  13. 'model_filename': args.model_state_file}
  14. def update_train_state(args, model, train_state):
  15. """Handle the training state updates.
  16. Components:
  17. - Early Stopping: Prevent overfitting.
  18. - Model Checkpoint: Model is saved if the model is better
  19. :param args: main arguments
  20. :param model: model to train
  21. :param train_state: a dictionary representing the training state values
  22. :returns:
  23. a new train_state
  24. """
  25. # Save one model at least
  26. if train_state['epoch_index'] == 0:
  27. torch.save(model.state_dict(), train_state['model_filename'])
  28. train_state['stop_early'] = False
  29. # Save model if performance improved
  30. elif train_state['epoch_index'] >= 1:
  31. loss_tm1, loss_t = train_state['val_loss'][-2:]
  32. # If loss worsened
  33. if loss_t >= train_state['early_stopping_best_val']:
  34. # Update step
  35. train_state['early_stopping_step'] += 1
  36. # Loss decreased
  37. else:
  38. # Save the best model
  39. if loss_t < train_state['early_stopping_best_val']:
  40. torch.save(model.state_dict(), train_state['model_filename'])
  41. # Reset early stopping step
  42. train_state['early_stopping_step'] = 0
  43. # Stop early ?
  44. train_state['stop_early'] = \
  45. train_state['early_stopping_step'] >= args.early_stopping_criteria
  46. return train_state
  47. def compute_accuracy(y_pred, y_target):
  48. _, y_pred_indices = y_pred.max(dim=1)
  49. n_correct = torch.eq(y_pred_indices, y_target).sum().item()
  50. return n_correct / len(y_pred_indices) * 100

  1. def set_seed_everywhere(seed, cuda):
  2. np.random.seed(seed)
  3. torch.manual_seed(seed)
  4. if cuda:
  5. torch.cuda.manual_seed_all(seed)
  6. def handle_dirs(dirpath):
  7. if not os.path.exists(dirpath):
  8. os.makedirs(dirpath)

  1. args = Namespace(
  2. # Data and path information
  3. surname_csv="data/surnames/surnames_with_splits.csv",
  4. vectorizer_file="vectorizer.json",
  5. model_state_file="model.pth",
  6. save_dir="model_storage/ch4/surname_mlp",
  7. # Model hyper parameters
  8. hidden_dim=300,
  9. # Training hyper parameters
  10. seed=1337,
  11. num_epochs=100,
  12. early_stopping_criteria=5,
  13. learning_rate=0.001,
  14. batch_size=64,
  15. # Runtime options
  16. cuda=False,
  17. reload_from_files=False,
  18. expand_filepaths_to_save_dir=True,
  19. )
  20. if args.expand_filepaths_to_save_dir:
  21. args.vectorizer_file = os.path.join(args.save_dir,
  22. args.vectorizer_file)
  23. args.model_state_file = os.path.join(args.save_dir,
  24. args.model_state_file)
  25. print("Expanded filepaths: ")
  26. print("\t{}".format(args.vectorizer_file))
  27. print("\t{}".format(args.model_state_file))
  28. # Check CUDA
  29. if not torch.cuda.is_available():
  30. args.cuda = False
  31. args.device = torch.device("cuda" if args.cuda else "cpu")
  32. print("Using CUDA: {}".format(args.cuda))
  33. # Set seed for reproducibility
  34. set_seed_everywhere(args.seed, args.cuda)
  35. # handle dirs
  36. handle_dirs(args.save_dir)
  1. if args.reload_from_files:
  2. # training from a checkpoint
  3. print("Reloading!")
  4. dataset = SurnameDataset.load_dataset_and_load_vectorizer(args.surname_csv,
  5. args.vectorizer_file)
  6. else:
  7. # create dataset and vectorizer
  8. print("Creating fresh!")
  9. dataset = SurnameDataset.load_dataset_and_make_vectorizer(args.surname_csv)
  10. dataset.save_vectorizer(args.vectorizer_file)
  11. vectorizer = dataset.get_vectorizer()
  12. classifier = SurnameClassifier(input_dim=len(vectorizer.surname_vocab),
  13. hidden_dim=args.hidden_dim,
  14. output_dim=len(vectorizer.nationality_vocab))

训练
  1. classifier = classifier.to(args.device)
  2. dataset.class_weights = dataset.class_weights.to(args.device)
  3. loss_func = nn.CrossEntropyLoss(dataset.class_weights)
  4. optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
  5. scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer,
  6. mode='min', factor=0.5,
  7. patience=1)
  8. train_state = make_train_state(args)
  9. epoch_bar = tqdm_notebook(desc='training routine',
  10. total=args.num_epochs,
  11. position=0)
  12. dataset.set_split('train')
  13. train_bar = tqdm_notebook(desc='split=train',
  14. total=dataset.get_num_batches(args.batch_size),
  15. position=1,
  16. leave=True)
  17. dataset.set_split('val')
  18. val_bar = tqdm_notebook(desc='split=val',
  19. total=dataset.get_num_batches(args.batch_size),
  20. position=1,
  21. leave=True)
  22. try:
  23. for epoch_index in range(args.num_epochs):
  24. train_state['epoch_index'] = epoch_index
  25. # Iterate over training dataset
  26. # setup: batch generator, set loss and acc to 0, set train mode on
  27. dataset.set_split('train')
  28. batch_generator = generate_batches(dataset,
  29. batch_size=args.batch_size,
  30. device=args.device)
  31. running_loss = 0.0
  32. running_acc = 0.0
  33. classifier.train()
  34. for batch_index, batch_dict in enumerate(batch_generator):
  35. # the training routine is these 5 steps:
  36. # --------------------------------------
  37. # step 1. zero the gradients
  38. optimizer.zero_grad()
  39. # step 2. compute the output
  40. y_pred = classifier(batch_dict['x_surname'])
  41. # step 3. compute the loss
  42. loss = loss_func(y_pred, batch_dict['y_nationality'])
  43. loss_t = loss.item()
  44. running_loss += (loss_t - running_loss) / (batch_index + 1)
  45. # step 4. use loss to produce gradients
  46. loss.backward()
  47. # step 5. use optimizer to take gradient step
  48. optimizer.step()
  49. # -----------------------------------------
  50. # compute the accuracy
  51. acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
  52. running_acc += (acc_t - running_acc) / (batch_index + 1)
  53. # update bar
  54. train_bar.set_postfix(loss=running_loss, acc=running_acc,
  55. epoch=epoch_index)
  56. train_bar.update()
  57. train_state['train_loss'].append(running_loss)
  58. train_state['train_acc'].append(running_acc)
  59. # Iterate over val dataset
  60. # setup: batch generator, set loss and acc to 0; set eval mode on
  61. dataset.set_split('val')
  62. batch_generator = generate_batches(dataset,
  63. batch_size=args.batch_size,
  64. device=args.device)
  65. running_loss = 0.
  66. running_acc = 0.
  67. classifier.eval()
  68. for batch_index, batch_dict in enumerate(batch_generator):
  69. # compute the output
  70. y_pred = classifier(batch_dict['x_surname'])
  71. # step 3. compute the loss
  72. loss = loss_func(y_pred, batch_dict['y_nationality'])
  73. loss_t = loss.to("cpu").item()
  74. running_loss += (loss_t - running_loss) / (batch_index + 1)
  75. # compute the accuracy
  76. acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
  77. running_acc += (acc_t - running_acc) / (batch_index + 1)
  78. val_bar.set_postfix(loss=running_loss, acc=running_acc,
  79. epoch=epoch_index)
  80. val_bar.update()
  81. train_state['val_loss'].append(running_loss)
  82. train_state['val_acc'].append(running_acc)
  83. train_state = update_train_state(args=args, model=classifier,
  84. train_state=train_state)
  85. scheduler.step(train_state['val_loss'][-1])
  86. if train_state['stop_early']:
  87. break
  88. train_bar.n = 0
  89. val_bar.n = 0
  90. epoch_bar.update()
  91. except KeyboardInterrupt:
  92. print("Exiting loop")

损失、准确率计算函数

  1. # compute the loss & accuracy on the test set using the best available model
  2. classifier.load_state_dict(torch.load(train_state['model_filename']))
  3. classifier = classifier.to(args.device)
  4. dataset.class_weights = dataset.class_weights.to(args.device)
  5. loss_func = nn.CrossEntropyLoss(dataset.class_weights)
  6. dataset.set_split('test')
  7. batch_generator = generate_batches(dataset,
  8. batch_size=args.batch_size,
  9. device=args.device)
  10. running_loss = 0.
  11. running_acc = 0.
  12. classifier.eval()
  13. for batch_index, batch_dict in enumerate(batch_generator):
  14. # compute the output
  15. y_pred = classifier(batch_dict['x_surname'])
  16. # compute the loss
  17. loss = loss_func(y_pred, batch_dict['y_nationality'])
  18. loss_t = loss.item()
  19. running_loss += (loss_t - running_loss) / (batch_index + 1)
  20. # compute the accuracy
  21. acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
  22. running_acc += (acc_t - running_acc) / (batch_index + 1)
  23. train_state['test_loss'] = running_loss
  24. train_state['test_acc'] = running_acc
  25. print("Test loss: {};".format(train_state['test_loss']))
  26. print("Test Accuracy: {}".format(train_state['test_acc']))

 模型训练结果及训练损失和准确率

 模型预测
  1. def predict_nationality(surname, classifier, vectorizer):
  2. """Predict the nationality from a new surname
  3. Args:
  4. surname (str): the surname to classifier
  5. classifier (SurnameClassifer): an instance of the classifier
  6. vectorizer (SurnameVectorizer): the corresponding vectorizer
  7. Returns:
  8. a dictionary with the most likely nationality and its probability
  9. """
  10. vectorized_surname = vectorizer.vectorize(surname)
  11. vectorized_surname = torch.tensor(vectorized_surname).view(1, -1)
  12. result = classifier(vectorized_surname, apply_softmax=True)
  13. probability_values, indices = result.max(dim=1)
  14. index = indices.item()
  15. predicted_nationality = vectorizer.nationality_vocab.lookup_index(index)
  16. probability_value = probability_values.item()
  17. return {'nationality': predicted_nationality, 'probability': probability_value}
  18. new_surname = input("Enter a surname to classify: ")
  19. classifier = classifier.to("cpu")
  20. prediction = predict_nationality(new_surname, classifier, vectorizer)
  21. print("{} -> {} (p={:0.2f})".format(new_surname,
  22. prediction['nationality'],
  23. prediction['probability']))

示例

在NLP等领域,通常不仅需要关注最佳预测结果,还需要考虑更多预测的情况。一个常见的做法是采用k-best预测,然后利用另一个模型对这些预测进行重新排序。PyTorch提供了一个便捷的函数torch.topk,可用于获取这些预测中的前k个最佳结果。

  1. vectorizer.nationality_vocab.lookup_index(8)
  2. def predict_topk_nationality(name, classifier, vectorizer, k=5):
  3. vectorized_name = vectorizer.vectorize(name)
  4. vectorized_name = torch.tensor(vectorized_name).view(1, -1)
  5. prediction_vector = classifier(vectorized_name, apply_softmax=True)
  6. probability_values, indices = torch.topk(prediction_vector, k=k)
  7. # returned size is 1,k
  8. probability_values = probability_values.detach().numpy()[0]
  9. indices = indices.detach().numpy()[0]
  10. results = []
  11. for prob_value, index in zip(probability_values, indices):
  12. nationality = vectorizer.nationality_vocab.lookup_index(index)
  13. results.append({'nationality': nationality,
  14. 'probability': prob_value})
  15. return results
  16. new_surname = input("Enter a surname to classify: ")
  17. classifier = classifier.to("cpu")
  18. k = int(input("How many of the top predictions to see? "))
  19. if k > len(vectorizer.nationality_vocab):
  20. print("Sorry! That's more than the # of nationalities we have.. defaulting you to max size :)")
  21. k = len(vectorizer.nationality_vocab)
  22. predictions = predict_topk_nationality(new_surname, classifier, vectorizer, k=k)
  23. print("Top {} predictions:".format(k))
  24. print("===================")
  25. for prediction in predictions:
  26. print("{} -> {} (p={:0.2f})".format(new_surname,
  27. prediction['nationality'],
  28. prediction['probability']))

 二、基于卷积神经网络(CNN)实现姓氏分类

2.1卷积神经网络(CNN)原理

卷积神经网络(Convolutional Neural Network,简称CNN)是一种深度学习模型,广泛应用于图像和视频识别、自然语言处理等领域。它的主要特点是通过卷积层(Convolutional Layer)和池化层(Pooling Layer)来提取特征,从而减少参数的数量和计算的复杂度。

2.1.1CNN的主要组件和工作原理:

卷积层(Convolutional Layer):

通过卷积操作,将图像或输入数据与若干个卷积核(filter)进行卷积,生成特征图(feature map)。
每个卷积核在图像上滑动,提取局部区域的信息,通过共享参数(即同一个卷积核在不同位置使用相同的参数),大大减少了参数的数量。


激活函数(Activation Function):

常用的激活函数是ReLU(Rectified Linear Unit),即将输入中的负值置为零,保持正值不变。
激活函数引入了非线性,使得神经网络可以学习到更复杂的特征。
池化层(Pooling Layer):

通过降采样(如最大池化Max Pooling或平均池化Average Pooling),减小特征图的尺寸,从而减少计算量和过拟合的风险。
最大池化是取局部区域中的最大值,平均池化是取局部区域的平均值。
全连接层(Fully Connected Layer):

将卷积层和池化层提取到的特征图展平(flatten)并输入到全连接层进行分类或回归。
全连接层与传统的神经网络相似,每个神经元与上一层的所有神经元相连接。
损失函数(Loss Function)和优化器(Optimizer):

损失函数用于衡量模型预测值与真实值之间的差距,常用的损失函数有交叉熵损失(Cross-Entropy Loss)和均方误差(Mean Squared Error)。
优化器用于更新模型的参数,常用的优化器有随机梯度下降(SGD)、Adam等。
正则化(Regularization):

为了防止过拟合,常用的正则化技术有Dropout和L2正则化。

通道(Channel)

非正式地,通道(channel)是指沿输入中的每个点的特征维度。在图像中,对应于RGB组件的图像中的每个像素有三个通道。在使用卷积时,文本数据也可以采用类似的概念。从概念上讲,如果文本文档中的“像素”是单词,那么通道的数量就是词汇表的大小。如果我们更细粒度地考虑字符的卷积,通道的数量就是字符集的大小(在本例中刚好是词汇表)。在PyTorch卷积实现中,输入通道的数量是in_channels参数。卷积操作可以在输出(out_channels)中产生多个通道。可以将其视为卷积运算符将输入特征维“映射”到输出特征维。

核大小(Kernel Size)

核矩阵的宽度称为核大小(PyTorch中的kernel_size)。


2.1.2CNN的优点


局部感知:通过卷积操作,CNN能够自动提取局部特征,适合处理图像数据。
参数共享:通过共享卷积核的参数,CNN大大减少了模型的参数数量,提升了计算效率。
平移不变性:卷积操作具有平移不变性,即在输入图像平移的情况下,卷积层的输出不变。


2.1.3应用场景


图像分类:如手写数字识别(MNIST)、物体识别(ImageNet)。
目标检测:如人脸检测、行人检测。
语义分割:如医学图像分割。
视频分析:如动作识别。
自然语言处理:如文本分类、情感分析

2.1.4CNN和MLP的差别

多层感知器(MLP)适合处理结构化的低维数据,通过全连接层逐层传递信息,但对高维数据效果较差。卷积神经网络(CNN)专为处理高维数据如图像和视频设计,通过卷积和池化层提取局部特征,参数更少,性能更强。简单来说,MLP适合处理平坦的表格数据,CNN则擅长处理带有空间结构的图像和视频数据。

2.2代码详情

2.2.1模型

相关库导入
  1. from argparse import Namespace
  2. from collections import Counter
  3. import json
  4. import os
  5. import string
  6. import numpy as np
  7. import pandas as pd
  8. import torch
  9. import torch.nn as nn
  10. import torch.nn.functional as F
  11. import torch.optim as optim
  12. from torch.utils.data import Dataset, DataLoader
  13. from tqdm import tqdm_notebook
相关类定义
  1. class Vocabulary(object):
  2. """Class to process text and extract vocabulary for mapping"""
  3. def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"):
  4. """
  5. Args:
  6. token_to_idx (dict): a pre-existing map of tokens to indices
  7. add_unk (bool): a flag that indicates whether to add the UNK token
  8. unk_token (str): the UNK token to add into the Vocabulary
  9. """
  10. if token_to_idx is None:
  11. token_to_idx = {}
  12. self._token_to_idx = token_to_idx
  13. self._idx_to_token = {idx: token
  14. for token, idx in self._token_to_idx.items()}
  15. self._add_unk = add_unk
  16. self._unk_token = unk_token
  17. self.unk_index = -1
  18. if add_unk:
  19. self.unk_index = self.add_token(unk_token)
  20. def to_serializable(self):
  21. """ returns a dictionary that can be serialized """
  22. return {'token_to_idx': self._token_to_idx,
  23. 'add_unk': self._add_unk,
  24. 'unk_token': self._unk_token}
  25. @classmethod
  26. def from_serializable(cls, contents):
  27. """ instantiates the Vocabulary from a serialized dictionary """
  28. return cls(**contents)
  29. def add_token(self, token):
  30. """Update mapping dicts based on the token.
  31. Args:
  32. token (str): the item to add into the Vocabulary
  33. Returns:
  34. index (int): the integer corresponding to the token
  35. """
  36. try:
  37. index = self._token_to_idx[token]
  38. except KeyError:
  39. index = len(self._token_to_idx)
  40. self._token_to_idx[token] = index
  41. self._idx_to_token[index] = token
  42. return index
  43. def add_many(self, tokens):
  44. """Add a list of tokens into the Vocabulary
  45. Args:
  46. tokens (list): a list of string tokens
  47. Returns:
  48. indices (list): a list of indices corresponding to the tokens
  49. """
  50. return [self.add_token(token) for token in tokens]
  51. def lookup_token(self, token):
  52. """Retrieve the index associated with the token
  53. or the UNK index if token isn't present.
  54. Args:
  55. token (str): the token to look up
  56. Returns:
  57. index (int): the index corresponding to the token
  58. Notes:
  59. `unk_index` needs to be >=0 (having been added into the Vocabulary)
  60. for the UNK functionality
  61. """
  62. if self.unk_index >= 0:
  63. return self._token_to_idx.get(token, self.unk_index)
  64. else:
  65. return self._token_to_idx[token]
  66. def lookup_index(self, index):
  67. """Return the token associated with the index
  68. Args:
  69. index (int): the index to look up
  70. Returns:
  71. token (str): the token corresponding to the index
  72. Raises:
  73. KeyError: if the index is not in the Vocabulary
  74. """
  75. if index not in self._idx_to_token:
  76. raise KeyError("the index (%d) is not in the Vocabulary" % index)
  77. return self._idx_to_token[index]
  78. def __str__(self):
  79. return "<Vocabulary(size=%d)>" % len(self)
  80. def __len__(self):
  81. return len(self._token_to_idx)
  82. class SurnameVectorizer(object):
  83. """ The Vectorizer which coordinates the Vocabularies and puts them to use"""
  84. def __init__(self, surname_vocab, nationality_vocab, max_surname_length):
  85. """
  86. Args:
  87. surname_vocab (Vocabulary): maps characters to integers
  88. nationality_vocab (Vocabulary): maps nationalities to integers
  89. max_surname_length (int): the length of the longest surname
  90. """
  91. self.surname_vocab = surname_vocab
  92. self.nationality_vocab = nationality_vocab
  93. self._max_surname_length = max_surname_length
  94. def vectorize(self, surname):
  95. """
  96. Args:
  97. surname (str): the surname
  98. Returns:
  99. one_hot_matrix (np.ndarray): a matrix of one-hot vectors
  100. """
  101. one_hot_matrix_size = (len(self.surname_vocab), self._max_surname_length)
  102. one_hot_matrix = np.zeros(one_hot_matrix_size, dtype=np.float32)
  103. for position_index, character in enumerate(surname):
  104. character_index = self.surname_vocab.lookup_token(character)
  105. one_hot_matrix[character_index][position_index] = 1
  106. return one_hot_matrix
  107. @classmethod
  108. def from_dataframe(cls, surname_df):
  109. """Instantiate the vectorizer from the dataset dataframe
  110. Args:
  111. surname_df (pandas.DataFrame): the surnames dataset
  112. Returns:
  113. an instance of the SurnameVectorizer
  114. """
  115. surname_vocab = Vocabulary(unk_token="@")
  116. nationality_vocab = Vocabulary(add_unk=False)
  117. max_surname_length = 0
  118. for index, row in surname_df.iterrows():
  119. max_surname_length = max(max_surname_length, len(row.surname))
  120. for letter in row.surname:
  121. surname_vocab.add_token(letter)
  122. nationality_vocab.add_token(row.nationality)
  123. return cls(surname_vocab, nationality_vocab, max_surname_length)
  124. @classmethod
  125. def from_serializable(cls, contents):
  126. surname_vocab = Vocabulary.from_serializable(contents['surname_vocab'])
  127. nationality_vocab = Vocabulary.from_serializable(contents['nationality_vocab'])
  128. return cls(surname_vocab=surname_vocab, nationality_vocab=nationality_vocab,
  129. max_surname_length=contents['max_surname_length'])
  130. def to_serializable(self):
  131. return {'surname_vocab': self.surname_vocab.to_serializable(),
  132. 'nationality_vocab': self.nationality_vocab.to_serializable(),
  133. 'max_surname_length': self._max_surname_length}
  134. class SurnameDataset(Dataset):
  135. def __init__(self, surname_df, vectorizer):
  136. """
  137. Args:
  138. name_df (pandas.DataFrame): the dataset
  139. vectorizer (SurnameVectorizer): vectorizer instatiated from dataset
  140. """
  141. self.surname_df = surname_df
  142. self._vectorizer = vectorizer
  143. self.train_df = self.surname_df[self.surname_df.split=='train']
  144. self.train_size = len(self.train_df)
  145. self.val_df = self.surname_df[self.surname_df.split=='val']
  146. self.validation_size = len(self.val_df)
  147. self.test_df = self.surname_df[self.surname_df.split=='test']
  148. self.test_size = len(self.test_df)
  149. self._lookup_dict = {'train': (self.train_df, self.train_size),
  150. 'val': (self.val_df, self.validation_size),
  151. 'test': (self.test_df, self.test_size)}
  152. self.set_split('train')
  153. # Class weights
  154. class_counts = surname_df.nationality.value_counts().to_dict()
  155. def sort_key(item):
  156. return self._vectorizer.nationality_vocab.lookup_token(item[0])
  157. sorted_counts = sorted(class_counts.items(), key=sort_key)
  158. frequencies = [count for _, count in sorted_counts]
  159. self.class_weights = 1.0 / torch.tensor(frequencies, dtype=torch.float32)
  160. @classmethod
  161. def load_dataset_and_make_vectorizer(cls, surname_csv):
  162. """Load dataset and make a new vectorizer from scratch
  163. Args:
  164. surname_csv (str): location of the dataset
  165. Returns:
  166. an instance of SurnameDataset
  167. """
  168. surname_df = pd.read_csv(surname_csv)
  169. train_surname_df = surname_df[surname_df.split=='train']
  170. return cls(surname_df, SurnameVectorizer.from_dataframe(train_surname_df))
  171. @classmethod
  172. def load_dataset_and_load_vectorizer(cls, surname_csv, vectorizer_filepath):
  173. """Load dataset and the corresponding vectorizer.
  174. Used in the case in the vectorizer has been cached for re-use
  175. Args:
  176. surname_csv (str): location of the dataset
  177. vectorizer_filepath (str): location of the saved vectorizer
  178. Returns:
  179. an instance of SurnameDataset
  180. """
  181. surname_df = pd.read_csv(surname_csv)
  182. vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
  183. return cls(surname_df, vectorizer)
  184. @staticmethod
  185. def load_vectorizer_only(vectorizer_filepath):
  186. """a static method for loading the vectorizer from file
  187. Args:
  188. vectorizer_filepath (str): the location of the serialized vectorizer
  189. Returns:
  190. an instance of SurnameDataset
  191. """
  192. with open(vectorizer_filepath) as fp:
  193. return SurnameVectorizer.from_serializable(json.load(fp))
  194. def save_vectorizer(self, vectorizer_filepath):
  195. """saves the vectorizer to disk using json
  196. Args:
  197. vectorizer_filepath (str): the location to save the vectorizer
  198. """
  199. with open(vectorizer_filepath, "w") as fp:
  200. json.dump(self._vectorizer.to_serializable(), fp)
  201. def get_vectorizer(self):
  202. """ returns the vectorizer """
  203. return self._vectorizer
  204. def set_split(self, split="train"):
  205. """ selects the splits in the dataset using a column in the dataframe """
  206. self._target_split = split
  207. self._target_df, self._target_size = self._lookup_dict[split]
  208. def __len__(self):
  209. return self._target_size
  210. def __getitem__(self, index):
  211. """the primary entry point method for PyTorch datasets
  212. Args:
  213. index (int): the index to the data point
  214. Returns:
  215. a dictionary holding the data point's features (x_data) and label (y_target)
  216. """
  217. row = self._target_df.iloc[index]
  218. surname_matrix = \
  219. self._vectorizer.vectorize(row.surname)
  220. nationality_index = \
  221. self._vectorizer.nationality_vocab.lookup_token(row.nationality)
  222. return {'x_surname': surname_matrix,
  223. 'y_nationality': nationality_index}
  224. def get_num_batches(self, batch_size):
  225. """Given a batch size, return the number of batches in the dataset
  226. Args:
  227. batch_size (int)
  228. Returns:
  229. number of batches in the dataset
  230. """
  231. return len(self) // batch_size
  232. def generate_batches(dataset, batch_size, shuffle=True,
  233. drop_last=True, device="cpu"):
  234. """
  235. A generator function which wraps the PyTorch DataLoader. It will
  236. ensure each tensor is on the write device location.
  237. """
  238. dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
  239. shuffle=shuffle, drop_last=drop_last)
  240. for data_dict in dataloader:
  241. out_data_dict = {}
  242. for name, tensor in data_dict.items():
  243. out_data_dict[name] = data_dict[name].to(device)
  244. yield out_data_dict
模型定义
  1. class SurnameClassifier(nn.Module):
  2. def __init__(self, initial_num_channels, num_classes, num_channels):
  3. """
  4. Args:
  5. initial_num_channels (int): size of the incoming feature vector
  6. num_classes (int): size of the output prediction vector
  7. num_channels (int): constant channel size to use throughout network
  8. """
  9. super(SurnameClassifier, self).__init__()
  10. self.convnet = nn.Sequential(
  11. nn.Conv1d(in_channels=initial_num_channels,
  12. out_channels=num_channels, kernel_size=3),
  13. nn.ELU(),
  14. nn.Conv1d(in_channels=num_channels, out_channels=num_channels,
  15. kernel_size=3, stride=2),
  16. nn.ELU(),
  17. nn.Conv1d(in_channels=num_channels, out_channels=num_channels,
  18. kernel_size=3, stride=2),
  19. nn.ELU(),
  20. nn.Conv1d(in_channels=num_channels, out_channels=num_channels,
  21. kernel_size=3),
  22. nn.ELU()
  23. )
  24. self.fc = nn.Linear(num_channels, num_classes)
  25. def forward(self, x_surname, apply_softmax=False):
  26. """The forward pass of the classifier
  27. Args:
  28. x_surname (torch.Tensor): an input data tensor.
  29. x_surname.shape should be (batch, initial_num_channels, max_surname_length)
  30. apply_softmax (bool): a flag for the softmax activation
  31. should be false if used with the Cross Entropy losses
  32. Returns:
  33. the resulting tensor. tensor.shape should be (batch, num_classes)
  34. """
  35. features = self.convnet(x_surname).squeeze(dim=2)
  36. prediction_vector = self.fc(features)
  37. if apply_softmax:
  38. prediction_vector = F.softmax(prediction_vector, dim=1)
  39. return prediction_vector
模型训练
  1. def make_train_state(args):
  2. return {'stop_early': False,
  3. 'early_stopping_step': 0,
  4. 'early_stopping_best_val': 1e8,
  5. 'learning_rate': args.learning_rate,
  6. 'epoch_index': 0,
  7. 'train_loss': [],
  8. 'train_acc': [],
  9. 'val_loss': [],
  10. 'val_acc': [],
  11. 'test_loss': -1,
  12. 'test_acc': -1,
  13. 'model_filename': args.model_state_file}

  1. def update_train_state(args, model, train_state):
  2. """Handle the training state updates.
  3. Components:
  4. - Early Stopping: Prevent overfitting.
  5. - Model Checkpoint: Model is saved if the model is better
  6. :param args: main arguments
  7. :param model: model to train
  8. :param train_state: a dictionary representing the training state values
  9. :returns:
  10. a new train_state
  11. """
  12. # Save one model at least
  13. if train_state['epoch_index'] == 0:
  14. torch.save(model.state_dict(), train_state['model_filename'])
  15. train_state['stop_early'] = False
  16. # Save model if performance improved
  17. elif train_state['epoch_index'] >= 1:
  18. loss_tm1, loss_t = train_state['val_loss'][-2:]
  19. # If loss worsened
  20. if loss_t >= train_state['early_stopping_best_val']:
  21. # Update step
  22. train_state['early_stopping_step'] += 1
  23. # Loss decreased
  24. else:
  25. # Save the best model
  26. if loss_t < train_state['early_stopping_best_val']:
  27. torch.save(model.state_dict(), train_state['model_filename'])
  28. # Reset early stopping step
  29. train_state['early_stopping_step'] = 0
  30. # Stop early ?
  31. train_state['stop_early'] = \
  32. train_state['early_stopping_step'] >= args.early_stopping_criteria
  33. return train_state
  1. def compute_accuracy(y_pred, y_target):
  2. y_pred_indices = y_pred.max(dim=1)[1]
  3. n_correct = torch.eq(y_pred_indices, y_target).sum().item()
  4. return n_correct / len(y_pred_indices) * 100
  5. args = Namespace(
  6. # Data and Path information
  7. surname_csv="data/surnames/surnames_with_splits.csv",
  8. vectorizer_file="vectorizer.json",
  9. model_state_file="model.pth",
  10. save_dir="model_storage/ch4/cnn",
  11. # Model hyper parameters
  12. hidden_dim=100,
  13. num_channels=256,
  14. # Training hyper parameters
  15. seed=1337,
  16. learning_rate=0.001,
  17. batch_size=128,
  18. num_epochs=100,
  19. early_stopping_criteria=5,
  20. dropout_p=0.1,
  21. # Runtime options
  22. cuda=False,
  23. reload_from_files=False,
  24. expand_filepaths_to_save_dir=True,
  25. catch_keyboard_interrupt=True
  26. )
  27. if args.expand_filepaths_to_save_dir:
  28. args.vectorizer_file = os.path.join(args.save_dir,
  29. args.vectorizer_file)
  30. args.model_state_file = os.path.join(args.save_dir,
  31. args.model_state_file)
  32. print("Expanded filepaths: ")
  33. print("\t{}".format(args.vectorizer_file))
  34. print("\t{}".format(args.model_state_file))
  35. # Check CUDA
  36. if not torch.cuda.is_available():
  37. args.cuda = False
  38. args.device = torch.device("cuda" if args.cuda else "cpu")
  39. print("Using CUDA: {}".format(args.cuda))
  40. def set_seed_everywhere(seed, cuda):
  41. np.random.seed(seed)
  42. torch.manual_seed(seed)
  43. if cuda:
  44. torch.cuda.manual_seed_all(seed)
  45. def handle_dirs(dirpath):
  46. if not os.path.exists(dirpath):
  47. os.makedirs(dirpath)
  48. # Set seed for reproducibility
  49. set_seed_everywhere(args.seed, args.cuda)
  50. # handle dirs
  51. handle_dirs(args.save_dir)

  1. if args.reload_from_files:
  2. # training from a checkpoint
  3. dataset = SurnameDataset.load_dataset_and_load_vectorizer(args.surname_csv,
  4. args.vectorizer_file)
  5. else:
  6. # create dataset and vectorizer
  7. dataset = SurnameDataset.load_dataset_and_make_vectorizer(args.surname_csv)
  8. dataset.save_vectorizer(args.vectorizer_file)
  9. vectorizer = dataset.get_vectorizer()
  10. classifier = SurnameClassifier(initial_num_channels=len(vectorizer.surname_vocab),
  11. num_classes=len(vectorizer.nationality_vocab),
  12. num_channels=args.num_channels)
  13. classifer = classifier.to(args.device)
  14. dataset.class_weights = dataset.class_weights.to(args.device)
  15. loss_func = nn.CrossEntropyLoss(weight=dataset.class_weights)
  16. optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
  17. scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer,
  18. mode='min', factor=0.5,
  19. patience=1)
  20. train_state = make_train_state(args)

  1. epoch_bar = tqdm_notebook(desc='training routine',
  2. total=args.num_epochs,
  3. position=0)
  4. dataset.set_split('train')
  5. train_bar = tqdm_notebook(desc='split=train',
  6. total=dataset.get_num_batches(args.batch_size),
  7. position=1,
  8. leave=True)
  9. dataset.set_split('val')
  10. val_bar = tqdm_notebook(desc='split=val',
  11. total=dataset.get_num_batches(args.batch_size),
  12. position=1,
  13. leave=True)
  14. try:
  15. for epoch_index in range(args.num_epochs):
  16. train_state['epoch_index'] = epoch_index
  17. # Iterate over training dataset
  18. # setup: batch generator, set loss and acc to 0, set train mode on
  19. dataset.set_split('train')
  20. batch_generator = generate_batches(dataset,
  21. batch_size=args.batch_size,
  22. device=args.device)
  23. running_loss = 0.0
  24. running_acc = 0.0
  25. classifier.train()
  26. for batch_index, batch_dict in enumerate(batch_generator):
  27. # the training routine is these 5 steps:
  28. # --------------------------------------
  29. # step 1. zero the gradients
  30. optimizer.zero_grad()
  31. # step 2. compute the output
  32. y_pred = classifier(batch_dict['x_surname'])
  33. # step 3. compute the loss
  34. loss = loss_func(y_pred, batch_dict['y_nationality'])
  35. loss_t = loss.item()
  36. running_loss += (loss_t - running_loss) / (batch_index + 1)
  37. # step 4. use loss to produce gradients
  38. loss.backward()
  39. # step 5. use optimizer to take gradient step
  40. optimizer.step()
  41. # -----------------------------------------
  42. # compute the accuracy
  43. acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
  44. running_acc += (acc_t - running_acc) / (batch_index + 1)
  45. # update bar
  46. train_bar.set_postfix(loss=running_loss, acc=running_acc,
  47. epoch=epoch_index)
  48. train_bar.update()
  49. train_state['train_loss'].append(running_loss)
  50. train_state['train_acc'].append(running_acc)
  51. # Iterate over val dataset
  52. # setup: batch generator, set loss and acc to 0; set eval mode on
  53. dataset.set_split('val')
  54. batch_generator = generate_batches(dataset,
  55. batch_size=args.batch_size,
  56. device=args.device)
  57. running_loss = 0.
  58. running_acc = 0.
  59. classifier.eval()
  60. for batch_index, batch_dict in enumerate(batch_generator):
  61. # compute the output
  62. y_pred = classifier(batch_dict['x_surname'])
  63. # step 3. compute the loss
  64. loss = loss_func(y_pred, batch_dict['y_nationality'])
  65. loss_t = loss.item()
  66. running_loss += (loss_t - running_loss) / (batch_index + 1)
  67. # compute the accuracy
  68. acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
  69. running_acc += (acc_t - running_acc) / (batch_index + 1)
  70. val_bar.set_postfix(loss=running_loss, acc=running_acc,
  71. epoch=epoch_index)
  72. val_bar.update()
  73. train_state['val_loss'].append(running_loss)
  74. train_state['val_acc'].append(running_acc)
  75. train_state = update_train_state(args=args, model=classifier,
  76. train_state=train_state)
  77. scheduler.step(train_state['val_loss'][-1])
  78. if train_state['stop_early']:
  79. break
  80. train_bar.n = 0
  81. val_bar.n = 0
  82. epoch_bar.update()
  83. except KeyboardInterrupt:
  84. print("Exiting loop")

  1. classifier.load_state_dict(torch.load(train_state['model_filename']))
  2. classifier = classifier.to(args.device)
  3. dataset.class_weights = dataset.class_weights.to(args.device)
  4. loss_func = nn.CrossEntropyLoss(dataset.class_weights)
  5. dataset.set_split('test')
  6. batch_generator = generate_batches(dataset,
  7. batch_size=args.batch_size,
  8. device=args.device)
  9. running_loss = 0.
  10. running_acc = 0.
  11. classifier.eval()
  12. for batch_index, batch_dict in enumerate(batch_generator):
  13. # compute the output
  14. y_pred = classifier(batch_dict['x_surname'])
  15. # compute the loss
  16. loss = loss_func(y_pred, batch_dict['y_nationality'])
  17. loss_t = loss.item()
  18. running_loss += (loss_t - running_loss) / (batch_index + 1)
  19. # compute the accuracy
  20. acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
  21. running_acc += (acc_t - running_acc) / (batch_index + 1)
  22. train_state['test_loss'] = running_loss
  23. train_state['test_acc'] = running_acc
  24. print("Test loss: {};".format(train_state['test_loss']))
  25. print("Test Accuracy: {}".format(train_state['test_acc']))

模型预测
  1. def predict_nationality(surname, classifier, vectorizer):
  2. """Predict the nationality from a new surname
  3. Args:
  4. surname (str): the surname to classifier
  5. classifier (SurnameClassifer): an instance of the classifier
  6. vectorizer (SurnameVectorizer): the corresponding vectorizer
  7. Returns:
  8. a dictionary with the most likely nationality and its probability
  9. """
  10. vectorized_surname = vectorizer.vectorize(surname)
  11. vectorized_surname = torch.tensor(vectorized_surname).unsqueeze(0)
  12. result = classifier(vectorized_surname, apply_softmax=True)
  13. probability_values, indices = result.max(dim=1)
  14. index = indices.item()
  15. predicted_nationality = vectorizer.nationality_vocab.lookup_index(index)
  16. probability_value = probability_values.item()
  17. return {'nationality': predicted_nationality, 'probability': probability_value}
  18. new_surname = input("Enter a surname to classify: ")
  19. classifier = classifier.cpu()
  20. prediction = predict_nationality(new_surname, classifier, vectorizer)
  21. print("{} -> {} (p={:0.2f})".format(new_surname,
  22. prediction['nationality'],
  23. prediction['probability']))

  1. def predict_topk_nationality(surname, classifier, vectorizer, k=5):
  2. """Predict the top K nationalities from a new surname
  3. Args:
  4. surname (str): the surname to classifier
  5. classifier (SurnameClassifer): an instance of the classifier
  6. vectorizer (SurnameVectorizer): the corresponding vectorizer
  7. k (int): the number of top nationalities to return
  8. Returns:
  9. list of dictionaries, each dictionary is a nationality and a probability
  10. """
  11. vectorized_surname = vectorizer.vectorize(surname)
  12. vectorized_surname = torch.tensor(vectorized_surname).unsqueeze(dim=0)
  13. prediction_vector = classifier(vectorized_surname, apply_softmax=True)
  14. probability_values, indices = torch.topk(prediction_vector, k=k)
  15. # returned size is 1,k
  16. probability_values = probability_values[0].detach().numpy()
  17. indices = indices[0].detach().numpy()
  18. results = []
  19. for kth_index in range(k):
  20. nationality = vectorizer.nationality_vocab.lookup_index(indices[kth_index])
  21. probability_value = probability_values[kth_index]
  22. results.append({'nationality': nationality,
  23. 'probability': probability_value})
  24. return results
  25. new_surname = input("Enter a surname to classify: ")
  26. k = int(input("How many of the top predictions to see? "))
  27. if k > len(vectorizer.nationality_vocab):
  28. print("Sorry! That's more than the # of nationalities we have.. defaulting you to max size :)")
  29. k = len(vectorizer.nationality_vocab)
  30. predictions = predict_topk_nationality(new_surname, classifier, vectorizer, k=k)
  31. print("Top {} predictions:".format(k))
  32. print("===================")
  33. for prediction in predictions:
  34. print("{} -> {} (p={:0.2f})".format(new_surname,
  35. prediction['nationality'],
  36. prediction['probability']))

本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号