当前位置:   article > 正文

【自然语言处理】基于mindspore的情感分类实验_mindspore imdb

mindspore imdb

实验名称

自然语言处理的应用——基于mindspore的情感分类实验

  • 实验目的

以IMDB影评情感分类体验MindSpore在自然语言处理上的应用

  • 实验内容:

1. 准备环节。

2. 加载数据集,进行数据处理。

3. 定义网络。

4. 定义优化器和损失函数。

5. 使用网络训练数据,生成模型。

6. 得到模型之后,使用验证数据集,查看模型精度情况。

  • 实验过程:

    1. 相关软件的下载和环境的搭建

(1)Annaconda的下载安装、Mindspore框架配置

之前学习python时已经安装过了Annaconda,下面进行环境的配置。

创建conda虚拟环境

进入华为mindspore官网,选择mindspore并下载镜像(选择合适的镜像):

安装成功!

当时尝试下载了很多,最终选到了可以安装成功的包。

之后检查是否安装成功:

  

(2)jupyter配置环境kernels:

      安装ipykernel和nb_conda。

使用可用的kernel内核,下载并打开jupyter lab。

 

2.准备工作

(1)将数据集解压到当前工作目录下,建立结构如下所示:

      

    (2)确定网络

使用基于LSTM构建的SentimentNet网络进行自然语言处理。

LSTM(Long short-term memory,长短期记忆)网络是一种时间循环神经网络,适合于处理和预测时间序列中间隔和延迟非常长的重要事件。 本次实验面向GPU或CPU硬件平台。

    (3) 配置运行信息和SentimentNet网络参数

      安装 easydict 依赖包、安装 gensim 依赖包。

运行代码中配置训练所需相关参数。

3.数据处理

    执行数据集预处理:

·定义ImdbParser 类解析文本数据集,包括编码、分词、对齐、处理GloVe原始数据,使之能够适应网络结构。

·定义convert_to_mindrecord 函数将数据集格式转换为MindRecord格式,便于MindSpore读取。函数 _convert_to_mindrecord 中 weight.txt 为数据预处理后自动生成的weight参数信息文件。

·调用convert_to_mindrecord 函数执行数据集预处理。

运行成功之后,在 preprocess 目录下生成MindRecord文件,通常该操作在数据集不变的情况下,无需每次训练都执行,查看 preprocess 文件目录结构,观察到生成了一系列文件。

创建训练集:

定义创建数据集函数lstm_create_dataset,创建训练集 ds_train。通过 create_dict_iterator 方法创建字典迭代器,读取已创建的数据集 ds_train 中的数据。运行代码,创建数据集并读取第1个 batch 中的 label 数据列表,和第1个 batch 中第1个元素的 feature 数据。

4.定义网络

    (1)导入初始化网络所需模块。

(2)定义需要单层LSTM小算子堆叠的设备类型。

(3)定义 lstm_default_state 函数来初始化网络参数及网络状态。

(4)定义 stack_lstm_default_state 函数来初始化小算子堆叠需要的初始化网络参数及网络状态。

(5)针对CPU场景,自定义单层LSTM小算子堆叠,来实现多层LSTM大算子功能。

(6)使用 Cell 方法,定义网络结构( SentimentNet 网络)。

(7)实例化 SentimentNet ,创建网络,最后输出网络中加载的参数

  


  1. 5.训练并保存模型

    创建优化器和损失函数模型,加载训练数据集( ds_train )并配置好CheckPoint 生成信息,然后使用 model.train 接口,进行模型训练。

    

   6.模型验证

创建并加载验证数据集(ds_eval),加载由训练保存的CheckPoint文件,进行验证,查看模型质量。

  • 结果展示:

根据之前的一系列操作、训练等,最终通过模型验证的输出我们可以看到,在经历了10轮epoch之后,使用验证的数据集,对文本的情感分析正确率约等于84.5%,在85%左右,达到一个基本满意的结果。

  • 心得体会:

本次实验遇到的问题主要是mindspore的版本问题,Mac所支持的版本只有1.6之后的,但是在使用nn.LSTMCell时需要1.6版本以下,通过询问学姐,将nn.LSTMCell修改为nn.LSTM,于是运行正确了。这样我就意识到想关环境的搭建和应用的模型应该要统一化,但该步解决完成后,后续的模型训练仍存在问题,于是我直接换windows系统来重新进行训练了。初步接触到自然语言处理中的文本分类问题,通过一步步的环境搭建、代码实现、训练文本等完成了情感分类问题。另外在定义网络时,文本主题分类的网络结构和情感分类的网络结构大致相似。在掌握了情感分类网络如何构造之后,很容易可以构造一个类似的网络,稍作调参即可用于文本主题分类任务。情感分类这一主题分类较为简单且实用性强。

参考代码:
 

  1. import argparse
  2. from mindspore import context
  3. from easydict import EasyDict as edict
  4. # LSTM CONFIG
  5. lstm_cfg = edict({
  6. 'num_classes': 2,
  7. 'learning_rate': 0.1,
  8. 'momentum': 0.9,
  9. 'num_epochs': 10,
  10. 'batch_size': 64,
  11. 'embed_size': 300,
  12. 'num_hiddens': 100,
  13. 'num_layers': 2,
  14. 'bidirectional': True,
  15. 'save_checkpoint_steps': 390,
  16. 'keep_checkpoint_max': 10
  17. })
  18. cfg = lstm_cfg
  19. parser = argparse.ArgumentParser(description='MindSpore LSTM Example')
  20. parser.add_argument('--preprocess', type=str, default='false', choices=['true', 'false'],
  21. help='whether to preprocess data.')
  22. parser.add_argument('--aclimdb_path', type=str, default="./datasets/aclImdb",
  23. help='path where the dataset is stored.')
  24. parser.add_argument('--glove_path', type=str, default="./datasets/glove",
  25. help='path where the GloVe is stored.')
  26. parser.add_argument('--preprocess_path', type=str, default="./preprocess",
  27. help='path where the pre-process data is stored.')
  28. parser.add_argument('--ckpt_path', type=str, default="./models/ckpt/nlp_application",
  29. help='the path to save the checkpoint file.')
  30. parser.add_argument('--pre_trained', type=str, default=None,
  31. help='the pretrained checkpoint file path.')
  32. parser.add_argument('--device_target', type=str, default="GPU", choices=['GPU', 'CPU'],
  33. help='the target device to run, support "GPU", "CPU". Default: "GPU".')
  34. args = parser.parse_args(['--device_target', 'CPU', '--preprocess', 'true'])
  35. context.set_context(
  36. mode=context.GRAPH_MODE,
  37. save_graphs=False,
  38. device_target=args.device_target)
  39. print("Current context loaded:\n mode: {}\n device_target: {}".format(context.get_context("mode"), context.get_context("device_target")))
  1. import os
  2. from itertools import chain
  3. import numpy as np
  4. import gensim
  5. from mindspore.mindrecord import FileWriter
  6. class ImdbParser():
  7. """
  8. parse aclImdb data to features and labels.
  9. sentence->tokenized->encoded->padding->features
  10. """
  11. def __init__(self, imdb_path, glove_path, embed_size=300):
  12. self.__segs = ['train', 'test']
  13. self.__label_dic = {'pos': 1, 'neg': 0}
  14. self.__imdb_path = imdb_path
  15. self.__glove_dim = embed_size
  16. self.__glove_file = os.path.join(glove_path, 'glove.6B.' + str(self.__glove_dim) + 'd.txt')
  17. # properties
  18. self.__imdb_datas = {}
  19. self.__features = {}
  20. self.__labels = {}
  21. self.__vacab = {}
  22. self.__word2idx = {}
  23. self.__weight_np = {}
  24. self.__wvmodel = None
  25. def parse(self):
  26. """
  27. parse imdb data to memory
  28. """
  29. self.__wvmodel = gensim.models.KeyedVectors.load_word2vec_format(self.__glove_file)
  30. for seg in self.__segs:
  31. self.__parse_imdb_datas(seg)
  32. self.__parse_features_and_labels(seg)
  33. self.__gen_weight_np(seg)
  34. def __parse_imdb_datas(self, seg):
  35. """
  36. load data from txt
  37. """
  38. data_lists = []
  39. for label_name, label_id in self.__label_dic.items():
  40. sentence_dir = os.path.join(self.__imdb_path, seg, label_name)
  41. for file in os.listdir(sentence_dir):
  42. with open(os.path.join(sentence_dir, file), mode='r', encoding='utf8') as f:
  43. sentence = f.read().replace('\n', '')
  44. data_lists.append([sentence, label_id])
  45. self.__imdb_datas[seg] = data_lists
  46. def __parse_features_and_labels(self, seg):
  47. """
  48. parse features and labels
  49. """
  50. features = []
  51. labels = []
  52. for sentence, label in self.__imdb_datas[seg]:
  53. features.append(sentence)
  54. labels.append(label)
  55. self.__features[seg] = features
  56. self.__labels[seg] = labels
  57. # update feature to tokenized
  58. self.__updata_features_to_tokenized(seg)
  59. # parse vacab
  60. self.__parse_vacab(seg)
  61. # encode feature
  62. self.__encode_features(seg)
  63. # padding feature
  64. self.__padding_features(seg)
  65. def __updata_features_to_tokenized(self, seg):
  66. tokenized_features = []
  67. for sentence in self.__features[seg]:
  68. tokenized_sentence = [word.lower() for word in sentence.split(" ")]
  69. tokenized_features.append(tokenized_sentence)
  70. self.__features[seg] = tokenized_features
  71. def __parse_vacab(self, seg):
  72. # vocab
  73. tokenized_features = self.__features[seg]
  74. vocab = set(chain(*tokenized_features))
  75. self.__vacab[seg] = vocab
  76. # word_to_idx: {'hello': 1, 'world':111, ... '<unk>': 0}
  77. word_to_idx = {word: i + 1 for i, word in enumerate(vocab)}
  78. word_to_idx['<unk>'] = 0
  79. self.__word2idx[seg] = word_to_idx
  80. def __encode_features(self, seg):
  81. """ encode word to index """
  82. word_to_idx = self.__word2idx['train']
  83. encoded_features = []
  84. for tokenized_sentence in self.__features[seg]:
  85. encoded_sentence = []
  86. for word in tokenized_sentence:
  87. encoded_sentence.append(word_to_idx.get(word, 0))
  88. encoded_features.append(encoded_sentence)
  89. self.__features[seg] = encoded_features
  90. def __padding_features(self, seg, maxlen=500, pad=0):
  91. """ pad all features to the same length """
  92. padded_features = []
  93. for feature in self.__features[seg]:
  94. if len(feature) >= maxlen:
  95. padded_feature = feature[:maxlen]
  96. else:
  97. padded_feature = feature
  98. while len(padded_feature) < maxlen:
  99. padded_feature.append(pad)
  100. padded_features.append(padded_feature)
  101. self.__features[seg] = padded_features
  102. def __gen_weight_np(self, seg):
  103. """
  104. generate weight by gensim
  105. """
  106. weight_np = np.zeros((len(self.__word2idx[seg]), self.__glove_dim), dtype=np.float32)
  107. for word, idx in self.__word2idx[seg].items():
  108. if word not in self.__wvmodel:
  109. continue
  110. word_vector = self.__wvmodel.get_vector(word)
  111. weight_np[idx, :] = word_vector
  112. self.__weight_np[seg] = weight_np
  113. def get_datas(self, seg):
  114. """
  115. return features, labels, and weight
  116. """
  117. features = np.array(self.__features[seg]).astype(np.int32)
  118. labels = np.array(self.__labels[seg]).astype(np.int32)
  119. weight = np.array(self.__weight_np[seg])
  120. return features, labels, weight
  121. def _convert_to_mindrecord(data_home, features, labels, weight_np=None, training=True):
  122. """
  123. convert imdb dataset to mindrecoed dataset
  124. """
  125. if weight_np is not None:
  126. np.savetxt(os.path.join(data_home, 'weight.txt'), weight_np)
  127. # write mindrecord
  128. schema_json = {"id": {"type": "int32"},
  129. "label": {"type": "int32"},
  130. "feature": {"type": "int32", "shape": [-1]}}
  131. data_dir = os.path.join(data_home, "aclImdb_train.mindrecord")
  132. if not training:
  133. data_dir = os.path.join(data_home, "aclImdb_test.mindrecord")
  134. def get_imdb_data(features, labels):
  135. data_list = []
  136. for i, (label, feature) in enumerate(zip(labels, features)):
  137. data_json = {"id": i,
  138. "label": int(label),
  139. "feature": feature.reshape(-1)}
  140. data_list.append(data_json)
  141. return data_list
  142. writer = FileWriter(data_dir, shard_num=4)
  143. data = get_imdb_data(features, labels)
  144. writer.add_schema(schema_json, "nlp_schema")
  145. writer.add_index(["id", "label"])
  146. writer.write_raw_data(data)
  147. writer.commit()
  148. def convert_to_mindrecord(embed_size, aclimdb_path, preprocess_path, glove_path):
  149. """
  150. convert imdb dataset to mindrecoed dataset
  151. """
  152. parser = ImdbParser(aclimdb_path, glove_path, embed_size)
  153. parser.parse()
  154. if not os.path.exists(preprocess_path):
  155. print(f"preprocess path {preprocess_path} is not exist")
  156. os.makedirs(preprocess_path)
  157. train_features, train_labels, train_weight_np = parser.get_datas('train')
  158. _convert_to_mindrecord(preprocess_path, train_features, train_labels, train_weight_np)
  159. test_features, test_labels, _ = parser.get_datas('test')
  160. _convert_to_mindrecord(preprocess_path, test_features, test_labels, training=False)
  161. if args.preprocess == "true":
  162. os.system("rm -f ./preprocess/aclImdb* weight*")
  163. print("============== Starting Data Pre-processing ==============")
  164. convert_to_mindrecord(cfg.embed_size, args.aclimdb_path, args.preprocess_path, args.glove_path)
  165. print("======================= Successful =======================")
  1. import os
  2. import mindspore.dataset as ds
  3. def lstm_create_dataset(data_home, batch_size, repeat_num=1, training=True):
  4. """Data operations."""
  5. ds.config.set_seed(1)
  6. data_dir = os.path.join(data_home, "aclImdb_train.mindrecord0")
  7. if not training:
  8. data_dir = os.path.join(data_home, "aclImdb_test.mindrecord0")
  9. data_set = ds.MindDataset(data_dir, columns_list=["feature", "label"], num_parallel_workers=4)
  10. # apply map operations on images
  11. data_set = data_set.shuffle(buffer_size=data_set.get_dataset_size())
  12. data_set = data_set.batch(batch_size=batch_size, drop_remainder=True)
  13. data_set = data_set.repeat(count=repeat_num)
  14. return data_set
  15. ds_train = lstm_create_dataset(args.preprocess_path, cfg.batch_size)
  16. iterator = next(ds_train.create_dict_iterator())
  17. first_batch_label = iterator["label"].asnumpy()
  18. first_batch_first_feature = iterator["feature"].asnumpy()[0]
  19. print(f"The first batch contains label below:\n{first_batch_label}\n")
  20. print(f"The feature of the first item in the first batch is below ")
  1. import math
  2. import numpy as np
  3. from mindspore import Tensor, nn, context, Parameter, ParameterTuple
  4. from mindspore.common.initializer import initializer
  5. import mindspore.ops as ops
  6. STACK_LSTM_DEVICE = ["CPU"]
  7. # Initialize short-term memory (h) and long-term memory (c) to 0
  8. def lstm_default_state(batch_size, hidden_size, num_layers, bidirectional):
  9. """init default input."""
  10. num_directions = 2 if bidirectional else 1
  11. h = Tensor(np.zeros((num_layers * num_directions, batch_size, hidden_size)).astype(np.float32))
  12. c = Tensor(np.zeros((num_layers * num_directions, batch_size, hidden_size)).astype(np.float32))
  13. return h, c
  14. def stack_lstm_default_state(batch_size, hidden_size, num_layers, bidirectional):
  15. """init default input."""
  16. num_directions = 2 if bidirectional else 1
  17. h_list = c_list = []
  18. for _ in range(num_layers):
  19. h_list.append(Tensor(np.zeros((num_directions, batch_size, hidden_size)).astype(np.float32)))
  20. c_list.append(Tensor(np.zeros((num_directions, batch_size, hidden_size)).astype(np.float32)))
  21. h, c = tuple(h_list), tuple(c_list)
  22. return h, c
  23. class StackLSTM(nn.Cell):
  24. """
  25. Stack multi-layers LSTM together.
  26. """
  27. def __init__(self,
  28. input_size,
  29. hidden_size,
  30. num_layers=1,
  31. has_bias=True,
  32. batch_first=False,
  33. dropout=0.0,
  34. bidirectional=False):
  35. super(StackLSTM, self).__init__()
  36. self.num_layers = num_layers
  37. self.batch_first = batch_first
  38. self.transpose = ops.Transpose()
  39. # direction number
  40. num_directions = 2 if bidirectional else 1
  41. # input_size list
  42. input_size_list = [input_size]
  43. for i in range(num_layers - 1):
  44. input_size_list.append(hidden_size * num_directions)
  45. # layers
  46. layers = []
  47. for i in range(num_layers):
  48. layers.append(nn.LSTMCell(input_size=input_size_list[i],
  49. hidden_size=hidden_size,
  50. has_bias=has_bias,
  51. batch_first=batch_first,
  52. bidirectional=bidirectional,
  53. dropout=dropout))
  54. # weights
  55. weights = []
  56. for i in range(num_layers):
  57. # weight size
  58. weight_size = (input_size_list[i] + hidden_size) * num_directions * hidden_size * 4
  59. if has_bias:
  60. bias_size = num_directions * hidden_size * 4
  61. weight_size = weight_size + bias_size
  62. # numpy weight
  63. stdv = 1 / math.sqrt(hidden_size)
  64. w_np = np.random.uniform(-stdv, stdv, (weight_size, 1, 1)).astype(np.float32)
  65. # lstm weight
  66. weights.append(Parameter(initializer(Tensor(w_np), w_np.shape), name="weight" + str(i)))
  67. #
  68. self.lstms = layers
  69. self.weight = ParameterTuple(tuple(weights))
  70. def construct(self, x, hx):
  71. """construct"""
  72. if self.batch_first:
  73. x = self.transpose(x, (1, 0, 2))
  74. # stack lstm
  75. h, c = hx
  76. hn = cn = None
  77. for i in range(self.num_layers):
  78. x, hn, cn, _, _ = self.lstms[i](x, h[i], c[i], self.weight[i])
  79. if self.batch_first:
  80. x = self.transpose(x, (1, 0, 2))
  81. return x, (hn, cn)
  82. class SentimentNet(nn.Cell):
  83. """Sentiment network structure."""
  84. def __init__(self,
  85. vocab_size,
  86. embed_size,
  87. num_hiddens,
  88. num_layers,
  89. bidirectional,
  90. num_classes,
  91. weight,
  92. batch_size):
  93. super(SentimentNet, self).__init__()
  94. # Mapp words to vectors
  95. self.embedding = nn.Embedding(vocab_size,
  96. embed_size,
  97. embedding_table=weight)
  98. self.embedding.embedding_table.requires_grad = False
  99. self.trans = ops.Transpose()
  100. self.perm = (1, 0, 2)
  101. if context.get_context("device_target") in STACK_LSTM_DEVICE:
  102. # stack lstm by user
  103. self.encoder = StackLSTM(input_size=embed_size,
  104. hidden_size=num_hiddens,
  105. num_layers=num_layers,
  106. has_bias=True,
  107. bidirectional=bidirectional,
  108. dropout=0.0)
  109. self.h, self.c = stack_lstm_default_state(batch_size, num_hiddens, num_layers, bidirectional)
  110. else:
  111. # standard lstm
  112. self.encoder = nn.LSTM(input_size=embed_size,
  113. hidden_size=num_hiddens,
  114. num_layers=num_layers,
  115. has_bias=True,
  116. bidirectional=bidirectional,
  117. dropout=0.0)
  118. self.h, self.c = lstm_default_state(batch_size, num_hiddens, num_layers, bidirectional)
  119. self.concat = ops.Concat(1)
  120. if bidirectional:
  121. self.decoder = nn.Dense(num_hiddens * 4, num_classes)
  122. else:
  123. self.decoder = nn.Dense(num_hiddens * 2, num_classes)
  124. def construct(self, inputs):
  125. # input:(64,500,300)
  126. embeddings = self.embedding(inputs)
  127. embeddings = self.trans(embeddings, self.perm)
  128. output, _ = self.encoder(embeddings, (self.h, self.c))
  129. # states[i] size(64,200) -> encoding.size(64,400)
  130. encoding = self.concat((output[0], output[499]))
  131. outputs = self.decoder(encoding)
  132. return outputs
  133. embedding_table = np.loadtxt(os.path.join(args.preprocess_path, "weight.txt")).astype(np.float32)
  134. network = SentimentNet(vocab_size=embedding_table.shape[0],
  135. embed_size=cfg.embed_size,
  136. num_hiddens=cfg.num_hiddens,
  137. num_layers=cfg.num_layers,
  138. bidirectional=cfg.bidirectional,
  139. num_classes=cfg.num_classes,
  140. weight=Tensor(embedding_table),
  141. batch_size=cfg.batch_size)
  142. print(network.parameters_dict(recurse=True))
  1. from mindspore import Model
  2. from mindspore.train.callback import CheckpointConfig, ModelCheckpoint, TimeMonitor, LossMonitor
  3. from mindspore.nn import Accuracy
  4. from mindspore import nn
  5. os.system("rm -f {0}/*.ckpt {0}/*.meta".format(args.ckpt_path))
  6. loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean')
  7. opt = nn.Momentum(network.trainable_params(), cfg.learning_rate, cfg.momentum)
  8. model = Model(network, loss, opt, {'acc': Accuracy()})
  9. loss_cb = LossMonitor(per_print_times=78)
  10. print("============== Starting Training ==============")
  11. config_ck = CheckpointConfig(save_checkpoint_steps=cfg.save_checkpoint_steps,
  12. keep_checkpoint_max=cfg.keep_checkpoint_max)
  13. ckpoint_cb = ModelCheckpoint(prefix="lstm", directory=args.ckpt_path, config=config_ck)
  14. time_cb = TimeMonitor(data_size=ds_train.get_dataset_size())
  15. if args.device_target == "CPU":
  16. model.train(cfg.num_epochs, ds_train, callbacks=[time_cb, ckpoint_cb, loss_cb], dataset_sink_mode=False)
  17. else:
  18. model.train(cfg.num_epochs, ds_train, callbacks=[time_cb, ckpoint_cb, loss_cb])
  19. print("============== Training Success ==============")
  1. from mindspore import load_checkpoint, load_param_into_net
  2. args.ckpt_path_saved = f'{args.ckpt_path}/lstm-{cfg.num_epochs}_390.ckpt'
  3. print("============== Starting Testing ==============")
  4. ds_eval = lstm_create_dataset(args.preprocess_path, cfg.batch_size, training=False)
  5. param_dict = load_checkpoint(args.ckpt_path_saved)
  6. load_param_into_net(network, param_dict)
  7. if args.device_target == "CPU":
  8. acc = model.eval(ds_eval, dataset_sink_mode=False)
  9. else:
  10. acc = model.eval(ds_eval)
  11. print("============== {} ==============".format(acc))

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/爱喝兽奶帝天荒/article/detail/843504
推荐阅读
  

闽ICP备14008679号