当前位置:   article > 正文

深度学习(8):基于BERT算法的文本内容情感分析_基于bert的文本情感分析

基于bert的文本情感分析

目标:基于BERT网络实现对文本的情感进行分析,将网络上的商品评论内容经过预处理后输入BERT模型训练和推理,最后将判断结果进行输出。

一、原理

1.了解BERT算法的基本原理

BERT介绍

谷歌AI团队发布的BERT模型在11种不同的自然语言处理任务中创出佳成绩,为自然语言处理带来里程碑式的改变,也是自然语言处理领域近期重要的进展。

BERT是一种对语言表征进行预训练的方法, 即是经过大型文本语料库(如维基百科)训练后获得的通用“语言理解”模型,该模型可用于自然语言处理下游任务(如自动问答)。BERT之所以表现得比过往的方法要好, 是因为它是首个用于自然语言处理预训练的无监督、深度双向系统。BERT的优势是能够轻松适用多种类型的自然语言处理任务。

Bert最关键有两点,第一点是特征抽取器采用Transformer,第二点是预训练的时候采用双向语言模型。
 

参考博客:Bert文本分类实战(附代码讲解)_Dr.sky_的博客-CSDN博客_bert实战

2.熟悉文本分类的常规方法

文本分类流程:1.输入文本预处理,2.文本表示及特征提取,3.构造分类器模型,4.文本分类。

文本分类技术参考博客:一文读懂文本分类技术路线_Yunlord的博客-CSDN博客_文本分类技术

二、过程

1.准备数据

  1. #准备数据,从OSS中获取数据并解压到当前目录:
  2. import os
  3. import oss2
  4. access_key_id = os.getenv('OSS_TEST_ACCESS_KEY_ID', 'LTAI4G1MuHTUeNrKdQEPnbph')
  5. access_key_secret = os.getenv('OSS_TEST_ACCESS_KEY_SECRET', 'm1ILSoVqcPUxFFDqer4tKDxDkoP1ji')
  6. bucket_name = os.getenv('OSS_TEST_BUCKET', 'mldemo')
  7. endpoint = os.getenv('OSS_TEST_ENDPOINT', 'https://oss-cn-shanghai.aliyuncs.com')
  8. # 创建Bucket对象,所有Object相关的接口都可以通过Bucket对象来进行
  9. bucket = oss2.Bucket(oss2.Auth(access_key_id, access_key_secret), endpoint, bucket_name)
  10. # 下载到本地文件
  11. bucket.get_object_to_file('data/c12/bert_data.zip', 'bert_data.zip')

  1. #解压数据
  2. !unzip -o -q bert_data.zip
  3. !rm -rf __MACOSX

!ls bert_input_data -ilht

 2.导入库

  1. import collections
  2. import csv
  3. import errno
  4. import tensorflow as tf
  5. import logging
  6. import logging as log
  7. import sys, os
  8. import traceback
  9. from sklearn.utils import shuffle
  10. import pandas as pd
  11. import numpy as np
  12. import modeling
  13. import optimization
  14. import tokenization
  15. %matplotlib inline

 

 3.读取数据

  1. #读取数据
  2. df = pd.read_csv("./bert_input_data/train_data.tsv", header=0,sep='\t').sample(n=100,random_state=1)
  1. df.columns=['baseid','xtext','category']
  2. df.head(5)

df.info()

  1. # number of different class
  2. print('\nnumber of different class: ', len(list(set(df.category))))
  3. print(list(set(df.category)))

 

查看不同类别数据对比情况:

df.category.value_counts().plot(kind='bar')

 

4.构建训练集

构建训练数据集,首先随机化数据:

  1. from sklearn.utils import shuffle
  2. df = shuffle(df,random_state=0)
  1. #查看随机化之后的数据情况
  2. df.head()

 

 将数据按照8:1:1分为训练集、验证信和测试集三部分、

  1. msk = np.random.rand(len(df)) < 0.8
  2. train = df[msk]
  3. dev_test = df[~msk]
  4. msk = np.random.rand(len(dev_test)) < 0.5
  5. dev = dev_test[msk]
  6. test = dev_test[~msk]

 将数据集存为tsv格式,作为BERT模型的输入

  1. export_csv_train = train.to_csv ('./bert_input_data/level1_train.tsv', sep='\t', index = None, header=None)
  2. export_csv_dev = dev.to_csv ('./bert_input_data/level1_dev.tsv', sep='\t',index = None, header=None)
  3. export_csv_test = test.to_csv ('./bert_input_data/level1_test.tsv', sep='\t',index = None, header=None)

5.模型训练

定义模型超参

  1. MODEL_OUTPUT_DIR = "./bert_output/"
  2. init_checkpoint = "./chinese_wwm_ext_L-12_H-768_A-12/bert_model.ckpt"
  3. bert_config_file = "./chinese_wwm_ext_L-12_H-768_A-12/bert_config.json"
  4. vocab_file = "./chinese_wwm_ext_L-12_H-768_A-12/vocab.txt"
  5. save_checkpoints_steps = 200
  6. iterations_per_loop = 100
  7. num_tpu_cores = 4
  8. warmup_proportion =100
  9. train_batch_size = 1
  10. learning_rate=5e-5
  11. eval_batch_size =1
  12. predict_batch_size=2
  13. max_seq_length =16
  14. data_dir = "./bert_input_data/"
  1. #清理模型输出目录,减少磁盘占用
  2. !rm bert_output -rf

BERT算法数据准备及训练代码:

  1. class InputExample(object):
  2. """A single training/test example for simple sequence classification."""
  3. def __init__(self, guid, text_a, text_b=None, label=None):
  4. self.guid = guid
  5. self.text_a = text_a
  6. self.text_b = text_b
  7. self.label = label
  8. class PaddingInputExample(object):
  9. pass
  10. class InputFeatures(object):
  11. """A single set of features of data."""
  12. def __init__(self,
  13. input_ids,
  14. input_mask,
  15. segment_ids,
  16. label_id,
  17. is_real_example=True):
  18. self.input_ids = input_ids
  19. self.input_mask = input_mask
  20. self.segment_ids = segment_ids
  21. self.label_id = label_id
  22. self.is_real_example = is_real_example
  23. class DataProcessor(object):
  24. """Base class for data converters for sequence classification data sets."""
  25. def get_train_examples(self, data_dir):
  26. """Gets a collection of `InputExample`s for the train set."""
  27. raise NotImplementedError()
  28. def get_dev_examples(self, data_dir):
  29. """Gets a collection of `InputExample`s for the dev set."""
  30. raise NotImplementedError()
  31. def get_test_examples(self, data_dir):
  32. """Gets a collection of `InputExample`s for prediction."""
  33. raise NotImplementedError()
  34. def get_labels(self):
  35. """Gets the list of labels for this data set."""
  36. raise NotImplementedError()
  37. @classmethod
  38. def _read_tsv(cls, input_file, quotechar=None):
  39. """Reads a tab separated value file."""
  40. with tf.gfile.Open(input_file, "r") as f:
  41. reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
  42. lines = []
  43. for line in reader:
  44. lines.append(line)
  45. return lines
  46. class SHLibProcessor(DataProcessor):
  47. """Processor for the SHlib data set ."""
  48. def __init__(self, label_list):
  49. self.static_label_list = load_labels(label_list)
  50. def get_train_examples(self, train_lines):
  51. """See base class."""
  52. return self._create_examples(train_lines, "train")
  53. def get_dev_examples(self, eval_lines):
  54. """See base class."""
  55. return self._create_examples(eval_lines, "dev")
  56. def get_test_examples(self, predict_lines):
  57. """See base class."""
  58. return self._create_examples(predict_lines, "test")
  59. def get_labels(self):
  60. """See base class."""
  61. return self.static_label_list
  62. def _create_examples(self, lines, set_type):
  63. """Creates examples for the training and dev sets."""
  64. examples = []
  65. for (i, line) in enumerate(lines):
  66. guid = line[0]# "%s-%s" % (set_type, i)
  67. if set_type == "test":
  68. text_a = tokenization.convert_to_unicode(line[1])
  69. label = tokenization.convert_to_unicode(line[2])
  70. else:
  71. text_a = tokenization.convert_to_unicode(line[1])
  72. label = tokenization.convert_to_unicode(line[2])
  73. examples.append(
  74. InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
  75. return examples
  76. def load_labels(self, label_file_path):
  77. with open(label_file_path,'r') as label_file:
  78. static_label_list = list(label_file.read().splitlines())
  79. print(static_label_list)
  80. return static_label_list
  81. def convert_single_example(ex_index, example, label_list, max_seq_length,
  82. tokenizer):
  83. """Converts a single `InputExample` into a single `InputFeatures`."""
  84. if isinstance(example, PaddingInputExample):
  85. return InputFeatures(
  86. input_ids=[0] * max_seq_length,
  87. input_mask=[0] * max_seq_length,
  88. segment_ids=[0] * max_seq_length,
  89. label_id=0,
  90. is_real_example=False)
  91. label_map = {}
  92. for (i, label) in enumerate(label_list):
  93. label_map[label] = i
  94. tokens_a = tokenizer.tokenize(example.text_a)
  95. tokens_b = None
  96. if example.text_b:
  97. tokens_b = tokenizer.tokenize(example.text_b)
  98. if tokens_b:
  99. _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
  100. else:
  101. # Account for [CLS] and [SEP] with "- 2"
  102. if len(tokens_a) > max_seq_length - 2:
  103. tokens_a = tokens_a[0:(max_seq_length - 2)]
  104. tokens = []
  105. segment_ids = []
  106. tokens.append("[CLS]")
  107. segment_ids.append(0)
  108. for token in tokens_a:
  109. tokens.append(token)
  110. segment_ids.append(0)
  111. tokens.append("[SEP]")
  112. segment_ids.append(0)
  113. if tokens_b:
  114. for token in tokens_b:
  115. tokens.append(token)
  116. segment_ids.append(1)
  117. tokens.append("[SEP]")
  118. segment_ids.append(1)
  119. input_ids = tokenizer.convert_tokens_to_ids(tokens)
  120. # The mask has 1 for real tokens and 0 for padding tokens. Only real
  121. # tokens are attended to.
  122. input_mask = [1] * len(input_ids)
  123. # Zero-pad up to the sequence length.
  124. while len(input_ids) < max_seq_length:
  125. input_ids.append(0)
  126. input_mask.append(0)
  127. segment_ids.append(0)
  128. assert len(input_ids) == max_seq_length
  129. assert len(input_mask) == max_seq_length
  130. assert len(segment_ids) == max_seq_length
  131. label_id = label_map[example.label]
  132. if ex_index < 3:
  133. print("*** Example ***")
  134. print("guid: %s" % (example.guid))
  135. print("tokens: %s" % " ".join(
  136. [tokenization.printable_text(x) for x in tokens]))
  137. print("input_ids: %s" % " ".join([str(x) for x in input_ids]))
  138. print("input_mask: %s" % " ".join([str(x) for x in input_mask]))
  139. print("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
  140. print("label: %s (id = %d)" % (example.label, label_id))
  141. feature = InputFeatures(
  142. input_ids=input_ids,
  143. input_mask=input_mask,
  144. segment_ids=segment_ids,
  145. label_id=label_id,
  146. is_real_example=True)
  147. return feature
  148. def file_based_convert_examples_to_features(examples, label_list, max_seq_length, tokenizer, output_file):
  149. """Convert a set of `InputExample`s to a TFRecord file."""
  150. writer = tf.python_io.TFRecordWriter(output_file)
  151. for (ex_index, example) in enumerate(examples):
  152. if ex_index % 10000 == 0:
  153. tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
  154. feature = convert_single_example(ex_index, example, label_list,
  155. max_seq_length, tokenizer)
  156. def create_int_feature(values):
  157. f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
  158. return f
  159. features = collections.OrderedDict()
  160. features["input_ids"] = create_int_feature(feature.input_ids)
  161. features["input_mask"] = create_int_feature(feature.input_mask)
  162. features["segment_ids"] = create_int_feature(feature.segment_ids)
  163. features["label_ids"] = create_int_feature([feature.label_id])
  164. features["is_real_example"] = create_int_feature(
  165. [int(feature.is_real_example)])
  166. tf_example = tf.train.Example(features=tf.train.Features(feature=features))
  167. writer.write(tf_example.SerializeToString())
  168. writer.close()
  169. def file_based_input_fn_builder(input_file, seq_length, is_training,
  170. drop_remainder):
  171. """Creates an `input_fn` closure to be passed to TPUEstimator."""
  172. name_to_features = {
  173. "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
  174. "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
  175. "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
  176. "label_ids": tf.FixedLenFeature([], tf.int64),
  177. "is_real_example": tf.FixedLenFeature([], tf.int64),
  178. }
  179. def _decode_record(record, name_to_features):
  180. """Decodes a record to a TensorFlow example."""
  181. example = tf.parse_single_example(record, name_to_features)
  182. # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
  183. # So cast all int64 to int32.
  184. for name in list(example.keys()):
  185. t = example[name]
  186. if t.dtype == tf.int64:
  187. t = tf.to_int32(t)
  188. example[name] = t
  189. return example
  190. def input_fn(params):
  191. """The actual input function."""
  192. batch_size = params["batch_size"]
  193. # For training, we want a lot of parallel reading and shuffling.
  194. # For eval, we want no shuffling and parallel reading doesn't matter.
  195. d = tf.data.TFRecordDataset(input_file)
  196. if is_training:
  197. d = d.repeat()
  198. d = d.shuffle(buffer_size=100)
  199. d = d.apply(
  200. tf.contrib.data.map_and_batch(
  201. lambda record: _decode_record(record, name_to_features),
  202. batch_size=batch_size,
  203. drop_remainder=drop_remainder))
  204. return d
  205. return input_fn
  206. def _truncate_seq_pair(tokens_a, tokens_b, max_length):
  207. """Truncates a sequence pair in place to the maximum length."""
  208. # This is a simple heuristic which will always truncate the longer sequence
  209. # one token at a time. This makes more sense than truncating an equal percent
  210. # of tokens from each, since if one sequence is very short then each token
  211. # that's truncated likely contains more information than a longer sequence.
  212. while True:
  213. total_length = len(tokens_a) + len(tokens_b)
  214. if total_length <= max_length:
  215. break
  216. if len(tokens_a) > len(tokens_b):
  217. tokens_a.pop()
  218. else:
  219. tokens_b.pop()
  220. def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
  221. labels, num_labels, use_one_hot_embeddings):
  222. """Creates a classification model."""
  223. model = modeling.BertModel(
  224. config=bert_config,
  225. is_training=is_training,
  226. input_ids=input_ids,
  227. input_mask=input_mask,
  228. token_type_ids=segment_ids,
  229. use_one_hot_embeddings=use_one_hot_embeddings)
  230. # In the demo, we are doing a simple classification task on the entire
  231. # segment.
  232. #
  233. # If you want to use the token-level output, use model.get_sequence_output()
  234. # instead.
  235. output_layer = model.get_pooled_output()
  236. hidden_size = output_layer.shape[-1].value
  237. output_weights = tf.get_variable(
  238. "output_weights", [num_labels, hidden_size],
  239. initializer=tf.truncated_normal_initializer(stddev=0.02))
  240. output_bias = tf.get_variable(
  241. "output_bias", [num_labels], initializer=tf.zeros_initializer())
  242. with tf.variable_scope("loss"):
  243. if is_training:
  244. # I.e., 0.1 dropout
  245. output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
  246. logits = tf.matmul(output_layer, output_weights, transpose_b=True)
  247. logits = tf.nn.bias_add(logits, output_bias)
  248. probabilities = tf.nn.softmax(logits, axis=-1)
  249. log_probs = tf.nn.log_softmax(logits, axis=-1)
  250. one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
  251. per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
  252. loss = tf.reduce_mean(per_example_loss)
  253. return (loss, per_example_loss, logits, probabilities)
  254. def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
  255. num_train_steps, num_warmup_steps, use_tpu,
  256. use_one_hot_embeddings):
  257. """Returns `model_fn` closure for TPUEstimator."""
  258. def model_fn(features, labels, mode, params): # pylint: disable=unused-argument
  259. """The `model_fn` for TPUEstimator."""
  260. tf.logging.info("*** Features ***")
  261. for name in sorted(features.keys()):
  262. print(" name = %s, shape = %s" % (name, features[name].shape))
  263. input_ids = features["input_ids"]
  264. input_mask = features["input_mask"]
  265. segment_ids = features["segment_ids"]
  266. label_ids = features["label_ids"]
  267. is_real_example = None
  268. if "is_real_example" in features:
  269. is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
  270. else:
  271. is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)
  272. is_training = (mode == tf.estimator.ModeKeys.TRAIN)
  273. (total_loss, per_example_loss, logits, probabilities) = create_model(
  274. bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
  275. num_labels, use_one_hot_embeddings)
  276. tvars = tf.trainable_variables()
  277. initialized_variable_names = {}
  278. scaffold_fn = None
  279. if init_checkpoint:
  280. (assignment_map, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
  281. if use_tpu:
  282. def tpu_scaffold():
  283. tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
  284. return tf.train.Scaffold()
  285. scaffold_fn = tpu_scaffold
  286. else:
  287. tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
  288. tf.logging.info("**** Trainable Variables ****")
  289. for var in tvars:
  290. init_string = ""
  291. if var.name in initialized_variable_names:
  292. init_string = ", *INIT_FROM_CKPT*"
  293. tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
  294. init_string)
  295. output_spec = None
  296. if mode == tf.estimator.ModeKeys.TRAIN:
  297. train_op = optimization.create_optimizer(
  298. total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
  299. output_spec = tf.contrib.tpu.TPUEstimatorSpec(
  300. mode=mode,
  301. loss=total_loss,
  302. train_op=train_op,
  303. scaffold_fn=scaffold_fn)
  304. elif mode == tf.estimator.ModeKeys.EVAL:
  305. def metric_fn(per_example_loss, label_ids, logits, is_real_example):
  306. predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
  307. accuracy = tf.metrics.accuracy(
  308. labels=label_ids, predictions=predictions, weights=is_real_example)
  309. loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example)
  310. return {
  311. "eval_accuracy": accuracy,
  312. "eval_loss": loss,
  313. }
  314. eval_metrics = (metric_fn,
  315. [per_example_loss, label_ids, logits, is_real_example])
  316. output_spec = tf.contrib.tpu.TPUEstimatorSpec(
  317. mode=mode,
  318. loss=total_loss,
  319. eval_metrics=eval_metrics,
  320. scaffold_fn=scaffold_fn)
  321. else:
  322. # predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
  323. # is_predicting = True
  324. # (predicted_labels, log_probs) = create_model(
  325. # is_predicting, input_ids, input_mask, segment_ids, label_ids, num_labels)
  326. log_probs = tf.nn.log_softmax(logits, axis=-1)
  327. probabilities = tf.nn.softmax(logits, axis=-1)
  328. predicted_labels = tf.squeeze(tf.argmax(log_probs, axis=-1, output_type=tf.int32))
  329. predictions = {
  330. 'probabilities': log_probs,
  331. 'labels': predicted_labels
  332. }
  333. output_spec = tf.contrib.tpu.TPUEstimatorSpec(
  334. mode=mode,
  335. predictions= predictions,#{"probabilities": probabilities},
  336. scaffold_fn=scaffold_fn)
  337. return output_spec
  338. return model_fn
  339. # This function is not used by this file but is still used by the Colab and
  340. # people who depend on it.
  341. def input_fn_builder(features, seq_length, is_training, drop_remainder):
  342. """Creates an `input_fn` closure to be passed to TPUEstimator."""
  343. all_input_ids = []
  344. all_input_mask = []
  345. all_segment_ids = []
  346. all_label_ids = []
  347. for feature in features:
  348. all_input_ids.append(feature.input_ids)
  349. all_input_mask.append(feature.input_mask)
  350. all_segment_ids.append(feature.segment_ids)
  351. all_label_ids.append(feature.label_id)
  352. def input_fn(params):
  353. """The actual input function."""
  354. batch_size = params["batch_size"]
  355. num_examples = len(features)
  356. # This is for demo purposes and does NOT scale to large data sets. We do
  357. # not use Dataset.from_generator() because that uses tf.py_func which is
  358. # not TPU compatible. The right way to load data is with TFRecordReader.
  359. d = tf.data.Dataset.from_tensor_slices({
  360. "input_ids":
  361. tf.constant(
  362. all_input_ids, shape=[num_examples, seq_length],
  363. dtype=tf.int32),
  364. "input_mask":
  365. tf.constant(
  366. all_input_mask,
  367. shape=[num_examples, seq_length],
  368. dtype=tf.int32),
  369. "segment_ids":
  370. tf.constant(
  371. all_segment_ids,
  372. shape=[num_examples, seq_length],
  373. dtype=tf.int32),
  374. "label_ids":
  375. tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
  376. })
  377. if is_training:
  378. d = d.repeat()
  379. d = d.shuffle(buffer_size=100)
  380. d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
  381. return d
  382. return input_fn
  383. # This function is not used by this file but is still used by the Colab and
  384. # people who depend on it.
  385. def convert_examples_to_features(examples, label_list, max_seq_length,
  386. tokenizer):
  387. """Convert a set of `InputExample`s to a list of `InputFeatures`."""
  388. features = []
  389. for (ex_index, example) in enumerate(examples):
  390. if ex_index % 1000 == 0:
  391. tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
  392. feature = convert_single_example(ex_index, example, label_list,
  393. max_seq_length, tokenizer)
  394. features.append(feature)
  395. return features
  1. class BTrainer(object):
  2. def __init__(self, train_list, predict_list, lable_list, output_dir, num_train_epochs):
  3. self.output_dir = output_dir
  4. tokenization.validate_case_matches_checkpoint(True, init_checkpoint)
  5. self.bert_config = modeling.BertConfig.from_json_file(bert_config_file)
  6. tf.gfile.MakeDirs(self.output_dir)
  7. self.processor = SHLibProcessor(lable_list)
  8. self.label_list = self.processor.get_labels()
  9. self.tokenizer = tokenization.FullTokenizer(
  10. vocab_file=vocab_file, do_lower_case=True)
  11. tpu_cluster_resolver = None
  12. is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
  13. self.run_config = tf.contrib.tpu.RunConfig(
  14. cluster=tpu_cluster_resolver,
  15. keep_checkpoint_max=1,
  16. master=None,
  17. model_dir=self.output_dir,
  18. save_checkpoints_steps=save_checkpoints_steps,
  19. tpu_config=tf.contrib.tpu.TPUConfig(
  20. iterations_per_loop=iterations_per_loop,
  21. num_shards=num_tpu_cores,
  22. per_host_input_for_training=is_per_host))
  23. num_train_steps = None
  24. num_warmup_steps = None
  25. self.train_examples = self.processor.get_train_examples(train_list)
  26. self.num_train_steps = int(len(self.train_examples) / train_batch_size * num_train_epochs)
  27. num_warmup_steps = int(self.num_train_steps * warmup_proportion)
  28. self.predict_examples = self.processor.get_test_examples(predict_list)
  29. model_fn = model_fn_builder(
  30. bert_config=self.bert_config,
  31. num_labels=len(self.label_list),
  32. init_checkpoint=init_checkpoint,
  33. learning_rate=learning_rate,
  34. num_train_steps=self.num_train_steps,
  35. num_warmup_steps=num_warmup_steps,
  36. use_tpu=False,
  37. use_one_hot_embeddings=False)
  38. # If TPU is not available, this will fall back to normal Estimator on CPU
  39. # or GPU.
  40. self.estimator = tf.contrib.tpu.TPUEstimator(
  41. use_tpu=False,
  42. model_fn=model_fn,
  43. config=self.run_config,
  44. train_batch_size=train_batch_size,
  45. eval_batch_size=eval_batch_size,
  46. predict_batch_size=predict_batch_size)
  47. def do_train(self):
  48. try:
  49. train_file = os.path.join(self.output_dir, "train.tf_record")
  50. file_based_convert_examples_to_features(
  51. self.train_examples, self.label_list, max_seq_length, self.tokenizer, train_file)
  52. print("***** Running training *****")
  53. train_input_fn = file_based_input_fn_builder(
  54. input_file=train_file,
  55. seq_length=max_seq_length,
  56. is_training=True,
  57. drop_remainder=True)
  58. self.estimator.train(input_fn=train_input_fn, max_steps=self.num_train_steps)
  59. print("train complete")
  60. except Exception:
  61. traceback.print_exc()
  62. return -4
  63. return 1
  64. def do_predict(self):
  65. num_actual_predict_examples = len(self.predict_examples)
  66. predict_file = os.path.join(self.output_dir, "predict.tf_record")
  67. file_based_convert_examples_to_features(self.predict_examples, self.label_list,
  68. max_seq_length, self.tokenizer,
  69. predict_file)
  70. predict_drop_remainder = True
  71. predict_input_fn = file_based_input_fn_builder(
  72. input_file=predict_file,
  73. seq_length=max_seq_length,
  74. is_training=False,
  75. drop_remainder=predict_drop_remainder)
  76. result = self.estimator.predict(input_fn=predict_input_fn)
  77. acc = 0
  78. output_predict_file = os.path.join(self.output_dir, "test_results.tsv")
  79. with tf.gfile.GFile(output_predict_file, "w") as writer:
  80. num_written_lines = 0
  81. print("***** Predict results *****")
  82. correct_count = 0
  83. for (i, prediction) in enumerate(result):
  84. if i >= num_actual_predict_examples:
  85. break
  86. if self.predict_examples[i].label == self.label_list[prediction['labels']]:
  87. correct_count += 1
  88. num_written_lines += 1
  89. writer.write(str(self.predict_examples[i].guid) + "\t" + str(self.predict_examples[i].text_a)
  90. + "\t" + str(self.predict_examples[i].label)
  91. + "\t" + str(self.label_list[prediction['labels']]) + "\n")
  92. acc = correct_count/num_written_lines
  93. print("total count:", num_written_lines, " correct:",correct_count," accuracy:",acc)
  94. return acc

#定义读取tsv和加载标签的方法

  1. def _read_tsv(input_file, quotechar=None):
  2. """Reads a tab separated value file."""
  3. with tf.gfile.Open(input_file, "r") as f:
  4. reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
  5. lines = []
  6. for line in reader:
  7. lines.append(line)
  8. return lines
  9. def load_labels(label_file_path):
  10. with open(label_file_path,'r') as label_file:
  11. static_label_list = list(label_file.read().splitlines())
  12. return static_label_list

准备训练数据

  1. train_file_path = "./bert_input_data/level1_train.tsv"
  2. dev_file_path = "./bert_input_data/level1_dev.tsv"
  3. test_file_path = "./bert_input_data/level1_test.tsv"
  4. label_file_path = "./bert_input_data/label.txt"
  5. train_epoch = 1

定义模型训练方法

  1. ef f_train_model(train_file_path,dev_file_path, lable_file_path,train_epoch):
  2. if not os.path.exists(train_file_path) or not os.path.exists(train_file_path):
  3. ret_value = 3
  4. return ret_value
  5. label_list = load_labels(lable_file_path)
  6. if len(label_list) <= 1:
  7. ret_value = 4
  8. return ret_value
  9. train = _read_tsv(train_file_path)
  10. if len(train) <= 20:
  11. ret_value = 5
  12. return ret_value
  13. dev = _read_tsv(dev_file_path)
  14. print("train length:",len(train))
  15. print("val length:",len(dev))
  16. trainer = BTrainer(train, dev, lable_file_path, MODEL_OUTPUT_DIR, train_epoch)
  17. return trainer.do_train()
  1. #训练模型
  2. f_train_model(train_file_path, dev_file_path, label_file_path, train_epoch)

 不知道怎么解决啊!

6.模型应用

定义模型验证方法

  1. def f_test_model(train_file_path,test_file_path, lable_file_path):
  2. if not os.path.exists(train_file_path) or not os.path.exists(train_file_path):
  3. ret_value = 3
  4. return ret_value
  5. label_list = load_labels(lable_file_path)
  6. if len(label_list) <= 1:
  7. ret_value = 4
  8. return ret_value
  9. train = _read_tsv(train_file_path)
  10. if len(train) <= 10:
  11. ret_value = 5
  12. return ret_value
  13. test = _read_tsv(test_file_path)
  14. print("test length:",len(test))
  15. trainer = BTrainer(train, test, lable_file_path, MODEL_OUTPUT_DIR, 0)
  16. return trainer.do_predict()
  1. #模训测试
  2. f_test_model(train_file_path, test_file_path, label_file_path)

 

好难搞啊,先放一放。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/你好赵伟/article/detail/624146
推荐阅读
相关标签
  

闽ICP备14008679号