当前位置:   article > 正文

源码解读之Fine-tune_fine-tune代码

fine-tune代码

这是我们源码解读的最后一个部分了。fine-tune搞明白之后推断也就没必要再分析了,反正形式都是一样的,重要的是明白根据不同任务调整输入格式和对loss的构建,这两个知识点学会之后,基本上也可以依葫芦画瓢做一些自己的任务了。

bert官方给了两个任务的fine-tune代码:

1.run_classifier.py

2.run_squad.py

其实就是我们在Bert系列(一)——demo运行里运行的demo,下面我就对这两个代码进行展开说明:

一、run_classifier.py

1、参数

  1. ## Required parameters
  2. flags.DEFINE_string(
  3. "data_dir", None,
  4. "The input data dir. Should contain the .tsv files (or other data files) "
  5. "for the task.")
  6. flags.DEFINE_string(
  7. "bert_config_file", None,
  8. "The config json file corresponding to the pre-trained BERT model. "
  9. "This specifies the model architecture.")
  10. flags.DEFINE_string("task_name", None, "The name of the task to train.")
  11. flags.DEFINE_string("vocab_file", None,
  12. "The vocabulary file that the BERT model was trained on.")
  13. flags.DEFINE_string(
  14. "output_dir", None,
  15. "The output directory where the model checkpoints will be written.")
  16. ## Other parameters
  17. flags.DEFINE_string(
  18. "init_checkpoint", None,
  19. "Initial checkpoint (usually from a pre-trained BERT model).")
  20. flags.DEFINE_bool(
  21. "do_lower_case", True,
  22. "Whether to lower case the input text. Should be True for uncased "
  23. "models and False for cased models.")
  24. flags.DEFINE_integer(
  25. "max_seq_length", 128,
  26. "The maximum total input sequence length after WordPiece tokenization. "
  27. "Sequences longer than this will be truncated, and sequences shorter "
  28. "than this will be padded.")
  29. flags.DEFINE_bool("do_train", False, "Whether to run training.")
  30. flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
  31. flags.DEFINE_bool(
  32. "do_predict", False,
  33. "Whether to run the model in inference mode on the test set.")
  34. flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.")
  35. flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
  36. flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
  37. flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
  38. flags.DEFINE_float("num_train_epochs", 3.0,
  39. "Total number of training epochs to perform.")
  40. flags.DEFINE_float(
  41. "warmup_proportion", 0.1,
  42. "Proportion of training to perform linear learning rate warmup for. "
  43. "E.g., 0.1 = 10% of training.")
  44. flags.DEFINE_integer("save_checkpoints_steps", 1000,
  45. "How often to save the model checkpoint.")
  46. flags.DEFINE_integer("iterations_per_loop", 1000,
  47. "How many steps to make in each estimator call.")
  48. flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
  49. tf.flags.DEFINE_string(
  50. "tpu_name", None,
  51. "The Cloud TPU to use for training. This should be either the name "
  52. "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
  53. "url.")
  54. tf.flags.DEFINE_string(
  55. "tpu_zone", None,
  56. "[Optional] GCE zone where the Cloud TPU is located in. If not "
  57. "specified, we will attempt to automatically detect the GCE project from "
  58. "metadata.")
  59. tf.flags.DEFINE_string(
  60. "gcp_project", None,
  61. "[Optional] Project name for the Cloud TPU-enabled project. If not "
  62. "specified, we will attempt to automatically detect the GCE project from "
  63. "metadata.")
  64. tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
  65. flags.DEFINE_integer(
  66. "num_tpu_cores", 8,
  67. "Only used if `use_tpu` is True. Total number of TPU cores to use.")

这些参数相信运行过demo的同学都已经认识了,不认识读读上面的英文解释也大概能明白什么意思。其中有两个可能需要说明下:

max_seq_length:指定WordPiece tokenization 之后的sequence的最大长度,要求小于等于预训练模型的最大sequence长度。当输入的数据长度小于max_seq_length时用0补齐,如果长度大于max_seq_length则truncate处理;

warmup_proportion:warm up 步数的比例,比如说总共学习100步,warmup_proportion=0.1表示前10步用来warm up,warm up时以较低的学习率进行学习(lr = global_step/num_warmup_steps * init_lr),10步之后以正常(或衰减)的学习率来学习。至于这么做的目的不太明白,有知道的同学请务必留言告诉我下,感激不尽。

2、数据预处理(以MRPC为例)

  1. class InputExample(object):
  2. """A single training/test example for simple sequence classification."""
  3. def __init__(self, guid, text_a, text_b=None, label=None):
  4. self.guid = guid
  5. self.text_a = text_a
  6. self.text_b = text_b
  7. self.label = label

这是输入语料样本的数据结构。

guid是该样本的唯一ID,text_a和text_b表示句子对,lable表示句子对关系,如果是test数据集则label统一为0。

  1. class InputFeatures(object):
  2. """A single set of features of data."""
  3. def __init__(self, input_ids, input_mask, segment_ids, label_id):
  4. self.input_ids = input_ids
  5. self.input_mask = input_mask
  6. self.segment_ids = segment_ids
  7. self.label_id = label_id

tokenization过后的样本数据结构,input_ids其实就是tokens的索引,input_mask不用解释,segment_ids对应模型的token_type_ids以上三者构成模型输入的X,label_id是标签,对应Y。

  1. class MrpcProcessor(DataProcessor):
  2. """Processor for the MRPC data set (GLUE version)."""
  3. def get_train_examples(self, data_dir):
  4. """See base class."""
  5. return self._create_examples(
  6. self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
  7. def get_dev_examples(self, data_dir):
  8. """See base class."""
  9. return self._create_examples(
  10. self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
  11. def get_test_examples(self, data_dir):
  12. """See base class."""
  13. return self._create_examples(
  14. self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
  15. def get_labels(self):
  16. """See base class."""
  17. return ["0", "1"]
  18. def _create_examples(self, lines, set_type):
  19. """Creates examples for the training and dev sets."""
  20. examples = []
  21. for (i, line) in enumerate(lines):
  22. if i == 0:
  23. continue
  24. guid = "%s-%s" % (set_type, i)
  25. text_a = tokenization.convert_to_unicode(line[3])
  26. text_b = tokenization.convert_to_unicode(line[4])
  27. if set_type == "test":
  28. label = "0"
  29. else:
  30. label = tokenization.convert_to_unicode(line[0])
  31. examples.append(
  32. InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
  33. return examples

MRPC的数据解析器,输入格式:

label 句子1ID 句子2ID 句子1 句子2

输出的格式为InputExample数据结构。

  1. def file_based_convert_examples_to_features(
  2. examples, label_list, max_seq_length, tokenizer, output_file):
  3. """Convert a set of `InputExample`s to a TFRecord file."""
  4. writer = tf.python_io.TFRecordWriter(output_file)
  5. for (ex_index, example) in enumerate(examples):
  6. if ex_index % 10000 == 0:
  7. tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
  8. feature = convert_single_example(ex_index, example, label_list,
  9. max_seq_length, tokenizer)
  10. def create_int_feature(values):
  11. f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
  12. return f
  13. features = collections.OrderedDict()
  14. features["input_ids"] = create_int_feature(feature.input_ids)
  15. features["input_mask"] = create_int_feature(feature.input_mask)
  16. features["segment_ids"] = create_int_feature(feature.segment_ids)
  17. features["label_ids"] = create_int_feature([feature.label_id])
  18. tf_example = tf.train.Example(features=tf.train.Features(feature=features))
  19. writer.write(tf_example.SerializeToString())

将examples转换成features,用到的函数是convert_single_example:

  1. def convert_single_example(ex_index, example, label_list, max_seq_length,
  2. tokenizer):
  3. """Converts a single `InputExample` into a single `InputFeatures`."""
  4. label_map = {}
  5. for (i, label) in enumerate(label_list):
  6. label_map[label] = i
  7. tokens_a = tokenizer.tokenize(example.text_a)
  8. tokens_b = None
  9. if example.text_b:
  10. tokens_b = tokenizer.tokenize(example.text_b)
  11. if tokens_b:
  12. _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
  13. else:
  14. # Account for [CLS] and [SEP] with "- 2"
  15. if len(tokens_a) > max_seq_length - 2:
  16. tokens_a = tokens_a[0:(max_seq_length - 2)]
  17. tokens = []
  18. segment_ids = []
  19. tokens.append("[CLS]")
  20. segment_ids.append(0)
  21. for token in tokens_a:
  22. tokens.append(token)
  23. segment_ids.append(0)
  24. tokens.append("[SEP]")
  25. segment_ids.append(0)
  26. if tokens_b:
  27. for token in tokens_b:
  28. tokens.append(token)
  29. segment_ids.append(1)
  30. tokens.append("[SEP]")
  31. segment_ids.append(1)
  32. input_ids = tokenizer.convert_tokens_to_ids(tokens)
  33. input_mask = [1] * len(input_ids)
  34. # Zero-pad up to the sequence length.
  35. while len(input_ids) < max_seq_length:
  36. input_ids.append(0)
  37. input_mask.append(0)
  38. segment_ids.append(0)
  39. assert len(input_ids) == max_seq_length
  40. assert len(input_mask) == max_seq_length
  41. assert len(segment_ids) == max_seq_length
  42. label_id = label_map[example.label]
  43. if ex_index < 5:
  44. tf.logging.info("*** Example ***")
  45. tf.logging.info("guid: %s" % (example.guid))
  46. tf.logging.info("tokens: %s" % " ".join(
  47. [tokenization.printable_text(x) for x in tokens]))
  48. tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
  49. tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
  50. tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
  51. tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
  52. feature = InputFeatures(
  53. input_ids=input_ids,
  54. input_mask=input_mask,
  55. segment_ids=segment_ids,
  56. label_id=label_id)
  57. return feature

把一个InputExample数据转换成InputFeatures数据结构。

(1)构造label_map ,因为label_list就["0", "1"]所以,label_map ={"0":0, "1":1};

(2)将text_a和text_b转化成token_a和token_b,并且将二者截取到长度之和为max_seq_length - 3,如果只有token_a没有token_b,则将tokens_a截取到长度为max_seq_length - 2;

(3)构造tokens和segment_ids,如果不满足长度用0补齐,并且构造input_mask。

3、模型构建

  1. def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
  2. labels, num_labels, use_one_hot_embeddings):
  3. """Creates a classification model."""
  4. model = modeling.BertModel(
  5. config=bert_config,
  6. is_training=is_training,
  7. input_ids=input_ids,
  8. input_mask=input_mask,
  9. token_type_ids=segment_ids,
  10. use_one_hot_embeddings=use_one_hot_embeddings)
  11. output_layer = model.get_pooled_output()
  12. hidden_size = output_layer.shape[-1].value
  13. output_weights = tf.get_variable(
  14. "output_weights", [num_labels, hidden_size],
  15. initializer=tf.truncated_normal_initializer(stddev=0.02))
  16. output_bias = tf.get_variable(
  17. "output_bias", [num_labels], initializer=tf.zeros_initializer())
  18. with tf.variable_scope("loss"):
  19. if is_training:
  20. # I.e., 0.1 dropout
  21. output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
  22. logits = tf.matmul(output_layer, output_weights, transpose_b=True)
  23. logits = tf.nn.bias_add(logits, output_bias)
  24. probabilities = tf.nn.softmax(logits, axis=-1)
  25. log_probs = tf.nn.log_softmax(logits, axis=-1)
  26. one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
  27. per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
  28. loss = tf.reduce_mean(per_example_loss)
  29. return (loss, per_example_loss, logits, probabilities)

X和Y都已经构造好了,将X作为模型的输入,剩下的就是将模型输出和Y进行计算得到loss。

这里的模型输出取的是pooled_output,之前我们已经说过pooled_output是模型最后一层的第一个片段。之后再用一个全连接+softmax和labels的one_hot计算loss。

二、run_squad.py

run_squad是基于SQuAD数据进行阅读理解任务的fine-tune,除了X/Y数据的转换、loss构建其他和run_classifier是一样的,下面我们重点学习下这两块。

1、X/Y数据的转换

  1. class SquadExample(object):
  2. def __init__(self,
  3. qas_id,
  4. question_text,
  5. doc_tokens,
  6. orig_answer_text=None,
  7. start_position=None,
  8. end_position=None,
  9. is_impossible=False):
  10. self.qas_id = qas_id
  11. self.question_text = question_text
  12. self.doc_tokens = doc_tokens
  13. self.orig_answer_text = orig_answer_text
  14. self.start_position = start_position
  15. self.end_position = end_position
  16. self.is_impossible = is_impossible

qas_id 样本ID,question_text问题文本,doc_tokens阅读材料[word0, word1, ...]的形式,orig_answer_text 原始答案的文本,start_position答案在文本中开始的位置,end_position答案在文本中结束的位置,is_impossible在SQuAD2里才会用到的字段这里可以不用关心。

  1. class InputFeatures(object):
  2. """A single set of features of data."""
  3. def __init__(self,
  4. unique_id,
  5. example_index,
  6. doc_span_index,
  7. tokens,
  8. token_to_orig_map,
  9. token_is_max_context,
  10. input_ids,
  11. input_mask,
  12. segment_ids,
  13. start_position=None,
  14. end_position=None,
  15. is_impossible=None):
  16. self.unique_id = unique_id
  17. self.example_index = example_index
  18. self.doc_span_index = doc_span_index
  19. self.tokens = tokens
  20. self.token_to_orig_map = token_to_orig_map
  21. self.token_is_max_context = token_is_max_context
  22. self.input_ids = input_ids
  23. self.input_mask = input_mask
  24. self.segment_ids = segment_ids
  25. self.start_position = start_position
  26. self.end_position = end_position
  27. self.is_impossible = is_impossible

unique_id feature的唯一id,example_index样本的索引,用于建立feature和example的对应,

doc_span_index该feature在doc_span的索引,如果一个文本很长,那么势必需要对其进行截取,截取成若干片段装进doc_span,doc_span里的各个片段会装进各个feature里面,所以一个feature对应的就会有一个doc_span_index;

tokens该样本的token序列,token_to_orig_map是tokens里面每一个token在原始doc_token的索引;

token_is_max_context是一个序列,里面的值表示该位置的token在当前span里面是否是最全上下文的。

例如bought这个词

Doc: the man went to the store and bought a gallon of milk
Span A: the man went to the
Span B: to the store and bought
Span C: and bought a gallon of

bought在spanB和spanC里都有出现,但很显然span C里bought是语境最全的,既有上文也有下文

input_ids 是tokens转化为token id作为模型的输入,input_mask 、segment_ids、is_impossible 不用多说了;

start_position 、 end_position为答案在当前tokens序列里面的位置(跟上面的不同,不是整个context里面的位置),需要注意的是如果答案不在当前span里的话,start_position 、 end_position均为0 。

SquadExample到InputFeatures转换的过程也是类似的,不用细讲,与run_classifier唯一不同的是classifier的输入是[CLS]句子a[SEP]句子b[SEP], 而squad是[CLS]问题[SEP]阅读材料片段[SEP]

  1. input_ids = features["input_ids"]
  2. input_mask = features["input_mask"]
  3. segment_ids = features["segment_ids"]

和这三个元素作为模型的输入X,而start_position和end_position作为Y,如果知道了Y就等于知道了答案的位置,然后反向在阅读材料context里面去找出来就可以了,逻辑大概就是这样。

2、loss构建

  1. def model_fn_builder(bert_config, init_checkpoint, learning_rate,
  2. num_train_steps, num_warmup_steps, use_tpu,
  3. use_one_hot_embeddings):
  4. """Returns `model_fn` closure for TPUEstimator."""
  5. def model_fn(features, labels, mode, params): # pylint: disable=unused-argument
  6. """The `model_fn` for TPUEstimator."""
  7. tf.logging.info("*** Features ***")
  8. for name in sorted(features.keys()):
  9. tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape))
  10. unique_ids = features["unique_ids"]
  11. input_ids = features["input_ids"]
  12. input_mask = features["input_mask"]
  13. segment_ids = features["segment_ids"]
  14. is_training = (mode == tf.estimator.ModeKeys.TRAIN)
  15. (start_logits, end_logits) = create_model(
  16. bert_config=bert_config,
  17. is_training=is_training,
  18. input_ids=input_ids,
  19. input_mask=input_mask,
  20. segment_ids=segment_ids,
  21. use_one_hot_embeddings=use_one_hot_embeddings)
  22. tvars = tf.trainable_variables()
  23. initialized_variable_names = {}
  24. scaffold_fn = None
  25. if init_checkpoint:
  26. (assignment_map, initialized_variable_names
  27. ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
  28. if use_tpu:
  29. def tpu_scaffold():
  30. tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
  31. return tf.train.Scaffold()
  32. scaffold_fn = tpu_scaffold
  33. else:
  34. tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
  35. tf.logging.info("**** Trainable Variables ****")
  36. for var in tvars:
  37. init_string = ""
  38. if var.name in initialized_variable_names:
  39. init_string = ", *INIT_FROM_CKPT*"
  40. tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
  41. init_string)
  42. output_spec = None
  43. if mode == tf.estimator.ModeKeys.TRAIN:
  44. seq_length = modeling.get_shape_list(input_ids)[1]
  45. def compute_loss(logits, positions):
  46. one_hot_positions = tf.one_hot(
  47. positions, depth=seq_length, dtype=tf.float32)
  48. log_probs = tf.nn.log_softmax(logits, axis=-1)
  49. loss = -tf.reduce_mean(
  50. tf.reduce_sum(one_hot_positions * log_probs, axis=-1))
  51. return loss
  52. start_positions = features["start_positions"]
  53. end_positions = features["end_positions"]
  54. start_loss = compute_loss(start_logits, start_positions)
  55. end_loss = compute_loss(end_logits, end_positions)
  56. total_loss = (start_loss + end_loss) / 2.0
  57. train_op = optimization.create_optimizer(
  58. total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
  59. output_spec = tf.contrib.tpu.TPUEstimatorSpec(
  60. mode=mode,
  61. loss=total_loss,
  62. train_op=train_op,
  63. scaffold_fn=scaffold_fn)
  64. elif mode == tf.estimator.ModeKeys.PREDICT:
  65. predictions = {
  66. "unique_ids": unique_ids,
  67. "start_logits": start_logits,
  68. "end_logits": end_logits,
  69. }
  70. output_spec = tf.contrib.tpu.TPUEstimatorSpec(
  71. mode=mode, predictions=predictions, scaffold_fn=scaffold_fn)
  72. else:
  73. raise ValueError(
  74. "Only TRAIN and PREDICT modes are supported: %s" % (mode))
  75. return output_spec
  76. return model_fn

从上面的代码我们可以发现,loss由两部分组成,答案start_positions的预测和end_positions的预测。

  1. def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
  2. use_one_hot_embeddings):
  3. """Creates a classification model."""
  4. model = modeling.BertModel(
  5. config=bert_config,
  6. is_training=is_training,
  7. input_ids=input_ids,
  8. input_mask=input_mask,
  9. token_type_ids=segment_ids,
  10. use_one_hot_embeddings=use_one_hot_embeddings)
  11. final_hidden = model.get_sequence_output()
  12. final_hidden_shape = modeling.get_shape_list(final_hidden, expected_rank=3)
  13. batch_size = final_hidden_shape[0]
  14. seq_length = final_hidden_shape[1]
  15. hidden_size = final_hidden_shape[2]
  16. output_weights = tf.get_variable(
  17. "cls/squad/output_weights", [2, hidden_size],
  18. initializer=tf.truncated_normal_initializer(stddev=0.02))
  19. output_bias = tf.get_variable(
  20. "cls/squad/output_bias", [2], initializer=tf.zeros_initializer())
  21. final_hidden_matrix = tf.reshape(final_hidden,
  22. [batch_size * seq_length, hidden_size])
  23. logits = tf.matmul(final_hidden_matrix, output_weights, transpose_b=True)
  24. logits = tf.nn.bias_add(logits, output_bias)
  25. logits = tf.reshape(logits, [batch_size, seq_length, 2])
  26. logits = tf.transpose(logits, [2, 0, 1])
  27. unstacked_logits = tf.unstack(logits, axis=0)
  28. (start_logits, end_logits) = (unstacked_logits[0], unstacked_logits[1])
  29. return (start_logits, end_logits)

模型的输出来自于sequence_output,即模型最后一层的输出,shape为[batch_size, seq_length, hidden_size ],之后再加一个全连接层,unpack成两个部分,分别对应答案的两个位置。

总结:

以上便是这两个demo的全部解读,squad里有很多细节特别是sample到feature的转换过程,比较复杂,但因为时间有限我们不做具体介绍,感兴趣的同学可以自己深入阅读一下。

不同任务的输入输出示意图

 

这两个任务可以和论文里面的示意图结合起来看,句子对分类任务对应的是图(a),阅读理解任务对应的是图(c)

本文系列
Bert系列(一)——demo运行
Bert系列(二)——模型主体源码解读
Bert系列(三)——源码解读之Pre-train
Bert系列(五)——中文分词实践 F1 97.8%(附代码)

Reference
1.https://github.com/google-research/bert
2.BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding



作者:西溪雷神
链接:https://www.jianshu.com/p/116bfdb9119a
来源:简书
简书著作权归作者所有,任何形式的转载都请联系作者获得授权并注明出处。

本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
  

闽ICP备14008679号