赞
踩
CCKS2019医渡云4k电子病历数据集命名实体识别
目录
CCKS2019医渡云4k电子病历数据集命名实体识别 1
Dataset 1
Project Structure 3
Requirements 5
Steps 5
Model 6
上游 6
下游 6
Config 8
Train 8
策略 8
日志 9
Evaluate 14
策略 14
评估单个模型 14
Performance 14
测试集表现 14
验证集最佳F1 16
官方提供的379条测试样本表现 16
官方提供的379条测试样本表现每的类别F1评测结果 16
Predict 16
模型大小
Small版:两张3090(24G),先用无监督MLM训练了100万步(maxlen为512),然后有监督多任务训练了75万步(maxlen从64到512不等,取决于任务),batch_size为512,优化器为LAMB;
Base版:四张3090(24G),先用无监督MLM训练了100万步(maxlen为512),然后有监督多任务训练了75万步(maxlen从64到512不等,取决于任务),batch_size为512,优化器为LAMB;
Large版:两张A100(80G),先用无监督MLM训练了100万步(maxlen为512),然后有监督多任务训练了50万步(maxlen从64到512不等,取决于任务),batch_size为512,优化器为LAMB。
Config
maxlen 训练中每个batch的最大单句长度,少于填充,多于截断
epochs 最大训练轮次
batch_size batch size
bert_layers bert层数,small ≤ 4,base ≤ 12
crf_lr_multiplier CRF层放大的学习率,必要时扩大它
model_type 模型, ‘roformer_v2’
dropout_rate dropout比率
max_lr 最大学习率,bert_layers越大应该越小,small建议5e-51e-4,base建议1e-55e-5
lstm_hidden_units lstm隐藏层数量
ATTENTION: 并非所有句子都要填充到同一个长度,要求每个batch内的每个样本长度一致即可。所以若batch中最大长度 ≤ maxlen,则该batch将填充or截断到最长句子长度,若batch中最大长度 ≥ maxlen,则该batch将填充or截断到config.py中的maxlen
Train
策略
划分策略
将1000条训练样本按8:2划分成训练集、验证集,并shuffle。
优化策略
使用EMA(exponential mobing average)滑动平均配合Adam作为优化策略。滑动平均可以用来估计变量的局部值,是的变量的更新与一段时间内的历史值有关。它的意义在于利用滑动平均的参数来提高模型在测试数据上的健壮性。 EMA 对每一个待更新训练学习的变量 (variable) 都会维护一个影子变量 (shadow variable)。影子变量的初始值就是这个变量的初始值。
BERT模型由于已经有了预训练权重,所以微调权重只需要很小的学习率,而LSTM和Dense使用的he_normal 初始化学习率,需要使用较大学习率,所以本模型使用分层学习率
在Embedding层注入扰动,对抗训练 ,使模型更具鲁棒性。
停止策略
在callback中计算验证集实体F1值,监控它。5轮不升即停。
# -*- coding:utf-8 -*- import os import pickle from config import batch_size, maxlen, epochs from evaluate import get_score from path import BASE_CONFIG_NAME, BASE_CKPT_NAME, BASE_MODEL_DIR, train_file_path, test_file_path, val_file_path, \ weights_path, event_type, MODEL_TYPE, label_dict_path from plot import train_plot, f1_plot os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' from model import BERT from preprocess import load_data, data_generator, NamedEntityRecognizer from utils.backend import keras, K from utils.adversarial import adversarial_training from utils.tokenizers import Tokenizer # bert配置 config_path = BASE_CONFIG_NAME checkpoint_path = BASE_CKPT_NAME dict_path = '{}/vocab.txt'.format(BASE_MODEL_DIR) # 标注数据 categories = set() train_data = load_data(train_file_path, categories) val_data = load_data(val_file_path, categories) categories = list(sorted(categories)) with open(label_dict_path, 'wb') as f: pickle.dump(categories, f) # 建立分词器 tokenizer = Tokenizer(dict_path, do_lower_case = True) bert = BERT(config_path, checkpoint_path, categories) model = bert.get_model() optimizer = bert.get_optimizer() CRF = bert.get_CRF() NER = NamedEntityRecognizer(tokenizer, model, categories, trans = K.eval(CRF.trans), starts = [0], ends = [0]) adversarial_training(model, 'Embedding-Token', 0.5) f1_list = [] recall_list = [] precision_list = [] count_model_did_not_improve = 0 class Evaluator(keras.callbacks.Callback): """评估与保存 """ def __init__(self, patience = 5): super().__init__() self.best_val_f1 = 0 self.patience = patience def on_epoch_end(self, epoch, logs = None): global count_model_did_not_improve save_file_path = ("{}/{}_{}_base".format(weights_path, event_type, MODEL_TYPE)) + ".h5" trans = K.eval(CRF.trans) NER.trans = trans # print(NER.trans) optimizer.apply_ema_weights() f1, precision, recall = get_score(val_data, NER) f1_list.append(f1) recall_list.append(recall) precision_list.append(precision) # 保存最优 if f1 >= self.best_val_f1: self.best_val_f1 = f1 model.save_weights(save_file_path) pickle.dump(K.eval(CRF.trans), open(("{}/{}_{}_crf_trans.pkl".format(weights_path, event_type, MODEL_TYPE)), 'wb')) count_model_did_not_improve = 0 else: count_model_did_not_improve += 1 print("Early stop count " + str(count_model_did_not_improve) + "/" + str(self.patience)) if count_model_did_not_improve >= self.patience: self.model.stop_training = True print("Epoch %05d: early stopping THR" % epoch) optimizer.reset_old_weights() print( 'valid: f1: %.5f, precision: %.5f, recall: %.5f, best f1: %.5f\n' % (f1, precision, recall, self.best_val_f1) ) train_generator = data_generator(train_data, batch_size, tokenizer, categories, maxlen) valid_generator = data_generator(val_data, batch_size, tokenizer, categories, maxlen) # test_generator = data_generator(test_data, batch_size, tokenizer, categories, maxlen) for i, item in enumerate(train_generator): print("\nbatch_token_ids shape: shape:", item[0][0].shape) print("batch_segment_ids shape:", item[0][1].shape) print("batch_labels shape:", item[1].shape) if i == 4: break # batch_token_ids: (32, maxlen) or (32, n), n <= maxlen # batch_segment_ids: (32, maxlen) or (32, n), n <= maxlen # batch_labels: (32, maxlen) or (32, n), n <= maxlen evaluator = Evaluator(patience = 5) print('\n\t\tTrain start!\t\t\n') history = model.fit( train_generator.forfit(), steps_per_epoch = len(train_generator), epochs = epochs, verbose = 1, callbacks = [evaluator] ) print('\n\tTrain end!\t\n') train_plot(history.history, history.epoch) data = { 'epoch': range(1, len(f1_list) + 1), 'f1': f1_list, 'recall': recall_list, 'precision': precision_list } f1_plot(data)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。