赞
踩
本文将会介绍如何使用keras-bert实现文本多标签分类任务,其中对BERT进行微调
。
本项目的项目结构如下:
其中依赖的Python第三方模块如下:
pandas==0.23.4
Keras==2.3.1
keras_bert==0.83.0
numpy==1.16.4
本文采用的数据集与文章NLP(二十八)多标签文本分类中的一致,以事件抽取比赛的数据集为参考,形成文本与事件类型的多标签数据集,一共为65种事件类型。样例数据(csv格式)如下:
label,content
司法行为-起诉|组织关系-裁员,最近,一位前便利蜂员工就因公司违规裁员,将便利蜂所在的公司虫极科技(北京)有限公司告上法庭。
组织关系-裁员,思科上海大规模裁员人均可获赔100万官方澄清事实
组织关系-裁员,日本巨头面临危机,已裁员1000多人,苹果也救不了它!
组织关系-裁员|组织关系-解散,在硅谷镀金失败的造车新势力们:蔚来裁员、奇点被偷窃、拜腾解散
在label中,每个事件类型用|隔开。
在该数据集中,训练集一共11958个样本,测试集一共1498个样本。
模型训练的脚本model_train.py的完整代码如下:
# -*- coding: utf-8 -*- import json import codecs import pandas as pd import numpy as np from keras_bert import load_trained_model_from_checkpoint, Tokenizer from keras.layers import * from keras.models import Model from keras.optimizers import Adam # 建议长度<=510 maxlen = 256 BATCH_SIZE = 8 config_path = './chinese_L-12_H-768_A-12/bert_config.json' checkpoint_path = './chinese_L-12_H-768_A-12/bert_model.ckpt' dict_path = './chinese_L-12_H-768_A-12/vocab.txt' token_dict = {} with codecs.open(dict_path, 'r', 'utf-8') as reader: for line in reader: token = line.strip() token_dict[token] = len(token_dict) class OurTokenizer(Tokenizer): def _tokenize(self, text): R = [] for c in text: if c in self._token_dict: R.append(c) else: R.append('[UNK]') # 剩余的字符是[UNK] return R tokenizer = OurTokenizer(token_dict) def seq_padding(X, padding=0): L = [len(x) for x in X] ML = max(L) return np.array([ np.concatenate([x, [padding] * (ML - len(x))]) if len(x) < ML else x for x in X ]) class DataGenerator: def __init__(self, data, batch_size=BATCH_SIZE): self.data = data self.batch_size = batch_size self.steps = len(self.data) // self.batch_size if len(self.data) % self.batch_size != 0: self.steps += 1 def __len__(self): return self.steps def __iter__(self): while True: idxs = list(range(len(self.data))) np.random.shuffle(idxs) X1, X2, Y = [], [], [] for i in idxs: d = self.data[i] text = d[0][:maxlen] x1, x2 = tokenizer.encode(first=text) y = d[1] X1.append(x1) X2.append(x2) Y.append(y) if len(X1) == self.batch_size or i == idxs[-1]: X1 = seq_padding(X1) X2 = seq_padding(X2) Y = seq_padding(Y) yield [X1, X2], Y [X1, X2, Y] = [], [], [] # 构建模型 def create_cls_model(num_labels): bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path, seq_len=None) for layer in bert_model.layers: layer.trainable = True x1_in = Input(shape=(None,)) x2_in = Input(shape=(None,)) x = bert_model([x1_in, x2_in]) cls_layer = Lambda(lambda x: x[:, 0])(x) # 取出[CLS]对应的向量用来做分类 p = Dense(num_labels, activation='sigmoid')(cls_layer) # 多分类 model = Model([x1_in, x2_in], p) model.compile( loss='binary_crossentropy', optimizer=Adam(1e-5), # 用足够小的学习率 metrics=['accuracy'] ) model.summary() return model if __name__ == '__main__': # 数据处理, 读取训练集和测试集 print("begin data processing...") train_df = pd.read_csv("data/train.csv").fillna(value="") test_df = pd.read_csv("data/test.csv").fillna(value="") select_labels = train_df["label"].unique() labels = [] for label in select_labels: if "|" not in label: if label not in labels: labels.append(label) else: for _ in label.split("|"): if _ not in labels: labels.append(_) with open("label.json", "w", encoding="utf-8") as f: f.write(json.dumps(dict(zip(range(len(labels)), labels)), ensure_ascii=False, indent=2)) train_data = [] test_data = [] for i in range(train_df.shape[0]): label, content = train_df.iloc[i, :] label_id = [0] * len(labels) for j, _ in enumerate(labels): for separate_label in label.split("|"): if _ == separate_label: label_id[j] = 1 train_data.append((content, label_id)) for i in range(test_df.shape[0]): label, content = test_df.iloc[i, :] label_id = [0] * len(labels) for j, _ in enumerate(labels): for separate_label in label.split("|"): if _ == separate_label: label_id[j] = 1 test_data.append((content, label_id)) # print(train_data[:10]) print("finish data processing!") # 模型训练 model = create_cls_model(len(labels)) train_D = DataGenerator(train_data) test_D = DataGenerator(test_data) print("begin model training...") model.fit_generator( train_D.__iter__(), steps_per_epoch=len(train_D), epochs=10, validation_data=test_D.__iter__(), validation_steps=len(test_D) ) print("finish model training!") # 模型保存 model.save('multi-label-ee.h5') print("Model saved!") result = model.evaluate_generator(test_D.__iter__(), steps=len(test_D)) print("模型评估结果:", result)
模型结构如下:
__________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_1 (InputLayer) (None, None) 0 __________________________________________________________________________________________________ input_2 (InputLayer) (None, None) 0 __________________________________________________________________________________________________ model_2 (Model) (None, None, 768) 101677056 input_1[0][0] input_2[0][0] __________________________________________________________________________________________________ lambda_1 (Lambda) (None, 768) 0 model_2[1][0] __________________________________________________________________________________________________ dense_1 (Dense) (None, 65) 49985 lambda_1[0][0] ================================================================================================== Total params: 101,727,041 Trainable params: 101,727,041 Non-trainable params: 0 __________________________________________________________________________________________________
从中我们可以发现,该模型结构与文章NLP(三十五)使用keras-bert实现文本多分类任务中给出的文本多分类模型结构大体一致,修改之处在于BERT后接的网络结构,所接的依然是dense层,但激活函数采用sigmoid函数,同时损失函数为binary_crossentropy。就其本质而言,该模型结构是对输出的65个结果采用0-1分类,故而激活函数采用sigmoid,这当然是文本多分类模型转化为多标签标签的最便捷方式,但不足之处在于,该模型并未考虑标签之间的依赖关系。
模型评估脚本model_evaluate.py的完整代码如下:
# -*- coding: utf-8 -*- # @Time : 2020/12/23 15:28 # @Author : Jclian91 # @File : model_evaluate.py # @Place : Yangpu, Shanghai # 模型评估脚本,利用hamming_loss作为多标签分类的评估指标,该值越小模型效果越好 import json import numpy as np import pandas as pd from keras.models import load_model from keras_bert import get_custom_objects from sklearn.metrics import hamming_loss, classification_report from model_train import token_dict, OurTokenizer maxlen = 256 # 加载训练好的模型 model = load_model("multi-label-ee.h5", custom_objects=get_custom_objects()) tokenizer = OurTokenizer(token_dict) with open("label.json", "r", encoding="utf-8") as f: label_dict = json.loads(f.read()) # 对单句话进行预测 def predict_single_text(text): # 利用BERT进行tokenize text = text[:maxlen] x1, x2 = tokenizer.encode(first=text) X1 = x1 + [0] * (maxlen - len(x1)) if len(x1) < maxlen else x1 X2 = x2 + [0] * (maxlen - len(x2)) if len(x2) < maxlen else x2 # 模型预测并输出预测结果 prediction = model.predict([[X1], [X2]]) one_hot = np.where(prediction > 0.5, 1, 0)[0] return one_hot, "|".join([label_dict[str(i)] for i in range(len(one_hot)) if one_hot[i]]) # 模型评估 def evaluate(): test_df = pd.read_csv("data/test.csv").fillna(value="") true_y_list, pred_y_list = [], [] true_label_list, pred_label_list = [], [] common_cnt = 0 for i in range(test_df.shape[0]): print("predict %d samples" % (i+1)) true_label, content = test_df.iloc[i, :] true_y = [0] * len(label_dict.keys()) for key, value in label_dict.items(): if value in true_label: true_y[int(key)] = 1 pred_y, pred_label = predict_single_text(content) if true_label == pred_label: common_cnt += 1 true_y_list.append(true_y) pred_y_list.append(pred_y) true_label_list.append(true_label) pred_label_list.append(pred_label) # F1值 print(classification_report(true_y_list, pred_y_list, digits=4)) return true_label_list, pred_label_list, hamming_loss(true_y_list, pred_y_list), common_cnt/len(true_y_list) true_labels, pred_labels, h_loss, accuracy = evaluate() df = pd.DataFrame({"y_true": true_labels, "y_pred": pred_labels}) df.to_csv("pred_result.csv") print("accuracy: ", accuracy) print("hamming loss: ", h_loss)
Hamming Loss为多标签分类所特有的评估方式,其值越小代表多标签分类模型的效果越好。运行上述模型评估代码,输出结果如下:
micro avg 0.9341 0.9578 0.9458 1657
macro avg 0.9336 0.9462 0.9370 1657
weighted avg 0.9367 0.9578 0.9456 1657
samples avg 0.9520 0.9672 0.9531 1657
accuracy: 0.8985313751668892
hamming loss: 0.001869158878504673
在这里,笔者希望与之前的文章NLP(二十八)多标签文本分类中的模型对比一下。当时采用的模型为用ALBERT提取特征向量,再用Bi-GRU+Attention+FCN进行分类,模型结构如下:
对该模型同样采用上述评估办法,输出的结果如下:
micro avg 0.9424 0.8292 0.8822 1657
macro avg 0.8983 0.7218 0.7791 1657
weighted avg 0.9308 0.8292 0.8669 1657
samples avg 0.8675 0.8496 0.8517 1657
accuracy: 0.7983978638184246
hamming loss: 0.0037691280681934887
可以发现,采用BERT微调的模型,在accuracy方面高出了约10%,各种F1值高出约5%-10%,Hamming Loss也小了很多。因此,BERT微调的模型比之前的模型效果好很多。
本项目已经开源,Github地址为:https://github.com/percent4/keras_bert_multi_label_cls 。
2020年12月27日于上海浦东
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。