当前位置:   article > 正文

NLP(三十六)使用keras-bert实现文本多标签分类任务_多标签文本分类bert 中文 代码

多标签文本分类bert 中文 代码

  本文将会介绍如何使用keras-bert实现文本多标签分类任务,其中对BERT进行微调

项目结构

  本项目的项目结构如下:
项目结构
其中依赖的Python第三方模块如下:

pandas==0.23.4
Keras==2.3.1
keras_bert==0.83.0
numpy==1.16.4
  • 1
  • 2
  • 3
  • 4

数据集介绍

  本文采用的数据集与文章NLP(二十八)多标签文本分类中的一致,以事件抽取比赛的数据集为参考,形成文本与事件类型的多标签数据集,一共为65种事件类型。样例数据(csv格式)如下:

label,content
司法行为-起诉|组织关系-裁员,最近,一位前便利蜂员工就因公司违规裁员,将便利蜂所在的公司虫极科技(北京)有限公司告上法庭。
组织关系-裁员,思科上海大规模裁员人均可获赔100万官方澄清事实
组织关系-裁员,日本巨头面临危机,已裁员1000多人,苹果也救不了它!
组织关系-裁员|组织关系-解散,在硅谷镀金失败的造车新势力们:蔚来裁员、奇点被偷窃、拜腾解散
  • 1
  • 2
  • 3
  • 4
  • 5

在label中,每个事件类型用|隔开。
  在该数据集中,训练集一共11958个样本,测试集一共1498个样本。

模型训练

  模型训练的脚本model_train.py的完整代码如下:

# -*- coding: utf-8 -*-
import json
import codecs
import pandas as pd
import numpy as np
from keras_bert import load_trained_model_from_checkpoint, Tokenizer
from keras.layers import *
from keras.models import Model
from keras.optimizers import Adam

# 建议长度<=510
maxlen = 256
BATCH_SIZE = 8
config_path = './chinese_L-12_H-768_A-12/bert_config.json'
checkpoint_path = './chinese_L-12_H-768_A-12/bert_model.ckpt'
dict_path = './chinese_L-12_H-768_A-12/vocab.txt'


token_dict = {}
with codecs.open(dict_path, 'r', 'utf-8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)


class OurTokenizer(Tokenizer):
    def _tokenize(self, text):
        R = []
        for c in text:
            if c in self._token_dict:
                R.append(c)
            else:
                R.append('[UNK]')   # 剩余的字符是[UNK]
        return R


tokenizer = OurTokenizer(token_dict)


def seq_padding(X, padding=0):
    L = [len(x) for x in X]
    ML = max(L)
    return np.array([
        np.concatenate([x, [padding] * (ML - len(x))]) if len(x) < ML else x for x in X
    ])


class DataGenerator:

    def __init__(self, data, batch_size=BATCH_SIZE):
        self.data = data
        self.batch_size = batch_size
        self.steps = len(self.data) // self.batch_size
        if len(self.data) % self.batch_size != 0:
            self.steps += 1

    def __len__(self):
        return self.steps

    def __iter__(self):
        while True:
            idxs = list(range(len(self.data)))
            np.random.shuffle(idxs)
            X1, X2, Y = [], [], []
            for i in idxs:
                d = self.data[i]
                text = d[0][:maxlen]
                x1, x2 = tokenizer.encode(first=text)
                y = d[1]
                X1.append(x1)
                X2.append(x2)
                Y.append(y)
                if len(X1) == self.batch_size or i == idxs[-1]:
                    X1 = seq_padding(X1)
                    X2 = seq_padding(X2)
                    Y = seq_padding(Y)
                    yield [X1, X2], Y
                    [X1, X2, Y] = [], [], []


# 构建模型
def create_cls_model(num_labels):
    bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path, seq_len=None)

    for layer in bert_model.layers:
        layer.trainable = True

    x1_in = Input(shape=(None,))
    x2_in = Input(shape=(None,))

    x = bert_model([x1_in, x2_in])
    cls_layer = Lambda(lambda x: x[:, 0])(x)    # 取出[CLS]对应的向量用来做分类
    p = Dense(num_labels, activation='sigmoid')(cls_layer)     # 多分类

    model = Model([x1_in, x2_in], p)
    model.compile(
        loss='binary_crossentropy',
        optimizer=Adam(1e-5), # 用足够小的学习率
        metrics=['accuracy']
    )
    model.summary()

    return model


if __name__ == '__main__':

    # 数据处理, 读取训练集和测试集
    print("begin data processing...")
    train_df = pd.read_csv("data/train.csv").fillna(value="")
    test_df = pd.read_csv("data/test.csv").fillna(value="")

    select_labels = train_df["label"].unique()
    labels = []
    for label in select_labels:
        if "|" not in label:
            if label not in labels:
                labels.append(label)
        else:
            for _ in label.split("|"):
                if _ not in labels:
                    labels.append(_)
    with open("label.json", "w", encoding="utf-8") as f:
        f.write(json.dumps(dict(zip(range(len(labels)), labels)), ensure_ascii=False, indent=2))

    train_data = []
    test_data = []
    for i in range(train_df.shape[0]):
        label, content = train_df.iloc[i, :]
        label_id = [0] * len(labels)
        for j, _ in enumerate(labels):
            for separate_label in label.split("|"):
                if _ == separate_label:
                    label_id[j] = 1
        train_data.append((content, label_id))

    for i in range(test_df.shape[0]):
        label, content = test_df.iloc[i, :]
        label_id = [0] * len(labels)
        for j, _ in enumerate(labels):
            for separate_label in label.split("|"):
                if _ == separate_label:
                    label_id[j] = 1
        test_data.append((content, label_id))

    # print(train_data[:10])
    print("finish data processing!")

    # 模型训练
    model = create_cls_model(len(labels))
    train_D = DataGenerator(train_data)
    test_D = DataGenerator(test_data)

    print("begin model training...")
    model.fit_generator(
        train_D.__iter__(),
        steps_per_epoch=len(train_D),
        epochs=10,
        validation_data=test_D.__iter__(),
        validation_steps=len(test_D)
    )

    print("finish model training!")

    # 模型保存
    model.save('multi-label-ee.h5')
    print("Model saved!")

    result = model.evaluate_generator(test_D.__iter__(), steps=len(test_D))
    print("模型评估结果:", result)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170

  模型结构如下:

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
model_2 (Model)                 (None, None, 768)    101677056   input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
lambda_1 (Lambda)               (None, 768)          0           model_2[1][0]                    
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 65)           49985       lambda_1[0][0]                   
==================================================================================================
Total params: 101,727,041
Trainable params: 101,727,041
Non-trainable params: 0
__________________________________________________________________________________________________
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18

  从中我们可以发现,该模型结构与文章NLP(三十五)使用keras-bert实现文本多分类任务中给出的文本多分类模型结构大体一致,修改之处在于BERT后接的网络结构,所接的依然是dense层,但激活函数采用sigmoid函数,同时损失函数为binary_crossentropy。就其本质而言,该模型结构是对输出的65个结果采用0-1分类,故而激活函数采用sigmoid,这当然是文本多分类模型转化为多标签标签的最便捷方式,但不足之处在于,该模型并未考虑标签之间的依赖关系。

模型评估

  模型评估脚本model_evaluate.py的完整代码如下:

# -*- coding: utf-8 -*-
# @Time : 2020/12/23 15:28
# @Author : Jclian91
# @File : model_evaluate.py
# @Place : Yangpu, Shanghai
# 模型评估脚本,利用hamming_loss作为多标签分类的评估指标,该值越小模型效果越好
import json
import numpy as np
import pandas as pd
from keras.models import load_model
from keras_bert import get_custom_objects
from sklearn.metrics import hamming_loss, classification_report

from model_train import token_dict, OurTokenizer

maxlen = 256

# 加载训练好的模型
model = load_model("multi-label-ee.h5", custom_objects=get_custom_objects())
tokenizer = OurTokenizer(token_dict)
with open("label.json", "r", encoding="utf-8") as f:
    label_dict = json.loads(f.read())


# 对单句话进行预测
def predict_single_text(text):
    # 利用BERT进行tokenize
    text = text[:maxlen]
    x1, x2 = tokenizer.encode(first=text)
    X1 = x1 + [0] * (maxlen - len(x1)) if len(x1) < maxlen else x1
    X2 = x2 + [0] * (maxlen - len(x2)) if len(x2) < maxlen else x2

    # 模型预测并输出预测结果
    prediction = model.predict([[X1], [X2]])
    one_hot = np.where(prediction > 0.5, 1, 0)[0]
    return one_hot, "|".join([label_dict[str(i)] for i in range(len(one_hot)) if one_hot[i]])


# 模型评估
def evaluate():
    test_df = pd.read_csv("data/test.csv").fillna(value="")
    true_y_list, pred_y_list = [], []
    true_label_list, pred_label_list = [], []
    common_cnt = 0
    for i in range(test_df.shape[0]):
        print("predict %d samples" % (i+1))
        true_label, content = test_df.iloc[i, :]
        true_y = [0] * len(label_dict.keys())
        for key, value in label_dict.items():
            if value in true_label:
                true_y[int(key)] = 1

        pred_y, pred_label = predict_single_text(content)
        if true_label == pred_label:
            common_cnt += 1
        true_y_list.append(true_y)
        pred_y_list.append(pred_y)
        true_label_list.append(true_label)
        pred_label_list.append(pred_label)

    # F1值
    print(classification_report(true_y_list, pred_y_list, digits=4))
    return true_label_list, pred_label_list, hamming_loss(true_y_list, pred_y_list), common_cnt/len(true_y_list)


true_labels, pred_labels, h_loss, accuracy = evaluate()
df = pd.DataFrame({"y_true": true_labels, "y_pred": pred_labels})
df.to_csv("pred_result.csv")

print("accuracy: ", accuracy)
print("hamming loss: ", h_loss)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71

Hamming Loss为多标签分类所特有的评估方式,其值越小代表多标签分类模型的效果越好。运行上述模型评估代码,输出结果如下:

   micro avg     0.9341    0.9578    0.9458      1657
   macro avg     0.9336    0.9462    0.9370      1657
weighted avg     0.9367    0.9578    0.9456      1657
 samples avg     0.9520    0.9672    0.9531      1657

accuracy:  0.8985313751668892
hamming loss:  0.001869158878504673
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

  在这里,笔者希望与之前的文章NLP(二十八)多标签文本分类中的模型对比一下。当时采用的模型为用ALBERT提取特征向量,再用Bi-GRU+Attention+FCN进行分类,模型结构如下:
 Bi-GRU+Attention+FCN
  对该模型同样采用上述评估办法,输出的结果如下:

   micro avg     0.9424    0.8292    0.8822      1657
   macro avg     0.8983    0.7218    0.7791      1657
weighted avg     0.9308    0.8292    0.8669      1657
 samples avg     0.8675    0.8496    0.8517      1657
accuracy:  0.7983978638184246
hamming loss:  0.0037691280681934887
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

可以发现,采用BERT微调的模型,在accuracy方面高出了约10%,各种F1值高出约5%-10%,Hamming Loss也小了很多。因此,BERT微调的模型比之前的模型效果好很多。

总结

  本项目已经开源,Github地址为:https://github.com/percent4/keras_bert_multi_label_cls
  2020年12月27日于上海浦东

参考文章

  1. NLP(二十八)多标签文本分类:https://blog.csdn.net/jclian91/article/details/105386190
  2. NLP(三十五)使用keras-bert实现文本多分类任务:https://blog.csdn.net/jclian91/article/details/111742576
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/2023面试高手/article/detail/485081
推荐阅读
相关标签
  

闽ICP备14008679号