BERT 简介
BERT是2018年google 提出来的预训练的语言模型,并且它打破很多NLP领域的任务记录,其提出在nlp的领域具有重要意义。预训练的(pre-train)的语言模型通过无监督的学习掌握了很多自然语言的一些语法或者语义知识,之后在做下游的nlp任务时就会显得比较容易。BERT在做下游的有监督nlp任务时就像一个做了充足预习的学生去上课,那效果肯定事半功倍。之前的word2vec,glove等Word Embedding技术也是通过无监督的训练让模型预先掌握了一些基础的语言知识,但是word embeding技术无论从预训练的模型复杂度(可以理解成学习的能力),以及无监督学习的任务难度都无法和BERT相比。
模型部分
首先BERT模型采用的是12层或者24层的双向的Transformer的Encoder作为特征提取器,如下图所示。要知道在nlp领域,特征提取能力方面的排序大致是Transformer>RNN>CNN。对Transformer不了解的同学可以看看笔者之前的这篇文章,而且一用就是12层,将nlp真正的往深度的方向推进了一大步。
预训练任务方面
BERT 为了让模型能够比较好的掌握自然语言方面的知识,提出了下面两种预训练的任务:
1.遮盖词的预测任务(mask word prediction),如下图所示:
将输入文本中15%的token随机遮盖,然后输入给模型,最终希望模型能够输出遮盖的词是什么,这就是让模型在做完形填空啊,而且还是选项的可能是所有词的完形填空,想想我们经常考试时做完形填空,给你四个选项你都不一定能够做对,这个任务可以让模型学到很多语法,甚至语义方面的知识。
2.下一个句子预测任务(next sentence prediction)
下一个句子预测任务如下图所示:给模型输入A,B两个句子,让模型判断B句子是否是A句子的下一句。这个任务是希望模型能够学到句子间的关系,更近一步的加强模型对自然语言的理解。
这两个颇具难度的预训练任务,让模型在预训练阶段就对自然语言有了比较深入的学习和认知,而这些知识对下游的nlp任务有着巨大的帮助。当然,想要模型通过预训练掌握知识,我们需要花费大量的语料,大量的计算资源和大量的时间。但是训练一遍就可以一直使用,这种一劳永逸的工作,依然很值得去做一做。
BERT的NER实战
这里笔者先介绍一下kashgari这个框架,此框架的github链接在这,封装这个框架的作者希望大家能够很方便的调用一些NLP领域高大上的技术,快速的进行一些实验。kashgari封装了BERT embedingg模型,LSTM-CRF实体识别模型,还有一些经典的文本分类的网络模型。这里笔者就是利用这个框架五分钟在自己的数据集上完成了基于BERT的NER实战。
数据读入
with open("train_data","rb") as f: data = f.read().decode("utf-8") train_data = data.split("\n\n") train_data = [token.split("\n") for token in train_data] train_data = [[j.split() for j in i ] for i in train_data] train_data.pop()
数据预处理
- train_x = [[token[0] for token in sen] for sen in train_data]
- train_y = [[token[1] for token in sen] for sen in train_data]
这里 train_x和 train_y都是一个list,
train_x: [[char_seq1],[char_seq2],[char_seq3],..... ]
train_y:[[label_seq1],[label_seq2],[label_seq3],..... ]
其中 char_seq1:["我","爱","荆","州"]
对应的的label_seq1:["O","O","B_LOC","I_LOC"]
数据预处理成一个字对应一个label就可以了,是不是很方便。kashgari已经封装了数据数值化,向量化的模块了,所以你已经不用操心文本和label的数值化问题了。这里要强调一下,由于google开源的BERT中文预训练模型采用的是字符级别的输入,所以数据预处理部分只能将文本处理成字符。
载入BERT
只需通过下面三行就可以轻易的加载BERT模型。
- from kashgari.embeddings import BERTEmbedding
- from kashgari.tasks.seq_labeling import BLSTMCRFModel embedding = BERTEmbedding("bert-base-chinese", 200)
运行后,代码会自动到BERT模型储存的地方下载预训练模型的参数,这里谷歌已经帮我们预训练好BERT模型了,所以我们只需要用它做下游任务即可。
搭建模型并训练
使用kashgari封装的LSTM+CRF模型,将数据喂给模型,同时设置好batch-size,就可以训练起来了。感觉整个过程是不是不需要五分钟(当然如果网速慢,下载预训练的BERT模型可能就超过五分钟了)。
- model = BLSTMCRFModel(embedding)
- model.fit(train_x,train_y,epochs=1,batch_size=100)
- __________________________________________________________________________________________________
- Layer (type) Output Shape Param # Connected to
- ==================================================================================================
- Input-Token (InputLayer) (None, 200) 0 __________________________________________________________________________________________________ Input-Segment (InputLayer) (None, 200) 0 __________________________________________________________________________________________________ Embedding-Token (TokenEmbedding [(None, 200, 768), ( 16226304 Input-Token[0][0] __________________________________________________________________________________________________ Embedding-Segment (Embedding) (None, 200, 768) 1536 Input-Segment[0][0] __________________________________________________________________________________________________ Embedding-Token-Segment (Add) (None, 200, 768) 0 Embedding-Token[0][0] Embedding-Segment[0][0] __________________________________________________________________________________________________ Embedding-Position (PositionEmb (None, 200, 768) 153600 Embedding-Token-Segment[0][0] __________________________________________________________________________________________________ Embedding-Dropout (Dropout) (None, 200, 768) 0 Embedding-Position[0][0] __________________________________________________________________________________________________ Embedding-Norm (LayerNormalizat (None, 200, 768) 1536 Embedding-Dropout[0][0] __________________________________________________________________________________________________ Encoder-1-MultiHeadSelfAttentio (None, 200, 768) 2362368 Embedding-Norm[0][0] __________________________________________________________________________________________________ Encoder-1-MultiHeadSelfAttentio (None, 200, 768) 0 Encoder-1-MultiHeadSelfAttention[ __________________________________________________________________________________________________ Encoder-1-MultiHeadSelfAttentio (None, 200, 768) 0 Embedding-Norm[0][0] Encoder-1-MultiHeadSelfAttention- __________________________________________________________________________________________________ Encoder-1-MultiHeadSelfAttentio (None, 200, 768) 1536 Encoder-1-MultiHeadSelfAttention- __________________________________________________________________________________________________ Encoder-1-FeedForward (FeedForw (None, 200, 768) 4722432 Encoder-1-MultiHeadSelfAttention- __________________________________________________________________________________________________ Encoder-1-FeedForward-Dropout ( (None, 200, 768) 0 Encoder-1-FeedForward[0][0] __________________________________________________________________________________________________ Encoder-1-FeedForward-Add (Add) (None, 200, 768) 0 Encoder-1-MultiHeadSelfAttention- Encoder-1-FeedForward-Dropout[0][ __________________________________________________________________________________________________ Encoder-1-FeedForward-Norm (Lay (None, 200, 768) 1536 Encoder-1-FeedForward-Add[0][0] __________________________________________________________________________________________________ Encoder-2-MultiHeadSelfAttentio (None, 200, 768) 2362368 Encoder-1-FeedForward-Norm[0][0] __________________________________________________________________________________________________ Encoder-2-MultiHeadSelfAttentio (None, 200, 768) 0 Encoder-2-MultiHeadSelfAttention[ __________________________________________________________________________________________________ Encoder-2-MultiHeadSelfAttentio (None, 200, 768) 0 Encoder-1-FeedForward-Norm[0][0] Encoder-2-MultiHeadSelfAttention- __________________________________________________________________________________________________ Encoder-2-MultiHeadSelfAttentio (None, 200, 768) 1536 Encoder-2-MultiHeadSelfAttention- __________________________________________________________________________________________________ Encoder-2-FeedForward (FeedForw (None, 200, 768) 4722432 Encoder-2-MultiHeadSelfAttention- __________________________________________________________________________________________________ Encoder-2-FeedForward-Dropout ( (None, 200, 768) 0 Encoder-2-FeedForward[0][0] __________________________________________________________________________________________________ Encoder-2-FeedForward-Add (Add) (None, 200, 768) 0 Encoder-2-MultiHeadSelfAttention- Encoder-2-FeedForward-Dropout[0][ __________________________________________________________________________________________________ Encoder-2-FeedForward-Norm (Lay (None, 200, 768) 1536 Encoder-2-FeedForward-Add[0][0] __________________________________________________________________________________________________ Encoder-3-MultiHeadSelfAttentio (None, 200, 768) 2362368 Encoder-2-FeedForward-Norm[0][0] __________________________________________________________________________________________________ Encoder-3-MultiHeadSelfAttentio (None, 200, 768) 0 Encoder-3-MultiHeadSelfAttention[ __________________________________________________________________________________________________ Encoder-3-MultiHeadSelfAttentio (None, 200, 768) 0 Encoder-2-FeedForward-Norm[0][0] Encoder-3-MultiHeadSelfAttention- __________________________________________________________________________________________________ Encoder-3-MultiHeadSelfAttentio (None, 200, 768) 1536 Encoder-3-MultiHeadSelfAttention- __________________________________________________________________________________________________ Encoder-3-FeedForward (FeedForw (None, 200, 768) 4722432 Encoder-3-MultiHeadSelfAttention- __________________________________________________________________________________________________ Encoder-3-FeedForward-Dropout ( (None, 200, 768) 0 Encoder-3-FeedForward[0][0] __________________________________________________________________________________________________ Encoder-3-FeedForward-Add (Add) (None, 200, 768) 0 Encoder-3-MultiHeadSelfAttention- Encoder-3-FeedForward-Dropout[0][ __________________________________________________________________________________________________ Encoder-3-FeedForward-Norm (Lay (None, 200, 768) 1536 Encoder-3-FeedForward-Add[0][0] __________________________________________________________________________________________________ Encoder-4-MultiHeadSelfAttentio (None, 200, 768) 2362368 Encoder-3-FeedForward-Norm[0][0] __________________________________________________________________________________________________ Encoder-4-MultiHeadSelfAttentio (None, 200, 768) 0 Encoder-4-MultiHeadSelfAttention[ __________________________________________________________________________________________________ Encoder-4-MultiHeadSelfAttentio (None, 200, 768) 0 Encoder-3-FeedForward-Norm[0][0] Encoder