赞
踩
NER进化史:从手工规则到特征模板,从机器学习方法到深度学习方法,一起领略NLP技术革新。
命名实体识别,Named Entity Recongition,简称NER,又称“专名识别”,是指识别文本中具有特定意义的实体,主要包括人名、地名、机构名、专有名词等。
命名实体识别是信息提取、问答系统、句法分析、机器翻译等应用领域的重要基础工具,在自然语言处理技术走向实用化的过程中占有重要地位。
一般来说,命名实体识别的任务就是识别出待处理文本中三大类(实体类、时间类和数字类)、七小类(人名、机构名、地名、时间、日期、货币和百分比)命名实体。
利用手工编写的规则,将文本与规则进行匹配,从而识别出命名实体。
例如:
规则1:《姓氏》 [\u4e00-\u9fa5]{1,4} “老师|同学”
规则2:《地名》“大学”
在上面的规则中,使用词性、字符串、正则表达式等编写。
规则1可以识别姓氏开头、老师或者同学结尾的人名,如“张晓明老师”、“王小二同学”等。
规则2可以识别大学名称,如北京大学、深圳大学、南昌大学、武汉大学等。
通常来说,规则编写可以用到:词性、词语、字符串、正则表达式、句法信息等。
优点:
1、构建规则速度较快
2、无需特殊标注数据
3、实例化容易,易于理解
缺点:
1、构建规则需要大量的语言学知识
2、规则之间的冲突问题不容易处理
3、构建规则的过程费时费力、可移植性差
HMM模型:
用隐马尔可夫模型(HMM)做命名实体识别(NER),具体方法可参考下面这篇博文:
https://blog.csdn.net/omnispace/article/details/89953185
优点:
1、无需手工编写规则,适用性更强
缺点:
1、HMM存在两个假设:一是输出观察值之间严格独立,二是状态的转移过程中当前状态只与前一状态有关。因此无法更好的利用上下文信息。
CRF模型:
优点:
1、去除了HMM中两个不合理的假设,能更好的利用上下文特征
2、解决了MEMM模型(最大熵隐马模型)标注偏置问题
缺点:
1、模型变复杂了
2、较好的利用了上下文特征,但还不够深入
实战
数据标注
数据的格式如下,它的每一行由一个字及其对应的标注组成,标注集采用BIOES,句子之间用一个空行隔开。
``` 白 B-ORG 宫 E-ORG 宣 O 布 O 特 B-PER 朗 I-PER 普 E-PER 与 O 普 B-PER 京 E-PER 将 O 在 O G B-ORG 2 I-ORG 0 E-ORG 峰 O 会 O 期 O 间 O 举 O 行 O 会 O 晤 O ```
说明:
PER:人名
ORG:组织机构名
NER标注说明可以参考:
https://max.book118.com/html/2018/0210/152660121.shtm
代码实现
crf模型 crf.py
from sklearn_crfsuite import CRF class CRFModel(object): def __init__(self, algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100, all_possible_transitions=False ): self.model = CRF(algorithm=algorithm, c1=c1, c2=c2, max_iterations=max_iterations, all_possible_transitions=all_possible_transitions) def train(self, sentences, tag_lists): features = [self.sent2features(s) for s in sentences] self.model.fit(features, tag_lists) def test(self, sentences): features = [self.sent2features(s) for s in sentences] pred_tag_lists = self.model.predict(features) return pred_tag_lists def word2features(self, sent, i): """抽取单个字的特征""" word = sent[i] prev_word = "<s>" if i == 0 else sent[i-1] next_word = "</s>" if i == (len(sent)-1) else sent[i+1] # 使用的特征: # 前一个词,当前词,后一个词, # 前一个词+当前词, 当前词+后一个词 features = { 'w': word, 'w-1': prev_word, 'w+1': next_word, 'w-1:w': prev_word+word, 'w:w+1': word+next_word, 'bias': 1 } return features def sent2features(self, sent): """抽取序列特征""" return [self.word2features(sent, i) for i in range(len(sent))]
模型训练 train.py
from crf import CRFModel from evaluating import Metrics import os, sys, pickle CRF_MODEL_PATH = './model/crf.pkl' def build_corpus(model, make_vocab=True, data_dir="./data"): """读取数据""" assert model in ['train', 'dev', 'test'] word_lists = [] tag_lists = [] with open(os.path.join(data_dir, model + ".char.txt"), 'r', encoding='utf-8') as f: word_list = [] tag_list = [] for line in f: if line != '\n': word, tag = line.strip('\n').split() word_list.append(word) tag_list.append(tag) else: word_lists.append(word_list) tag_lists.append(tag_list) word_list = [] tag_list = [] # 如果make_vocab为True,还需要返回word2id和tag2id if make_vocab: word2id = build_map(word_lists) tag2id = build_map(tag_lists) return word_lists, tag_lists, word2id, tag2id else: return word_lists, tag_lists def build_map(lists): maps = {} for list_ in lists: for e in list_: if e not in maps: maps[e] = len(maps) return maps def crf_train_eval(train_data, test_data, remove_O=False): # 训练CRF模型 train_word_lists, train_tag_lists = train_data test_word_lists, test_tag_lists = test_data crf_model = CRFModel() crf_model.train(train_word_lists, train_tag_lists) save_model(crf_model, CRF_MODEL_PATH) pred_tag_lists = crf_model.test(test_word_lists) metrics = Metrics(test_tag_lists, pred_tag_lists, remove_O=remove_O) metrics.report_scores() metrics.report_confusion_matrix() return pred_tag_lists def save_model(model, file_name): """用于保存模型""" with open(file_name, "wb") as f: pickle.dump(model, f) def load_model(file_name): """用于加载模型""" with open(file_name, "rb") as f: model = pickle.load(f) return model def main(): """训练模型,评估结果""" # 读取数据 print('loading...') train_word_lists, train_tag_lists, word2id, tag2id = build_corpus("train") dev_word_lists, dev_tag_lists = build_corpus("dev", make_vocab=False) test_word_lists, test_tag_lists = build_corpus("test", make_vocab=False) # 训练评估CRF模型 print('training...') crf_pred = crf_train_eval((train_word_lists, train_tag_lists), (test_word_lists, test_tag_lists)) # 加载并评估CRF模型 print('evaluating...') crf_model = load_model(CRF_MODEL_PATH) crf_pred = crf_model.test(dev_word_lists) metrics = Metrics(dev_tag_lists, crf_pred) metrics.report_scores() metrics.report_confusion_matrix() if __name__ == "__main__": main()
依赖 evaluating.py
from collections import Counter class Metrics(object): """用于评价模型,计算每个标签的精确率,召回率,F1分数""" def __init__(self, golden_tags, predict_tags, remove_O=False): # [[t1, t2], [t3, t4]...] --> [t1, t2, t3, t4...] self.golden_tags = flatten_lists(golden_tags) self.predict_tags = flatten_lists(predict_tags) if remove_O: # 将O标记移除,只关心实体标记 self._remove_Otags() # 辅助计算的变量 self.tagset = set(self.golden_tags) self.correct_tags_number = self.count_correct_tags() self.predict_tags_counter = Counter(self.predict_tags) self.golden_tags_counter = Counter(self.golden_tags) # 计算精确率 self.precision_scores = self.cal_precision() # 计算召回率 self.recall_scores = self.cal_recall() # 计算F1分数 self.f1_scores = self.cal_f1() def cal_precision(self): precision_scores = {} for tag in self.tagset: precision_scores[tag] = self.correct_tags_number.get(tag, 0) / \ max(1e-10, self.predict_tags_counter[tag]) return precision_scores def cal_recall(self): recall_scores = {} for tag in self.tagset: recall_scores[tag] = self.correct_tags_number.get(tag, 0) / \ max(1e-10, self.golden_tags_counter[tag]) return recall_scores def cal_f1(self): f1_scores = {} for tag in self.tagset: p, r = self.precision_scores[tag], self.recall_scores[tag] f1_scores[tag] = 2*p*r / (p+r+1e-10) # 加上一个特别小的数,防止分母为0 return f1_scores def report_scores(self): """将结果用表格的形式打印出来,像这个样子: precision recall f1-score support B-LOC 0.775 0.757 0.766 1084 I-LOC 0.601 0.631 0.616 325 B-MISC 0.698 0.499 0.582 339 I-MISC 0.644 0.567 0.603 557 B-ORG 0.795 0.801 0.798 1400 I-ORG 0.831 0.773 0.801 1104 B-PER 0.812 0.876 0.843 735 I-PER 0.873 0.931 0.901 634 avg/total 0.779 0.764 0.770 6178 """ # 打印表头 header_format = '{:>9s} {:>9} {:>9} {:>9} {:>9}' header = ['precision', 'recall', 'f1-score', 'support'] print(header_format.format('', *header)) row_format = '{:>9s} {:>9.4f} {:>9.4f} {:>9.4f} {:>9}' # 打印每个标签的 精确率、召回率、f1分数 for tag in self.tagset: print(row_format.format( tag, self.precision_scores[tag], self.recall_scores[tag], self.f1_scores[tag], self.golden_tags_counter[tag] )) # 计算并打印平均值 avg_metrics = self._cal_weighted_average() print(row_format.format( 'avg/total', avg_metrics['precision'], avg_metrics['recall'], avg_metrics['f1_score'], len(self.golden_tags) )) def count_correct_tags(self): """计算每种标签预测正确的个数(对应精确率、召回率计算公式上的tp),用于后面精确率以及召回率的计算""" correct_dict = {} for gold_tag, predict_tag in zip(self.golden_tags, self.predict_tags): if gold_tag == predict_tag: if gold_tag not in correct_dict: correct_dict[gold_tag] = 1 else: correct_dict[gold_tag] += 1 return correct_dict def _cal_weighted_average(self): weighted_average = {} total = len(self.golden_tags) # 计算weighted precisions: weighted_average['precision'] = 0. weighted_average['recall'] = 0. weighted_average['f1_score'] = 0. for tag in self.tagset: size = self.golden_tags_counter[tag] weighted_average['precision'] += self.precision_scores[tag] * size weighted_average['recall'] += self.recall_scores[tag] * size weighted_average['f1_score'] += self.f1_scores[tag] * size for metric in weighted_average.keys(): weighted_average[metric] /= total return weighted_average def _remove_Otags(self): length = len(self.golden_tags) O_tag_indices = [i for i in range(length) if self.golden_tags[i] == 'O'] self.golden_tags = [tag for i, tag in enumerate(self.golden_tags) if i not in O_tag_indices] self.predict_tags = [tag for i, tag in enumerate(self.predict_tags) if i not in O_tag_indices] print("原总标记数为{},移除了{}个O标记,占比{:.2f}%".format( length, len(O_tag_indices), len(O_tag_indices) / length * 100 )) def report_confusion_matrix(self): """计算混淆矩阵""" print("\nConfusion Matrix:") tag_list = list(self.tagset) # 初始化混淆矩阵 matrix[i][j]表示第i个tag被模型预测成第j个tag的次数 tags_size = len(tag_list) matrix = [] for i in range(tags_size): matrix.append([0] * tags_size) # 遍历tags列表 for golden_tag, predict_tag in zip(self.golden_tags, self.predict_tags): try: row = tag_list.index(golden_tag) col = tag_list.index(predict_tag) matrix[row][col] += 1 except ValueError: # 有极少数标记没有出现在golden_tags,但出现在predict_tags,跳过这些标记 continue # 输出矩阵 row_format_ = '{:>7} ' * (tags_size+1) print(row_format_.format("", *tag_list)) for i, row in enumerate(matrix): print(row_format_.format(tag_list[i], *row)) def flatten_lists(lists): flatten_list = [] for l in lists: if type(l) == list: flatten_list += l else: flatten_list.append(l) return flatten_list
模型训练依赖数据文件:
./data/train.char.txt
./data/test.char.txt
模型训练结果:
precision | recall | f1-score |
---|---|---|
0.9543 | 0.9543 | 0.9542 |
完整模型代码(含训练数据)见:基于crf的中文命名实体识别完整代码(含训练数据)
基于字特征
网络结构:
实战
数据标注
同crf一样
代码
这里推荐github上面特别好的几个博文:
1、A very simple BiLSTM-CRF model for Chinese Named Entity Recognition 中文命名实体识别 (TensorFlow) :https://github.com/Determined22/zh-NER-TF
2、中文命名实体识别,实体抽取,tensorflow,pytorch,BiLSTM+CRF :https://github.com/buppt/ChineseNER
3、 End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF implement in pyotrch :https://github.com/bamtercelboo/pytorch_NER_BiLSTM_CNN_CRF
4、NER: Chinese Named Entity Recognition in Keras:https://github.com/littledeepthink/NER-in-Chinese-Text
基于词特征
网络结构同上
数据标注
标注方法同上,标注集也采用BIOES。先做最小粒度分词,如:
中国 B-ORG
烟草 I-ORG
总公司 E-ORG
最近 O
火 O
了 O
基于词特征与基于字特征的对比
1、字特征无需分词,减少因分词错误带来的影响
2、都能更好的使用词向量的特征,如使用预先训练好的词向量/字向量作为embedding时,字向量‘总’与‘分’离得很近,因此模型也能很好的识别到“中国烟草分公司”也是ORG;同样的,词向量‘总公司’与‘分公司’离得很近,因此模型也能很好的识别到“中国烟草分公司”也是ORG
3、基于词的特征更丰富,如词向量能轻易得到‘南京’、‘南昌’、‘武汉’是一类词(地名),而基于词特征则没有该特征。基于词特征的模型能更好的识别‘南京大学’、‘南昌大学’、‘武汉大学’等
基于词特征
BERT相关知识,参见我的另一篇博文:NLP进化史系列之语言模型
github上面有篇很好的博文:BERT-BiLSMT-CRF-NER,这里就不再赘述了。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。