赞
踩
NER即命名实体识别,是一种自然语言处理的基础技术,用于在给定的文本内容中提取适当的实体,并将提取的实体分类到预定义的类别下,例如公司名称、人名、地名等实体。Spacy 库允许我们通过根据特定上下文更新现有模型来训练 NER,也可以训练新的 NER 模型。在本文中,我们将探讨如何构建自定义 NER 模型。
首先加载所需的依赖库:
- import os
- import pandas as pd
- import spacy
- import random
- import joblib
- from spacy.training.example import Example
- import time
这里假设已从label studio标注完所有NER训练数据,并导出Json文件,因此将批量文件合并转换成Excel:
- # labeling studio导出数据后,pd读json
- path = 'C:/Users/xxx/downloads/'
- dfs = []
- for file in os.listdir(path):
- data = pd.read_json(path + str(file), encoding='utf-8')
- data['created_at'] = data['created_at'].apply(lambda x: x.replace(tzinfo=None) if x is not None else None)
- data['updated_at'] = data['updated_at'].apply(lambda x: x.replace(tzinfo=None) if x is not None else None)
- dfs.append(data)
- # 合并所有的dataframe
- merged_df = pd.concat(dfs)
- # 将合并后的dataframe输出到Excel文件
- merged_df.to_excel(path + 'labeling output.xlsx', index=False)
然后再将上述Excel output转换成训练数据集:
- df = pd.read_excel("C:/Users/xxx/downloads/labeling output.xlsx")
- df = df.dropna()
- label = df.label.values.tolist()
- content = df['Content'].values.tolist()
- all_label, all_values = [], []
- ## 处理 Label 列的数据
- for line in range(len(label)):
- ner_label, values = [], []
- for i in eval(label[line]):
- start = i['start']
- end = i['end']
- labels = i['labels'][0]
- value = i['text']
- # print(start,end,labels)
- ner_label.append((start, end, labels))
- values.append(value)
- all_label.append(ner_label)
- all_values.append(values)
- df['ner-label'] = all_label
- df['values'] = all_values
- df.to_excel("C:/Users/xxx/downloads/ner.xlsx", encoding='utf-8-sig', index=False)
下一步则需要检查ner.xlsx的数据质量,是否满足训练要求,数据格式包含如下column:
content | ner-label | values |
文本内容1 | (start_id1, end_id1, Entity1), (start_id2, end_id2, Entity2), ... | ['Entity content1', 'Entity content2', ...] |
文本内容2 | (start_id3, end_id3, Entity3), (start_id4, end_id4, Entity2), ... | ['Entity content3', 'Entity content4', ...] |
当数据集处理完成,就开始进入模型训练环节啦!
1.将读取好的数据集转换训练NER模型的数据格式:
- def GroupData(content,label,TRAIN_DATA):
- for line in range(len(content)):
- Entity = {}
- Entities = []
- for i in eval(label[line]):
- if 'int' in str(type(i)):
- Entities.append(eval(label[line]))
- break
- if 'tuple' in str(type(i)):
- Entities.append(i)
- Entity["entities"] = Entities
- TRAIN_DATA.append((content[line],Entity))
- return TRAIN_DATA
2.进入模型训练函数,iterations:模型迭代次数
- def train_spacy(TRAIN_DATA, iterations):
- # 创建一个空白英文模型:en,中文模型:zh
- nlp = spacy.blank("en")
- #若没有则添加NER组件
- if "ner" not in nlp.pipe_names:
- ner = nlp.add_pipe("ner",last=True)
- #添加所有实体标签到spaCy模型
- for _, annotations in TRAIN_DATA:
- # print(annotations.get("entities"))
- for ent in annotations.get("entities"):
- ner.add_label(ent[2])
- #获取模型中除了NER之外的其他管件
- other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
- #开始训练 #消除其他管件的影响
- with nlp.disable_pipes(*other_pipes):
- optimizer = nlp.begin_training()
- optimizer.learn_rate = 1e-3
- for itn in range(iterations):
- print("开始迭代",itn + 1,"次")
- random.shuffle(TRAIN_DATA)
- losses = {}
- for text, annotations in TRAIN_DATA:
- try:
- doc = nlp.make_doc(text)
- example = Example.from_dict(doc, annotations)
- nlp.update([example], losses=losses, drop=0.4,sgd=optimizer)
- except:
- continue
- return (nlp)
主函数:
- if __name__ == '__main__':
- df = pd.read_excel(r"C:\Users\xxx\Desktop\data\NER\ner-train.xlsx")
- df = df.dropna()
- ## 转换为列表 ##
- content = df['Content'].values.tolist()
- label = df['label'].values.tolist()
- TRAIN_DATA = []
- ## 整合数据
- GroupData(content, label, TRAIN_DATA)
-
- ## 训练模型
- begin = time.perf_counter()
- trained_nlp = train_spacy(TRAIN_DATA,55)
- joblib.dump(trained_nlp, r"C:\Users\xxx\Desktop\data\NER\NER.m")
- end_time = time.perf_counter()
- run_time = end_time - begin
- print('模型生成成功,建模运行时间', run_time, 's')
至此,相信你已成功训练出自定义的NER模型,参考下一篇文章会继续讲述NLP干货: (4) NER模型的调用和效果评估。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。