当前位置:   article > 正文

NLP干货: (3) 基于Spacy建立命名实体识别模型_spacy ner

spacy ner

NER即命名实体识别,是一种自然语言处理的基础技术,用于在给定的文本内容中提取适当的实体,并将提取的实体分类到预定义的类别下,例如公司名称、人名、地名等实体。Spacy 库允许我们通过根据特定上下文更新现有模型,也可以训练新的模型。在本文中,我们将探讨如何构建自定义 NER 模型。

1.首先加载所需的依赖库:

  1. import os
  2. import pandas as pd
  3. import spacy
  4. import random
  5. import joblib
  6. from spacy.training.example import Example
  7. import time

2.从label studio标注完数据后导出Json文件,并将批量文件合并转换成Excel:

  1. # labeling studio导出数据后,pd读json
  2. path = 'C:/Users/xxx/downloads/'
  3. dfs = []
  4. for file in os.listdir(path):
  5. data = pd.read_json(path + str(file), encoding='utf-8')
  6. data['created_at'] = data['created_at'].apply(lambda x: x.replace(tzinfo=None) if x is not None else None)
  7. data['updated_at'] = data['updated_at'].apply(lambda x: x.replace(tzinfo=None) if x is not None else None)
  8. dfs.append(data)
  9. # 合并所有的dataframe
  10. merged_df = pd.concat(dfs)
  11. # 将合并后的dataframe输出到Excel文件
  12. merged_df.to_excel(path + 'labeling output.xlsx', index=False)

3.将上述Excel output转换成训练数据集:

  1. df = pd.read_excel("C:/Users/xxx/downloads/labeling output.xlsx")
  2. df = df.dropna()
  3. label = df.label.values.tolist()
  4. content = df['Content'].values.tolist()
  5. all_label, all_values = [], []
  6. ## 处理 Label 列的数据
  7. for line in range(len(label)):
  8. ner_label, values = [], []
  9. for i in eval(label[line]):
  10. start = i['start']
  11. end = i['end']
  12. labels = i['labels'][0]
  13. value = i['text']
  14. # print(start,end,labels)
  15. ner_label.append((start, end, labels))
  16. values.append(value)
  17. all_label.append(ner_label)
  18. all_values.append(values)
  19. df['ner-label'] = all_label
  20. df['values'] = all_values
  21. df.to_excel("C:/Users/xxx/downloads/ner.xlsx", index=False)

4.检查ner.xlsx的数据质量,是否满足训练要求,数据格式包含如下column:

contentner-labelvalues
文本内容1(start_id1, end_id1, Entity1), (start_id2, end_id2, Entity2), ...['Entity's content1', 'Entity's content2', ...]
文本内容2(start_id3, end_id3, Entity3), (start_id4, end_id4, Entity2), ...['Entity's content3', 'Entity's content4', ...]

5.模型训练环节,将读取好的数据集转换训练NER模型的数据格式:

  1. def GroupData(content,label,TRAIN_DATA):
  2. for line in range(len(content)):
  3. Entity = {}
  4. Entities = []
  5. for i in eval(label[line]):
  6. if 'int' in str(type(i)):
  7. Entities.append(eval(label[line]))
  8. break
  9. if 'tuple' in str(type(i)):
  10. Entities.append(i)
  11. Entity["entities"] = Entities
  12. TRAIN_DATA.append((content[line],Entity))
  13. return TRAIN_DATA
'
运行

6.进入模型训练函数,iterations:模型迭代次数

  1. def train_spacy(TRAIN_DATA, iterations):
  2. # 创建一个空白英文模型:en,中文模型:zh
  3. nlp = spacy.blank("en")
  4. #若没有则添加NER组件
  5. if "ner" not in nlp.pipe_names:
  6. ner = nlp.add_pipe("ner",last=True)
  7. #添加所有实体标签到spaCy模型
  8. for _, annotations in TRAIN_DATA:
  9. # print(annotations.get("entities"))
  10. for ent in annotations.get("entities"):
  11. ner.add_label(ent[2])
  12. #获取模型中除了NER之外的其他管件
  13. other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
  14. #开始训练 #消除其他管件的影响
  15. with nlp.disable_pipes(*other_pipes):
  16. optimizer = nlp.begin_training()
  17. optimizer.learn_rate = 1e-3
  18. for itn in range(iterations):
  19. print("开始迭代",itn + 1,"次")
  20. random.shuffle(TRAIN_DATA)
  21. losses = {}
  22. for text, annotations in TRAIN_DATA:
  23. try:
  24. doc = nlp.make_doc(text)
  25. example = Example.from_dict(doc, annotations)
  26. nlp.update([example], losses=losses, sgd=optimizer) # drop=0.4
  27. except:
  28. continue
  29. return (nlp)
'
运行

7.主函数:

  1. if __name__ == '__main__':
  2. df = pd.read_excel(r"C:\Users\xxx\Desktop\data\NER\ner-train.xlsx")
  3. df = df.dropna()
  4. ## 转换为列表 ##
  5. content = df['Content'].values.tolist()
  6. label = df['label'].values.tolist()
  7. TRAIN_DATA = []
  8. ## 整合数据
  9. GroupData(content, label, TRAIN_DATA)
  10. ## 训练模型
  11. begin = time.perf_counter()
  12. trained_nlp = train_spacy(TRAIN_DATA,50)
  13. joblib.dump(trained_nlp, r"C:\Users\xxx\Desktop\data\NER\NER.m")
  14. end_time = time.perf_counter()
  15. run_time = end_time - begin
  16. print('模型生成成功,建模运行时间', run_time, 's')

至此,相信你已成功训练出自定义的NER模型,参考下一篇文章会继续讲述NLP干货: (4) NER模型的调用和效果评估

声明:本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号