赞
踩
在上一篇文章中讲述了如何 使用Spacy建立NER模型,因此本篇将讲述如何调用模型,并评估模型的效果。
首先加载所需的库:
- import pandas as pd
- import joblib
然后加载模型和验证数据集,提取所有文本的实体信息:
- ## 调用 NER模型
- ner = joblib.load(r"C:\Users\xxx\Desktop\ner\NER.m")
- print('NER模型加载成功')
- #
- ## NER信息抽取
- df = pd.read_excel(r"C:\Users\xxx\Desktop\NER\NER-validate.xlsx")
- ## 转换为列表 ##
- content = df['Content'].values.tolist()
- AllEntity = []
- for line in range(len(content)):
- doc = ner(content[line])
- ## NER输出所有信息
- Entity = [(ent.text, ent.label_) for ent in doc.ents]
- Entity = sorted(list(set(Entity)), key=Entity.index)
- # print('实体识别:', Entity)
- AllEntity.append(Entity)
- df['AllEntity'] = AllEntity
- Result_Path = r"C:\Users\xxx\Desktop\NER\ner-ValidateResult1.xlsx"
- df.to_excel(Result_Path, index=False)
接着就是分析上述输出的Excel文件:
- ## 验证 NER 模型
- df = pd.read_excel(Result_Path)
- label = df['label'].tolist()
- values = df['value'].tolist()
- AllEntity = df['AllEntity'].tolist()
- total = len(label)
- Manual_Labeling, mark_AllEnt = [],[]
- right_count = all_source_count = all_pred_count = count = fail_count = 0
-
- for line in range(len(label)):
- ## 统计NER找准找全率
- col1 = label[line]
- # print(col1)
- col2 = values[line]
- # print(col2)
- col3 = AllEntity[line]
- # print(col3)
-
- ## eval字符串转元组、列表
- col1 = eval(col1)
- col2 = eval(col2)
- col3 = eval(col3)
- print('\nAllEntity:', col3)
-
- ## 获取列1的第一个元素和列2的元素,得到新的列表
- new_list = [(col2[j],col1[j][2]) for j in range(len(col2))]
- print('labelled_list:',list(set(new_list)))
- Manual_Labeling.append(list(set(new_list)))
-
- all_source_count = all_source_count + len(list(set(new_list)))
- all_pred_count = all_pred_count + len(col3)
- # 统计找准数量
- for item in col3:
- if isinstance(item, tuple) and item in new_list:
- right_count += 1
- else:
- fail_count += 1
-
- # 统计找准+找全数量
- match_count = 0
- for item in new_list:
- if item in col3:
- match_count += 1
- if match_count == len(new_list):
- count += 1
- mark_AllEnt.append(1)
- else:
- mark_AllEnt.append(0)
- print('Entity总数=',all_source_count, '; Entity预测正确个数=',right_count, '; 找准+找全的文本个数为=',count, '; Entity预测错误个数=',fail_count)
-
- df['Manual_Labeling'] = Manual_Labeling
- df['AllEnt_Correction'] = mark_AllEnt
- df.to_excel(Result_Path, index=False)
- print('\nNER测试文本个数=',total, '; labelled Entity总数=',all_source_count, '; model预测正确个数=',right_count, '; 召回率=',round(right_count/all_source_count, 2), '; 精确率=',round(1-(fail_count/all_pred_count), 2))
输出结果:
若NER模型的效果不够好,则可以尝试从几个方面提升:1.增加迭代次数 2.调整学习率和drop参数 3.增加训练数据量
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。