当前位置:   article > 正文

NLP入门(五)用深度学习实现命名实体识别(NER)

b-org o b-misc o o o b-misc o o

前言

  在文章:NLP入门(四)命名实体识别(NER)中,笔者介绍了两个实现命名实体识别的工具——NLTK和Stanford NLP。在本文中,我们将会学习到如何使用深度学习工具来自己一步步地实现NER,只要你坚持看完,就一定会很有收获的。
  OK,话不多说,让我们进入正题。
  几乎所有的NLP都依赖一个强大的语料库,本项目实现NER的语料库如下(文件名为train.txt,一共42000行,这里只展示前15行,可以在文章最后的Github地址下载该语料库):

played on Monday ( home team in CAPS ) :
VBD IN NNP ( NN NN IN NNP ) :
O O O O O O O O O O
American League
NNP NNP
B-MISC I-MISC
Cleveland 2 DETROIT 1
NNP CD NNP CD
B-ORG O B-ORG O
BALTIMORE 12 Oakland 11 ( 10 innings )
VB CD NNP CD ( CD NN )
B-ORG O B-ORG O O O O O
TORONTO 5 Minnesota 3
TO CD NNP CD
B-ORG O B-ORG O
......

简单介绍下该语料库的结构:该语料库一共42000行,每三行为一组,其中,第一行为英语句子,第二行为每个句子的词性(关于英语单词的词性,可参考文章:NLP入门(三)词形还原(Lemmatization)),第三行为NER系统的标注,具体的含义会在之后介绍。
  我们的NER项目的名称为DL_4_NER,结构如下:

NER项目的名称

项目中每个文件的功能如下:

  • utils.py: 项目配置及数据导入
  • data_processing.py: 数据探索
  • Bi_LSTM_Model_training.py: 模型创建及训练
  • Bi_LSTM_Model_predict.py: 对新句子进行NER预测

  接下来,笔者将结合代码文件,分部介绍该项目的步骤,当所有步骤介绍完毕后,我们的项目就结束了,而你,也就知道了如何用深度学习实现命名实体识别(NER)。
  Let's begin!

项目配置

  第一步,是项目的配置及数据导入,在utils.py文件中实现,完整的代码如下:

  1. # -*- coding: utf-8 -*-
  2. import numpy as np
  3. import pandas as pd
  4. # basic settings for DL_4_NER Project
  5. BASE_DIR = "F://NERSystem"
  6. CORPUS_PATH = "%s/train.txt" % BASE_DIR
  7. KERAS_MODEL_SAVE_PATH = '%s/Bi-LSTM-4-NER.h5' % BASE_DIR
  8. WORD_DICTIONARY_PATH = '%s/word_dictionary.pk' % BASE_DIR
  9. InVERSE_WORD_DICTIONARY_PATH = '%s/inverse_word_dictionary.pk' % BASE_DIR
  10. LABEL_DICTIONARY_PATH = '%s/label_dictionary.pk' % BASE_DIR
  11. OUTPUT_DICTIONARY_PATH = '%s/output_dictionary.pk' % BASE_DIR
  12. CONSTANTS = [
  13. KERAS_MODEL_SAVE_PATH,
  14. InVERSE_WORD_DICTIONARY_PATH,
  15. WORD_DICTIONARY_PATH,
  16. LABEL_DICTIONARY_PATH,
  17. OUTPUT_DICTIONARY_PATH
  18. ]
  19. # load data from corpus to from pandas DataFrame
  20. def load_data():
  21. with open(CORPUS_PATH, 'r') as f:
  22. text_data = [text.strip() for text in f.readlines()]
  23. text_data = [text_data[k].split('\t') for k in range(0, len(text_data))]
  24. index = range(0, len(text_data), 3)
  25. # Transforming data to matrix format for neural network
  26. input_data = list()
  27. for i in range(1, len(index) - 1):
  28. rows = text_data[index[i-1]:index[i]]
  29. sentence_no = np.array([i]*len(rows[0]), dtype=str)
  30. rows.append(sentence_no)
  31. rows = np.array(rows).T
  32. input_data.append(rows)
  33. input_data = pd.DataFrame(np.concatenate([item for item in input_data]),\
  34. columns=['word', 'pos', 'tag', 'sent_no'])
  35. return input_data
'
运行

在该代码中,先是设置了语料库文件的路径CORPUS_PATH,KERAS模型保存路径KERAS_MODEL_SAVE_PATH,以及在项目过程中会用到的三个字典的保存路径(以pickle文件形式保存)WORD_DICTIONARY_PATH,LABEL_DICTIONARY_PATH, OUTPUT_DICTIONARY_PATH。然后是load_data()函数,它将语料库中的文本以Pandas中的DataFrame结构展示出来,该数据框的前30行如下:

  1. word pos tag sent_no
  2. 0 played VBD O 1
  3. 1 on IN O 1
  4. 2 Monday NNP O 1
  5. 3 ( ( O 1
  6. 4 home NN O 1
  7. 5 team NN O 1
  8. 6 in IN O 1
  9. 7 CAPS NNP O 1
  10. 8 ) ) O 1
  11. 9 : : O 1
  12. 10 American NNP B-MISC 2
  13. 11 League NNP I-MISC 2
  14. 12 Cleveland NNP B-ORG 3
  15. 13 2 CD O 3
  16. 14 DETROIT NNP B-ORG 3
  17. 15 1 CD O 3
  18. 16 BALTIMORE VB B-ORG 4
  19. 17 12 CD O 4
  20. 18 Oakland NNP B-ORG 4
  21. 19 11 CD O 4
  22. 20 ( ( O 4
  23. 21 10 CD O 4
  24. 22 innings NN O 4
  25. 23 ) ) O 4
  26. 24 TORONTO TO B-ORG 5
  27. 25 5 CD O 5
  28. 26 Minnesota NNP B-ORG 5
  29. 27 3 CD O 5
  30. 28 Milwaukee NNP B-ORG 6
  31. 29 3 CD O 6

在该数据框中,word这一列表示文本语料库中的单词,pos这一列表示该单词的词性,tag这一列表示NER的标注,sent_no这一列表示该单词在第几个句子中。

数据探索

  接着,第二步是数据探索,即对输入的数据(input_data)进行一些数据review,完整的代码(data_processing.py)如下:

  1. # -*- coding: utf-8 -*-
  2. import pickle
  3. import numpy as np
  4. from collections import Counter
  5. from itertools import accumulate
  6. from operator import itemgetter
  7. import matplotlib.pyplot as plt
  8. import matplotlib as mpl
  9. from utils import BASE_DIR, CONSTANTS, load_data
  10. # 设置matplotlib绘图时的字体
  11. mpl.rcParams['font.sans-serif']=['SimHei']
  12. # 数据查看
  13. def data_review():
  14. # 数据导入
  15. input_data = load_data()
  16. # 基本的数据review
  17. sent_num = input_data['sent_no'].astype(np.int).max()
  18. print("一共有%s个句子。\n"%sent_num)
  19. vocabulary = input_data['word'].unique()
  20. print("一共有%d个单词。"%len(vocabulary))
  21. print("前10个单词为:%s.\n"%vocabulary[:11])
  22. pos_arr = input_data['pos'].unique()
  23. print("单词的词性列表:%s.\n"%pos_arr)
  24. ner_tag_arr = input_data['tag'].unique()
  25. print("NER的标注列表:%s.\n" % ner_tag_arr)
  26. df = input_data[['word', 'sent_no']].groupby('sent_no').count()
  27. sent_len_list = df['word'].tolist()
  28. print("句子长度及出现频数字典:\n%s." % dict(Counter(sent_len_list)))
  29. # 绘制句子长度及出现频数统计图
  30. sort_sent_len_dist = sorted(dict(Counter(sent_len_list)).items(), key=itemgetter(0))
  31. sent_no_data = [item[0] for item in sort_sent_len_dist]
  32. sent_count_data = [item[1] for item in sort_sent_len_dist]
  33. plt.bar(sent_no_data, sent_count_data)
  34. plt.title("句子长度及出现频数统计图")
  35. plt.xlabel("句子长度")
  36. plt.ylabel("句子长度出现的频数")
  37. plt.savefig("%s/句子长度及出现频数统计图.png" % BASE_DIR)
  38. plt.close()
  39. # 绘制句子长度累积分布函数(CDF)
  40. sent_pentage_list = [(count/sent_num) for count in accumulate(sent_count_data)]
  41. # 寻找分位点为quantile的句子长度
  42. quantile = 0.9992
  43. #print(list(sent_pentage_list))
  44. for length, per in zip(sent_no_data, sent_pentage_list):
  45. if round(per, 4) == quantile:
  46. index = length
  47. break
  48. print("\n分位点为%s的句子长度:%d." % (quantile, index))
  49. # 绘制CDF
  50. plt.plot(sent_no_data, sent_pentage_list)
  51. plt.hlines(quantile, 0, index, colors="c", linestyles="dashed")
  52. plt.vlines(index, 0, quantile, colors="c", linestyles="dashed")
  53. plt.text(0, quantile, str(quantile))
  54. plt.text(index, 0, str(index))
  55. plt.title("句子长度累积分布函数图")
  56. plt.xlabel("句子长度")
  57. plt.ylabel("句子长度累积频率")
  58. plt.savefig("%s/句子长度累积分布函数图.png" % BASE_DIR)
  59. plt.close()
  60. # 数据处理
  61. def data_processing():
  62. # 数据导入
  63. input_data = load_data()
  64. # 标签及词汇表
  65. labels, vocabulary = list(input_data['tag'].unique()), list(input_data['word'].unique())
  66. # 字典列表
  67. word_dictionary = {word: i+1 for i, word in enumerate(vocabulary)}
  68. inverse_word_dictionary = {i+1: word for i, word in enumerate(vocabulary)}
  69. label_dictionary = {label: i+1 for i, label in enumerate(labels)}
  70. output_dictionary = {i+1: labels for i, labels in enumerate(labels)}
  71. dict_list = [word_dictionary, inverse_word_dictionary,label_dictionary, output_dictionary]
  72. # 保存为pickle形式
  73. for dict_item, path in zip(dict_list, CONSTANTS[1:]):
  74. with open(path, 'wb') as f:
  75. pickle.dump(dict_item, f)
  76. #data_review()

调用data_review()函数,输出的结果如下:

  1. 一共有13998个句子。
  2. 一共有24339个单词。
  3. 10个单词为:['played' 'on' 'Monday' '(' 'home' 'team' 'in' 'CAPS' ')' ':' 'American'].
  4. 单词的词性列表:['VBD' 'IN' 'NNP' '(' 'NN' ')' ':' 'CD' 'VB' 'TO' 'NNS' ',' 'VBP' 'VBZ'
  5. '.' 'VBG' 'PRP$' 'JJ' 'CC' 'JJS' 'RB' 'DT' 'VBN' '"' 'PRP' 'WDT' 'WRB'
  6. 'MD' 'WP' 'POS' 'JJR' 'WP$' 'RP' 'NNPS' 'RBS' 'FW' '$' 'RBR' 'EX' "''"
  7. 'PDT' 'UH' 'SYM' 'LS' 'NN|SYM'].
  8. NER的标注列表:['O' 'B-MISC' 'I-MISC' 'B-ORG' 'I-ORG' 'B-PER' 'B-LOC' 'I-PER' 'I-LOC'
  9. 'sO'].
  10. 句子长度及出现频数字典:
  11. {1: 177, 2: 1141, 3: 620, 4: 794, 5: 769, 6: 639, 7: 999, 8: 977, 9: 841, 10: 501, 11: 395, 12: 316, 13: 339, 14: 291, 15: 275, 16: 225, 17: 229, 18: 212, 19: 197, 20: 221, 21: 228, 22: 221, 23: 230, 24: 210, 25: 207, 26: 224, 27: 188, 28: 199, 29: 214, 30: 183, 31: 202, 32: 167, 33: 167, 34: 141, 35: 130, 36: 119, 37: 105, 38: 112, 39: 98, 40: 78, 41: 74, 42: 63, 43: 51, 44: 42, 45: 39, 46: 19, 47: 22, 48: 19, 49: 15, 50: 16, 51: 8, 52: 9, 53: 5, 54: 4, 55: 9, 56: 2, 57: 2, 58: 2, 59: 2, 60: 3, 62: 2, 66: 1, 67: 1, 69: 1, 71: 1, 72: 1, 78: 1, 80: 1, 113: 1, 124: 1}.
  12. 分位点为0.9992的句子长度:60.

在该语料库中,一共有13998个句子,比预期的42000/3=14000个句子少两个。一个有24339个单词,单词量还是蛮大的,当然,这里对单词没有做任何处理,直接保留了语料库中的形式(后期可以继续优化)。单词的词性可以参考文章:NLP入门(三)词形还原(Lemmatization)。我们需要注意的是,NER的标注列表为['O' ,'B-MISC', 'I-MISC', 'B-ORG' ,'I-ORG', 'B-PER' ,'B-LOC' ,'I-PER', 'I-LOC','sO'],因此,本项目的NER一共分为四类:PER(人名),LOC(位置),ORG(组织)以及MISC,其中B表示开始,I表示中间,O表示单字词,不计入NER,sO表示特殊单字词。
  接下来,让我们考虑下句子的长度,这对后面的建模时填充的句子长度有有参考作用。句子长度及出现频数的统计图如下:

句子长度及出现频数统计图

可以看到,句子长度基本在60以下,当然,这也可以在输出的句子长度及出现频数字典中看到。那么,我们是否可以选在一个标准作为后面模型的句子填充的长度呢?答案是,利用出现频数的累计分布函数的分位点,在这里,我们选择分位点为0.9992,对应的句子长度为60,如下图:

句子长度累积分布函数图

  接着是数据处理函数data_processing(),它的功能主要是实现单词、标签字典,并保存为pickle文件形式,便于后续直接调用。

建模

  在第三步中,我们建立Bi-LSTM模型来训练训练,完整的Python代码(Bi_LSTM_Model_training.py)如下:

  1. # -*- coding: utf-8 -*-
  2. import pickle
  3. import numpy as np
  4. import pandas as pd
  5. from utils import BASE_DIR, CONSTANTS, load_data
  6. from data_processing import data_processing
  7. from keras.utils import np_utils, plot_model
  8. from keras.models import Sequential
  9. from keras.preprocessing.sequence import pad_sequences
  10. from keras.layers import Bidirectional, LSTM, Dense, Embedding, TimeDistributed
  11. # 模型输入数据
  12. def input_data_for_model(input_shape):
  13. # 数据导入
  14. input_data = load_data()
  15. # 数据处理
  16. data_processing()
  17. # 导入字典
  18. with open(CONSTANTS[1], 'rb') as f:
  19. word_dictionary = pickle.load(f)
  20. with open(CONSTANTS[2], 'rb') as f:
  21. inverse_word_dictionary = pickle.load(f)
  22. with open(CONSTANTS[3], 'rb') as f:
  23. label_dictionary = pickle.load(f)
  24. with open(CONSTANTS[4], 'rb') as f:
  25. output_dictionary = pickle.load(f)
  26. vocab_size = len(word_dictionary.keys())
  27. label_size = len(label_dictionary.keys())
  28. # 处理输入数据
  29. aggregate_function = lambda input: [(word, pos, label) for word, pos, label in
  30. zip(input['word'].values.tolist(),
  31. input['pos'].values.tolist(),
  32. input['tag'].values.tolist())]
  33. grouped_input_data = input_data.groupby('sent_no').apply(aggregate_function)
  34. sentences = [sentence for sentence in grouped_input_data]
  35. x = [[word_dictionary[word[0]] for word in sent] for sent in sentences]
  36. x = pad_sequences(maxlen=input_shape, sequences=x, padding='post', value=0)
  37. y = [[label_dictionary[word[2]] for word in sent] for sent in sentences]
  38. y = pad_sequences(maxlen=input_shape, sequences=y, padding='post', value=0)
  39. y = [np_utils.to_categorical(label, num_classes=label_size + 1) for label in y]
  40. return x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary
  41. # 定义深度学习模型:Bi-LSTM
  42. def create_Bi_LSTM(vocab_size, label_size, input_shape, output_dim, n_units, out_act, activation):
  43. model = Sequential()
  44. model.add(Embedding(input_dim=vocab_size + 1, output_dim=output_dim,
  45. input_length=input_shape, mask_zero=True))
  46. model.add(Bidirectional(LSTM(units=n_units, activation=activation,
  47. return_sequences=True)))
  48. model.add(TimeDistributed(Dense(label_size + 1, activation=out_act)))
  49. model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
  50. return model
  51. # 模型训练
  52. def model_train():
  53. # 将数据集分为训练集和测试集,占比为9:1
  54. input_shape = 60
  55. x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary = input_data_for_model(input_shape)
  56. train_end = int(len(x)*0.9)
  57. train_x, train_y = x[0:train_end], np.array(y[0:train_end])
  58. test_x, test_y = x[train_end:], np.array(y[train_end:])
  59. # 模型输入参数
  60. activation = 'selu'
  61. out_act = 'softmax'
  62. n_units = 100
  63. batch_size = 32
  64. epochs = 10
  65. output_dim = 20
  66. # 模型训练
  67. lstm_model = create_Bi_LSTM(vocab_size, label_size, input_shape, output_dim, n_units, out_act, activation)
  68. lstm_model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=1)
  69. # 模型保存
  70. model_save_path = CONSTANTS[0]
  71. lstm_model.save(model_save_path)
  72. plot_model(lstm_model, to_file='%s/LSTM_model.png' % BASE_DIR)
  73. # 在测试集上的效果
  74. N = test_x.shape[0] # 测试的条数
  75. avg_accuracy = 0 # 预测的平均准确率
  76. for start, end in zip(range(0, N, 1), range(1, N+1, 1)):
  77. sentence = [inverse_word_dictionary[i] for i in test_x[start] if i != 0]
  78. y_predict = lstm_model.predict(test_x[start:end])
  79. input_sequences, output_sequences = [], []
  80. for i in range(0, len(y_predict[0])):
  81. output_sequences.append(np.argmax(y_predict[0][i]))
  82. input_sequences.append(np.argmax(test_y[start][i]))
  83. eval = lstm_model.evaluate(test_x[start:end], test_y[start:end])
  84. print('Test Accuracy: loss = %0.6f accuracy = %0.2f%%' % (eval[0], eval[1] * 100))
  85. avg_accuracy += eval[1]
  86. output_sequences = ' '.join([output_dictionary[key] for key in output_sequences if key != 0]).split()
  87. input_sequences = ' '.join([output_dictionary[key] for key in input_sequences if key != 0]).split()
  88. output_input_comparison = pd.DataFrame([sentence, output_sequences, input_sequences]).T
  89. print(output_input_comparison.dropna())
  90. print('#' * 80)
  91. avg_accuracy /= N
  92. print("测试样本的平均预测准确率:%.2f%%." % (avg_accuracy * 100))
  93. model_train()

在上面的代码中,先是通过input_data_for_model()函数来处理好进入模型的数据,其参数为input_shape,即填充句子时的长度。然后是创建Bi-LSTM模型create_Bi_LSTM(),模型的示意图如下:

Bi-LSTM模型示意图

最后,是在输入的数据上进行模型训练,将原始的数据分为训练集和测试集,占比为9:1,训练的周期为10次。

模型训练

  运行上述模型训练代码,一共训练10个周期,训练时间大概为500s,在训练集上的准确率达99%以上,在测试集上的平均准确率为95%以上。以下是最后几个测试集上的预测结果:

  1. ......(前面的输出已忽略)
  2. Test Accuracy: loss = 0.000986 accuracy = 100.00%
  3. 0 1 2
  4. 0 Cardiff B-ORG B-ORG
  5. 1 1 O O
  6. 2 Brighton B-ORG B-ORG
  7. 3 0 O O
  8. ################################################################################
  9. 1/1 [==============================] - 0s 10ms/step
  10. Test Accuracy: loss = 0.000274 accuracy = 100.00%
  11. 0 1 2
  12. 0 Carlisle B-ORG B-ORG
  13. 1 0 O O
  14. 2 Hull B-ORG B-ORG
  15. 3 0 O O
  16. ################################################################################
  17. 1/1 [==============================] - 0s 9ms/step
  18. Test Accuracy: loss = 0.000479 accuracy = 100.00%
  19. 0 1 2
  20. 0 Chester B-ORG B-ORG
  21. 1 1 O O
  22. 2 Cambridge B-ORG B-ORG
  23. 3 1 O O
  24. ################################################################################
  25. 1/1 [==============================] - 0s 9ms/step
  26. Test Accuracy: loss = 0.003092 accuracy = 100.00%
  27. 0 1 2
  28. 0 Darlington B-ORG B-ORG
  29. 1 4 O O
  30. 2 Swansea B-ORG B-ORG
  31. 3 1 O O
  32. ################################################################################
  33. 1/1 [==============================] - 0s 8ms/step
  34. Test Accuracy: loss = 0.000705 accuracy = 100.00%
  35. 0 1 2
  36. 0 Exeter B-ORG B-ORG
  37. 1 2 O O
  38. 2 Scarborough B-ORG B-ORG
  39. 3 2 O O
  40. ################################################################################
  41. 测试样本的平均预测准确率:95.55%.

  该模型在原始数据上的识别效果还是可以的。
  训练完模型后,BASE_DIR中的所有文件如下:

模型训练完后的所有文件截图

模型预测

  最后,也许是整个项目最为激动人心的时刻,因为,我们要在新数据集上测试模型的识别效果。预测新数据的识别结果的完整Python代码(Bi_LSTM_Model_predict.py)如下:

  1. # -*- coding: utf-8 -*-
  2. # Name entity recognition for new data
  3. # Import the necessary modules
  4. import pickle
  5. import numpy as np
  6. from utils import CONSTANTS
  7. from keras.preprocessing.sequence import pad_sequences
  8. from keras.models import load_model
  9. from nltk import word_tokenize
  10. # 导入字典
  11. with open(CONSTANTS[1], 'rb') as f:
  12. word_dictionary = pickle.load(f)
  13. with open(CONSTANTS[4], 'rb') as f:
  14. output_dictionary = pickle.load(f)
  15. try:
  16. # 数据预处理
  17. input_shape = 60
  18. sent = 'New York is the biggest city in America.'
  19. new_sent = word_tokenize(sent)
  20. new_x = [[word_dictionary[word] for word in new_sent]]
  21. x = pad_sequences(maxlen=input_shape, sequences=new_x, padding='post', value=0)
  22. # 载入模型
  23. model_save_path = CONSTANTS[0]
  24. lstm_model = load_model(model_save_path)
  25. # 模型预测
  26. y_predict = lstm_model.predict(x)
  27. ner_tag = []
  28. for i in range(0, len(new_sent)):
  29. ner_tag.append(np.argmax(y_predict[0][i]))
  30. ner = [output_dictionary[i] for i in ner_tag]
  31. print(new_sent)
  32. print(ner)
  33. # 去掉NER标注为O的元素
  34. ner_reg_list = []
  35. for word, tag in zip(new_sent, ner):
  36. if tag != 'O':
  37. ner_reg_list.append((word, tag))
  38. # 输出模型的NER识别结果
  39. print("NER识别结果:")
  40. if ner_reg_list:
  41. for i, item in enumerate(ner_reg_list):
  42. if item[1].startswith('B'):
  43. end = i+1
  44. while end <= len(ner_reg_list)-1 and ner_reg_list[end][1].startswith('I'):
  45. end += 1
  46. ner_type = item[1].split('-')[1]
  47. ner_type_dict = {'PER': 'PERSON: ',
  48. 'LOC': 'LOCATION: ',
  49. 'ORG': 'ORGANIZATION: ',
  50. 'MISC': 'MISC: '
  51. }
  52. print(ner_type_dict[ner_type],\
  53. ' '.join([item[0] for item in ner_reg_list[i:end]]))
  54. else:
  55. print("模型并未识别任何有效命名实体。")
  56. except KeyError as err:
  57. print("您输入的句子有单词不在词汇表中,请重新输入!")
  58. print("不在词汇表中的单词为:%s." % err)

输出结果为:

  1. ['New', 'York', 'is', 'the', 'biggest', 'city', 'in', 'America', '.']
  2. ['B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O']
  3. NER识别结果:
  4. LOCATION: New York
  5. LOCATION: America

  接下来,再测试三个笔者自己想的句子:

输入为:

sent = 'James is a world famous actor, whose home is in London.'

输出结果为:

  1. ['James', 'is', 'a', 'world', 'famous', 'actor', ',', 'whose', 'home', 'is', 'in', 'London', '.']
  2. ['B-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O']
  3. NER识别结果:
  4. PERSON: James
  5. LOCATION: London

输入为:

sent = 'Oxford is in England, Jack is from here.'

输出为:

  1. ['Oxford', 'is', 'in', 'England', ',', 'Jack', 'is', 'from', 'here', '.']
  2. ['B-PER', 'O', 'O', 'B-LOC', 'O', 'B-PER', 'O', 'O', 'O', 'O']
  3. NER识别结果:
  4. PERSON: Oxford
  5. LOCATION: England
  6. PERSON: Jack

输入为:

sent = 'I love Shanghai.'

输出为:

  1. ['I', 'love', 'Shanghai', '.']
  2. ['O', 'O', 'B-LOC', 'O']
  3. NER识别结果:
  4. LOCATION: Shanghai

在上面的例子中,只有Oxford的识别效果不理想,模型将它识别为PERSON,其实应该是ORGANIZATION。

  接下来是三个来自CNN和wikipedia的句子:

输入为:

sent = "the US runs the risk of a military defeat by China or Russia"

输出为:

  1. ['the', 'US', 'runs', 'the', 'risk', 'of', 'a', 'military', 'defeat', 'by', 'China', 'or', 'Russia']
  2. ['O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O', 'B-LOC']
  3. NER识别结果:
  4. LOCATION: US
  5. LOCATION: China
  6. LOCATION: Russia

输入为:

sent = "Home to the headquarters of the United Nations, New York is an important center for international diplomacy."

输出为:

  1. ['Home', 'to', 'the', 'headquarters', 'of', 'the', 'United', 'Nations', ',', 'New', 'York', 'is', 'an', 'important', 'center', 'for', 'international', 'diplomacy', '.']
  2. ['O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
  3. NER识别结果:
  4. ORGANIZATION: United Nations
  5. LOCATION: New York

输入为:

sent = "The United States is a founding member of the United Nations, World Bank, International Monetary Fund."

输出为:

  1. ['The', 'United', 'States', 'is', 'a', 'founding', 'member', 'of', 'the', 'United', 'Nations', ',', 'World', 'Bank', ',', 'International', 'Monetary', 'Fund', '.']
  2. ['O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'B-ORG', 'I-ORG', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O']
  3. NER识别结果:
  4. LOCATION: United States
  5. ORGANIZATION: United Nations
  6. ORGANIZATION: World Bank
  7. ORGANIZATION: International Monetary Fund

  这三个例子识别全部正确。

总结

  到这儿,笔者的这个项目就差不多了。我们有必要对这个项目做个总结。
  首先是这个项目的优点。它的优点在于能够让你一步步地实现NER,而且除了语料库,你基本熟悉了如何创建一个识别NER系统的步骤,同时,对深度学习模型及其应用也有了深刻理解。因此,好处是显而易见的。当然,在实际工作中,语料库的整理才是最耗费时间的,能够占到90%或者更多的时间,因此,有一个好的语料库你才能展开工作。
  接着讲讲这个项目的缺点。第一个,是语料库不够大,当然,约14000条句子也够了,但本项目没有对句子进行文本预处理,所以,有些单词的变形可能无法进入词汇表。第二个,缺少对新词的处理,一旦句子中出现一个新的单词,这个模型便无法处理,这是后期需要完善的地方。第三个,句子的填充长度为60,如果输入的句子长度大于60,则后面的部分将无法有效识别。
  因此,后续还有更多的工作需要去做,当然,做一个中文NER也是可以考虑的。
  本项目已上传Github,地址为 https://github.com/percent4/DL_4_NER 。:欢迎大家参考~

注意:本人现已开通微信公众号: Python爬虫与算法(微信号为:easy_web_scrape), 欢迎大家关注哦~~

参考文献

  1. BOOK: Applied Natural Language Processing with Python, Taweh Beysolow II
  2. WEBSITE:https://github.com/Apress/applied-natural-language-processing-w-python
  3. WEBSITE: NLP入门(四)命名实体识别(NER): https://www.jianshu.com/p/16e1f6a7aaef

转载于:https://www.cnblogs.com/jclian91/p/9970281.html

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/运维做开发/article/detail/788903
推荐阅读
相关标签
  

闽ICP备14008679号