当前位置:   article > 正文

自然语言处理(NLP)之用深度学习实现命名实体识别(NER)_自然语言处理实体识别

自然语言处理实体识别

        几乎所有的NLP都依赖一个强大的语料库,本项目实现NER的语料库如下(文件名为train.txt,一共42000行,这里只展示前15行,可以在文章最后的Github地址下载该语料库):

played on Monday ( home team in CAPS ) :
VBD IN NNP ( NN NN IN NNP ) :
O O O O O O O O O O
American League
NNP NNP
B-MISC I-MISC
Cleveland 2 DETROIT 1
NNP CD NNP CD
B-ORG O B-ORG O
BALTIMORE 12 Oakland 11 ( 10 innings )
VB CD NNP CD ( CD NN )
B-ORG O B-ORG O O O O O
TORONTO 5 Minnesota 3
TO CD NNP CD
B-ORG O B-ORG O

        简单介绍下该语料库的结构:该语料库一共42000行,每三行为一组,其中,第一行为英语句子,第二行为每个句子的词性(关于英语单词的词性,可参考文章:NLP词形还(自然语言处理(NLP)之英文单词词性还原_IT之一小佬的博客-CSDN博客),第三行为NER系统的标注,具体的含义会在之后介绍。

我们的NER项目的名称为NLP_NER,结构如下:

项目中每个文件的功能如下:

  • utils.py: 项目配置及数据导入
  • data_processing.py: 数据探索
  • Bi_LSTM_Model_training.py: 模型创建及训练
  • Bi_LSTM_Model_predict.py: 对新句子进行NER预测

项目配置

  第一步,是项目的配置及数据导入,在utils.py文件中实现,完整的代码如下:

  1. import pandas as pd
  2. import numpy as np
  3. CORPUS_PATH = './data/train.txt'
  4. KEYS_MODEL_SAVE_PATH = './data/bi_lstm_ner.h5'
  5. WORD_DICTIONARY_PATH = './data/word_dictionary.pk'
  6. INVERSE_WORD_DICTIONARY_PATH = './data/inverse_word_dictionary.pk'
  7. LABEL_DICTIONARY_PATH = './data/label_dictionary.pk'
  8. OUTPUT_DICTIONARY_PATH = './data/output_dictionary.pk'
  9. CONSTANTS = [
  10. KEYS_MODEL_SAVE_PATH,
  11. WORD_DICTIONARY_PATH,
  12. INVERSE_WORD_DICTIONARY_PATH,
  13. LABEL_DICTIONARY_PATH,
  14. OUTPUT_DICTIONARY_PATH
  15. ]
  16. # load data from corpus to from pandas DataFrame
  17. def load_data():
  18. with open(CORPUS_PATH, 'r') as f:
  19. text_data = [text.strip() for text in f.readlines()]
  20. text_data = [text_data[k].split('\t') for k in range(0, len(text_data))]
  21. index = range(0, len(text_data), 3)
  22. # Transforming data to matrix format for neural network
  23. input_data = list()
  24. for i in range(1, len(index) - 1):
  25. rows = text_data[index[i - 1]: index[i]]
  26. sentence_no = np.array([i] * len(rows[0]), dtype=str)
  27. rows.append(sentence_no)
  28. rows = np.array(rows).T
  29. input_data.append(rows)
  30. input_data = pd.DataFrame(np.concatenate([item for item in input_data]), columns=['word', 'pos', 'tag', 'sent_no'])
  31. return input_data
  32. if __name__ == '__main__':
  33. data = load_data()
  34. print(data)

        在该代码中,先是设置了语料库文件的路径CORPUS_PATH,KERAS模型保存路径KERAS_MODEL_SAVE_PATH,以及在项目过程中会用到的三个字典的保存路径(以pickle文件形式保存)WORD_DICTIONARY_PATH,LABEL_DICTIONARY_PATH, OUTPUT_DICTIONARY_PATH。然后是load_data()函数,它将语料库中的文本以Pandas中的DataFrame结构展示出来,运行结果如下:

  1. word pos tag sent_no
  2. 0 played VBD O 1
  3. 1 on IN O 1
  4. 2 Monday NNP O 1
  5. 3 ( ( O 1
  6. 4 home NN O 1
  7. ... ... ... ... ...
  8. 201110 75 CD O 13997
  9. 201111 .409 CD O 13997
  10. 201112 28 CD O 13997
  11. 201113 CENTRAL NNP B-MISC 13998
  12. 201114 DIVISION NNP I-MISC 13998
  13. [201115 rows x 4 columns]

        在该数据框中,word这一列表示文本语料库中的单词,pos这一列表示该单词的词性,tag这一列表示NER的标注,sent_no这一列表示该单词在第几个句子中。

数据探索

  接着,第二步是数据探索,即对输入的数据(input_data)进行一些数据review,完整的代码(data_processing.py)如下:

  1. import pickle
  2. import numpy as np
  3. from collections import Counter
  4. from itertools import accumulate
  5. from operator import itemgetter
  6. import matplotlib.pyplot as plt
  7. import matplotlib as mpl
  8. from utils import CONSTANTS, load_data
  9. # 设置matplotlib绘图时的字体
  10. mpl.rcParams['font.sans-serif'] = ['SimHei']
  11. # 数据查看
  12. def data_review():
  13. # 导入数据
  14. input_data = load_data()
  15. # 基本的数据review
  16. sent_num = input_data['sent_no'].astype(np.int).max()
  17. print('一共有%s个句子。' % sent_num)
  18. vocabulary = input_data['word'].unique()
  19. print('一个有%d个单词。' % len(vocabulary))
  20. print('前10个单词为:%s' % vocabulary[:11])
  21. pos_arr = input_data['tag'].unique()
  22. print('单词的词性列表:%s.' % pos_arr)
  23. df = input_data[['word', 'sent_no']].groupby('sent_no').count()
  24. sent_len_list = df['word'].tolist()
  25. print('句子长度及出现的频数字典:\n%s.' % dict(Counter(sent_len_list)))
  26. # 绘制句子长度及出现频数统计图
  27. sort_sent_len_dict = sorted(dict(Counter(sent_len_list)).items(), key=itemgetter(0))
  28. sent_no_data = [item[0] for item in sort_sent_len_dict]
  29. sent_count_data = [item[1] for item in sort_sent_len_dict]
  30. plt.bar(sent_no_data, sent_count_data)
  31. plt.title('句子长度及出现频数统计图')
  32. plt.xlabel('句子长度')
  33. plt.ylabel('句子长度出现的频数')
  34. plt.savefig('./data/句子长度及出现频数统计图.png')
  35. plt.close()
  36. # 绘制句子长度累计分布函数(CDF)
  37. sent_pentage_list = [(count / sent_num) for count in accumulate(sent_count_data)]
  38. # 寻找分位点为quantile的句子长度
  39. quantile = 0.9992
  40. # print(list(sent_pentage_list))
  41. for length, per in zip(sent_no_data, sent_pentage_list):
  42. if round(per, 4) == quantile:
  43. index = length
  44. break
  45. print('分位点为%s的句子长度为:%d' % (quantile, index))
  46. # 绘制CDF
  47. plt.plot(sent_no_data, sent_pentage_list)
  48. plt.hlines(quantile, 0, index, colors="c", linestyles="dashed")
  49. plt.vlines(index, 0, quantile, colors="c", linestyles="dashed")
  50. plt.text(0, quantile, str(quantile))
  51. plt.text(index, 0, str(index))
  52. plt.title("句子长度累积分布函数图")
  53. plt.xlabel("句子长度")
  54. plt.ylabel("句子长度累积频率")
  55. plt.savefig("./data/句子长度累积分布函数图.png")
  56. plt.close()
  57. # 数据处理
  58. def data_processing():
  59. # 数据导入
  60. input_data = load_data()
  61. # 标签及词汇表
  62. labels, vocabulary = list(input_data['tag'].unique()), list(input_data['word'].unique())
  63. # 字典列表
  64. word_dictionary = {word: i + 1 for i, word in enumerate(vocabulary)}
  65. inverse_word_vocabulary = {i + 1: word for i, word in enumerate(vocabulary)}
  66. label_dictionary = {laber: i + 1 for i, laber in enumerate(labels)}
  67. output_dictionary = {i + 1: labels for i, labels in enumerate(labels)}
  68. dict_list = [word_dictionary, inverse_word_vocabulary, label_dictionary, output_dictionary]
  69. # 保存为pickle形式
  70. for dict_item, path in zip(dict_list, CONSTANTS[1:]):
  71. with open(path, 'wb') as f:
  72. pickle.dump(dict_item, f)
  73. if __name__ == '__main__':
  74. data_review()

调用data_review()函数,输出的结果如下:

  1. 一共有13998个句子。
  2. 一个有24339个单词。
  3. 10个单词为:['played' 'on' 'Monday' '(' 'home' 'team' 'in' 'CAPS' ')' ':' 'American']
  4. 单词的词性列表:['O' 'B-MISC' 'I-MISC' 'B-ORG' 'I-ORG' 'B-PER' 'B-LOC' 'I-PER' 'I-LOC'
  5. 'sO'].
  6. 句子长度及出现的频数字典:
  7. {10: 501, 5: 769, 9: 841, 6: 639, 4: 794, 37: 105, 21: 228, 40: 78, 23: 230, 38: 112, 25: 207, 18: 212, 19: 197, 8: 977, 2: 1141, 41: 74, 20: 221, 11: 395, 7: 999, 30: 183, 34: 141, 16: 225, 13: 339, 15: 275, 3: 620, 29: 214, 22: 221, 14: 291, 31: 202, 26: 224, 33: 167, 24: 210, 27: 188, 42: 63, 39: 98, 17: 229, 1: 177, 35: 130, 36: 119, 12: 316, 32: 167, 48: 19, 51: 8, 28: 199, 46: 19, 52: 9, 47: 22, 44: 42, 43: 51, 113: 1, 49: 15, 45: 39, 50: 16, 58: 2, 69: 1, 59: 2, 53: 5, 66: 1, 71: 1, 72: 1, 54: 4, 55: 9, 57: 2, 62: 2, 67: 1, 124: 1, 80: 1, 56: 2, 60: 3, 78: 1}.
  8. 分位点为0.9992的句子长度为:60

        在该语料库中,一共有13998个句子,比预期的42000/3=14000个句子少两个。一个有24339个单词,单词量还是蛮大的,当然,这里对单词没有做任何处理,直接保留了语料库中的形式(后期可以继续优化)。我们需要注意的是,NER的标注列表为[‘O’ ,‘B-MISC’, ‘I-MISC’, ‘B-ORG’ ,‘I-ORG’, ‘B-PER’ ,‘B-LOC’ ,‘I-PER’, ‘I-LOC’,‘sO’],因此,本项目的NER一共分为四类:PER(人名),LOC(位置),ORG(组织)以及MISC,其中B表示开始,I表示中间,O表示单字词,不计入NER,sO表示特殊单字词。

        接下来,让我们考虑下句子的长度,这对后面的建模时填充的句子长度有有参考作用。句子长度及出现频数的统计图如下:

        可以看到,句子长度基本在60以下,当然,这也可以在输出的句子长度及出现频数字典中看到。那么,我们是否可以选在一个标准作为后面模型的句子填充的长度呢?答案是,利用出现频数的累计分布函数的分位点,在这里,我们选择分位点为0.9992,对应的句子长度为60,如下图:

        接着是数据处理函数data_processing(),它的功能主要是实现单词、标签字典,并保存为pickle文件形式,便于后续直接调用。

建模

  在第三步中,我们建立Bi-LSTM模型来训练训练,完整的Python代码(Bi_LSTM_Model_training.py)如下:

  1. import pickle
  2. import numpy as np
  3. import pandas as pd
  4. from utils import CONSTANTS, load_data
  5. from data_processing import data_processing
  6. from keras.utils import np_utils, plot_model
  7. from keras.models import Sequential
  8. from keras.preprocessing.sequence import pad_sequences
  9. from keras.layers import Bidirectional, LSTM, Dense, Embedding, TimeDistributed
  10. # 模型输入数据
  11. def input_data_for_model(input_shape):
  12. # 数据导入
  13. input_data = load_data()
  14. # 数据处理
  15. data_processing()
  16. # 导入字典
  17. with open(CONSTANTS[1], 'rb') as f:
  18. word_dictionary = pickle.load(f)
  19. with open(CONSTANTS[2], 'rb') as f:
  20. inverse_word_dictionary = pickle.load(f)
  21. with open(CONSTANTS[3], 'rb') as f:
  22. label_dictionary = pickle.load(f)
  23. with open(CONSTANTS[4], 'rb') as f:
  24. output_dictionary = pickle.load(f)
  25. vocab_size = len(word_dictionary.keys())
  26. label_size = len(label_dictionary.keys())
  27. # 处理输入数据
  28. aggregate_function = lambda input: [(word, pos, label) for word, pos, label in
  29. zip(input['word'].values.tolist(), input['pos'].values.tolist(),
  30. input['tag'].values.tolist())]
  31. grouped_input_data = input_data.groupby('sent_no').apply(aggregate_function)
  32. sentences = [sentence for sentence in grouped_input_data]
  33. x = [[word_dictionary[word[0]] for word in sent] for sent in sentences]
  34. x = pad_sequences(maxlen=input_shape, sequences=x, padding='post', value=0)
  35. y = [[label_dictionary[word[2]] for word in sent] for sent in sentences]
  36. y = pad_sequences(maxlen=input_shape, sequences=y, padding='post', value=0)
  37. y = [np_utils.to_categorical(label, num_classes=label_size + 1) for label in y]
  38. return x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary
  39. # 定义深度学习模型:Bi-LSTM
  40. def creat_bi_lstm(vocab_size, label_size, input_shape, output_dim, n_unite, out_act, activation):
  41. model = Sequential()
  42. model.add(Embedding(input_dim=vocab_size + 1, output_dim=output_dim, input_length=input_shape, mask_zero=True))
  43. model.add(Bidirectional(LSTM(units=n_unite, activation=activation, return_sequences=True)))
  44. model.add(TimeDistributed(Dense(label_size + 1, activation=out_act)))
  45. model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
  46. return model
  47. # 模型训练
  48. def model_train():
  49. # 将数据集分为训练集和测试集,占比为9:1
  50. input_shape = 60
  51. x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary = input_data_for_model(input_shape)
  52. train_end = int(len(x) * 0.9)
  53. train_x, train_y = x[0:train_end], np.array(y[0: train_end])
  54. test_x, test_y = x[train_end:], np.array(y[train_end:])
  55. # 模型输入参数
  56. activation = 'selu'
  57. out_act = 'softmax'
  58. n_units = 100
  59. batch_size = 32
  60. epochs = 10
  61. output_dim = 20
  62. # 模型训练
  63. lstm_model = creat_bi_lstm(vocab_size, label_size, input_shape, output_dim, n_units, out_act, activation)
  64. lstm_model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=1)
  65. # 模型保存
  66. model_save_path = CONSTANTS[0]
  67. lstm_model.save(model_save_path)
  68. plot_model(lstm_model, to_file='./data/lstm_model.png')
  69. # 在测试集上的效果
  70. N = test_x.shape[0] # 测试的条数
  71. avg_accuracy = 0 # 预测的平均准确率
  72. for start, end in zip(range(0, N, 1), range(1, N + 1, 1)):
  73. sentence = [inverse_word_dictionary[i] for i in test_x[start] if i != 0]
  74. y_predict = lstm_model.predict(test_x[start:end])
  75. input_sequences, output_sequences = [], []
  76. for i in range(0, len(y_predict[0])):
  77. output_sequences.append(np.argmax(y_predict[0][i]))
  78. input_sequences.append(np.argmax(test_y[start][i]))
  79. eval = lstm_model.evaluate(test_x[start:end], test_y[start:end])
  80. print('Test Accuracy: loss = %0.6f accuracy = %0.2f%%' % (eval[0], eval[1] * 100))
  81. avg_accuracy += eval[1]
  82. output_sequences = ' '.join([output_dictionary[key] for key in output_sequences if key != 0]).split()
  83. input_sequences = ' '.join([output_dictionary[key] for key in input_sequences if key != 0]).split()
  84. output_input_comparison = pd.DataFrame([sentence, output_sequences, input_sequences]).T
  85. print(output_input_comparison.dorpna())
  86. print('#' * 80)
  87. avg_accuracy /= N
  88. print("测试样本的平均预测准确率:%.2f%%." % (avg_accuracy * 100))
  89. if __name__ == '__main__':
  90. model_train()

        在上面的代码中,先是通过input_data_for_model()函数来处理好进入模型的数据,其参数为input_shape,即填充句子时的长度。然后是创建Bi-LSTM模型create_Bi_LSTM(),模型的示意图如下:

        最后,是在输入的数据上进行模型训练,将原始的数据分为训练集和测试集,占比为9:1,训练的周期为10次。

模型训练

  运行上述模型训练代码,一共训练10个周期,训练时间大概为500s,在训练集上的准确率达99%以上,在测试集上的平均准确率为93%以上。以下是最后几个测试集上的预测结果:

  1. Epoch 1/10
  2. 394/394 [==============================] - 13s 29ms/step - loss: 0.2133 - accuracy: 0.8241
  3. Epoch 2/10
  4. 394/394 [==============================] - 11s 29ms/step - loss: 0.0603 - accuracy: 0.9191
  5. Epoch 3/10
  6. 394/394 [==============================] - 11s 29ms/step - loss: 0.0292 - accuracy: 0.9670
  7. Epoch 4/10
  8. 394/394 [==============================] - 12s 30ms/step - loss: 0.0157 - accuracy: 0.9840
  9. Epoch 5/10
  10. 394/394 [==============================] - 12s 31ms/step - loss: 0.0093 - accuracy: 0.9904
  11. Epoch 6/10
  12. 394/394 [==============================] - 12s 31ms/step - loss: 0.0063 - accuracy: 0.9935
  13. Epoch 7/10
  14. 394/394 [==============================] - 12s 30ms/step - loss: 0.0043 - accuracy: 0.9955
  15. Epoch 8/10
  16. 394/394 [==============================] - 12s 29ms/step - loss: 0.0032 - accuracy: 0.9964
  17. Epoch 9/10
  18. 394/394 [==============================] - 11s 29ms/step - loss: 0.0022 - accuracy: 0.9978
  19. Epoch 10/10
  20. 394/394 [==============================] - 12s 30ms/step - loss: 0.0014 - accuracy: 0.9988
  21. 1/1 [==============================] - 0s 337ms/step - loss: 0.1548 - accuracy: 0.9375
  22. Test Accuracy: loss = 0.154795 accuracy = 93.75%

该模型在原始数据上的识别效果还是可以的。
  训练完模型后,BASE_DIR中的所有文件如下:

模型预测

  最后,也许是整个项目最为激动人心的时刻,因为,我们要在新数据集上测试模型的识别效果。预测新数据的识别结果的完整Python代码(Bi_LSTM_Model_predict.py)如下:

  1. # Import the necessary modules
  2. import pickle
  3. import numpy as np
  4. from utils import CONSTANTS
  5. from keras.preprocessing.sequence import pad_sequences
  6. from keras.models import load_model
  7. from nltk import word_tokenize
  8. # 导入字典
  9. with open(CONSTANTS[1], 'rb') as f:
  10. word_dictionary = pickle.load(f)
  11. with open(CONSTANTS[4], 'rb') as f:
  12. output_dictionary = pickle.load(f)
  13. try:
  14. # 数据预处理
  15. input_shape = 60
  16. sent = 'New York is the biggest city in America.'
  17. new_sent = word_tokenize(sent)
  18. new_x = [[word_dictionary[word] for word in new_sent]]
  19. x = pad_sequences(maxlen=input_shape, sequences=new_x, padding='post', value=0)
  20. # 载入模型
  21. model_save_path = CONSTANTS[0]
  22. lstm_model = load_model(model_save_path)
  23. # 模型预测
  24. y_predict = lstm_model.predict(x)
  25. ner_tag = []
  26. for i in range(0, len(new_sent)):
  27. ner_tag.append(np.argmax(y_predict[0][i]))
  28. ner = [output_dictionary[i] for i in ner_tag]
  29. print(new_sent)
  30. print(ner)
  31. # 去掉NER标注为O的元素
  32. ner_reg_list = []
  33. for word, tag in zip(new_sent, ner):
  34. if tag != 'O':
  35. ner_reg_list.append((word, tag))
  36. # 输出模型的NER识别结果
  37. print("NER识别结果:")
  38. if ner_reg_list:
  39. for i, item in enumerate(ner_reg_list):
  40. if item[1].startswith('B'):
  41. end = i + 1
  42. while end <= len(ner_reg_list) - 1 and ner_reg_list[end][1].startswith('I'):
  43. end += 1
  44. ner_type = item[1].split('-')[1]
  45. ner_type_dict = {'PER': 'PERSON: ',
  46. 'LOC': 'LOCATION: ',
  47. 'ORG': 'ORGANIZATION: ',
  48. 'MISC': 'MISC: '
  49. }
  50. print(ner_type_dict[ner_type], ' '.join([item[0] for item in ner_reg_list[i:end]]))
  51. else:
  52. print("模型并未识别任何有效命名实体。")
  53. except KeyError as err:
  54. print("您输入的句子有单词不在词汇表中,请重新输入!")
  55. print("不在词汇表中的单词为:%s." % err)

输出结果为:

  1. ['New', 'York', 'is', 'the', 'biggest', 'city', 'in', 'America', '.']
  2. ['B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O']
  3. NER识别结果:
  4. LOCATION: New York
  5. LOCATION: America
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/AllinToyou/article/detail/543210
推荐阅读