当前位置:   article > 正文

NLP入门(四)命名实体识别(NER)

ner gpe

  本文将会简单介绍自然语言处理(NLP)中的命名实体识别(NER)。
  命名实体识别(Named Entity Recognition,简称NER)是信息提取、问答系统、句法分析、机器翻译等应用领域的重要基础工具,在自然语言处理技术走向实用化的过程中占有重要地位。一般来说,命名实体识别的任务就是识别出待处理文本中三大类(实体类、时间类和数字类)、七小类(人名、机构名、地名、时间、日期、货币和百分比)命名实体。
  举个简单的例子,在句子“小明早上8点去学校上课。”中,对其进行命名实体识别,应该能提取信息

人名:小明,时间:早上8点,地点:学校。

  本文将会介绍几个工具用来进行命名实体识别,后续有机会的话,我们将会尝试着用HMM、CRF或深度学习来实现命名实体识别。
  首先我们来看一下NLTK和Stanford NLP中对命名实体识别的分类,如下图:

NLTK和Stanford NLP中对命名实体识别的分类

在上图中,LOCATION和GPE有重合。GPE通常表示地理—政治条目,比如城市,州,国家,洲等。LOCATION除了上述内容外,还能表示名山大川等。FACILITY通常表示知名的纪念碑或人工制品等。
  下面介绍两个工具来进行NER的任务:NLTK和Stanford NLP。
  首先是NLTK,我们的示例文档(介绍FIFA,来源于维基百科)如下:

FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its
membership now comprises 211 national associations. Member countries must each also be members of one of
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America
and the Caribbean, Oceania, and South America.

实现NER的Python代码如下:

  1. import re
  2. import pandas as pd
  3. import nltk
  4. def parse_document(document):
  5. document = re.sub('\n', ' ', document)
  6. if isinstance(document, str):
  7. document = document
  8. else:
  9. raise ValueError('Document is not string!')
  10. document = document.strip()
  11. sentences = nltk.sent_tokenize(document)
  12. sentences = [sentence.strip() for sentence in sentences]
  13. return sentences
  14. # sample document
  15. text = """
  16. FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,
  17. Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its
  18. membership now comprises 211 national associations. Member countries must each also be members of one of
  19. the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America
  20. and the Caribbean, Oceania, and South America.
  21. """
  22. # tokenize sentences
  23. sentences = parse_document(text)
  24. tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
  25. # tag sentences and use nltk's Named Entity Chunker
  26. tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
  27. ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences]
  28. # extract all named entities
  29. named_entities = []
  30. for ne_tagged_sentence in ne_chunked_sents:
  31. for tagged_tree in ne_tagged_sentence:
  32. # extract only chunks having NE labels
  33. if hasattr(tagged_tree, 'label'):
  34. entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) #get NE name
  35. entity_type = tagged_tree.label() # get NE category
  36. named_entities.append((entity_name, entity_type))
  37. # get unique named entities
  38. named_entities = list(set(named_entities))
  39. # store named entities in a data frame
  40. entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
  41. # display results
  42. print(entity_frame)

输出结果如下:

  1. Entity Name Entity Type
  2. 0 FIFA ORGANIZATION
  3. 1 Central America ORGANIZATION
  4. 2 Belgium GPE
  5. 3 Caribbean LOCATION
  6. 4 Asia GPE
  7. 5 France GPE
  8. 6 Oceania GPE
  9. 7 Germany GPE
  10. 8 South America GPE
  11. 9 Denmark GPE
  12. 10 Zürich GPE
  13. 11 Africa PERSON
  14. 12 Sweden GPE
  15. 13 Netherlands GPE
  16. 14 Spain GPE
  17. 15 Switzerland GPE
  18. 16 North GPE
  19. 17 Europe GPE

可以看到,NLTK中的NER任务大体上完成得还是不错的,能够识别FIFA为组织(ORGANIZATION),Belgium,Asia为GPE, 但是也有一些不太如人意的地方,比如,它将Central America识别为ORGANIZATION,而实际上它应该为GPE;将Africa识别为PERSON,实际上应该为GPE。

  接下来,我们尝试着用Stanford NLP工具。关于该工具,我们主要使用Stanford NER 标注工具。在使用这个工具之前,你需要在自己的电脑上安装Java(一般是JDK),并将Java添加到系统路径中,同时下载英语NER的文件包:stanford-ner-2018-10-16.zip(大小为172MB),下载地址为:https://nlp.stanford.edu/soft...。以笔者的电脑为例,Java所在的路径为:C:Program FilesJavajdk1.8.0_161binjava.exe, 下载Stanford NER的zip文件解压后的文件夹的路径为:E://stanford-ner-2018-10-16,如下图所示:

E://stanford-ner-2018-10-16

在classifer文件夹中有如下文件:

E://stanford-ner-2018-10-16/classifiers

它们代表的含义如下:

3 class: Location, Person, Organization
4 class: Location, Person, Organization, Misc
7 class: Location, Person, Organization, Money, Percent, Date, Time

  可以使用Python实现Stanford NER,完整的代码如下:

  1. import re
  2. from nltk.tag import StanfordNERTagger
  3. import os
  4. import pandas as pd
  5. import nltk
  6. def parse_document(document):
  7. document = re.sub('\n', ' ', document)
  8. if isinstance(document, str):
  9. document = document
  10. else:
  11. raise ValueError('Document is not string!')
  12. document = document.strip()
  13. sentences = nltk.sent_tokenize(document)
  14. sentences = [sentence.strip() for sentence in sentences]
  15. return sentences
  16. # sample document
  17. text = """
  18. FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,
  19. Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its
  20. membership now comprises 211 national associations. Member countries must each also be members of one of
  21. the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America
  22. and the Caribbean, Oceania, and South America.
  23. """
  24. sentences = parse_document(text)
  25. tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
  26. # set java path in environment variables
  27. java_path = r'C:\Program Files\Java\jdk1.8.0_161\bin\java.exe'
  28. os.environ['JAVAHOME'] = java_path
  29. # load stanford NER
  30. sn = StanfordNERTagger('E://stanford-ner-2018-10-16/classifiers/english.muc.7class.distsim.crf.ser.gz',
  31. path_to_jar='E://stanford-ner-2018-10-16/stanford-ner.jar')
  32. # tag sentences
  33. ne_annotated_sentences = [sn.tag(sent) for sent in tokenized_sentences]
  34. # extract named entities
  35. named_entities = []
  36. for sentence in ne_annotated_sentences:
  37. temp_entity_name = ''
  38. temp_named_entity = None
  39. for term, tag in sentence:
  40. # get terms with NE tags
  41. if tag != 'O':
  42. temp_entity_name = ' '.join([temp_entity_name, term]).strip() #get NE name
  43. temp_named_entity = (temp_entity_name, tag) # get NE and its category
  44. else:
  45. if temp_named_entity:
  46. named_entities.append(temp_named_entity)
  47. temp_entity_name = ''
  48. temp_named_entity = None
  49. # get unique named entities
  50. named_entities = list(set(named_entities))
  51. # store named entities in a data frame
  52. entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
  53. # display results
  54. print(entity_frame)

输出结果如下:

  1. Entity Name Entity Type
  2. 0 1904 DATE
  3. 1 Denmark LOCATION
  4. 2 Spain LOCATION
  5. 3 North & Central America ORGANIZATION
  6. 4 South America LOCATION
  7. 5 Belgium LOCATION
  8. 6 Zürich LOCATION
  9. 7 the Netherlands LOCATION
  10. 8 France LOCATION
  11. 9 Caribbean LOCATION
  12. 10 Sweden LOCATION
  13. 11 Oceania LOCATION
  14. 12 Asia LOCATION
  15. 13 FIFA ORGANIZATION
  16. 14 Europe LOCATION
  17. 15 Africa LOCATION
  18. 16 Switzerland LOCATION
  19. 17 Germany LOCATION

可以看到,在Stanford NER的帮助下,NER的实现效果较好,将Africa识别为LOCATION,将1904识别为时间(这在NLTK中没有识别出来),但还是对North & Central America识别有误,将其识别为ORGANIZATION。
  值得注意的是,并不是说Stanford NER一定会比NLTK NER的效果好,两者针对的对象,预料,算法可能有差异,因此,需要根据自己的需求决定使用什么工具。
  本次分享到此结束,以后有机会的话,将会尝试着用HMM、CRF或深度学习来实现命名实体识别。

注意:本人现已开通微信公众号: Python爬虫与算法(微信号为:easy_web_scrape), 欢迎大家关注哦~~

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/你好赵伟/article/detail/350609
推荐阅读
相关标签
  

闽ICP备14008679号