当前位置:   article > 正文

自然语言处理(NLP)之英文单词词性还原_自然语言处理中把词形还原和词性标注结合的代码

自然语言处理中把词形还原和词性标注结合的代码

        词形还原(Lemmatization)是文本预处理中的重要部分,与词干提取(stemming)很相似。

        简单说来,词形还原就是去掉单词的词缀,提取单词的主干部分,通常提取后的单词会是字典中的单词,不同于词干提取(stemming),提取后的单词不一定会出现在单词中。比如,单词“cars”词形还原后的单词为“car”,单词“ate”词形还原后的单词为“eat”。

        在Python的nltk模块中,使用WordNet为我们提供了稳健的词形还原的函数。如以下示例Python代码:

  1. from nltk.stem import WordNetLemmatizer
  2. wnl = WordNetLemmatizer()
  3. # lemmatize nouns
  4. print(wnl.lemmatize('cars', 'n'))
  5. print(wnl.lemmatize('men', 'n'))
  6. # lemmatize verbs
  7. print(wnl.lemmatize('running', 'v'))
  8. print(wnl.lemmatize('ate', 'v'))
  9. # lemmatize adjectives
  10. print(wnl.lemmatize('saddest', 'a'))
  11. print(wnl.lemmatize('fancier', 'a'))

运行结果:

  1. car
  2. men
  3. run
  4. eat
  5. sad
  6. fancy

        在以上代码中,wnl.lemmatize()函数可以进行词形还原,第一个参数为单词,第二个参数为该单词的词性,如名词,动词,形容词等,返回的结果为输入单词的词形还原后的结果。

        词形还原一般是简单的,但具体我们在使用时,指定单词的词性很重要,不然词形还原可能效果不好,如以下代码:

  1. from nltk.stem import WordNetLemmatizer
  2. wnl = WordNetLemmatizer()
  3. print(wnl.lemmatize('ate', 'n'))
  4. print(wnl.lemmatize('fancier', 'v'))

输出结果如下:

        那么,如何获取单词的词性呢?在NLP中,使用Parts of speech(POS)技术实现。在nltk中,可以使用nltk.pos_tag()获取单词在句子中的词性,如以下Python代码:

  1. from nltk import word_tokenize
  2. from nltk import pos_tag
  3. sentence = 'The brown fox is quick and he is jumping over the lazy dog'
  4. tokens = word_tokenize(sentence)
  5. tagged_sent = pos_tag(tokens)
  6. print(tokens)
  7. print(tagged_sent)

输出结果如下:

  1. ['The', 'brown', 'fox', 'is', 'quick', 'and', 'he', 'is', 'jumping', 'over', 'the', 'lazy', 'dog']
  2. [('The', 'DT'), ('brown', 'JJ'), ('fox', 'NN'), ('is', 'VBZ'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('is', 'VBZ'), ('jumping', 'VBG'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

        OK,知道了获取单词在句子中的词性,再结合词形还原,就能很好地完成词形还原功能。示例的Python代码如下:

  1. from nltk import word_tokenize, pos_tag
  2. from nltk.corpus import wordnet
  3. from nltk.stem import WordNetLemmatizer
  4. # 获取单词的词性
  5. def get_wordnet_pos(tag):
  6. if tag.startswith('J'):
  7. return wordnet.ADJ
  8. elif tag.startswith('V'):
  9. return wordnet.VERB
  10. elif tag.startswith('N'):
  11. return wordnet.NOUN
  12. elif tag.startswith('R'):
  13. return wordnet.ADV
  14. else:
  15. return None
  16. sentence = 'football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.'
  17. print(sentence)
  18. tokens = word_tokenize(sentence) # 分词
  19. tagged_sent = pos_tag(tokens) # 获取单词的词性
  20. print(tagged_sent)
  21. wnl = WordNetLemmatizer()
  22. lemmas_sent = []
  23. for tag in tagged_sent:
  24. wordnet_pos = get_wordnet_pos(tag[1]) or wordnet.NOUN
  25. lemmas_sent.append(wnl.lemmatize(tag[0], pos=wordnet_pos)) # 词性还原
  26. print(lemmas_sent)

输出结果如下:

  1. football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.
  2. [('football', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('family', 'NN'), ('of', 'IN'), ('team', 'NN'), ('sports', 'NNS'), ('that', 'WDT'), ('involve', 'VBP'), (',', ','), ('to', 'TO'), ('varying', 'VBG'), ('degrees', 'NNS'), (',', ','), ('kicking', 'VBG'), ('a', 'DT'), ('ball', 'NN'), ('to', 'TO'), ('score', 'VB'), ('a', 'DT'), ('goal', 'NN'), ('.', '.')]
  3. ['football', 'be', 'a', 'family', 'of', 'team', 'sport', 'that', 'involve', ',', 'to', 'vary', 'degree', ',', 'kick', 'a', 'ball', 'to', 'score', 'a', 'goal', '.']

输出的结果就是对句子中的单词进行词形还原后的结果。

声明:本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号