当前位置:   article > 正文

NLP Lemmatisation(词性还原) 和 Stemming(词干提取) NLTK pos_tag word_tokenize

lemmatisation

参考文章:NLP Lemmatisation(词性还原) 和 Stemming(词干提取) NLTK pos_tag word_tokenize_心之所向-CSDN博客_nltk 词形还原

词形还原(lemmatization),是把一个词汇还原为一般形式(能表达完整语义),方法较为复杂;而词干提取(stemming)是抽取词的词干或词根形式(不一定能够表达完整语义),方法较为简单。
Stemming(词干提取)
基于语言的规则。如英语中名词变复数形式规则。由于基于规则,可能出现规则外的情况。

  1. # Porter Stemmer基于Porter词干提取算法
  2. from nltk.stem.porter import PorterStemmer
  3. porter_stemmer = PorterStemmer()
  4. porter_stemmer.stem('leaves')
  1. # 输出:'leav'
  2. # 但实际应该是名词'leaf'

nltk中主要有以下方法:

  1. # 基于Porter词干提取算法
  2. from nltk.stem.porter import PorterStemmer
  3. porter_stemmer = PorterStemmer()
  4. porter_stemmer.stem(‘maximum’)
  5. # 基于Lancaster 词干提取算法
  6. from nltk.stem.lancaster import LancasterStemmer
  7. lancaster_stemmer = LancasterStemmer()
  8. lancaster_stemmer.stem(‘maximum’)
  9. # 基于Snowball 词干提取算法
  10. from nltk.stem import SnowballStemmer
  11. snowball_stemmer = SnowballStemmer(“english”)
  12. snowball_stemmer.stem(‘maximum’)

Lemmatisation(词性还原):
基于字典的映射。nltk中要求手动注明词性,否则可能会有问题。因此一般先要分词、词性标注,再词性还原。

  1. from nltk.stem import WordNetLemmatizer
  2. lemmatizer = WordNetLemmatizer()
  3. lemmatizer.lemmatize('leaves')
# 输出:'leaf'

完整过程:

word_tokenize("apples % , I've loves green")

pos_tag(word_tokenize("apples % , I've loves green"))

  1. wnl = WordNetLemmatizer()
  2. wnl.lemmatize('apples', pos='n')
  1. def lemmatize_all(sentence):
  2. wnl = WordNetLemmatizer()
  3. for word, tag in pos_tag(word_tokenize(sentence)):
  4. if tag.startswith('NN'):
  5. yield wnl.lemmatize(word, pos='n')
  6. elif tag.startswith('VB'):
  7. yield wnl.lemmatize(word, pos='v')
  8. elif tag.startswith('JJ'):
  9. yield wnl.lemmatize(word, pos='a')
  10. elif tag.startswith('R'):
  11. yield wnl.lemmatize(word, pos='r')
  12. else:
  13. yield word
  14. train_f = []
  15. test_f = []
  16. for i in range(0, len(train_feature)):
  17. train_f.append(' '.join(lemmatize_all(train_feature[i])))
  18. for i in range(0, len(test_feature)):
  19. test_f.append(' '.join(lemmatize_all(test_train[i])))

NLTK词性:

  1. CC 连词 and, or,but, if, while,although
  2. CD 数词 twenty-four, fourth, 1991,14:24
  3. DT 限定词 the, a, some, most,every, no
  4. EX 存在量词 there, there's
  5. FW 外来词 dolce, ersatz, esprit, quo,maitre
  6. IN 介词连词 on, of,at, with,by,into, under
  7. JJ 形容词 new,good, high, special, big, local
  8. JJR 比较级词语 bleaker braver breezier briefer brighter brisker
  9. JJS 最高级词语 calmest cheapest choicest classiest cleanest clearest
  10. LS 标记 A A. B B. C C. D E F First G H I J K
  11. MD 情态动词 can cannot could couldn't
  12. NN 名词 year,home, costs, time, education
  13. NNS 名词复数 undergraduates scotches
  14. NNP 专有名词 Alison,Africa,April,Washington
  15. NNPS 专有名词复数 Americans Americas Amharas Amityvilles
  16. PDT 前限定词 all both half many
  17. POS 所有格标记 ' 's
  18. PRP 人称代词 hers herself him himself hisself
  19. PRP$ 所有格 her his mine my our ours
  20. RB 副词 occasionally unabatingly maddeningly
  21. RBR 副词比较级 further gloomier grander
  22. RBS 副词最高级 best biggest bluntest earliest
  23. RP 虚词 aboard about across along apart
  24. SYM 符号 % & ' '' ''. ) )
  25. TO 词to to
  26. UH 感叹词 Goodbye Goody Gosh Wow
  27. VB 动词 ask assemble assess
  28. VBD 动词过去式 dipped pleaded swiped
  29. VBG 动词现在分词 telegraphing stirring focusing
  30. VBN 动词过去分词 multihulled dilapidated aerosolized
  31. VBP 动词现在式非第三人称时态 predominate wrap resort sue
  32. VBZ 动词现在式第三人称时态 bases reconstructs marks
  33. WDT Wh限定词 who,which,when,what,where,how
  34. WP WH代词 that what whatever
  35. WP$ WH代词所有格 whose
  36. WRB WH副词
  1. # 查看说明
  2. nltk.help.upenn_tagset(‘JJ’)

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/羊村懒王/article/detail/285642
推荐阅读
相关标签
  

闽ICP备14008679号