赞
踩
参考文章:NLP Lemmatisation(词性还原) 和 Stemming(词干提取) NLTK pos_tag word_tokenize_心之所向-CSDN博客_nltk 词形还原
词形还原(lemmatization),是把一个词汇还原为一般形式(能表达完整语义),方法较为复杂;而词干提取(stemming)是抽取词的词干或词根形式(不一定能够表达完整语义),方法较为简单。
Stemming(词干提取):
基于语言的规则。如英语中名词变复数形式规则。由于基于规则,可能出现规则外的情况。
- # Porter Stemmer基于Porter词干提取算法
- from nltk.stem.porter import PorterStemmer
- porter_stemmer = PorterStemmer()
- porter_stemmer.stem('leaves')
- # 输出:'leav'
- # 但实际应该是名词'leaf'
nltk中主要有以下方法:
- # 基于Porter词干提取算法
- from nltk.stem.porter import PorterStemmer
- porter_stemmer = PorterStemmer()
- porter_stemmer.stem(‘maximum’)
-
- # 基于Lancaster 词干提取算法
- from nltk.stem.lancaster import LancasterStemmer
- lancaster_stemmer = LancasterStemmer()
- lancaster_stemmer.stem(‘maximum’)
-
- # 基于Snowball 词干提取算法
- from nltk.stem import SnowballStemmer
- snowball_stemmer = SnowballStemmer(“english”)
- snowball_stemmer.stem(‘maximum’)
Lemmatisation(词性还原):
基于字典的映射。nltk中要求手动注明词性,否则可能会有问题。因此一般先要分词、词性标注,再词性还原。
- from nltk.stem import WordNetLemmatizer
- lemmatizer = WordNetLemmatizer()
- lemmatizer.lemmatize('leaves')
# 输出:'leaf'
完整过程:
word_tokenize("apples % , I've loves green")
pos_tag(word_tokenize("apples % , I've loves green"))
- wnl = WordNetLemmatizer()
- wnl.lemmatize('apples', pos='n')
- def lemmatize_all(sentence):
- wnl = WordNetLemmatizer()
- for word, tag in pos_tag(word_tokenize(sentence)):
- if tag.startswith('NN'):
- yield wnl.lemmatize(word, pos='n')
- elif tag.startswith('VB'):
- yield wnl.lemmatize(word, pos='v')
- elif tag.startswith('JJ'):
- yield wnl.lemmatize(word, pos='a')
- elif tag.startswith('R'):
- yield wnl.lemmatize(word, pos='r')
- else:
- yield word
-
- train_f = []
- test_f = []
- for i in range(0, len(train_feature)):
- train_f.append(' '.join(lemmatize_all(train_feature[i])))
- for i in range(0, len(test_feature)):
- test_f.append(' '.join(lemmatize_all(test_train[i])))
NLTK词性:
- CC 连词 and, or,but, if, while,although
- CD 数词 twenty-four, fourth, 1991,14:24
- DT 限定词 the, a, some, most,every, no
- EX 存在量词 there, there's
- FW 外来词 dolce, ersatz, esprit, quo,maitre
- IN 介词连词 on, of,at, with,by,into, under
- JJ 形容词 new,good, high, special, big, local
- JJR 比较级词语 bleaker braver breezier briefer brighter brisker
- JJS 最高级词语 calmest cheapest choicest classiest cleanest clearest
- LS 标记 A A. B B. C C. D E F First G H I J K
- MD 情态动词 can cannot could couldn't
- NN 名词 year,home, costs, time, education
- NNS 名词复数 undergraduates scotches
- NNP 专有名词 Alison,Africa,April,Washington
- NNPS 专有名词复数 Americans Americas Amharas Amityvilles
- PDT 前限定词 all both half many
- POS 所有格标记 ' 's
- PRP 人称代词 hers herself him himself hisself
- PRP$ 所有格 her his mine my our ours
- RB 副词 occasionally unabatingly maddeningly
- RBR 副词比较级 further gloomier grander
- RBS 副词最高级 best biggest bluntest earliest
- RP 虚词 aboard about across along apart
- SYM 符号 % & ' '' ''. ) )
- TO 词to to
- UH 感叹词 Goodbye Goody Gosh Wow
- VB 动词 ask assemble assess
- VBD 动词过去式 dipped pleaded swiped
- VBG 动词现在分词 telegraphing stirring focusing
- VBN 动词过去分词 multihulled dilapidated aerosolized
- VBP 动词现在式非第三人称时态 predominate wrap resort sue
- VBZ 动词现在式第三人称时态 bases reconstructs marks
- WDT Wh限定词 who,which,when,what,where,how
- WP WH代词 that what whatever
- WP$ WH代词所有格 whose
- WRB WH副词
- # 查看说明
- nltk.help.upenn_tagset(‘JJ’)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。