赞
踩
需要安装nltk自然语言处理包,anaconda默认已经安装了
还需要安装nltk语料库:http://www.nltk.org/data.html
鱼香肉丝里面多放点辣椒
对称加密需要DES处理引擎
天儿冷了多穿点
Are you curious about tokenization ? Let’s see how it works! we need to analyze a couple of sentences with puntuations to see it in action.
import nltk.tokenize as tk
doc = 'Are you curious about tokenization ? Let’s see how it works! we need to analyze a couple of sentences with puntuations to see it in action.'
print(doc)
print('-'*72)
#分句
tokens = tk.sent_tokenize(doc)
for token in tokens:
print(token)
print('_'*72)
#分词
tokens = tk.word_tokenize(doc)
for token in tokens:
print(token)
print('_'*72)
#词和标点
tokenizer = tk.WordPuncTokenizer()
tokens = tokenizer.tokenize(doc)
for token in tokens:
print(token)
print('_'*72)
play -> plays/playing/player
Porter:宽松、简单、快,但是比较粗暴
Lancater:严格,复杂,慢,保词干的语法正确
Snowball:在精度和效率上位于以上两种提取器之间
import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
words = ['table','probably','wolves','playing','is','dog','the','beaches','grounded','deamt','envision']
#用porter方法来识别词干
stemmer = pt.PorterStemmer()
for word in words:
stem = stemmer.stem(word)
print(stem)
print('-'*72)
stemmer = lc.LancasterStemmer()
for word in words:
stem = stemmer.stem(word)
print(stem)
print('-'*72)
stemmer = sb.SnowballStemmer('english')
for word in words:
stem = stemmer.stem(word)
print(stem)
print('-'*72)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。