当前位置:   article > 正文

自然语言处理(NLP) 一: 分词、分句、词干提取_对分词结果进行词干提干

对分词结果进行词干提干

需要安装nltk自然语言处理包,anaconda默认已经安装了
还需要安装nltk语料库:http://www.nltk.org/data.html

自然语言基础知识:

1、分词
鱼香肉丝里面多放点辣椒
对称加密需要DES处理引擎
天儿冷了多穿点
  • 1
  • 2
  • 3

Are you curious about tokenization ? Let’s see how it works! we need to analyze a couple of sentences with puntuations to see it in action.

import nltk.tokenize as tk
doc = 'Are you curious about tokenization ? Let’s see how it works! we need to analyze a couple of sentences with puntuations to see it in action.'
print(doc)
print('-'*72)

#分句
tokens = tk.sent_tokenize(doc)
for token in tokens:
    print(token)
print('_'*72)

#分词
tokens = tk.word_tokenize(doc)
for token in tokens:
    print(token)
print('_'*72)

#词和标点
tokenizer = tk.WordPuncTokenizer()
tokens = tokenizer.tokenize(doc)
for token in tokens:
    print(token)
print('_'*72)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
2、词干提取

play -> plays/playing/player
Porter:宽松、简单、快,但是比较粗暴
Lancater:严格,复杂,慢,保词干的语法正确
Snowball:在精度和效率上位于以上两种提取器之间

import nltk.stem.porter as pt 
import nltk.stem.lancaster as lc 
import nltk.stem.snowball as sb 
words = ['table','probably','wolves','playing','is','dog','the','beaches','grounded','deamt','envision']
#用porter方法来识别词干
stemmer = pt.PorterStemmer()
for word in words:
    stem = stemmer.stem(word)
    print(stem)
print('-'*72)

stemmer = lc.LancasterStemmer()
for word in words:
    stem = stemmer.stem(word)
    print(stem)
print('-'*72)

stemmer = sb.SnowballStemmer('english')
for word in words:
    stem = stemmer.stem(word)
    print(stem)
print('-'*72)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/喵喵爱编程/article/detail/805068
推荐阅读
相关标签
  

闽ICP备14008679号