当前位置:   article > 正文

利用NLTK做中英文分词_nltk中文分词

nltk中文分词

美图欣赏:
在这里插入图片描述
一.NLTK环境配置

1.安装nltk包(如果开始能装 忽然爆红 多装几次)

pip install nltk
  • 1

2.在python consol里面

 //1.先导入包
 import nltk
 
//2.下载基本的数据
 nltk.download() 
  • 1
  • 2
  • 3
  • 4
  • 5

注:如果在线下载失败,可以自行官网下载 然后放到指定文件夹。

二.利用NLTK做英文分词

这里选用的是anaconda做解释器

1.实现段落分句

import nltk
# 获取一段文本
text = "In the coming new term, there will be many challenging exams. Firstly, in June, there is a College English Test Band Four. In May, Certificate of Accounting Professional is around the corner. Without sufficient preparations, I can hardly expect to pass those exams. So I have to plan more time to take enough preparation."

#(1)实现段落分句
#language是关键词(红色表示),默认是english
#sent_tokenize方法作用就是分句 (以“.”作为分割符)

tokenize = nltk.sent_tokenize(text, language='english')
print(tokenize)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

打印结果:

['In the coming new term, there will be many challenging exams.', 'Firstly, in June, there is a College English Test Band Four.', 'In May, Certificate of Accounting Professional is around the corner.', 'Without sufficient preparations, I can hardly expect to pass those exams.', 'So I have to plan more time to take enough preparation.']
  • 1

2.实现分词

#(2)实现分词
#append方法:循环出来的内容进行追加到数组words[]中
#word_tokenize方法作用进行分词

words = []
for word in tokenize:
    words.append(nltk.word_tokenize(word))
print(words)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

打印结果:

[['In', 'the', 'coming', 'new', 'term', ',', 'there', 'will', 'be', 'many', 'challenging', 'exams', '.'], ['Firstly', ',', 'in', 'June', ',', 'there', 'is', 'a', 'College', 'English', 'Test', 'Band', 'Four', '.'], ['In', 'May', ',', 'Certificate', 'of', 'Accounting', 'Professional', 'is', 'around', 'the', 'corner', '.'], ['Without', 'sufficient', 'preparations', ',', 'I', 'can', 'hardly', 'expect', 'to', 'pass', 'those', 'exams', '.'], ['So', 'I', 'have', 'to', 'plan', 'more', 'time', 'to', 'take', 'enough', 'preparation', '.']]
  • 1

3.词性标注

#(3)词性标注
#pos_tag方法就是做词性解析

wordtagging = []
for cixing in words:
    wordtagging.append(nltk.pos_tag(cixing))
print(wordtagging)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

打印结果:

[[('In', 'IN'), ('the', 'DT'), ('coming', 'VBG'), ('new', 'JJ'), ('term', 'NN'), (',', ','), ('there', 'EX'), ('will', 'MD'), ('be', 'VB'), ('many', 'JJ'), ('challenging', 'VBG'), ('exams', 'NNS'), ('.', '.')], [('Firstly', 'RB'), (',', ','), ('in', 'IN'), ('June', 'NNP'), (',', ','), ('there', 'EX'), ('is', 'VBZ'), ('a', 'DT'), ('College', 'NNP'), ('English', 'NNP'), ('Test', 'NNP'), ('Band', 'NNP'), ('Four', 'NNP'), ('.', '.')], [('In', 'IN'), ('May', 'NNP'), (',', ','), ('Certificate', 'NNP'), ('of', 'IN'), ('Accounting', 'NNP'), ('Professional', 'NNP'), ('is', 'VBZ'), ('around', 'IN'), ('the', 'DT'), ('corner', 'NN'), ('.', '.')], [('Without', 'IN'), ('sufficient', 'JJ'), ('preparations', 'NNS'), (',', ','), ('I', 'PRP'), ('can', 'MD'), ('hardly', 'RB'), ('expect', 'VB'), ('to', 'TO'), ('pass', 'VB'), ('those', 'DT'), ('exams', 'NNS'), ('.', '.')], [('So', 'RB'), ('I', 'PRP'), ('have', 'VBP'), ('to', 'TO'), ('plan', 'VB'), ('more', 'JJR'), ('time', 'NN'), ('to', 'TO'), ('take', 'VB'), ('enough', 'JJ'), ('preparation', 'NN'), ('.', '.')]]
  • 1

三.利用NLTK做中文分词

1.实现段落分句

import nltk

#解析中文
#做中文分词解析,分割符一定要用“.”才可以正确识别解析(“.”后面一定要一个空格)

text1 = '同是风华正茂,怎敢甘拜下风 . 保持学习,保持饥饿'
Juzi_chinese = nltk.sent_tokenize(text1)
print(Juzi_chinese)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

结果:

['同是风华正茂,怎敢甘拜下风 .', '保持学习,保持饥饿']
  • 1

2.实现分词

#分词解析的是文本,不是句子
#word_tokenize方法实现分词

tokens=nltk.word_tokenize(text1)
print(tokens)
  • 1
  • 2
  • 3
  • 4
  • 5

打印结果:

['同是', '风华', '正茂,怎敢', '甘拜', '下风', '.', '保持', '学习,保持', '饥饿']

  • 1
  • 2
      ————保持饥饿,保持学习
            Jackson_MVP
  • 1
  • 2
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/菜鸟追梦旅行/article/detail/348933?site
推荐阅读
相关标签
  

闽ICP备14008679号