利用NLTK做中英文分词_nltk中文分词

作者：菜鸟追梦旅行 | 2024-04-01 12:47:49

踩

nltk中文分词

美图欣赏：
在这里插入图片描述
一.NLTK环境配置

1.安装nltk包（如果开始能装忽然爆红多装几次）

pip install nltk
1

2.在python consol里面

 //1.先导入包
 import nltk
 
//2.下载基本的数据
 nltk.download() 
1
2
3
4
5

注:如果在线下载失败，可以自行官网下载然后放到指定文件夹。

二.利用NLTK做英文分词

这里选用的是anaconda做解释器

1.实现段落分句

import nltk
# 获取一段文本
text = "In the coming new term, there will be many challenging exams. Firstly, in June, there is a College English Test Band Four. In May, Certificate of Accounting Professional is around the corner. Without sufficient preparations, I can hardly expect to pass those exams. So I have to plan more time to take enough preparation."

#（1）实现段落分句
#language是关键词（红色表示），默认是english
#sent_tokenize方法作用就是分句 (以“.”作为分割符)

tokenize = nltk.sent_tokenize(text, language='english')
print(tokenize)
1
2
3
4
5
6
7
8
9
10

打印结果:

['In the coming new term, there will be many challenging exams.', 'Firstly, in June, there is a College English Test Band Four.', 'In May, Certificate of Accounting Professional is around the corner.', 'Without sufficient preparations, I can hardly expect to pass those exams.', 'So I have to plan more time to take enough preparation.']
1

2.实现分词

#（2）实现分词
#append方法:循环出来的内容进行追加到数组words[]中
#word_tokenize方法作用进行分词

words = []
for word in tokenize:
    words.append(nltk.word_tokenize(word))
print(words)
1
2
3
4
5
6
7
8

打印结果：

[['In', 'the', 'coming', 'new', 'term', ',', 'there', 'will', 'be', 'many', 'challenging', 'exams', '.'], ['Firstly', ',', 'in', 'June', ',', 'there', 'is', 'a', 'College', 'English', 'Test', 'Band', 'Four', '.'], ['In', 'May', ',', 'Certificate', 'of', 'Accounting', 'Professional', 'is', 'around', 'the', 'corner', '.'], ['Without', 'sufficient', 'preparations', ',', 'I', 'can', 'hardly', 'expect', 'to', 'pass', 'those', 'exams', '.'], ['So', 'I', 'have', 'to', 'plan', 'more', 'time', 'to', 'take', 'enough', 'preparation', '.']]
1

3.词性标注

#(3)词性标注
#pos_tag方法就是做词性解析

wordtagging = []
for cixing in words:
    wordtagging.append(nltk.pos_tag(cixing))
print(wordtagging)
1
2
3
4
5
6
7

打印结果：

[[('In', 'IN'), ('the', 'DT'), ('coming', 'VBG'), ('new', 'JJ'), ('term', 'NN'), (',', ','), ('there', 'EX'), ('will', 'MD'), ('be', 'VB'), ('many', 'JJ'), ('challenging', 'VBG'), ('exams', 'NNS'), ('.', '.')], [('Firstly', 'RB'), (',', ','), ('in', 'IN'), ('June', 'NNP'), (',', ','), ('there', 'EX'), ('is', 'VBZ'), ('a', 'DT'), ('College', 'NNP'), ('English', 'NNP'), ('Test', 'NNP'), ('Band', 'NNP'), ('Four', 'NNP'), ('.', '.')], [('In', 'IN'), ('May', 'NNP'), (',', ','), ('Certificate', 'NNP'), ('of', 'IN'), ('Accounting', 'NNP'), ('Professional', 'NNP'), ('is', 'VBZ'), ('around', 'IN'), ('the', 'DT'), ('corner', 'NN'), ('.', '.')], [('Without', 'IN'), ('sufficient', 'JJ'), ('preparations', 'NNS'), (',', ','), ('I', 'PRP'), ('can', 'MD'), ('hardly', 'RB'), ('expect', 'VB'), ('to', 'TO'), ('pass', 'VB'), ('those', 'DT'), ('exams', 'NNS'), ('.', '.')], [('So', 'RB'), ('I', 'PRP'), ('have', 'VBP'), ('to', 'TO'), ('plan', 'VB'), ('more', 'JJR'), ('time', 'NN'), ('to', 'TO'), ('take', 'VB'), ('enough', 'JJ'), ('preparation', 'NN'), ('.', '.')]]
1

三.利用NLTK做中文分词

1.实现段落分句

import nltk

#解析中文
#做中文分词解析，分割符一定要用“.”才可以正确识别解析（“.”后面一定要一个空格）

text1 = '同是风华正茂，怎敢甘拜下风 . 保持学习，保持饥饿'
Juzi_chinese = nltk.sent_tokenize(text1)
print(Juzi_chinese)
1
2
3
4
5
6
7
8

结果：

['同是风华正茂，怎敢甘拜下风 .', '保持学习，保持饥饿']
1

2.实现分词

#分词解析的是文本，不是句子
#word_tokenize方法实现分词

tokens=nltk.word_tokenize(text1)
print(tokens)
1
2
3
4
5

打印结果：

['同是', '风华', '正茂，怎敢', '甘拜', '下风', '.', '保持', '学习，保持', '饥饿']

1
2

      ————保持饥饿，保持学习
            Jackson_MVP
1
2

本文内容由网友自发贡献，转载请注明出处：https://www.wpsshop.cn/w/菜鸟追梦旅行/article/detail/348933