英文文本分词处理（NLTK）_nltk分词

作者：繁依Fanyi0 | 2024-04-01 12:49:44

踩

nltk分词

文章目录

1、NLTK的安装

首先，打开终端（Anaconda Prompt）安装nltk：

pip install nltk
1

打开Python终端或是Anaconda 的Spyder并输入以下内容来安装 NLTK 包

import nltk
nltk.download()
1
2

注意: 详细操作或其他安装方式请查看 Anaconda3安装jieba库和NLTK库。

2、NLTK分词和分句

由于英语的句子基本上就是由标点符号、空格和词构成，那么只要根据空格和标点符号将词语分割成数组即可，所以相对来说简单很多：
（1）分词：

from nltk import word_tokenize     #以空格形式实现分词
paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!"
words = word_tokenize(paragraph)
print(words)
1
2
3
4

运行结果：

['The', 'first', 'time', 'I', 'heard', 'that', 'song', 'was', 'in', 'Hawaii', 'on', 'radio', '.', 'I', 'was', 'just', 'a', 'kid', ',', 'and', 'loved', 'it', 'very', 'much', '!', 'What', 'a', 'fantastic', 'song', '!']
1

（2）分句：

from nltk import sent_tokenize    #以符号形式实现分句
sentences = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!"
sentence = sent_tokenize(sentences )
print(sentence)
1
2
3
4

运行结果：

['The first time I heard that song was in Hawaii on radio.', 'I was just a kid, and loved it very much!', 'What a fantastic song!']
1

注意： NLTK分词或者分句以后，都会自动形成列表的形式

3、NLTK分词后去除标点符号

from nltk import word_tokenize
paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!".lower()
cutwords1 = word_tokenize(paragraph)   #分词
print('【NLTK分词结果：】')
print(cutwords1)

interpunctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']   #定义标点符号列表
cutwords2 = [word for word in cutwords1 if word not in interpunctuations]   #去除标点符号
print('\n【NLTK分词后去除符号结果：】')
print(cutwords2)
1
2
3
4
5
6
7
8
9
10

运行结果：

【NLTK分词结果：】
['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', '.', 'i', 'was', 'just', 'a', 'kid', ',', 'and', 'loved', 'it', 'very', 'much', '!', 'what', 'a', 'fantastic', 'song', '!']

【NLTK分词后去除符号结果：】
['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', 'i', 'was', 'just', 'a', 'kid', 'and', 'loved', 'it', 'very', 'much', 'what', 'a', 'fantastic', 'song']
1
2
3
4
5

4、NLTK分词后去除停用词<

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/繁依Fanyi0/article/detail/348940