当前位置:   article > 正文

英文文本分词处理(NLTK)_nltk分词

nltk分词

1、NLTK的安装

首先,打开终端(Anaconda Prompt)安装nltk:

pip install nltk
  • 1

打开Python终端或是Anaconda 的Spyder并输入以下内容来安装 NLTK 包

import nltk
nltk.download()
  • 1
  • 2

注意: 详细操作或其他安装方式请查看 Anaconda3安装jieba库和NLTK库

2、NLTK分词和分句

 由于英语的句子基本上就是由标点符号、空格和词构成,那么只要根据空格和标点符号将词语分割成数组即可,所以相对来说简单很多:
(1)分词:

from nltk import word_tokenize     #以空格形式实现分词
paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!"
words = word_tokenize(paragraph)
print(words)
  • 1
  • 2
  • 3
  • 4

运行结果:

['The', 'first', 'time', 'I', 'heard', 'that', 'song', 'was', 'in', 'Hawaii', 'on', 'radio', '.', 'I', 'was', 'just', 'a', 'kid', ',', 'and', 'loved', 'it', 'very', 'much', '!', 'What', 'a', 'fantastic', 'song', '!']
  • 1

(2)分句:

from nltk import sent_tokenize    #以符号形式实现分句
sentences = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!"
sentence = sent_tokenize(sentences )
print(sentence)
  • 1
  • 2
  • 3
  • 4

运行结果:

['The first time I heard that song was in Hawaii on radio.', 'I was just a kid, and loved it very much!', 'What a fantastic song!']
  • 1

注意: NLTK分词或者分句以后,都会自动形成列表的形式

3、NLTK分词后去除标点符号
from nltk import word_tokenize
paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!".lower()
cutwords1 = word_tokenize(paragraph)   #分词
print('【NLTK分词结果:】')
print(cutwords1)

interpunctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']   #定义标点符号列表
cutwords2 = [word for word in cutwords1 if word not in interpunctuations]   #去除标点符号
print('\n【NLTK分词后去除符号结果:】')
print(cutwords2)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

运行结果:

【NLTK分词结果:】
['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', '.', 'i', 'was', 'just', 'a', 'kid', ',', 'and', 'loved', 'it', 'very', 'much', '!', 'what', 'a', 'fantastic', 'song', '!']

【NLTK分词后去除符号结果:】
['the', 'first', 'time', 'i', 'heard', 'that', 'song', 'was', 'in', 'hawaii', 'on', 'radio', 'i', 'was', 'just', 'a', 'kid', 'and', 'loved', 'it', 'very', 'much', 'what', 'a', 'fantastic', 'song']
  • 1
  • 2
  • 3
  • 4
  • 5
4、NLTK分词后去除停用词<
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/繁依Fanyi0/article/detail/348940
推荐阅读
相关标签
  

闽ICP备14008679号