NLTK使用教程（持续更新ing...）_chinese.pickle

作者：小小林熬夜学编程 | 2024-04-04 10:39:53

踩

chinese.pickle

诸神缄默不语-个人CSDN博文目录

NLTK是自然语言处理中常用的Python包，本文是NLTK包的一个简单的使用教程。
NLTK API文档：NLTK :: nltk package

文章目录

1. tokenize：分词，分句
2. stem
- 2.1 nltk.stem.wordnet

1. tokenize：分词，分句

在我下载的punkt文件夹里没有chinese.pickle文件（我在网上看到一些GitHub issue和google group里面有，我很迷惑，反正我没有），所以我认为应该不能实现中文操作。
语言可以通过函数的language入参调整，但是反正默认是英文，不能用中文的，那别的我也不会……所以我没有对此作尝试。
tokenize文档：NLTK :: nltk.tokenize package
punkt文档：NLTK :: nltk.tokenize.punkt module
punkt源码：NLTK :: nltk.tokenize.punkt

英文分词（需要安装Punkt sentence tokenization模型）：

from nltk.tokenize import word_tokenize

sentence="We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes."
tokenized_result=word_tokenize(sentence)
print(tokenized_result)
1
2
3
4
5

输出结果：
['We', 'trained', 'a', 'large', ',', 'deep', 'convolutional', 'neural', 'network', 'to', 'classify', 'the', '1.2', 'million', 'high-resolution', 'images', 'in', 'the', 'ImageNet', 'LSVRC-2010', 'contest', 'into', 'the', '1000', 'different', 'classes', '.']

英文简单分词（仅使用规则，即空格和标点符号实现分词）：

from nltk.tokenize import wordpunct_tokenize

sentence="We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes."
tokenized_result=wordpunct_tokenize(sentence)
print(tokenized_result)
1
2
3
4
5

输出结果：
['We', 'trained', 'a', 'large', ',', 'deep', 'convolutional', 'neural', 'network', 'to', 'classify', 'the', '1', '.', '2', 'million', 'high', '-', 'resolution', 'images', 'in', 'the', 'ImageNet', 'LSVRC', '-', '2010', 'contest', 'into', 'the', '1000', 'different', 'classes', '.']

英文分句，英文分句+分词：

>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
... two of them.\n\nThanks.'''
>>> sent_tokenize(s)
['Good muffins cost $3.88\nin New York.', 'Please buy me\ntwo of them.', 'Thanks.']
>>> [word_tokenize(t) for t in sent_tokenize(s)] 
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'],
['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]
1
2
3
4
5
6
7
8

2. stem

stem文档：NLTK :: nltk.stem package

2.1 nltk.stem.wordnet

nltk.stem.wordnet模块官网：NLTK :: nltk.stem.wordnet module

英文，使用WordNet Lemmatizer实现lemmatize：
使用WordNet内置的morphy函数来实现lemmatize，该函数官网：morphy(7WN) | WordNet

本文内容由网友自发贡献，转载请注明出处：https://www.wpsshop.cn/w/小小林熬夜学编程/article/detail/358116