语言处理任务 | NLTK模块 | 功能描述 |
获取和处理语料库 | nltk.corpus | 语料库和词典的标准化接口 |
字符串处理 | nltk.tokenize, nltk.stem | 分词,句子分解提取主干 |
搭配发现 | nltk.collocations | t-检验, χ \chi χ,点互信息PMI |
词性标识符 | nltk.tag | n-gram,backokk,Brill,HMM,TnT |
分类 | nltk.classfy, nltk.cluster | 决策树,最大熵,贝叶斯,EM,k-means |
分块 | nltk.chunk | 正则表达式,n-gram,命名实体 |
解析 | nltk.parse | 图表,基于特征,一致性,概率,依赖 |
语义解释 | nltk.sem, nltk.inference | λ \lambda λ演算,一阶逻辑,模型检验 |
指标评测 | nltk.metrics | 精度,召回率,协议系数 |
概率与估计 | nltk.probability | 概率分布,平滑概率分布 |
应用 | nltk.app, nltk.char | 图形化的关键词排序,分析器,WordNet查看器,聊天机器人 |
语言学领域的工作 | nltk.toolbox | 处理SIL工具箱格式的数据 |
语料库 | 说明 |
gutenberg | 一个有若干万部的小说语料库,多是古典作品 |
webtext | 收集的网络广告等内容 |
nps_chat | 有上万条聊天消息语料库,即时聊天消息为主 |
brown | 一个百万词级的英语语料库,按文体进行分类 |
reuters | 路透社语料库,上万篇新闻方档,约有1百万字,分90个主题,并分为训练集和测试集两组 |
inaugural | 演讲语料库,几十个文本,都是总统演说 |
方法明 | 说明 |
fileids() | 返回语料库中文件名列表 |
fileids(categories=[]) | 返回指定类别的文件名列表 |
raw(fid=[c1,c2]) | 返回指定文件名的文本字符串 |
raw(catergories=[]) | 返回指定分类的原始文本 |
sents(fid=[c1,c2]) | 返回指定文件名的语句列表 |
sents(catergories=[c1,c2]) | 按分类返回语句列表 |
words(filename) | 返回指定文件名的单词列表 |
words(catogories=[]) | 返回指定分类的单词列表 |
from nltk.corpus import reuters
print(reuters.categories()) # 输出reuters语料库的类别
print(len(reuters.sents())) # 输出reuters语料库的句子数量
print(len(reuters.words())) # 输出reuters语料库的词数量
sentence = "In a new study, women researchers used lava flow records, along with sedimentary and Antarctic ice core data, " \
"to examine that event. They found that the reversal took about as long as many scientists previously " \
"believed it did, just a few thousand years. In a new study, researchers used lava flow records, along with " \
"sedimentary and Antarctic ice core data, to examine that event. "
tokens = nltk.sent_tokenize(sentence)
tokens2 = nltk.word_tokenize(sentence)
方法 | 作用 |
B() | 返回词典的长度 |
plot(title,cumulative=False) | 绘制频率分布图,若cumu为True,则是累积频率分布图 |
tabulate() | 生成频率分布的表格形式 |
most_common() | 返回出现次数最频繁的词与频度 |
hapaxes() | 返回只出现过一次的词 |
import nltk
text = open('demo.txt').read()
fdist = nltk.FreqDist(nltk.word_tokenize(text))
fdist.plot(30, cumulative=True)
import nltk
words = open('demo.txt').read()
text = nltk.text.Text(nltk.word_tokenize(words))
text.dispersion_plot(["time",'about','field','magnetic','records','underway','time' ])
词性标注——POS(Part Of Speech),是一种分析句子成分的方法,通过它来识别每个词的词性。
标记 | 词性 | 示例 |
ADJ | 形容词 | new, good, high, special, big, local |
ADV | 动词 | really, already, still, early, now |
CONJ | 连词 | and, or, but, if, while, although |
DET | 限定词 | the, a, some, most, every, no |
EX | 存在量词 | there, there’s |
MOD | 情态动词 | will, can, would, may, must, should |
NN | 名词 | year,home,costs,time |
NNP | 专有名词 | April,China,Washington |
NUM | 数词 | fourth,2016, 09:30 |
PRON | 代词 | he,they,us |
P | 介词 | on,over,with,of |
TO | 词to | to |
UH | 叹词 | ah,ha,oops |
VB | 动词 | |
VBD | 动词过去式 | made,said,went |
VBG | 现在分词 | going,lying,playing |
VBN | 过去分词 | taken,given,gone |
WH | wh限定词 | who,where,when,what |
import nltk
sentence = "They found that the reversal took about as long as many scientists previously believed it did, " \
"just a few thousand years.";
tokens = nltk.word_tokenize(sentence)
taged_sent = nltk.pos_tag(tokens)
文本经过简单的而分词处理后,还会包含大量的无实际意义的通用词,由于这些常用字或者词使用的频率相当的高,比如a,the, he等,每个页面几乎都包含了这些词汇,如果搜索引擎它们当关键字进行索引,那么所有的网站都会被索引,而且没有区分度,所以一般把这些词直接去掉,不可当做关键词。NLTK提供了一份英文停用词词典直接使用。
sentence = "In a new study, women researchers used lava flow records, along with sedimentary and Antarctic ice core data, " \
"to examine that event. They found that the reversal took about as long as many scientists previously " \
"believed it did, just a few thousand years. In a new study, researchers used lava flow records, along with " \
"sedimentary and Antarctic ice core data, to examine that event. "
tokens = nltk.word_tokenize(sentence)
stops = set(nltk.corpus.stopwords.words('english'))
tokens = [word for word in tokens if word.lower() not in stops]
from nltk.corpus import wordnet syn = wordnet.synsets("dynamic") print("定义:", syn[0].definition()) print("例句:", syn[0].examples()) synonyms = [] for lemma in syn[0].lemmas(): synonyms.append(lemma.name()) print("同义词:", synonyms) antonyms = [] for ss in syn: for lemma in ss.lemmas(): if lemma.antonyms(): antonyms.append(lemma.antonyms()[0].name()) print("反义词:", antonyms)
import nltk sentence = "In a new study, women researchers used lava flow records, along with sedimentary and Antarctic ice core data, " \ "to examine that event. They found that the reversal took about as long as many scientists previously " \ "believed it did, just a few thousand years. In a new study, researchers used lava flow records, along with " \ "sedimentary and Antarctic ice core data, to examine that event. " tokens = nltk.word_tokenize(sentence) porter = nltk.PorterStemmer() lancaster = nltk.LancasterStemmer() print("sentence: " + sentence) print("PorterStemmer: ") print([porter.stem(t) for t in tokens]) print("LancasterStemmer: ") print([lancaster.stem(t) for t in tokens])
import nltk from nltk.corpus import wordnet def get_wordnet_pos(tag): # 单词词性转换 if tag.startswith('J'): return wordnet.ADJ elif tag.startswith('V'): return wordnet.VERB elif tag.startswith('N'): return wordnet.NOUN elif tag.startswith('R'): return wordnet.ADV else: return None tokens = nltk.word_tokenize(sentence) taged_sent = nltk.pos_tag(tokens) wnl = nltk.WordNetLemmatizer() print("WordNetLemmatizer:") print([wnl.lemmatize(t[0],get_wordnet_pos(t[1]) or wordnet.NOUN) for t in taged_sent])
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。