赞
踩
NLTK设计目标:
NLTK自然语言处理库,主要包括以下模块,及其对应的功能
语言处理任务 | NLTK模块 | 功能描述 |
---|---|---|
获取和处理语料库 | nltk.corpus | 语料库和词典的标准化接口 |
字符串处理 | nltk.tokenize, nltk.stem | 分词,句子分解提取主干 |
搭配发现 | nltk.collocations | t-检验, χ \chi χ,点互信息PMI |
词性标识符 | nltk.tag | n-gram,backokk,Brill,HMM,TnT |
分类 | nltk.classfy, nltk.cluster | 决策树,最大熵,贝叶斯,EM,k-means |
分块 | nltk.chunk | 正则表达式,n-gram,命名实体 |
解析 | nltk.parse | 图表,基于特征,一致性,概率,依赖 |
语义解释 | nltk.sem, nltk.inference | λ \lambda λ演算,一阶逻辑,模型检验 |
指标评测 | nltk.metrics | 精度,召回率,协议系数 |
概率与估计 | nltk.probability | 概率分布,平滑概率分布 |
应用 | nltk.app, nltk.char | 图形化的关键词排序,分析器,WordNet查看器,聊天机器人 |
语言学领域的工作 | nltk.toolbox | 处理SIL工具箱格式的数据 |
在nltk.corpus包下,提供了几类标注好的语料库。如下表所示:
语料库 | 说明 |
---|---|
gutenberg | 一个有若干万部的小说语料库,多是古典作品 |
webtext | 收集的网络广告等内容 |
nps_chat | 有上万条聊天消息语料库,即时聊天消息为主 |
brown | 一个百万词级的英语语料库,按文体进行分类 |
reuters | 路透社语料库,上万篇新闻方档,约有1百万字,分90个主题,并分为训练集和测试集两组 |
inaugural | 演讲语料库,几十个文本,都是总统演说 |
方法明 | 说明 |
---|---|
fileids() | 返回语料库中文件名列表 |
fileids(categories=[]) | 返回指定类别的文件名列表 |
raw(fid=[c1,c2]) | 返回指定文件名的文本字符串 |
raw(catergories=[]) | 返回指定分类的原始文本 |
sents(fid=[c1,c2]) | 返回指定文件名的语句列表 |
sents(catergories=[c1,c2]) | 按分类返回语句列表 |
words(filename) | 返回指定文件名的单词列表 |
words(catogories=[]) | 返回指定分类的单词列表 |
from nltk.corpus import reuters
print(reuters.categories()) # 输出reuters语料库的类别
print(len(reuters.sents())) # 输出reuters语料库的句子数量
print(len(reuters.words())) # 输出reuters语料库的词数量
tokenize是NLTK的分词包,其中的函数可以识别英文词汇和标点符号对文本进行分句或分词处理。
sentence = "In a new study, women researchers used lava flow records, along with sedimentary and Antarctic ice core data, " \
"to examine that event. They found that the reversal took about as long as many scientists previously " \
"believed it did, just a few thousand years. In a new study, researchers used lava flow records, along with " \
"sedimentary and Antarctic ice core data, to examine that event. "
tokens = nltk.sent_tokenize(sentence)
print('sent_tokenize:')
print(np.array(tokens))
tokens2 = nltk.word_tokenize(sentence)
print('word_tokenize:')
print(tokens2)
在NLTK中通过FreqDist类进行实现。这个类主要记录了每个词出现的次数,根据统计数据生成表格,或绘图。其结构很简单,用一个有序词典进行实现。
方法 | 作用 |
---|---|
B() | 返回词典的长度 |
plot(title,cumulative=False) | 绘制频率分布图,若cumu为True,则是累积频率分布图 |
tabulate() | 生成频率分布的表格形式 |
most_common() | 返回出现次数最频繁的词与频度 |
hapaxes() | 返回只出现过一次的词 |
import nltk
text = open('demo.txt').read()
fdist = nltk.FreqDist(nltk.word_tokenize(text))
fdist.plot(30, cumulative=True)
绘制离散图,查看指定单词在文中的分布位置。
import nltk
words = open('demo.txt').read()
text = nltk.text.Text(nltk.word_tokenize(words))
text.dispersion_plot(["time",'about','field','magnetic','records','underway','time' ])
词性标注——POS(Part Of Speech),是一种分析句子成分的方法,通过它来识别每个词的词性。
标记 | 词性 | 示例 |
---|---|---|
ADJ | 形容词 | new, good, high, special, big, local |
ADV | 动词 | really, already, still, early, now |
CONJ | 连词 | and, or, but, if, while, although |
DET | 限定词 | the, a, some, most, every, no |
EX | 存在量词 | there, there’s |
MOD | 情态动词 | will, can, would, may, must, should |
NN | 名词 | year,home,costs,time |
NNP | 专有名词 | April,China,Washington |
NUM | 数词 | fourth,2016, 09:30 |
PRON | 代词 | he,they,us |
P | 介词 | on,over,with,of |
TO | 词to | to |
UH | 叹词 | ah,ha,oops |
VB | 动词 | |
VBD | 动词过去式 | made,said,went |
VBG | 现在分词 | going,lying,playing |
VBN | 过去分词 | taken,given,gone |
WH | wh限定词 | who,where,when,what |
import nltk
sentence = "They found that the reversal took about as long as many scientists previously believed it did, " \
"just a few thousand years.";
tokens = nltk.word_tokenize(sentence)
taged_sent = nltk.pos_tag(tokens)
print(taged_sent)
文本经过简单的而分词处理后,还会包含大量的无实际意义的通用词,由于这些常用字或者词使用的频率相当的高,比如a,the, he等,每个页面几乎都包含了这些词汇,如果搜索引擎它们当关键字进行索引,那么所有的网站都会被索引,而且没有区分度,所以一般把这些词直接去掉,不可当做关键词。NLTK提供了一份英文停用词词典直接使用。
sentence = "In a new study, women researchers used lava flow records, along with sedimentary and Antarctic ice core data, " \
"to examine that event. They found that the reversal took about as long as many scientists previously " \
"believed it did, just a few thousand years. In a new study, researchers used lava flow records, along with " \
"sedimentary and Antarctic ice core data, to examine that event. "
tokens = nltk.word_tokenize(sentence)
stops = set(nltk.corpus.stopwords.words('english'))
tokens = [word for word in tokens if word.lower() not in stops]
print(tokens)
wordnet是为自然语言处理构建的数据库。它包括部分词语的一个同义词组和一个简短的定义。
from nltk.corpus import wordnet syn = wordnet.synsets("dynamic") print("定义:", syn[0].definition()) print("例句:", syn[0].examples()) synonyms = [] for lemma in syn[0].lemmas(): synonyms.append(lemma.name()) print("同义词:", synonyms) antonyms = [] for ss in syn: for lemma in ss.lemmas(): if lemma.antonyms(): antonyms.append(lemma.antonyms()[0].name()) print("反义词:", antonyms)
NLP在获取语料之后,通常要进行文本预处理。NTLK英文的预处理包括:分词,去停词,提取词干等步骤。对于英文去停词的支持,在corpus下包含了一个stopword的停词库。对于提取词词干,提供了Porter和Lancaster两个stemer。另个还提供了一个WordNetLemmatizer做词形归并,lemmatize()函数可以进行词形还原,第一个参数为单词,第二个参数为该单词的词性
Stem通常基于语法规则使用正则表达式来实现,处理的范围广,但过于死板。而Lemmatizer实现采用基于词典的方式来解决,因而更慢一些,处理的范围和词典的大小有关。
词干提取(stemming)是文本预处理中较为主要的操作,是去除单词的前后缀得到词根的过程。词干提取以抽取词的词干或词根形式,基于规则,方法比较简单,不一定能够表达完整语义。
import nltk sentence = "In a new study, women researchers used lava flow records, along with sedimentary and Antarctic ice core data, " \ "to examine that event. They found that the reversal took about as long as many scientists previously " \ "believed it did, just a few thousand years. In a new study, researchers used lava flow records, along with " \ "sedimentary and Antarctic ice core data, to examine that event. " tokens = nltk.word_tokenize(sentence) porter = nltk.PorterStemmer() lancaster = nltk.LancasterStemmer() print("sentence: " + sentence) print("PorterStemmer: ") print([porter.stem(t) for t in tokens]) print("LancasterStemmer: ") print([lancaster.stem(t) for t in tokens])
Porter和Lancaster词干提取器按照各自的规则剥离词缀。此例中Porter词干提取器正确处理了词women,而Lancaster进行了不必要的切分。
词形还原(Lemmatization)是文本预处理中的重要部分,与词干提取(stemming)很相似。词形还原就是去掉单词的词缀,提取单词的主干部分,通常提取后的单词会是字典中的单词,不同于词干提取(stemming),提取后的单词不一定会出现在单词中。比如,单词“cars”词形还原后的单词为“car”,单词“ate”词形还原后的单词为“eat”。
import nltk from nltk.corpus import wordnet def get_wordnet_pos(tag): # 单词词性转换 if tag.startswith('J'): return wordnet.ADJ elif tag.startswith('V'): return wordnet.VERB elif tag.startswith('N'): return wordnet.NOUN elif tag.startswith('R'): return wordnet.ADV else: return None tokens = nltk.word_tokenize(sentence) taged_sent = nltk.pos_tag(tokens) wnl = nltk.WordNetLemmatizer() print("WordNetLemmatizer:") print([wnl.lemmatize(t[0],get_wordnet_pos(t[1]) or wordnet.NOUN) for t in taged_sent])
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。