赞
踩
Project Gutenberg的语料库包含
>>>import nltk
>>>from nltk.corpus import gutenberg
>>>gutenberg.fileids()
['austen-emma.txt','austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt','blake-poems.txt','bryant-stories.txt', 'burgess-busterbrown.txt','carroll-alice.txt','chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt','edgeworth-parents.txt', 'melville-moby_dick.txt','milton-paradise.txt','shakespeare-caesar.txt', 'shakespeare-hamlet.txt',
'shakespeare-macbeth.txt','whitman-leaves.txt’]
调用使用:
>>>emma = nltk.corpus.gutenberg.words('austen-emma.txt')
#num_chars 变量计数了空白字符
#row()对文件的内容不进行任何语言处理
#sents()函数把文本划分成句子,其中每一个句子是一个词链表
非正规文本语料库
>>> from nltk.corpus import webtext
>>> for fileid in webtext.fileids():
即时消息聊天会话语料库
>>> from nltk.corpus import nps_chat
>>> chatroom =nps_chat.posts('10-19-20s_706posts.xml')
>>> chatroom[12
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。