赞
踩
NLTK是一个自然语言处理的切分包,如果使用的是基本的Python,需要安装该包才能使用,笔者使用的是jupyter notebook(anaconda3)不需要自己安装,直接使用如下的代码即可使用
import nltk
本文的目的是nltk.download()的应用,所以不对nltk中的其他函数进行讲解。该函数是用来切分大批量的句子。
使用前需要下载nltk中的相应的语言的 pikle文件(可以在tokenizers/punkt中找到),所以先下载punkt如下,结果会显示其安装在电脑的位置接下来可能会用到。很遗憾的是没有中文的
nltk.download('punkt')
[nltk_data] Downloading package punkt to C:\Users\my
[nltk_data] computer\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
True
在一开始我输入的代码中并没有‘r’和‘encoding’,也不知道什么原因报错了,路径也总是报错,加上了哪两个就好了,之后又发现其实不加也行,也不知道为什么一开始总错,如果有小伙伴有类似的情况,欢迎留言交流。
ch=nltk.data.load(r'tokenizers\\punkt\\english.pickle',encoding='utf-8')
可以发现其实并没有得到切分,但是有趣的是如果加了逗号的话会有不一样的结果
ch.tokenize('I am Chinese I love the Chinese Communist Party to support the fundamental interests of the Chinese people')
['I am Chinese I love the Chinese Communist Party to support the fundamental interests of the Chinese people']
以下是nltk.data.load函数的参数等等
help(nltk.data.load)
Help on function load in module nltk.data: load(resource_url, format='auto', cache=True, verbose=False, logic_parser=None, fstruct_reader=None, encoding=None) Load a given resource from the NLTK data package. The following resource formats are currently supported: - ``pickle`` - ``json`` - ``yaml`` - ``cfg`` (context free grammars) - ``pcfg`` (probabilistic CFGs) - ``fcfg`` (feature-based CFGs) - ``fol`` (formulas of First Order Logic) - ``logic`` (Logical formulas to be parsed by the given logic_parser) - ``val`` (valuation of First Order Logic model) - ``text`` (the file contents as a unicode string) - ``raw`` (the raw file contents as a byte string) If no format is specified, ``load()`` will attempt to determine a format based on the resource name's file extension. If that fails, ``load()`` will raise a ``ValueError`` exception. For all text formats (everything except ``pickle``, ``json``, ``yaml`` and ``raw``), it tries to decode the raw contents using UTF-8, and if that doesn't work, it tries with ISO-8859-1 (Latin-1), unless the ``encoding`` is specified. :type resource_url: str :param resource_url: A URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package. :type cache: bool :param cache: If true, add this resource to a cache. If load() finds a resource in its cache, then it will return it from the cache rather than loading it. :type verbose: bool :param verbose: If true, print a message when loading a resource. Messages are not displayed when a resource is retrieved from the cache. :type logic_parser: LogicParser :param logic_parser: The parser that will be used to parse logical expressions. :type fstruct_reader: FeatStructReader :param fstruct_reader: The parser that will be used to parse the feature structure of an fcfg. :type encoding: str :param encoding: the encoding of the input; only used for text formats.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。