当前位置:   article > 正文

nltk.data.load()应用及其要注意的事项_[nltk_data] downloading package punkt to /root/nlt

[nltk_data] downloading package punkt to /root/nltk_data...

NLTK是一个自然语言处理的切分包,如果使用的是基本的Python,需要安装该包才能使用,笔者使用的是jupyter notebook(anaconda3)不需要自己安装,直接使用如下的代码即可使用

import nltk

    本文的目的是nltk.download()的应用,所以不对nltk中的其他函数进行讲解。该函数是用来切分大批量的句子。
    使用前需要下载nltk中的相应的语言的 pikle文件(可以在tokenizers/punkt中找到),所以先下载punkt如下,结果会显示其安装在电脑的位置接下来可能会用到。很遗憾的是没有中文的

    nltk.download('punkt')
      [nltk_data] Downloading package punkt to C:\Users\my
      [nltk_data]     computer\AppData\Roaming\nltk_data...
      [nltk_data]   Package punkt is already up-to-date!
      
      
      
      
      
      True
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9

      在一开始我输入的代码中并没有‘r’和‘encoding’,也不知道什么原因报错了,路径也总是报错,加上了哪两个就好了,之后又发现其实不加也行,也不知道为什么一开始总错,如果有小伙伴有类似的情况,欢迎留言交流。

      ch=nltk.data.load(r'tokenizers\\punkt\\english.pickle',encoding='utf-8')

        可以发现其实并没有得到切分,但是有趣的是如果加了逗号的话会有不一样的结果

        ch.tokenize('I am Chinese I love the Chinese Communist Party to support the fundamental interests of the Chinese people')
          ['I am Chinese I love the Chinese Communist Party to support the fundamental interests of the Chinese people']
          
          • 1

          以下是nltk.data.load函数的参数等等

          help(nltk.data.load)
            Help on function load in module nltk.data:
            
            load(resource_url, format='auto', cache=True, verbose=False, logic_parser=None, fstruct_reader=None, encoding=None)
                Load a given resource from the NLTK data package.  The following
                resource formats are currently supported:
                
                  - ``pickle``
                  - ``json``
                  - ``yaml``
                  - ``cfg`` (context free grammars)
                  - ``pcfg`` (probabilistic CFGs)
                  - ``fcfg`` (feature-based CFGs)
                  - ``fol`` (formulas of First Order Logic)
                  - ``logic`` (Logical formulas to be parsed by the given logic_parser)
                  - ``val`` (valuation of First Order Logic model)
                  - ``text`` (the file contents as a unicode string)
                  - ``raw`` (the raw file contents as a byte string)
                
                If no format is specified, ``load()`` will attempt to determine a
                format based on the resource name's file extension.  If that
                fails, ``load()`` will raise a ``ValueError`` exception.
                
                For all text formats (everything except ``pickle``, ``json``, ``yaml`` and ``raw``),
                it tries to decode the raw contents using UTF-8, and if that doesn't
                work, it tries with ISO-8859-1 (Latin-1), unless the ``encoding``
                is specified.
                
                :type resource_url: str
                :param resource_url: A URL specifying where the resource should be
                    loaded from.  The default protocol is "nltk:", which searches
                    for the file in the the NLTK data package.
                :type cache: bool
                :param cache: If true, add this resource to a cache.  If load()
                    finds a resource in its cache, then it will return it from the
                    cache rather than loading it.
                :type verbose: bool
                :param verbose: If true, print a message when loading a resource.
                    Messages are not displayed when a resource is retrieved from
                    the cache.
                :type logic_parser: LogicParser
                :param logic_parser: The parser that will be used to parse logical
                    expressions.
                :type fstruct_reader: FeatStructReader
                :param fstruct_reader: The parser that will be used to parse the
                    feature structure of an fcfg.
                :type encoding: str
                :param encoding: the encoding of the input; only used for text formats.
            
            • 1
            • 2
            • 3
            • 4
            • 5
            • 6
            • 7
            • 8
            • 9
            • 10
            • 11
            • 12
            • 13
            • 14
            • 15
            • 16
            • 17
            • 18
            • 19
            • 20
            • 21
            • 22
            • 23
            • 24
            • 25
            • 26
            • 27
            • 28
            • 29
            • 30
            • 31
            • 32
            • 33
            • 34
            • 35
            • 36
            • 37
            • 38
            • 39
            • 40
            • 41
            • 42
            • 43
            • 44
            • 45
            • 46
            • 47
              声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/知新_RL/article/detail/377563
              推荐阅读
              相关标签
                

              闽ICP备14008679号