当前位置:   article > 正文

环境配置 | 有关NLP的库安装学习使用示例,原理解释及出错解析

环境配置 | 有关NLP的库安装学习使用示例,原理解释及出错解析

1.Spacy库学习

1.1.介绍

spacy:文本预处理库,Python和Cython中的高级自然语言处理库,它建立在最新的研究基础之上,从一开始就设计用于实际产品。spaCy带有预先训练的统计模型和单词向量,目前支持20多种语言的标记。它具有世界上速度最快的句法分析器,用于标签的卷积神经网络模型,解析和命名实体识别以及与深度学习整合。它是在MIT许可下发布的商业开源软件。【1】

1.2.安装

win10,pycharm,anaconda的虚拟环境(要注意pip和conda不能重复)

pip install spacy -i https://pypi.tuna.tsinghua.edu.cn/simple

1.3.示例使用

不同语言所需要的额外依赖库

 

1.3.1.英文分词的实现

  1. import spacy # 导包
  2. #########英文分词##########
  3. # 加载英文模型
  4. nlp = spacy.load("en_core_web_sm")
  5. # 使用模型,传入句子即可
  6. doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
  7. # 获取分词结果
  8. print([token.text for token in doc])

 结果

1.3.2.中文分词及单词编码的实现

  1. #########对中文进行分词和Word Embedding##########
  2. import spacy # 导包
  3. # 加载模型,并排除掉不需要的components
  4. nlp1 = spacy.load("zh_core_web_sm", exclude=("tagger", "parser", "senter", "attribute_ruler", "ner"))
  5. # 对句子进行处理
  6. doc = nlp1("自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。")
  7. # for循环获取每一个token与它对应的向量
  8. for token in doc:
  9. # 这里为了方便展示,只截取5位,但实际该模型将中文词编码成了96维的向量
  10. print(token.text, token.tensor[:5])

结果

1.3.3.韩语分词及单词编码的实现

  1. ########对韩语句法依存解析##########
  2. #(虚拟环境中韩语模型下载命令)python -m spacy download ko_core_news_sm
  3. import spacy # 导包
  4. from spacy.lang.ko.examples import sentences
  5. nlp2 = spacy.load("ko_core_news_sm")
  6. doc = nlp2(sentences[0])
  7. print(doc.text)
  8. for token in doc:
  9. print(token.text, token.pos_, token.dep_)

 结果

 可参考【2】

1.3.4.检测英文主题及实体类型

  1. import spacy
  2. # Load the English NLP model
  3. nlp = spacy.load('en_core_web_sm')
  4. # The text we want to examine
  5. text = """London is the capital and most populous city of England and
  6. the United Kingdom. Standing on the River Thames in the south east
  7. of the island of Great Britain, London has been a major settlement
  8. for two millennia. It was founded by the Romans, who named it Londinium.
  9. """
  10. # Parse the text with spaCy. This runs the entire pipeline.
  11. doc = nlp(text)
  12. # 'doc' now contains a parsed version of text. We can use it to do anything we want!
  13. # For example, this will print out all the named entities that were detected:
  14. for entity in doc.ents:
  15. print(f"{entity.text} ({entity.label_})")

得到一个在我们的文档中检测到的命名实体和实体类型的列表:

 1.3.5.词汇与文本相似度

  1. import spacy
  2. #python -m spacy download en_core_web_lg
  3. nlp = spacy.load("en_core_web_lg")
  4. # 词汇语义相似度(关联性)
  5. banana = nlp.vocab['banana']
  6. dog = nlp.vocab['dog']
  7. fruit = nlp.vocab['fruit']
  8. animal = nlp.vocab['animal']
  9. print(dog.similarity(animal), dog.similarity(fruit)) # 0.6618534 0.23552845
  10. print(banana.similarity(fruit), banana.similarity(animal)) # 0.67148364 0.2427285
  11. # 文本语义相似度(关联性)
  12. target = nlp("Cats are beautiful animals.")
  13. doc1 = nlp("Dogs are awesome.")
  14. doc2 = nlp("Some gorgeous creatures are felines.")
  15. doc3 = nlp("Dolphins are swimming mammals.")

1.4.实现原理

组件:tok2vec,标记器,形态化器,解析器,词形还原器(trainable_lemmatizer),senter,ner。

spaCy的处理过程(Processing Pipeline)

当调用文本时,spaCy 首先标记文本以生成对象。然后通过几个不同的步骤进行处理 - 这也是 称为处理管道训练管道使用的管道通常包括标记器、词形还原器、分析器 和实体识别器。每个管道组件返回已处理的、 然后将其传递给下一个组件。

tok2vec:

1.5.错误修正

错误1

在pip install spacy后,运行出现没有spacy.load()时

 卸载spacy

pip uninstall spacy

然后重新安装

pip install spacy -i https://pypi.tuna.tsinghua.edu.cn/simple

错误原因分析:错误是由将文件命名为“spacy”引起的,显然它会产生命名冲突。

解决方案:修改文件名spacy.py,不能与spacy库同名。

错误2

实现代码python -m spacy download en_core_web_sm,出现错误如下

E:\Anaconda3\envs\tf24\lib\site-packages\h5py\__init__.py:39: UserWarning: h5py is running against HDF5 1.10.5 when it was built against 1.10.6, this may cause problems
  '{0}.{1}.{2}'.format(*version.hdf5_built_version_tuple)
Warning! ***HDF5 library version mismatched error***
The HDF5 header files used to compile this application do not match
the version used by the HDF5 library to which this application is linked.
Data corruption or segmentation faults may occur if the application continues.
This can happen when an application was compiled by one version of HDF5 but
linked with a different version of static or shared HDF5 library.
You should recompile the application or check your shared library related
settings such as 'LD_LIBRARY_PATH'.
You can, at your own risk, disable this warning by setting the environment
variable 'HDF5_DISABLE_VERSION_CHECK' to a value of '1'.
Setting it to 2 or higher will suppress the warning messages totally.
Headers are 1.10.6, library is 1.10.5

错误原因分析:pycharm会对库 版本更新,升级新的版本,导致版本不匹配

解决方案:(我的版本h5py-2.10.0 和 tensorflow-2.4.0 Python3.7)

卸载pip uninstall h5py

安装pip install h5py==2.10.0

修改后成功!!

2.Textacy学习

用于执行各种自然语言处理任务的Python库,建立在高性能spaCy库的基础上,在 spaCy 之上实现了几种常见的数据抽取算法。

示例

  1. import spacy
  2. import textacy.extract
  3. # Load the large English NLP model
  4. nlp = spacy.load('en_core_web_sm')
  5. # The text we want to examine
  6. text = """London is the capital and most populous city of England and the United Kingdom.
  7. Standing on the River Thames in the south east of the island of Great Britain,
  8. London has been a major settlement for two millennia. It was founded by the Romans,
  9. who named it Londinium.
  10. """
  11. # Parse the document with spaCy
  12. doc = nlp(text)
  13. # Extract semi-structured statements
  14. statements = textacy.extract.semistructured_statements(doc, "London")
  15. # Print the results
  16. print("Here are the things I know about London:")
  17. for statement in statements:
  18. subject, verb, fact = statement
  19. print(f" - {fact}")

错误1

 Traceback (most recent call last):
  File "G:/NLP/bert-master/bert-master/nlpbase/textacypre.py", line 18, in <module>
    statements = textacy.extract.semistructured_statements(doc, "London")
TypeError: semistructured_statements() takes 1 positional argument but 2 were given(如图)

参考文献

【1】Trained Models & Pipelines · spaCy Models Documentation

【2】恩田 / 梅卡布科 / README.md — 比特桶 (bitbucket.org) 

【3】英语文本处理工具库——spaCy - 简书 (jianshu.com)

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/AllinToyou/article/detail/617433
推荐阅读
相关标签
  

闽ICP备14008679号