赞
踩
spacy:文本预处理库,Python和Cython中的高级自然语言处理库,它建立在最新的研究基础之上,从一开始就设计用于实际产品。spaCy带有预先训练的统计模型和单词向量,目前支持20多种语言的标记。它具有世界上速度最快的句法分析器,用于标签的卷积神经网络模型,解析和命名实体识别以及与深度学习整合。它是在MIT许可下发布的商业开源软件。【1】
win10,pycharm,anaconda的虚拟环境(要注意pip和conda不能重复)
pip install spacy -i https://pypi.tuna.tsinghua.edu.cn/simple
不同语言所需要的额外依赖库
- import spacy # 导包
-
- #########英文分词##########
- # 加载英文模型
- nlp = spacy.load("en_core_web_sm")
-
- # 使用模型,传入句子即可
- doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
- # 获取分词结果
- print([token.text for token in doc])
结果
- #########对中文进行分词和Word Embedding##########
- import spacy # 导包
- # 加载模型,并排除掉不需要的components
- nlp1 = spacy.load("zh_core_web_sm", exclude=("tagger", "parser", "senter", "attribute_ruler", "ner"))
- # 对句子进行处理
- doc = nlp1("自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。")
- # for循环获取每一个token与它对应的向量
- for token in doc:
- # 这里为了方便展示,只截取5位,但实际该模型将中文词编码成了96维的向量
- print(token.text, token.tensor[:5])
结果
- ########对韩语句法依存解析##########
- #(虚拟环境中韩语模型下载命令)python -m spacy download ko_core_news_sm
-
- import spacy # 导包
- from spacy.lang.ko.examples import sentences
-
- nlp2 = spacy.load("ko_core_news_sm")
- doc = nlp2(sentences[0])
- print(doc.text)
- for token in doc:
- print(token.text, token.pos_, token.dep_)
结果
可参考【2】
- import spacy
-
- # Load the English NLP model
- nlp = spacy.load('en_core_web_sm')
-
- # The text we want to examine
- text = """London is the capital and most populous city of England and
- the United Kingdom. Standing on the River Thames in the south east
- of the island of Great Britain, London has been a major settlement
- for two millennia. It was founded by the Romans, who named it Londinium.
- """
-
- # Parse the text with spaCy. This runs the entire pipeline.
- doc = nlp(text)
-
- # 'doc' now contains a parsed version of text. We can use it to do anything we want!
- # For example, this will print out all the named entities that were detected:
- for entity in doc.ents:
- print(f"{entity.text} ({entity.label_})")
得到一个在我们的文档中检测到的命名实体和实体类型的列表:
- import spacy
- #python -m spacy download en_core_web_lg
- nlp = spacy.load("en_core_web_lg")
- # 词汇语义相似度(关联性)
-
- banana = nlp.vocab['banana']
- dog = nlp.vocab['dog']
- fruit = nlp.vocab['fruit']
- animal = nlp.vocab['animal']
-
- print(dog.similarity(animal), dog.similarity(fruit)) # 0.6618534 0.23552845
- print(banana.similarity(fruit), banana.similarity(animal)) # 0.67148364 0.2427285
-
- # 文本语义相似度(关联性)
- target = nlp("Cats are beautiful animals.")
-
- doc1 = nlp("Dogs are awesome.")
- doc2 = nlp("Some gorgeous creatures are felines.")
- doc3 = nlp("Dolphins are swimming mammals.")
组件:tok2vec,标记器,形态化器,解析器,词形还原器(trainable_lemmatizer),senter,ner。
spaCy的处理过程(Processing Pipeline)
当调用文本时,spaCy 首先标记文本以生成对象。然后通过几个不同的步骤进行处理 - 这也是 称为处理管道。训练管道使用的管道通常包括标记器、词形还原器、分析器 和实体识别器。每个管道组件返回已处理的、 然后将其传递给下一个组件。
tok2vec:
在pip install spacy后,运行出现没有spacy.load()时
卸载spacy
pip uninstall spacy
然后重新安装
pip install spacy -i https://pypi.tuna.tsinghua.edu.cn/simple
错误原因分析:错误是由将文件命名为“spacy”引起的,显然它会产生命名冲突。
解决方案:修改文件名spacy.py,不能与spacy库同名。
实现代码python -m spacy download en_core_web_sm,出现错误如下
E:\Anaconda3\envs\tf24\lib\site-packages\h5py\__init__.py:39: UserWarning: h5py is running against HDF5 1.10.5 when it was built against 1.10.6, this may cause problems
'{0}.{1}.{2}'.format(*version.hdf5_built_version_tuple)
Warning! ***HDF5 library version mismatched error***
The HDF5 header files used to compile this application do not match
the version used by the HDF5 library to which this application is linked.
Data corruption or segmentation faults may occur if the application continues.
This can happen when an application was compiled by one version of HDF5 but
linked with a different version of static or shared HDF5 library.
You should recompile the application or check your shared library related
settings such as 'LD_LIBRARY_PATH'.
You can, at your own risk, disable this warning by setting the environment
variable 'HDF5_DISABLE_VERSION_CHECK' to a value of '1'.
Setting it to 2 or higher will suppress the warning messages totally.
Headers are 1.10.6, library is 1.10.5
错误原因分析:pycharm会对库 版本更新,升级新的版本,导致版本不匹配。
解决方案:(我的版本h5py-2.10.0 和 tensorflow-2.4.0 Python3.7)
卸载pip uninstall h5py
安装pip install h5py==2.10.0
修改后成功!!
用于执行各种自然语言处理任务的Python库,建立在高性能spaCy库的基础上,在 spaCy 之上实现了几种常见的数据抽取算法。
示例
- import spacy
- import textacy.extract
-
- # Load the large English NLP model
- nlp = spacy.load('en_core_web_sm')
-
- # The text we want to examine
- text = """London is the capital and most populous city of England and the United Kingdom.
- Standing on the River Thames in the south east of the island of Great Britain,
- London has been a major settlement for two millennia. It was founded by the Romans,
- who named it Londinium.
- """
-
- # Parse the document with spaCy
- doc = nlp(text)
-
- # Extract semi-structured statements
- statements = textacy.extract.semistructured_statements(doc, "London")
-
- # Print the results
- print("Here are the things I know about London:")
-
- for statement in statements:
- subject, verb, fact = statement
- print(f" - {fact}")
错误1
Traceback (most recent call last):
File "G:/NLP/bert-master/bert-master/nlpbase/textacypre.py", line 18, in <module>
statements = textacy.extract.semistructured_statements(doc, "London")
TypeError: semistructured_statements() takes 1 positional argument but 2 were given(如图)
【1】Trained Models & Pipelines · spaCy Models Documentation
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。