当前位置:   article > 正文

spaCy使用_spacy 的官方文档

spacy 的官方文档

官方文档
https://spacy.io/usage

spaCy是一个Python自然语言处理工具包,诞生于2014年年中,号称“Industrial-Strength Natural Language Processing in Python”,是具有工业级强度的Python NLP工具包。spaCy里大量使用了 Cython 来提高相关模块的性能,这个区别于学术性质更浓的Python NLTK,因此具有了业界应用的实际价值。

加载模型

# 导入工具包和英文模型
# python -m spacy download en 用管理员身份打开CMD

import spacy
nlp = spacy.load('en')
  • 1
  • 2
  • 3
  • 4
  • 5

文本处理

doc = nlp('Weather is good, very windy and sunny. We have no classes in the afternoon.')
# 分词
for token in doc:
    print (token)
OUT:
 Weather
is
good
,
very
windy
and
sunny
.
We
have
no
classes
in
the
afternoon
---------------------------------
#分句
for sent in doc.sents:
    print (sent)
OUT:
Weather is good, very windy and sunny.
We have no classes in the afternoon.   
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28

词性 参考 https://www.winwaed.com/blog/2011/11/08/part-of-speech-tags/

for token in doc:
    print ('{}-{}'.format(token,token.pos_))
OUT:
Weather-PROPN
is-VERB
good-ADJ
,-PUNCT
very-ADV
windy-ADJ
and-CCONJ
sunny-ADJ
.-PUNCT
We-PRON
have-VERB
no-DET
classes-NOUN
in-ADP
the-DET
afternoon-NOUN
.-PUNCT 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20

命名体识别

doc_2 = nlp("I went to Paris where I met my old friend Jack from uni.")
for ent in doc_2.ents:
    print ('{}-{}'.format(ent,ent.label_))
OUT:
Paris-GPE
Jack-PERSON 
----
from spacy import displacy
doc = nlp('I went to Paris where I met my old friend Jack from uni.')
displacy.render(doc,style='ent',jupyter=True)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

在这里插入图片描述

练习 : 找到书中所有人物名字
在这里插入图片描述

def read_file(file_name):
    with open(file_name, 'r') as file:
        return file.read()
# 加载文本数据
text = read_file('./data/pride_and_prejudice.txt')
processed_text = nlp(text)
sentences = [s for s in processed_text.sents]
print (len(sentences))
OUT:
6469
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

一共有6469个句子

from collections import Counter,defaultdict
def find_person(doc):
    c = Counter()
    for ent in processed_text.ents:
        if ent.label_ == 'PERSON':
            c[ent.lemma_]+=1
    return c.most_common(10)
print (find_person(processed_text))
OUT:
[('elizabeth', 604), ('darcy', 276), ('jane', 274), ('bennet', 233), ('bingley', 189), ('collins', 179), ('wickham', 170), ('gardiner', 95), ('lizzy', 94), ('lady catherine', 77)]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

搞定

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/AllinToyou/article/detail/562211
推荐阅读
相关标签
  

闽ICP备14008679号