赞
踩
目录
3.词性标注 (Part-of-speech tagging)
5.命名实体识别 (Named Entity Recognization)
9.指代消解 (Coreference Resolution)
spacy 可以用于进行分词,命名实体识别,词性识别等等
pip install spacy
安装之后还要下载官方的训练模型, 不同的语言有不同的训练模型,这里只用对应中文的模型演示:
python -m spacy download zh_core_web_sm
代码中使用:
- import spacy
- nlp = spacy.load("zh_core_web_sm")
模型官方文档:
Trained Models & Pipelines · spaCy Models Documentation
每种语言也会有几种不同的模型,例如中文的模型除了刚才下载的 zh_core_web_sm 外,还有zh_core_web_trf、zh_core_web_md 等,它们的区别在于准确度和体积大小, zh_core_web_sm 体积小,准确度相比zh_core_web_trf差,zh_core_web_trf相对就体积大。这样可以适应不同场景。
这里以模型 zh_core_web_sm 做一个介绍
Trained Models & Pipelines · spaCy Models Documentation
tok2vec: 分词
tagger: 词性标注
parser: 依存分析
senter: 分句
ner: 命名实体识别
attribute_ruler: 更改属性映射(没有具体了解)
模型会中指明包含哪些词性、依存分析、实体种类:
这是一些词性名称的解释:
IP:简单从句
NP:名词短语
VP:动词短语
PU:断句符,通常是句号、问号、感叹号等标点符号
LCP:方位词短语
PP:介词短语
CP:由‘的’构成的表示修饰性关系的短语
DNP:由‘的’构成的表示所属关系的短语
ADVP:副词短语
ADJP:形容词短语
DP:限定词短语
QP:量词短语
NN:常用名词
NR:固有名词
NT:时间名词
PN:代词
VV:动词
VC:是
CC:表示连词
VE:有
VA:表语形容词
AS:内容标记(如:了)
VRD:动补复合词
CD: 表示基数词
DT: determiner 表示限定词
EX: existential there 存在句
FW: foreign word 外来词
IN: preposition or conjunction, subordinating 介词或从属连词
JJ: adjective or numeral, ordinal 形容词或序数词
JJR: adjective, comparative 形容词比较级
JJS: adjective, superlative 形容词最高级
LS: list item marker 列表标识
MD: modal auxiliary 情态助动词
PDT: pre-determiner 前位限定词
POS: genitive marker 所有格标记
PRP: pronoun, personal 人称代词
RB: adverb 副词
RBR: adverb, comparative 副词比较级
RBS: adverb, superlative 副词最高级
RP: particle 小品词
SYM: symbol 符号
TO:”to” as preposition or infinitive marker 作为介词或不定式标记
WDT: WH-determiner WH限定词
WP: WH-pronoun WH代词
WP$: WH-pronoun, possessive WH所有格代词
WRB:Wh-adverb WH副词
官方关于词性、依存关系、实体的名词解释:
- def explain(term):
- """Get a description for a given POS tag, dependency label or entity type.
- term (str): The term to explain.
- RETURNS (str): The explanation, or `None` if not found in the glossary.
- EXAMPLE:
- >>> spacy.explain(u'NORP')
- >>> doc = nlp(u'Hello world')
- >>> print([w.text, w.tag_, spacy.explain(w.tag_) for w in doc])
- """
- if term in GLOSSARY:
- return GLOSSARY[term]
-
-
- GLOSSARY = {
- # POS tags
- # Universal POS Tags
- # http://universaldependencies.org/u/pos/
- "ADJ": "adjective",
- "ADP": "adposition",
- "ADV": "adverb",
- "AUX": "auxiliary",
- "CONJ": "conjunction",
- "CCONJ": "coordinating conjunction",
- "DET": "determiner",
- "INTJ": "interjection",
- "NOUN": "noun",
- "NUM": "numeral",
- "PART": "particle",
- "PRON": "pronoun",
- "PROPN": "proper noun",
- "PUNCT": "punctuation",
- "SCONJ": "subordinating conjunction",
- "SYM": "symbol",
- "VERB": "verb",
- "X": "other",
- "EOL": "end of line",
- "SPACE": "space",
- # POS tags (English)
- # OntoNotes 5 / Penn Treebank
- # https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
- ".": "punctuation mark, sentence closer",
- ",": "punctuation mark, comma",
- "-LRB-": "left round bracket",
- "-RRB-": "right round bracket",
- "``": "opening quotation mark",
- '""': "closing quotation mark",
- "''": "closing quotation mark",
- ":": "punctuation mark, colon or ellipsis",
- "$": "symbol, currency",
- "#": "symbol, number sign",
- "AFX": "affix",
- "CC": "conjunction, coordinating",
- "CD": "cardinal number",
- "DT": "determiner",
- "EX": "existential there",
- "FW": "foreign word",
- "HYPH": "punctuation mark, hyphen",
- "IN": "conjunction, subordinating or preposition",
- "JJ": "adjective (English), other noun-modifier (Chinese)",
- "JJR": "adjective, comparative",
- "JJS": "adjective, superlative",
- "LS": "list item marker",
- "MD": "verb, modal auxiliary",
- "NIL": "missing tag",
- "NN": "noun, singular or mass",
- "NNP": "noun, proper singular",
- "NNPS": "noun, proper plural",
- "NNS": "noun, plural",
- "PDT": "predeterminer",
- "POS": "possessive ending",
- "PRP": "pronoun, personal",
- "PRP$": "pronoun, possessive",
- "RB": "adverb",
- "RBR": "adverb, comparative",
- "RBS": "adverb, superlative",
- "RP": "adverb, particle",
- "TO": 'infinitival "to"',
- "UH": "interjection",
- "VB": "verb, base form",
- "VBD": "verb, past tense",
- "VBG": "verb, gerund or present participle",
- "VBN": "verb, past participle",
- "VBP": "verb, non-3rd person singular present",
- "VBZ": "verb, 3rd person singular present",
- "WDT": "wh-determiner",
- "WP": "wh-pronoun, personal",
- "WP$": "wh-pronoun, possessive",
- "WRB": "wh-adverb",
- "SP": "space (English), sentence-final particle (Chinese)",
- "ADD": "email",
- "NFP": "superfluous punctuation",
- "GW": "additional word in multi-word expression",
- "XX": "unknown",
- "BES": 'auxiliary "be"',
- "HVS": 'forms of "have"',
- "_SP": "whitespace",
- # POS Tags (German)
- # TIGER Treebank
- # http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
- "$(": "other sentence-internal punctuation mark",
- "$,": "comma",
- "$.": "sentence-final punctuation mark",
- "ADJA": "adjective, attributive",
- "ADJD": "adjective, adverbial or predicative",
- "APPO": "postposition",
- "APPR": "preposition; circumposition left",
- "APPRART": "preposition with article",
- "APZR": "circumposition right",
- "ART": "definite or indefinite article",
- "CARD": "cardinal number",
- "FM": "foreign language material",
- "ITJ": "interjection",
- "KOKOM": "comparative conjunction",
- "KON": "coordinate conjunction",
- "KOUI": 'subordinate conjunction with "zu" and infinitive',
- "KOUS": "subordinate conjunction with sentence",
- "NE": "proper noun",
- "NNE": "proper noun",
- "PAV": "pronominal adverb",
- "PROAV": "pronominal adverb",
- "PDAT": "attributive demonstrative pronoun",
- "PDS": "substituting demonstrative pronoun",
- "PIAT": "attributive indefinite pronoun without determiner",
- "PIDAT": "attributive indefinite pronoun with determiner",
- "PIS": "substituting indefinite pronoun",
- "PPER": "non-reflexive personal pronoun",
- "PPOSAT": "attributive possessive pronoun",
- "PPOSS": "substituting possessive pronoun",
- "PRELAT": "attributive relative pronoun",
- "PRELS": "substituting relative pronoun",
- "PRF": "reflexive personal pronoun",
- "PTKA": "particle with adjective or adverb",
- "PTKANT": "answer particle",
- "PTKNEG": "negative particle",
- "PTKVZ": "separable verbal particle",
- "PTKZU": '"zu" before infinitive',
- "PWAT": "attributive interrogative pronoun",
- "PWAV": "adverbial interrogative or relative pronoun",
- "PWS": "substituting interrogative pronoun",
- "TRUNC": "word remnant",
- "VAFIN": "finite verb, auxiliary",
- "VAIMP": "imperative, auxiliary",
- "VAINF": "infinitive, auxiliary",
- "VAPP": "perfect participle, auxiliary",
- "VMFIN": "finite verb, modal",
- "VMINF": "infinitive, modal",
- "VMPP": "perfect participle, modal",
- "VVFIN": "finite verb, full",
- "VVIMP": "imperative, full",
- "VVINF": "infinitive, full",
- "VVIZU": 'infinitive with "zu", full',
- "VVPP": "perfect participle, full",
- "XY": "non-word containing non-letter",
- # POS Tags (Chinese)
- # OntoNotes / Chinese Penn Treebank
- # https://repository.upenn.edu/cgi/viewcontent.cgi?article=1039&context=ircs_reports
- "AD": "adverb",
- "AS": "aspect marker",
- "BA": "把 in ba-construction",
- # "CD": "cardinal number",
- "CS": "subordinating conjunction",
- "DEC": "的 in a relative clause",
- "DEG": "associative 的",
- "DER": "得 in V-de const. and V-de-R",
- "DEV": "地 before VP",
- "ETC": "for words 等, 等等",
- # "FW": "foreign words"
- "IJ": "interjection",
- # "JJ": "other noun-modifier",
- "LB": "被 in long bei-const",
- "LC": "localizer",
- "M": "measure word",
- "MSP": "other particle",
- # "NN": "common noun",
- "NR": "proper noun",
- "NT": "temporal noun",
- "OD": "ordinal number",
- "ON": "onomatopoeia",
- "P": "preposition excluding 把 and 被",
- "PN": "pronoun",
- "PU": "punctuation",
- "SB": "被 in short bei-const",
- # "SP": "sentence-final particle",
- "VA": "predicative adjective",
- "VC": "是 (copula)",
- "VE": "有 as the main verb",
- "VV": "other verb",
- # Noun chunks
- "NP": "noun phrase",
- "PP": "prepositional phrase",
- "VP": "verb phrase",
- "ADVP": "adverb phrase",
- "ADJP": "adjective phrase",
- "SBAR": "subordinating conjunction",
- "PRT": "particle",
- "PNP": "prepositional noun phrase",
- # Dependency Labels (English)
- # ClearNLP / Universal Dependencies
- # https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md
- "acl": "clausal modifier of noun (adjectival clause)",
- "acomp": "adjectival complement",
- "advcl": "adverbial clause modifier",
- "advmod": "adverbial modifier",
- "agent": "agent",
- "amod": "adjectival modifier",
- "appos": "appositional modifier",
- "attr": "attribute",
- "aux": "auxiliary",
- "auxpass": "auxiliary (passive)",
- "case": "case marking",
- "cc": "coordinating conjunction",
- "ccomp": "clausal complement",
- "clf": "classifier",
- "complm": "complementizer",
- "compound": "compound",
- "conj": "conjunct",
- "cop": "copula",
- "csubj": "clausal subject",
- "csubjpass": "clausal subject (passive)",
- "dative": "dative",
- "dep": "unclassified dependent",
- "det": "determiner",
- "discourse": "discourse element",
- "dislocated": "dislocated elements",
- "dobj": "direct object",
- "expl": "expletive",
- "fixed": "fixed multiword expression",
- "flat": "flat multiword expression",
- "goeswith": "goes with",
- "hmod": "modifier in hyphenation",
- "hyph": "hyphen",
- "infmod": "infinitival modifier",
- "intj": "interjection",
- "iobj": "indirect object",
- "list": "list",
- "mark": "marker",
- "meta": "meta modifier",
- "neg": "negation modifier",
- "nmod": "modifier of nominal",
- "nn": "noun compound modifier",
- "npadvmod": "noun phrase as adverbial modifier",
- "nsubj": "nominal subject",
- "nsubjpass": "nominal subject (passive)",
- "nounmod": "modifier of nominal",
- "npmod": "noun phrase as adverbial modifier",
- "num": "number modifier",
- "number": "number compound modifier",
- "nummod": "numeric modifier",
- "oprd": "object predicate",
- "obj": "object",
- "obl": "oblique nominal",
- "orphan": "orphan",
- "parataxis": "parataxis",
- "partmod": "participal modifier",
- "pcomp": "complement of preposition",
- "pobj": "object of preposition",
- "poss": "possession modifier",
- "possessive": "possessive modifier",
- "preconj": "pre-correlative conjunction",
- "prep": "prepositional modifier",
- "prt": "particle",
- "punct": "punctuation",
- "quantmod": "modifier of quantifier",
- "rcmod": "relative clause modifier",
- "relcl": "relative clause modifier",
- "reparandum": "overridden disfluency",
- "root": "root",
- "vocative": "vocative",
- "xcomp": "open clausal complement",
- # Dependency labels (German)
- # TIGER Treebank
- # http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
- # currently missing: 'cc' (comparative complement) because of conflict
- # with English labels
- "ac": "adpositional case marker",
- "adc": "adjective component",
- "ag": "genitive attribute",
- "ams": "measure argument of adjective",
- "app": "apposition",
- "avc": "adverbial phrase component",
- "cd": "coordinating conjunction",
- "cj": "conjunct",
- "cm": "comparative conjunction",
- "cp": "complementizer",
- "cvc": "collocational verb construction",
- "da": "dative",
- "dh": "discourse-level head",
- "dm": "discourse marker",
- "ep": "expletive es",
- "hd": "head",
- "ju": "junctor",
- "mnr": "postnominal modifier",
- "mo": "modifier",
- "ng": "negation",
- "nk": "noun kernel element",
- "nmc": "numerical component",
- "oa": "accusative object",
- "oc": "clausal object",
- "og": "genitive object",
- "op": "prepositional object",
- "par": "parenthetical element",
- "pd": "predicate",
- "pg": "phrasal genitive",
- "ph": "placeholder",
- "pm": "morphological particle",
- "pnc": "proper noun component",
- "rc": "relative clause",
- "re": "repeated element",
- "rs": "reported speech",
- "sb": "subject",
- "sbp": "passivized subject (PP)",
- "sp": "subject or predicate",
- "svp": "separable verb prefix",
- "uc": "unit component",
- "vo": "vocative",
- # Named Entity Recognition
- # OntoNotes 5
- # https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf
- "PERSON": "People, including fictional",
- "NORP": "Nationalities or religious or political groups",
- "FACILITY": "Buildings, airports, highways, bridges, etc.",
- "FAC": "Buildings, airports, highways, bridges, etc.",
- "ORG": "Companies, agencies, institutions, etc.",
- "GPE": "Countries, cities, states",
- "LOC": "Non-GPE locations, mountain ranges, bodies of water",
- "PRODUCT": "Objects, vehicles, foods, etc. (not services)",
- "EVENT": "Named hurricanes, battles, wars, sports events, etc.",
- "WORK_OF_ART": "Titles of books, songs, etc.",
- "LAW": "Named documents made into laws.",
- "LANGUAGE": "Any named language",
- "DATE": "Absolute or relative dates or periods",
- "TIME": "Times smaller than a day",
- "PERCENT": 'Percentage, including "%"',
- "MONEY": "Monetary values, including unit",
- "QUANTITY": "Measurements, as of weight or distance",
- "ORDINAL": '"first", "second", etc.',
- "CARDINAL": "Numerals that do not fall under another type",
- # Named Entity Recognition
- # Wikipedia
- # http://www.sciencedirect.com/science/article/pii/S0004370212000276
- # https://pdfs.semanticscholar.org/5744/578cc243d92287f47448870bb426c66cc941.pdf
- "PER": "Named person or family.",
- "MISC": "Miscellaneous entities, e.g. events, nationalities, products or works of art",
- # https://github.com/ltgoslo/norne
- "EVT": "Festivals, cultural events, sports events, weather phenomena, wars, etc.",
- "PROD": "Product, i.e. artificially produced entities including speeches, radio shows, programming languages, contracts, laws and ideas",
- "DRV": "Words (and phrases?) that are dervied from a name, but not a name in themselves, e.g. 'Oslo-mannen' ('the man from Oslo')",
- "GPE_LOC": "Geo-political entity, with a locative sense, e.g. 'John lives in Spain'",
- "GPE_ORG": "Geo-political entity, with an organisation sense, e.g. 'Spain declined to meet with Belgium'",
- }
url: https://github.com/explosion/spaCy/blob/master/spacy/glossary.py
- import spacy
-
-
- s = "小米董事长叶凡决定投资华为。在2002年,他还创作了<遮天>。"
-
- nlp = spacy.load("zh_core_web_sm")
- doc = nlp(s)
-
- # 1. 分句 (sentencizer)
- for i in doc.sents:
- print(i)
-
-
-
- """
- 小米董事长叶凡决定投资华为。
- 在2002年,他还创作了<遮天>。
- """
- # 2. 分词 (Tokenization)
- print([w.text for w in doc])
-
-
- """
- ['小米', '董事长', '叶凡', '决定', '投资', '华为', '。', '在', '2002年', ',', '他', '还', '创作', '了', '<遮天>', '。']
- """
细粒度
- print([(w.text, w.tag_) for w in doc])
-
-
- """
- [('小米', 'NR'), ('董事长', 'NN'), ('叶凡', 'NR'), ('决定', 'VV'), ('投资', 'VV'), ('华为', 'NR'), ('。', 'PU'), ('在', 'P'), ('2002年', 'NT'), (',', 'PU'), ('他', 'PN'), ('还', 'AD'), ('创作', 'VV'), ('了', 'AS'), ('<遮天>', 'NN'), ('。', 'PU')]
- """
粗粒度
- print([(w.text, w.pos_) for w in doc])
-
- """
- [('小米', 'PROPN'), ('董事长', 'NOUN'), ('叶凡', 'PROPN'), ('决定', 'VERB'), ('投资', 'VERB'), ('华为', 'PROPN'), ('。', 'PUNCT'), ('在', 'ADP'), ('2002年', 'NOUN'), (',', 'PUNCT'), ('他', 'PRON'), ('还', 'ADV'), ('创作', 'VERB'), ('了', 'PART'), ('<遮天>', 'NOUN'), ('。', 'PUNCT')]
- """
- print([(w.text, w.is_stop) for w in doc])
-
-
- """
- [('小米', False), ('董事长', False), ('叶凡', False), ('决定', True), ('投资', False), ('华为', False), ('。', True), ('在', True), ('2002年', False), (',', True), ('他', True), ('还', True), ('创作', False), ('了', True), ('<遮天>', False), ('。', True)]
- """
- # 命名实体识别 (Named Entity Recognization)
- print([(e.text, e.label_) for e in doc.ents])
-
-
-
- """
- [('小米', 'PERSON'), ('叶凡', 'PERSON'), ('2002年', 'DATE')]
- """
- print([(w.text, w.dep_) for w in doc])
-
-
- """
- [('小米', 'nmod:assmod'), ('董事长', 'appos'), ('叶凡', 'nsubj'), ('决定', 'ROOT'), ('投资', 'ccomp'), ('华为', 'dobj'), ('。', 'punct'), ('在', 'case'), ('2002年', 'nmod:prep'), (',', 'punct'), ('他', 'nsubj'), ('还', 'advmod'), ('创作', 'ROOT'), ('了', 'aux:asp'), ('<遮天>', 'dobj'), ('。', 'punct')]
- """
这个模型没有这个功能,用英文模型演示下
找到单词的原型,即词性还原,将am, is, are, have been
还原成be
,复数还原成单数(cats -> cat)
,过去时态还原成现在时态 (had -> have)
。
- import spacy
- nlp = spacy.load('en_core_web_sm')
-
- txt = "A magnetic monopole is a hypothetical elementary particle."
- doc = nlp(txt)
-
- lem = [token.lemma_ for token in doc]
- print(lem)
-
-
- """
- ['a', 'magnetic', 'monopole', 'be', 'a', 'hypothetical', 'elementary', 'particle', '.']
- """
这个模型没有这个功能,用英文模型演示下
- noun_chunks = [nc for nc in doc.noun_chunks]
- print(noun_chunks)
-
- """
- [A magnetic monopole, a hypothetical elementary particle]
- """
指代消解 ,寻找句子中代词 he
,she
,it
所对应的实体。为了使用这个模块,需要使用神经网络预训练的指代消解系数,如果前面没有安装,可运行命令:pip install neuralcoref
这个模型没有这个功能,用英文模型演示下
- txt = "My sister has a son and she loves him."
-
- # 将预训练的神经网络指代消解加入到spacy的管道中
- import neuralcoref
- neuralcoref.add_to_pipe(nlp)
-
- doc = nlp(txt)
- doc._.coref_clusters
-
- """
- [My sister: [My sister, she], a son: [a son, him]]
- """
- from spacy import displacy
-
-
-
- # 可视化依存关系
- html_str = displacy.render(doc, style="dep")
-
-
- #可视化命名名称实体
- # html_str = displacy.render(doc, style="ent")
-
-
- with open("D:\\data\\ss.html", "w", encoding="utf8") as f:
- f.write(html_str)
-
-
html_str 是一个html格式的字符串, 保存到本地 ss.html文件,浏览器打开效果:
依存关系
命名实体
官方还有一个可视化的库: spacy-streamlit , 专门用于spacy相关的nlp可视化。
streamlit 也是一个专门可视化的库。
spacy-streamlit 有一个使用demo:
https://share.streamlit.io/ines/spacy-streamlit-demo/app.py
demo对应githup
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。