赞
踩
在语言学上,词性(Par-Of-Speech, Pos )指的是单词的语法分类,也称为词类。同一个类别的词语具有相似的语法性质,所有词性的集合称为词性标注集。不同的语料库采用了不同的词性标注集,一般都含有形容词、动词、名词等常见词性。
序列标注指的是给定一个序列 ,找出序列中每个元素对应标签 的问题。其中,y 所有可能的取值集合称为标注集。比如,输入一个自然数序列,输出它们的奇偶性。
求解序列标注问题的模型一般称为序列标注器,通常由模型从一个标注数据集中学习相关知识后再进行预测。在NLP问题中,x 通常是字符或词语,而 y 则是待预测的组词角色或词性等标签。中文分词、词性标注以及命名实体识别,都可以转化为序列标注问题。
词性 | 名称 | 诠释 |
---|---|---|
Ag | 形语素 | 形容词性语素。形容词代码为a,语素代码g前面置以A。 |
a | 形容词 | 取英语形容词adjective的第1个字母。 |
ad | 副形词 | 直接作状语的形容词。形容词代码a和副词代码d并在一起。 |
an | 名形词 | 具有名词功能的形容词。形容词代码a和名词代码n并在一起。 |
b | 区别词 | 取汉字“别”的声母。 |
c | 连词 | 取英语连词conjunction的第1个字母。 |
Dg | 副语素 | 副词性语素。副词代码为d,语素代码g前面置以D。 |
d | 副词 | 取adverb的第2个字母,因其第1个字母已用于形容词。 |
e | 叹词 | 取英语叹词exclamation的第1个字母。 |
f | 方位词 | 取汉字“方” 的声母。 |
g | 语素 | 绝大多数语素都能作为合成词的“词根”,取汉字“根”的声母。 |
h | 前接成分 | 取英语head的第1个字母。 |
i | 成语 | 取英语成语idiom的第1个字母。 |
j | 简称略语 | 取汉字“简”的声母。 |
k | 后接成分 | |
l | 习用语 | 习用语尚未成为成语,有点“临时性”,取“临”的声母。 |
m | 数词 | 取英语numeral的第3个字母,n,u已有他用。 |
Ng | 名语素 | 名词性语素。名词代码为n,语素代码g前面置以N。 |
n | 名词 | 取英语名词noun的第1个字母。 |
nr | 人名 | 名词代码n和“人(ren)”的声母并在一起。 |
ns | 地名 | 名词代码n和处所词代码s并在一起。 |
nt | 机构团体 | “团”的声母为t,名词代码n和t并在一起。 |
nz | 其他专名 | “专”的声母的第1个字母为z,名词代码n和z并在一起。 |
o | 拟声词 | 取英语拟声词onomatopoeia的第1个字母。 |
p | 介词 | 取英语介词prepositional的第1个字母。 |
q | 量词 | 取英语quantity的第1个字母。 |
r | 代词 | 取英语代词pronoun的第2个字母,因p已用于介词。 |
s | 处所词 | 取英语space的第1个字母。 |
Tg | 时语素 | 时间词性语素。时间词代码为t,在语素的代码g前面置以T。 |
t | 时间词 | 取英语time的第1个字母。 |
u | 助词 | 取英语助词auxiliary 的第2个字母,因a已用于形容词。 |
Vg | 动语素 | 动词性语素。动词代码为v。在语素的代码g前面置以V。 |
v | 动词 | 取英语动词verb的第一个字母。 |
vd | 副动词 | 直接作状语的动词。动词和副词的代码并在一起。 |
vn | 名动词 | 指具有名词功能的动词。动词和名词的代码并在一起。 |
w | 标点符号 | |
x | 非语素字 | 非语素字只是一个符号,字母x通常用于代表未知数、符号。 |
y | 语气词 | 取汉字“语”的声母。 |
z | 状态词 | 取汉字“状”的声母的前一个字母。 |
词性 | 单词 | 名称 |
---|---|---|
ADJ | adjective | 形容词 |
ADP | adposition | 介词 |
ADV | adverb | 副词 |
AUX | auxiliary verb | 助动词 |
CONJ | coordinating conjunction | 并列连词 |
DET | determiner | 限定词 |
INTJ | interjection | 感叹词 |
NOUN | noun | 名词 |
NUM | numeral | 数词 |
PART | particle | 助词 |
PRON | pronoun | 代词 |
PROPN | proper noun | 专有名词 |
PUNCT | punctuation | 标点符号 |
SCONJ | subordinating conjunction | 从属连词 |
SYM | symbol 符号 | |
VERB | verb 动词 | |
X | other |
官方关于词性、依存关系、实体的名词解释:
def explain(term): """Get a description for a given POS tag, dependency label or entity type. term (str): The term to explain. RETURNS (str): The explanation, or `None` if not found in the glossary. EXAMPLE: >>> spacy.explain(u'NORP') >>> doc = nlp(u'Hello world') >>> print([w.text, w.tag_, spacy.explain(w.tag_) for w in doc]) """ if term in GLOSSARY: return GLOSSARY[term] GLOSSARY = { # POS tags # Universal POS Tags # http://universaldependencies.org/u/pos/ "ADJ": "adjective", "ADP": "adposition", "ADV": "adverb", "AUX": "auxiliary", "CONJ": "conjunction", "CCONJ": "coordinating conjunction", "DET": "determiner", "INTJ": "interjection", "NOUN": "noun", "NUM": "numeral", "PART": "particle", "PRON": "pronoun", "PROPN": "proper noun", "PUNCT": "punctuation", "SCONJ": "subordinating conjunction", "SYM": "symbol", "VERB": "verb", "X": "other", "EOL": "end of line", "SPACE": "space", # POS tags (English) # OntoNotes 5 / Penn Treebank # https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html ".": "punctuation mark, sentence closer", ",": "punctuation mark, comma", "-LRB-": "left round bracket", "-RRB-": "right round bracket", "``": "opening quotation mark", '""': "closing quotation mark", "''": "closing quotation mark", ":": "punctuation mark, colon or ellipsis", "$": "symbol, currency", "#": "symbol, number sign", "AFX": "affix", "CC": "conjunction, coordinating", "CD": "cardinal number", "DT": "determiner", "EX": "existential there", "FW": "foreign word", "HYPH": "punctuation mark, hyphen", "IN": "conjunction, subordinating or preposition", "JJ": "adjective (English), other noun-modifier (Chinese)", "JJR": "adjective, comparative", "JJS": "adjective, superlative", "LS": "list item marker", "MD": "verb, modal auxiliary", "NIL": "missing tag", "NN": "noun, singular or mass", "NNP": "noun, proper singular", "NNPS": "noun, proper plural", "NNS": "noun, plural", "PDT": "predeterminer", "POS": "possessive ending", "PRP": "pronoun, personal", "PRP$": "pronoun, possessive", "RB": "adverb", "RBR": "adverb, comparative", "RBS": "adverb, superlative", "RP": "adverb, particle", "TO": 'infinitival "to"', "UH": "interjection", "VB": "verb, base form", "VBD": "verb, past tense", "VBG": "verb, gerund or present participle", "VBN": "verb, past participle", "VBP": "verb, non-3rd person singular present", "VBZ": "verb, 3rd person singular present", "WDT": "wh-determiner", "WP": "wh-pronoun, personal", "WP$": "wh-pronoun, possessive", "WRB": "wh-adverb", "SP": "space (English), sentence-final particle (Chinese)", "ADD": "email", "NFP": "superfluous punctuation", "GW": "additional word in multi-word expression", "XX": "unknown", "BES": 'auxiliary "be"', "HVS": 'forms of "have"', "_SP": "whitespace", # POS Tags (German) # TIGER Treebank # http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf "$(": "other sentence-internal punctuation mark", "$,": "comma", "$.": "sentence-final punctuation mark", "ADJA": "adjective, attributive", "ADJD": "adjective, adverbial or predicative", "APPO": "postposition", "APPR": "preposition; circumposition left", "APPRART": "preposition with article", "APZR": "circumposition right", "ART": "definite or indefinite article", "CARD": "cardinal number", "FM": "foreign language material", "ITJ": "interjection", "KOKOM": "comparative conjunction", "KON": "coordinate conjunction", "KOUI": 'subordinate conjunction with "zu" and infinitive', "KOUS": "subordinate conjunction with sentence", "NE": "proper noun", "NNE": "proper noun", "PAV": "pronominal adverb", "PROAV": "pronominal adverb", "PDAT": "attributive demonstrative pronoun", "PDS": "substituting demonstrative pronoun", "PIAT": "attributive indefinite pronoun without determiner", "PIDAT": "attributive indefinite pronoun with determiner", "PIS": "substituting indefinite pronoun", "PPER": "non-reflexive personal pronoun", "PPOSAT": "attributive possessive pronoun", "PPOSS": "substituting possessive pronoun", "PRELAT": "attributive relative pronoun", "PRELS": "substituting relative pronoun", "PRF": "reflexive personal pronoun", "PTKA": "particle with adjective or adverb", "PTKANT": "answer particle", "PTKNEG": "negative particle", "PTKVZ": "separable verbal particle", "PTKZU": '"zu" before infinitive', "PWAT": "attributive interrogative pronoun", "PWAV": "adverbial interrogative or relative pronoun", "PWS": "substituting interrogative pronoun", "TRUNC": "word remnant", "VAFIN": "finite verb, auxiliary", "VAIMP": "imperative, auxiliary", "VAINF": "infinitive, auxiliary", "VAPP": "perfect participle, auxiliary", "VMFIN": "finite verb, modal", "VMINF": "infinitive, modal", "VMPP": "perfect participle, modal", "VVFIN": "finite verb, full", "VVIMP": "imperative, full", "VVINF": "infinitive, full", "VVIZU": 'infinitive with "zu", full', "VVPP": "perfect participle, full", "XY": "non-word containing non-letter", # POS Tags (Chinese) # OntoNotes / Chinese Penn Treebank # https://repository.upenn.edu/cgi/viewcontent.cgi?article=1039&context=ircs_reports "AD": "adverb", "AS": "aspect marker", "BA": "把 in ba-construction", # "CD": "cardinal number", "CS": "subordinating conjunction", "DEC": "的 in a relative clause", "DEG": "associative 的", "DER": "得 in V-de const. and V-de-R", "DEV": "地 before VP", "ETC": "for words 等, 等等", # "FW": "foreign words" "IJ": "interjection", # "JJ": "other noun-modifier", "LB": "被 in long bei-const", "LC": "localizer", "M": "measure word", "MSP": "other particle", # "NN": "common noun", "NR": "proper noun", "NT": "temporal noun", "OD": "ordinal number", "ON": "onomatopoeia", "P": "preposition excluding 把 and 被", "PN": "pronoun", "PU": "punctuation", "SB": "被 in short bei-const", # "SP": "sentence-final particle", "VA": "predicative adjective", "VC": "是 (copula)", "VE": "有 as the main verb", "VV": "other verb", # Noun chunks "NP": "noun phrase", "PP": "prepositional phrase", "VP": "verb phrase", "ADVP": "adverb phrase", "ADJP": "adjective phrase", "SBAR": "subordinating conjunction", "PRT": "particle", "PNP": "prepositional noun phrase", # Dependency Labels (English) # ClearNLP / Universal Dependencies # https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md "acl": "clausal modifier of noun (adjectival clause)", "acomp": "adjectival complement", "advcl": "adverbial clause modifier", "advmod": "adverbial modifier", "agent": "agent", "amod": "adjectival modifier", "appos": "appositional modifier", "attr": "attribute", "aux": "auxiliary", "auxpass": "auxiliary (passive)", "case": "case marking", "cc": "coordinating conjunction", "ccomp": "clausal complement", "clf": "classifier", "complm": "complementizer", "compound": "compound", "conj": "conjunct", "cop": "copula", "csubj": "clausal subject", "csubjpass": "clausal subject (passive)", "dative": "dative", "dep": "unclassified dependent", "det": "determiner", "discourse": "discourse element", "dislocated": "dislocated elements", "dobj": "direct object", "expl": "expletive", "fixed": "fixed multiword expression", "flat": "flat multiword expression", "goeswith": "goes with", "hmod": "modifier in hyphenation", "hyph": "hyphen", "infmod": "infinitival modifier", "intj": "interjection", "iobj": "indirect object", "list": "list", "mark": "marker", "meta": "meta modifier", "neg": "negation modifier", "nmod": "modifier of nominal", "nn": "noun compound modifier", "npadvmod": "noun phrase as adverbial modifier", "nsubj": "nominal subject", "nsubjpass": "nominal subject (passive)", "nounmod": "modifier of nominal", "npmod": "noun phrase as adverbial modifier", "num": "number modifier", "number": "number compound modifier", "nummod": "numeric modifier", "oprd": "object predicate", "obj": "object", "obl": "oblique nominal", "orphan": "orphan", "parataxis": "parataxis", "partmod": "participal modifier", "pcomp": "complement of preposition", "pobj": "object of preposition", "poss": "possession modifier", "possessive": "possessive modifier", "preconj": "pre-correlative conjunction", "prep": "prepositional modifier", "prt": "particle", "punct": "punctuation", "quantmod": "modifier of quantifier", "rcmod": "relative clause modifier", "relcl": "relative clause modifier", "reparandum": "overridden disfluency", "root": "root", "vocative": "vocative", "xcomp": "open clausal complement", # Dependency labels (German) # TIGER Treebank # http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf # currently missing: 'cc' (comparative complement) because of conflict # with English labels "ac": "adpositional case marker", "adc": "adjective component", "ag": "genitive attribute", "ams": "measure argument of adjective", "app": "apposition", "avc": "adverbial phrase component", "cd": "coordinating conjunction", "cj": "conjunct", "cm": "comparative conjunction", "cp": "complementizer", "cvc": "collocational verb construction", "da": "dative", "dh": "discourse-level head", "dm": "discourse marker", "ep": "expletive es", "hd": "head", "ju": "junctor", "mnr": "postnominal modifier", "mo": "modifier", "ng": "negation", "nk": "noun kernel element", "nmc": "numerical component", "oa": "accusative object", "oc": "clausal object", "og": "genitive object", "op": "prepositional object", "par": "parenthetical element", "pd": "predicate", "pg": "phrasal genitive", "ph": "placeholder", "pm": "morphological particle", "pnc": "proper noun component", "rc": "relative clause", "re": "repeated element", "rs": "reported speech", "sb": "subject", "sbp": "passivized subject (PP)", "sp": "subject or predicate", "svp": "separable verb prefix", "uc": "unit component", "vo": "vocative", # Named Entity Recognition # OntoNotes 5 # https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf "PERSON": "People, including fictional", "NORP": "Nationalities or religious or political groups", "FACILITY": "Buildings, airports, highways, bridges, etc.", "FAC": "Buildings, airports, highways, bridges, etc.", "ORG": "Companies, agencies, institutions, etc.", "GPE": "Countries, cities, states", "LOC": "Non-GPE locations, mountain ranges, bodies of water", "PRODUCT": "Objects, vehicles, foods, etc. (not services)", "EVENT": "Named hurricanes, battles, wars, sports events, etc.", "WORK_OF_ART": "Titles of books, songs, etc.", "LAW": "Named documents made into laws.", "LANGUAGE": "Any named language", "DATE": "Absolute or relative dates or periods", "TIME": "Times smaller than a day", "PERCENT": 'Percentage, including "%"', "MONEY": "Monetary values, including unit", "QUANTITY": "Measurements, as of weight or distance", "ORDINAL": '"first", "second", etc.', "CARDINAL": "Numerals that do not fall under another type", # Named Entity Recognition # Wikipedia # http://www.sciencedirect.com/science/article/pii/S0004370212000276 # https://pdfs.semanticscholar.org/5744/578cc243d92287f47448870bb426c66cc941.pdf "PER": "Named person or family.", "MISC": "Miscellaneous entities, e.g. events, nationalities, products or works of art", # https://github.com/ltgoslo/norne "EVT": "Festivals, cultural events, sports events, weather phenomena, wars, etc.", "PROD": "Product, i.e. artificially produced entities including speeches, radio shows, programming languages, contracts, laws and ideas", "DRV": "Words (and phrases?) that are dervied from a name, but not a name in themselves, e.g. 'Oslo-mannen' ('the man from Oslo')", "GPE_LOC": "Geo-political entity, with a locative sense, e.g. 'John lives in Spain'", "GPE_ORG": "Geo-political entity, with an organisation sense, e.g. 'Spain declined to meet with Belgium'", }
HMM是概率有向图模型中的一种,应该是机器学习中推理最繁琐的算法了,并且用到了动态规划算法,具体数学细节参考《统计学习方法》; 隐马尔可夫模型( Hidden Markov Model, HMM)是描述两个时序序列联合分布 p(x,y) 的概率模型: x 序列外界可见(外界指的是观测者),称为观测序列(obsevation sequence); y 序列外界不可见,称为状态序列(state sequence)。比如观测 x 为单词,状态 y 为词性,我们需要根据单词序列去猜测它们的词性。隐马尔可夫模型之所以称为“隐”,是因为从外界来看,状 态序列(例如词性)隐藏不可见,是待求的因变量。从这个角度来讲,人们也称状态为隐状态(hidden state),而称观测为显状态( visible state)。隐马尔可夫模型之所以称为“马尔可夫模型”,”是因为它满足马尔可夫假设; 从数据–>HMM模型–>预测词性,要解决概率计算问题、学习问题、预测问题,预测问题就是根据观测序列,预测概率最大的状态序列(即词性序列);
CRF是概率无向图模型中的一种,数学细节和HMM基本类似,具体数学细节参考《统计学习方法》;
Hanlp中的CRF实现由于基于java虚拟机,速度比c++要慢,所以作者建议直接在本机上安装crf++,mac安装很简单:brew install crf++,其它安装方法参考
perceptron算法可以说是一层神经网络,多层perceptron便成了深度学习模型,具体参考《统计学习方法》;
按照中文分词时的经验,感知机能够利用丰富的上下文特征,是优于隐马尔可夫模型的选择,对于词性标注也是如此。
A c c u r a c y = 预 测 正 确 的 标 签 数 标 签 总 数 Accuracy = \frac{预测正确的标签数} {标签总数} Accuracy=标签总数预测正确的标签数
from pyhanlp import * # 这里使用的是1998年《人民日报》1月份语料 from tests.book.ch07.pku import PKU199801_TRAIN HMMPOSTagger = JClass('com.hankcs.hanlp.model.hmm.HMMPOSTagger') AbstractLexicalAnalyzer = JClass('com.hankcs.hanlp.tokenizer.lexical.AbstractLexicalAnalyzer') PerceptronSegmenter = JClass('com.hankcs.hanlp.model.perceptron.PerceptronSegmenter') FirstOrderHiddenMarkovModel = JClass('com.hankcs.hanlp.model.hmm.FirstOrderHiddenMarkovModel') SecondOrderHiddenMarkovModel = JClass('com.hankcs.hanlp.model.hmm.SecondOrderHiddenMarkovModel') def train_hmm_pos(corpus, model): tagger = HMMPOSTagger(model) # 创建词性标注器 tagger.train(corpus) # 训练 # 词性标注器不负责分词,所以只接受分词后的单词序列 # print(', '.join(tagger.tag("他", "的", "希望", "是", "希望", "上学"))) # 预测 # 加上analyzer可以同时执行分词和词性标注 analyzer = AbstractLexicalAnalyzer(PerceptronSegmenter(), tagger) # 构造词法分析器 # 英文缩写词性 # print(analyzer.analyze("今年元旦我要去看升国旗!")) # 分词+词性标注 # 把英文缩写词性转化为英文 print(analyzer.analyze("他的希望是希望上学").translateLabels()) # 分词+词性标注 return tagger # 一阶隐马尔可夫 tagger = train_hmm_pos(PKU199801_TRAIN, FirstOrderHiddenMarkovModel()) # 二阶隐马尔可夫 tagger = train_hmm_pos(PKU199801_TRAIN, SecondOrderHiddenMarkovModel())
from pyhanlp import * from tests.book.ch07.demo_hmm_pos import AbstractLexicalAnalyzer, PerceptronSegmenter from tests.book.ch07.pku import POS_MODEL, PKU199801_TRAIN CRFPOSTagger = JClass('com.hankcs.hanlp.model.crf.CRFPOSTagger') def train_crf_pos(corpus): # 选项1.使用HanLP的Java API训练,慢 tagger = CRFPOSTagger(None) # 创建空白标注器 tagger.train(corpus, POS_MODEL) # 训练 tagger = CRFPOSTagger(POS_MODEL) # 加载 # 选项2.使用CRF++训练,HanLP加载。(训练命令由选项1给出) # tagger = CRFPOSTagger(POS_MODEL + ".txt") print(', '.join(tagger.tag("他", "的", "希望", "是", "希望", "上学"))) # 预测 analyzer = AbstractLexicalAnalyzer(PerceptronSegmenter(), tagger) # 构造词法分析器 print(analyzer.analyze("李狗蛋的希望是希望上学")) # 分词+词性标注 return tagger if __name__ == '__main__': tagger = train_crf_pos(PKU199801_TRAIN)
from pyhanlp import * from tests.book.ch07.demo_hmm_pos import AbstractLexicalAnalyzer, PerceptronSegmenter from tests.book.ch07.pku import PKU199801_TRAIN, POS_MODEL POSTrainer = JClass('com.hankcs.hanlp.model.perceptron.POSTrainer') PerceptronPOSTagger = JClass('com.hankcs.hanlp.model.perceptron.PerceptronPOSTagger') def train_perceptron_pos(corpus): trainer = POSTrainer() # trainer.train(corpus, POS_MODEL) # 训练 tagger = PerceptronPOSTagger(POS_MODEL) # 加载 print(', '.join(tagger.tag("他", "的", "希望", "是", "希望", "上学"))) # 预测 analyzer = AbstractLexicalAnalyzer(PerceptronSegmenter(), tagger) # 构造词法分析器 print(analyzer.analyze("李狗蛋的希望是希望上学")) # 分词+词性标注 return tagger if __name__ == '__main__': train_perceptron_pos(PKU199801_TRAIN)
# -*- coding: utf-8 -*- """ Created on Fri Apr 1 18:10:18 2022 @author: He Zekai """ from __future__ import unicode_literals, print_function import plac import random from pathlib import Path import spacy from spacy.training import Example from spacy.tokens import Doc TAG_MAP = { 'n':{'pos':'普通名词'}, 'f':{'pos':'方位名词'}, 's':{'pos':'处所名词'}, 'nw':{'pos':'作品名'}, 'nz':{'pos':'其他专名'}, 'v':{'pos':'普通动词'}, 'vd':{'pos':'动副词'}, 'vn':{'pos':'名动词'}, 'a':{'pos':'形容词'}, 'ad':{'pos':'副形词'}, 'an':{'pos':'名形词'}, 'd':{'pos':'副词'}, 'm':{'pos':'数量词'}, 'q':{'pos':'量词'}, 'r':{'pos':'代词'}, 'p':{'pos':'介词'}, 'c':{'pos':'连词'}, 'u':{'pos':'助词'}, 'xc':{'pos':'其他虚词'}, 'w':{'pos':'标点符号'}, 'PER':{'pos':'人名'}, 'LOC':{'pos':'地名'}, 'ORG':{'pos':'机构名'}, 'TIME':{'pos':'时间'} } import random from LAC import LAC import pandas as pd text = [] with open('train_text.txt',encoding='utf8') as f: for i in f.readlines(): text.append(i.strip('\u200b\u200b\u200b\n')) random.seed(123) lac = LAC(mode='lac') train = random.sample(text, 20)#随机取20条新闻 data = lac.run(train) TRAIN_DATA = [] for i in range(len(data)): txt = (data[i][0],{'tags':data[i][1]}) TRAIN_DATA.append(txt) TRAIN_DATA @plac.annotations( lang=("ISO Code of language to use", "option", "l", str), output_dir=("Optional output directory", "option", "o", Path), n_iter=("Number of training iterations", "option", "n", int)) def main(lang='zh', output_dir="C:/Users/11752/Desktop/大三下/自然语言处理/作业5--自然语言处理作业zip/model/", n_iter=300): nlp = spacy.blank(lang) tagger = nlp.add_pipe('tagger') print("names2:",nlp.pipe_names) #添加标注器 for tag, values in TAG_MAP.items(): print("tag:",tag) tagger.add_label(tag) optimizer = nlp.begin_training() #模型初始化 for i in range(n_iter): random.shuffle(TRAIN_DATA) losses = {} for text, annotations in TRAIN_DATA: example = Example.from_dict(Doc(nlp.vocab, words=text, spaces=[""] * len(text)), annotations) nlp.update([example], sgd=optimizer, losses=losses) print("i:",str(i) + str(losses)) test_text = random.sample(train, 5)#随机取20条新闻 test_text = u','.join(test_text) doc = nlp(test_text) print("doc:",[t.text for t in doc]) print('Test_Tags', [(t.text, t.tag_, TAG_MAP[t.tag_]['pos']) for t in doc]) # 将模型保存到输出目录 if output_dir is not None: output_dir = Path(output_dir) if not output_dir.exists(): output_dir.mkdir() nlp.to_disk(output_dir) print("Saved model to", output_dir) # 保存模型 print("Loading from", output_dir) nlp2 = spacy.load(output_dir) doc = nlp2(test_text) print('Tags', [(t.text, t.tag_, TAG_MAP[t.tag_]['pos']) for t in doc]) if __name__ == '__main__': plac.call(main)
https://blog.csdn.net/yuebowhu/article/details/112006712
https://blog.csdn.net/weixin_45965387/article/details/123953377
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。