赞
踩
词性标注是指为输入文本中的单词标注对应词性的过程。词性标注的主要作用在于预测接下来一个词的词性,并为句法分析、信息抽取等工作打下基础。通常地,实现词性标注的算法有HMM(隐马尔科夫)和深度学习(RNN、LSTM等)。然而,在中文中,由于汉语是一种缺乏词形态变化的语言,没有直接判断的依据,且常用词兼类现象严重,研究者主观原因造成的不同都给中文词性标注带来了很大的难点。
本文将介绍如何通过Python程序实现词性标注,并运用spaCy训练中文词性标注模型:
首先,对于给定的训练集数据:
利用spaCy模块进行nlp处理,初始化一个标签列表和文本字符串,将文本分词后用“/”号隔开,并储存文本的词性标签到标签列表中,代码如下:
def train_data(train_path): nlp = spacy.load('zh_core_web_sm') train_list = [] for line in open(train_path,"r",encoding="utf8"): train_list.append(line) #print(train_list) result = [] train_dict = {} for i in train_list: doc = nlp(i) label = [] text = "" #print(doc) for j in doc: text += j.text+"/" #result.append(str(j.text)) #print(text) label.append(j.pos_[0]) #print(result) train_dict[j.pos_[0]] = {"pos":j.pos_} #print(train_dict) result.append((text[:-1],{'tags':label})) return result,train_dict
大致会得到如下结果:
然后,进行模型训练:
@plac.annotations( lang=("ISO Code of language to use", "option", "l", str), output_dir=("Optional output directory", "option", "o", Path), n_iter=("Number of training iterations", "option", "n", int)) def main(lang='zh', output_dir=None, n_iter=25): nlp = spacy.blank(lang) ##创建一个空的模型,en表示是英文的模型 tagger = nlp.add_pipe('tagger') # Add the tags. This needs to be done before you start training. for tag, values in train_dict.items(): print("tag:",tag) print("values:",values) #tagger.add_label(tag, values) tagger.add_label(tag) #tagger.add_label(values['pos']) #nlp.create_pipe(tagger) print("3:",tagger) #nlp.add_pipe(tagger) optimizer = nlp.begin_training() ##模型初始化 for i in range(n_iter): random.shuffle(result) ##打乱列表 losses = {} for text, annotations in result: example = Example.from_dict(nlp.make_doc(text), annotations) #nlp.update([text], [annotations], sgd=optimizer, losses=losses) nlp.update([example], sgd=optimizer, losses=losses) print(losses)
运行结果如下:
最后,同样过程处理测试数据:
代码如下:
test_path = r"E:\1\Study\大三下\自然语言处理\第五章作业\test.txt" test_list = [] for line in open(test_path,"r",encoding="utf8"): test_list.append(line) for z in test_list: txt = nlp(z) test_text = "" for word in txt: test_text += word.text+"/" print('test_data:', [(t.text, t.tag_, t.pos_) for t in txt]) # save model to output directory if output_dir is not None: output_dir = Path(output_dir) if not output_dir.exists(): output_dir.mkdir() nlp.to_disk(output_dir) print("Saved model to", output_dir) # test the save model print("Loading from", output_dir) nlp2 = spacy.load(output_dir) doc = nlp2(test_text) print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
验证结果如下:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。