赞
踩
Shallow parsing
又叫Chunking
(分块)是介于词性标注和Constituency parsing 之间的一种浅层分析方法。用于识别文本中最小短语块,例如名词短语NP,动词短语VP以及介词短语PP等。
例如上图中,从文本 "We saw the yellow dog"
,提取出名词短语块,称为NP-chunk
。最后得到相应的浅层句法结构
从解决方法上看与命名实体识别NER
相似,都是序列标注的问题,常用的标签有BMES
,BIO
,BIOE
。标签与相应的块名称X
组合, 例如B-NP
代表块名词短语的开头。
图片来自博客
句子中的短语块,一般有以下几种类型:
但是现有的工具(spacy
,textblob
等),一般只关注NP-chunking任务,仅仅提取文本序列中的名词短语块。conll2000-chunking任务提取NP, VP以及PP短语块,这里也提供了相应的数据集
可以使用基于规则的方法和基于机器学习的方法
基于规则的方法需要手动定义chunking的文法,并且需要注意嵌套
def preprocess(doc): sentences = nltk.sent_tokenize(doc) sentences = [nltk.word_tokenize(sent) for sent in sentences] sentences = [nltk.pos_tag(sent) for sent in sentences] return sentences sentence = "The blogger taught the reader to chunk" sentence = preprocess(sentence) print(sentence) grammar = "NP: {<DT>?<JJ>*<NN>}" # 匹配模式,限定词(0或1个) + 形容词(0个以上) + 名词 NPChunker = nltk.RegexpParser(grammar) result = NPChunker.parse(sentence[0]) print(result)
输出:
[[('The', 'DT'), ('blogger', 'NN'), ('taught', 'VBD'), ('the', 'DT'), ('reader', 'NN'), ('to', 'TO'), ('chunk', 'VB')]]
(S
(NP The/DT blogger/NN)
taught/VBD
(NP the/DT reader/NN)
to/TO
chunk/VB)
输入有两种形式,一是原始的文本,二是原始文本+词性标注(准确率相比前者会高很多)
这里使用nltk中自带的语料conll2000,可使用如下命令下载,训练最大熵分类器,自动提取文本中的名词短语块NP,动词短语块VP和介词短语块PP:
import nltk
nltk.download("conll2000")
代码如下:
def tags_since_dt(sentence, i): tags = set() for word, pos in sentence[:i]: if pos == 'DT': tags = set() else: tags.add(pos) return '+'.join(sorted(tags)) def npchunk_features(sentence, i, history): word, pos = sentence[i] if i == 0: prevword, prevpos = "<START>", "<START>" else: prevword, prevpos = sentence[i - 1] if i == len(sentence) - 1: nextword, nextpos = "<END>", "<END>" else: nextword, nextpos = sentence[i + 1] return {"pos": pos, "word": word, "prevpos": prevpos, "nextpos": nextpos, "prevword": prevword, "nextword": nextword, "prevpos+pos": "%s+%s" % (prevpos, pos), "pos+nextpos": "%s+%s" % (pos, nextpos), "prevpos+pos+nextpos": "%s+%s+%s" % (prevpos, pos, nextpos), "prevword+word+nextword": "%s+%s+%s" % (prevword, word, nextword), "tags-since-dt": tags_since_dt(sentence, i)} class ConsecutiveNPChunkTagger(nltk.TaggerI): def __init__(self, train_sents): train_set = [] for tagged_sent in train_sents: untagged_sent = nltk.tag.untag(tagged_sent) history = [] for i, (word, tag) in enumerate(tagged_sent): featureset = npchunk_features(untagged_sent, i, history) train_set.append((featureset, tag)) history.append(tag) self.classifier = nltk.MaxentClassifier.train( train_set, algorithm='IIS', trace=0) def tag(self, sentence): history = [] for i, word in enumerate(sentence): featureset = npchunk_features(sentence, i, history) tag = self.classifier.classify(featureset) history.append(tag) return zip(sentence, history) # 模型及特征构建 class ConsecutiveNPChunker(nltk.ChunkParserI): def __init__(self, train_sents): tagged_sents = [[((w, t), c) for (w, t, c) in nltk.chunk.tree2conlltags(sent)] for sent in train_sents] # 词->词性->chunk标签 # iob_tagged = tree2conlltags(chunked_sentence) # chunk_tree = conlltags2tree(iob_tagged) # len(conll2000.chunked_sents()) # 10948 # len(conll2000.chunked_words()) # 166433 self.tagger = ConsecutiveNPChunkTagger(tagged_sents) def parse(self, sentence): tagged_sents = self.tagger.tag(sentence) conlltags = [(w, t, c) for ((w, t), c) in tagged_sents] return nltk.chunk.conlltags2tree(conlltags) from nltk.corpus import conll2000 # 获取训练和测试数据 train_sents = conll2000.chunked_sents('train.txt') chunked_sentence = conll2000.chunked_sents()[0] test_sents = conll2000.chunked_sents('test.txt') # 训练模型 chunker = ConsecutiveNPChunker(train_sents) # 测试 print(chunker.evaluate(test_sents)) # 保存模型 import pickle pickle.dump(chunker, open("chunker.bin", "wb")) # 加载模型 chunker = pickle.load(open("chunker.bin", "rb")) # 测试样例 sentence = 'It is the 2019 novel coronavirus that has breaks out worldwide.' test_sent_words = nltk.word_tokenize(sentence) test_sent_pos = nltk.pos_tag(test_sent_words) test_sent = [(word, pos) for word, pos in zip(test_sent_words, test_sent_pos)] print(chunker.parse(test_sent_pos))
输出:
ChunkParse score:
IOB Accuracy: 93.9%%
Precision: 89.0%%
Recall: 92.1%%
F-Measure: 90.5%%
(S
(NP It/PRP)
(VP is/VBZ)
(NP the/DT 2019/CD novel/NN coronavirus/NN)
(NP that/WDT)
(VP has/VBZ breaks/VBN)
out/RP
(NP worldwide/NN)
./.)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。