当前位置:   article > 正文

浅层分析-shallow parsing

浅层分析-shallow parsing

Shallow parsing 又叫Chunking(分块)是介于词性标注和Constituency parsing 之间的一种浅层分析方法。用于识别文本中最小短语块,例如名词短语NP,动词短语VP以及介词短语PP等。

介绍

在这里插入图片描述

例如上图中,从文本 "We saw the yellow dog",提取出名词短语块,称为NP-chunk。最后得到相应的浅层句法结构

在这里插入图片描述

从解决方法上看与命名实体识别NER相似,都是序列标注的问题,常用的标签有BMESBIOBIOE。标签与相应的块名称X组合, 例如B-NP 代表块名词短语的开头。

序列标注的标签
图片来自博客

句子中的短语块,一般有以下几种类型:
在这里插入图片描述

但是现有的工具(spacytextblob等),一般只关注NP-chunking任务,仅仅提取文本序列中的名词短语块。conll2000-chunking任务提取NP, VP以及PP短语块,这里也提供了相应的数据集

实践

可以使用基于规则的方法和基于机器学习的方法

基于规则的方法

基于规则的方法需要手动定义chunking的文法,并且需要注意嵌套

def preprocess(doc):
    sentences = nltk.sent_tokenize(doc)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    return sentences


sentence = "The blogger taught the reader to chunk"
sentence = preprocess(sentence)
print(sentence)

grammar = "NP: {<DT>?<JJ>*<NN>}" 
# 匹配模式,限定词(0或1个) + 形容词(0个以上) + 名词
NPChunker = nltk.RegexpParser(grammar)
result = NPChunker.parse(sentence[0])
print(result)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

输出:

[[('The', 'DT'), ('blogger', 'NN'), ('taught', 'VBD'), ('the', 'DT'), ('reader', 'NN'), ('to', 'TO'), ('chunk', 'VB')]]
(S
  (NP The/DT blogger/NN)
  taught/VBD
  (NP the/DT reader/NN)
  to/TO
  chunk/VB)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
基于机器学习的方法(最大熵分类器)

输入有两种形式,一是原始的文本,二是原始文本+词性标注(准确率相比前者会高很多)

这里使用nltk中自带的语料conll2000,可使用如下命令下载,训练最大熵分类器,自动提取文本中的名词短语块NP,动词短语块VP和介词短语块PP:

import nltk
nltk.download("conll2000")
  • 1
  • 2

代码如下:

def tags_since_dt(sentence, i):
    tags = set()
    for word, pos in sentence[:i]:
        if pos == 'DT':
            tags = set()
        else:
            tags.add(pos)
    return '+'.join(sorted(tags))


def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "&lt;START&gt;", "&lt;START&gt;"
    else:
        prevword, prevpos = sentence[i - 1]
    if i == len(sentence) - 1:
        nextword, nextpos = "&lt;END&gt;", "&lt;END&gt;"
    else:
        nextword, nextpos = sentence[i + 1]
    return {"pos": pos,
            "word": word,
            "prevpos": prevpos,
            "nextpos": nextpos,
            "prevword": prevword,
            "nextword": nextword,
            "prevpos+pos": "%s+%s" % (prevpos, pos),
            "pos+nextpos": "%s+%s" % (pos, nextpos),
            "prevpos+pos+nextpos": "%s+%s+%s" % (prevpos, pos, nextpos),
            "prevword+word+nextword": "%s+%s+%s" % (prevword, word, nextword),
            "tags-since-dt": tags_since_dt(sentence, i)}


class ConsecutiveNPChunkTagger(nltk.TaggerI):

    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent, i, history)
                train_set.append((featureset, tag))
                history.append(tag)
        self.classifier = nltk.MaxentClassifier.train(
            train_set, algorithm='IIS', trace=0)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

# 模型及特征构建
class ConsecutiveNPChunker(nltk.ChunkParserI):

    def __init__(self, train_sents):
        tagged_sents = [[((w, t), c) for (w, t, c) in
                         nltk.chunk.tree2conlltags(sent)]
                        for sent in train_sents]
        # 词->词性->chunk标签
        # iob_tagged = tree2conlltags(chunked_sentence)
        # chunk_tree = conlltags2tree(iob_tagged)
        # len(conll2000.chunked_sents())  # 10948
        # len(conll2000.chunked_words())  # 166433
        self.tagger = ConsecutiveNPChunkTagger(tagged_sents)

    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence)
        conlltags = [(w, t, c) for ((w, t), c) in tagged_sents]
        return nltk.chunk.conlltags2tree(conlltags)


from nltk.corpus import conll2000

# 获取训练和测试数据
train_sents = conll2000.chunked_sents('train.txt')
chunked_sentence = conll2000.chunked_sents()[0]
test_sents = conll2000.chunked_sents('test.txt')
# 训练模型
chunker = ConsecutiveNPChunker(train_sents)
# 测试
print(chunker.evaluate(test_sents))


# 保存模型
import pickle
pickle.dump(chunker, open("chunker.bin", "wb"))

# 加载模型
chunker = pickle.load(open("chunker.bin", "rb"))

# 测试样例
sentence = 'It is the 2019 novel coronavirus that has breaks out worldwide.'
test_sent_words = nltk.word_tokenize(sentence)
test_sent_pos = nltk.pos_tag(test_sent_words)
test_sent = [(word, pos) for word, pos in zip(test_sent_words, test_sent_pos)]
print(chunker.parse(test_sent_pos))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100

输出:

ChunkParse score:
    IOB Accuracy:  93.9%%
    Precision:     89.0%%
    Recall:        92.1%%
    F-Measure:     90.5%%
(S
  (NP It/PRP)
  (VP is/VBZ)
  (NP the/DT 2019/CD novel/NN coronavirus/NN)
  (NP that/WDT)
  (VP has/VBZ breaks/VBN)
  out/RP
  (NP worldwide/NN)
  ./.)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/盐析白兔/article/detail/242565
推荐阅读
相关标签
  

闽ICP备14008679号