赞
踩
词性标注(Part-Of-Speech tagging, POS tagging)也被称为语法标注(grammatical tagging)或词类消疑(word-category disambiguation),是语料库语言学(corpus linguistics)中将语料库内单词的词性按其含义和上下文内容进行标记的文本数据处理技术 。
词性标注可以由人工或特定算法完成,使用机器学习(machine learning)方法实现词性标注是自然语言处理(Natural Language Processing, NLP)的研究内容。常见的词性标注算法包括隐马尔可夫模型(Hidden Markov Model, HMM)、条件随机场(Conditional random fields, CRFs)等 。
词性标注主要被应用于文本挖掘(text mining)和NLP领域,是各类基于文本的机器学习任务,例如语义分析(semantic analysis)和指代消解(coreference resolution)的预处理步骤。
和分词一样,词性标注算法也分为两大类,基于字符串匹配的字典查找算法和基于统计的算法。jieba分词就综合了两种算法,对于分词后识别出来的词语,直接从字典中查找其词性。而对于未登录词,则采用HMM隐马尔科夫模型和viterbi算法来识别。
基于字符串匹配的字典查找算法
先对语句进行分词,然后从字典中查找每个词语的词性,对其进行标注即可。jieba词性标注中,对于识别出来的词语,就是采用了这种方法。这种方法比较简单,通俗易懂,但是不能解决一词多词性的问题,因此存在一定的误差。
下图即为jieba分词中的词典的一部分词语。每一行对应一个词语,分为三部分,分别为词语名 词数 词性。因此分词完成后只需要在字典中查找该词语的词性即可对其完成标注。
基于统计的词性标注算法
和分词一样,我们也可以通过HMM隐马尔科夫模型来进行词性标注。观测序列即为分词后的语句,隐藏序列即为经过标注后的词性标注序列。起始概率 发射概率和转移概率和分词中的含义大同小异,可以通过大规模语料统计得到。观测序列到隐藏序列的计算可以通过viterbi算法,利用统计得到的起始概率 发射概率和转移概率来得到。得到隐藏序列后,就完成了词性标注过程。
tag2id, id2tag = {}, {} # maps tag to id . tag2id: {"VB": 0, "NNP":1,..} , id2tag: {0: "VB", 1: "NNP"....} word2id, id2word = {}, {} # maps word to id for line in open('traindata.txt'): items = line.split('/') word, tag = items[0], items[1].rstrip() # 抽取每一行里的单词和词性 if word not in word2id: word2id[word] = len(word2id) id2word[len(id2word)] = word if tag not in tag2id: tag2id[tag] = len(tag2id) id2tag[len(id2tag)] = tag M = len(word2id) # M: 词典的大小、# of words in dictionary N = len(tag2id) # N: 词性的种类个数 # of tags in tag set print(word2id)
{'Newsweek': 0, ',': 1, 'trying': 2, 'to': 3, 'keep': 4, 'pace': 5, 'with': 6, 'rival': 7, 'Time': 8, 'magazine': 9, 'announced': 10, 'new': 11, 'advertising': 12, 'rates': 13, 'for': 14, '1990': 15, 'and': 16, 'said': 17, 'it': 18, 'will': 19, 'introduce': 20, 'a': 21, 'incentive': 22, 'plan': 23, 'advertisers': 24, '.': 25, 'The': 26, 'ad': 27, 'from': 28, 'unit': 29, 'of': 30, 'the': 31, 'Washington': 32, 'Post': 33, 'Co.': 34, 'is': 35, 'second': 36, 'has': 37, 'offered': 38, 'in': 39, 'three': 40, 'years': 41, 'Plans': 42, 'that': 43, 'give': 44, 'discounts': 45, 'maintaining': 46, 'or': 47, 'increasing': 48, 'spending': 49, ......}
# 构建 pi, A, B import numpy as np pi = np.zeros(N) # 每个词性出现在句子中第一个位置的概率, N: # of tags pi[i]: tag i出现在句子中第一个位置的概率 A = np.zeros((N, M)) # A[i][j]: 给定tag i, 出现单词j的概率。 N: # of tags M: # of words in dictionary B = np.zeros((N,N)) # B[i][j]: 之前的状态是i, 之后转换成转态j的概率 N: # of tags prev_tag = "" for line in open('traindata.txt'): items = line.split('/') wordId, tagId = word2id[items[0]], tag2id[items[1].rstrip()] if prev_tag == "": # 这意味着是句子的开始 pi[tagId] += 1 A[tagId][wordId] += 1 else: # 如果不是句子的开头 A[tagId][wordId] += 1 B[tag2id[prev_tag]][tagId] += 1 if items[0] == ".": prev_tag = "" else: prev_tag = items[1].rstrip() # normalize pi = pi/sum(pi) for i in range(N): A[i] /= sum(A[i]) B[i] /= sum(B[i]) print(A) # 到此为止计算完了模型的所有的参数: pi, A, B
array([[5.16155673e-04, 0.00000000e+00, 0.00000000e+00, ...,
5.16155673e-05, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 9.99801725e-01, 0.00000000e+00, ...,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 1.45846544e-02, ...,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
...,
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])
定义log函数,平滑处理
def log(v):
if v == 0:
return np.log(v+0.000001)
return np.log(v)
def viterbi(x, pi, A, B): """ x: user input string/sentence: x: "I like playing soccer" pi: initial probability of tags A: 给定tag, 每个单词出现的概率 B: tag之间的转移概率 """ x = [word2id[word] for word in x.split(" ")] # x: [4521, 412, 542 ..] T = len(x) dp = np.zeros((T,N)) # dp[i][j]: w1...wi, 假设wi的tag是第j个tag ptr = np.array([[0 for x in range(N)] for y in range(T)] ) # T*N # TODO: ptr = np.zeros((T,N), dtype=int) for j in range(N): # basecase for DP算法 dp[0][j] = log(pi[j]) + log(A[j][x[0 ]]) for i in range(1,T): # 每个单词 for j in range(N): # 每个词性 # TODO: 以下几行代码可以写成一行(vectorize的操作, 会使得效率变高) dp[i][j] = -9999999 for k in range(N): # 从每一个k可以到达j score = dp[i-1][k] + log(B[k][j] + log(A[j][x[i]]) if score > dp[i][j]: dp[i][j] = score ptr[i][j] = k # decoding: 把最好的tag sequence 打印出来 best_seq = [0]*T # best_seq = [1,5,2,23,4,...] # step1: 找出对应于最后一个单词的词性 best_seq[T-1] = np.argmax(dp[T-1]) # step2: 通过从后到前的循环来依次求出每个单词的词性 for i in range(T-2, -1, -1): # T-2, T-1,... 1, 0 best_seq[i] = ptr[i+1][best_seq[i+1]] # 到目前为止, best_seq存放了对应于x的 词性序列 for i in range(len(best_seq)): print (id2tag[best_seq[i]])
x = "Social Security number , passport number and details about the services provided for the payment"
viterbi(x, pi, A, B)
MD
IN
VB
NNP
MD
VB
CD
VBD
NN
MD
VBD
VBZ
NN
MD
VB
# -*- coding: utf-8 -*- import numpy as np def map(): global word2id,id2word,tag2id,id2tag for line in open("traindata.txt"): items = line.split("/") word,tag = items[0],items[1].rstrip() if word not in word2id: word2id[word] = len(word2id) id2word[len(word2id)] = word if tag not in tag2id: tag2id[tag] = len(tag2id) id2tag[len(tag2id)] = tag # 构建 pi,A,B def pi_A_B(): global pi,A,B prev_tag = "" for line in open("traindata.txt"): items = line.split("/") wordId, tagId = word2id[items[0]],tag2id[items[1].rstrip()] if prev_tag == "": pi[tagId] +=1 A[tagId][wordId] += 1 else: A[tagId][wordId] +=1 B[tag2id[prev_tag]][tagId] +=1 if items[0] == ".": prev_tag = "" else: prev_tag = items[1].rstrip() pi = pi/sum(pi) for i in range(N): A[i] /= sum(A[i]) B[i] /= sum(B[i]) def log(v): if v == 0: return np.log(v+0.0001) return np.log(v) def viterbi(x,pi,A,B): x = [word2id[word] for word in x.split(" ")] T = len(x) dp = np.zeros((T,N)) ptr = np.array([[0 for x in range(N)] for y in range(T)]) for j in range(N): dp[0][j] = log(pi[j]) + log(A[j][x[0]]) for i in range(1,T): for j in range(N): dp[i][j]=-99999 for k in range(N): score = dp[i-1][k] + log(B[k][j]) + log(A[j][x[i]]) if score > dp[i][j]: dp[i][j]=score ptr[i][j] = k best_seq = [0]*T best_seq[T-1] = np.argmax(dp[T-1]) for i in range(T-2,-1,-1): best_seq[i] = ptr[i+1][best_seq[i+1]] for i in range(len(best_seq)): print(id2tag[best_seq[i]]) if __name__ == "__main__": tag2id, id2tag = {}, {} # map tag to id ,tag2id:{"VB":0,"NNP":1},id2tag:{0:"VB",1:"NNP"} word2id, id2word = {}, {} #map word to id map() M = len(word2id) # M : 词典的大小 # of words in dictionary N = len(tag2id) # N : 词性的种类个数 # of tags in tag set pi = np.zeros(N) A = np.zeros((N,M)) B = np.zeros((N,N)) pi_A_B() x = "Social Security number , passport number and details about the services provided for the payment" viterbi(x, pi, A, B)
jieba可以在分词的同时,完成词性标注,因此标注速度可以得到保证。通过查询字典的方式获取识别词的词性,通过HMM隐马尔科夫模型来获取未登录词的词性,从而完成整个语句的词性标注。但可以看到查询字典的方式不能解决一词多词性的问题,也就是词性歧义问题。故精度上还是有所欠缺的。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。