赞
踩
- 目标:词性标注
-
- s = w1w2w3......wn 单词
-
- z = (z1z2......zn) 词性
-
-
- 目的:argmax p(z|s) -> Noisy Channel Model
-
- = argmax p(s|z) p(z)
-
-
- p(s|z) - Translation Model
-
- p(z) - Language Model
-
- = argmax p(w1w2...wn|z1z2....zn)p(z1z2....zn) (假设条件独立)
-
- = argmax p(w1|z1) p(w2|z2) ..... p(wn|zn)p(z1)p(z2|z1)p(z3|z1z2)......
-
- (马尔科夫假设)
-
- = argmax Pi(i=1..n) P(wi|zi) * p(z1)p(z2|z1)p(z3|z2)......
-
- => argmax logpi(i=1...n)p(wi|zi)p(z1) pi(i=2..n)p(zj|zj-1)
- = argmax sum(i=1..n) log p(wi|zi) + logp(z1) + sum(j=2..n)logp(zj|zj-1)
-
- z' = argmax sum(i=1..n) log p(wi|zi) + logp(z1) + sum(j=2..n)logp(zj|zj-1)
-
-
-
- "计算pi、A、B代码,traindata.txt文件数据见文章结尾"
-
- tag2id, id2tag = {}, {} # maps tag to id . tag2id: {"VB": 0, "NNP":1,..} , id2tag: {0: "VB", 1: "NNP"....}
- word2id, id2word = {}, {} # maps word to id
-
- for line in open('traindata.txt'):
- items = line.split('/')
- word, tag = items[0], items[1].rstrip() # 抽取每一行里的单词和词性
-
- if word not in word2id:
- word2id[word] = len(word2id)
- id2word[len(id2word)] = word
- if tag not in tag2id:
- tag2id[tag] = len(tag2id)
- id2tag[len(id2tag)] = tag
-
- M = len(word2id) # M: 词典的大小、# of words in dictionary
- N = len(tag2id) # N: 词性的种类个数 # of tags in tag set
-
-
- # 构建 pi, A, B
- import numpy as np
- pi = np.zeros(N) # 每个词性出现在句子中第一个位置的概率, N: # of tags pi[i]: tag i出现在句子中第一个位置的概率
- A = np.zeros((N, M)) # A[i][j]: 给定tag i, 出现单词j的概率。 N: # of tags M: # of words in dictionary
- B = np.zeros((N,N)) # B[i][j]: 之前的状态是i, 之后转换成转态j的概率 N: # of tags
-
-
- prev_tag = ""
- for line in open('traindata.txt'):
- items = line.split('/')
- wordId, tagId = word2id[items[0]], tag2id[items[1].rstrip()]
- if prev_tag == "": # 这意味着是句子的开始
- pi[tagId] += 1
- A[tagId][wordId] += 1
- else: # 如果不是句子的开头
- A[tagId][wordId] += 1
- B[tag2id[prev_tag]][tagId] += 1
-
- if items[0] == ".":
- prev_tag = ""
- else:
- prev_tag = items[1].rstrip()
-
- # normalize
- pi = pi/sum(pi)
- for i in range(N):
- A[i] /= sum(A[i])
- B[i] /= sum(B[i])
-
- # 到此为止计算完了模型的所有的参数: pi, A, B
-
-
-
知道了pi、A、B,需要求出最优的z
维特比算法最终为一个动态规划寻找最优路径的问题,最终代码如下:
- def log(v):
- if v == 0:
- return np.log(v+0.000001)
- return np.log(v)
-
- def viterbi(x, pi, A, B):
- """
- x: user input string/sentence: x: "I like playing soccer"
- pi: initial probability of tags
- A: 给定tag, 每个单词出现的概率
- B: tag之间的转移概率
- """
- x = [word2id[word] for word in x.split(" ")] # x: [4521, 412, 542 ..]
- T = len(x)
-
- dp = np.zeros((T,N)) # dp[i][j]: w1...wi, 假设wi的tag是第j个tag
- ptr = np.array([[0 for x in range(N)] for y in range(T)] ) # T*N
- # TODO: ptr = np.zeros((T,N), dtype=int)
-
- for j in range(N): # basecase for DP算法
- dp[0][j] = log(pi[j]) + log(A[j][x[0]])
-
- for i in range(1,T): # 每个单词
- for j in range(N): # 每个词性
- # TODO: 以下几行代码可以写成一行(vectorize的操作, 会使得效率变高)
- dp[i][j] = -9999999
- for k in range(N): # 从每一个k可以到达j
- score = dp[i-1][k] + log(B[k][j]) + log(A[j][x[i]])
- if score > dp[i][j]:
- dp[i][j] = score
- ptr[i][j] = k
-
- # decoding: 把最好的tag sequence 打印出来
- best_seq = [0]*T # best_seq = [1,5,2,23,4,...]
- # step1: 找出对应于最后一个单词的词性
- best_seq[T-1] = np.argmax(dp[T-1])
-
- # step2: 通过从后到前的循环来依次求出每个单词的词性
- for i in range(T-2, -1, -1): # T-2, T-1,... 1, 0
- best_seq[i] = ptr[i+1][best_seq[i+1]]
-
- # 到目前为止, best_seq存放了对应于x的 词性序列
- for i in range(len(best_seq)):
- print (id2tag[best_seq[i]])
-
-
最终验证,输入一句话,可以得出对应的词性:
- x = "Social Security number , passport number and details about the services provided for the payment"
- print(viterbi(x, pi, A, B))
-
- NNP
- NNP
- NN
- ,
- NN
- NN
- CC
- NNS
- IN
- DT
- NNS
- VBN
- IN
- DT
- NN
- traindata.txt 部分训练语料如下所示:
-
- Newsweek/NNP
- ,/,
- trying/VBG
- to/TO
- keep/VB
- pace/NN
- with/IN
- rival/JJ
- Time/NNP
- magazine/NN
- ,/,
- announced/VBD
- new/JJ
- advertising/NN
- rates/NNS
- for/IN
- 1990/CD
- and/CC
- said/VBD
- it/PRP
- will/MD
- introduce/VB
- a/DT
- new/JJ
- incentive/NN
- plan/NN
- for/IN
- advertisers/NNS
- ./.
- The/DT
- new/JJ
- ad/NN
- plan/NN
- from/IN
- Newsweek/NNP
- ,/,
- a/DT
- unit/NN
- of/IN
- the/DT
- Washington/NNP
- Post/NNP
- Co./NNP
- ,/,
- is/VBZ
- the/DT
- second/JJ
- incentive/NN
- plan/NN
- the/DT
- magazine/NN
- has/VBZ
- offered/VBN
- advertisers/NNS
- in/IN
- three/CD
- years/NNS
- ./.
- Plans/NNS
- that/WDT
- give/VBP
- advertisers/NNS
- discounts/NNS
- for/IN
- maintaining/VBG
- or/CC
- increasing/VBG
- ad/NN
- spending/NN
- have/VBP
- become/VBN
- permanent/JJ
- fixtures/NNS
- at/IN
- the/DT
- news/NN
- weeklies/NNS
- and/CC
- underscore/VBP
- the/DT
- fierce/JJ
- competition/NN
- between/IN
- Newsweek/NNP
- ,/,
- Time/NNP
- Warner/NNP
- Inc./NNP
- 's/POS
- Time/NNP
- magazine/NN
- ,/,
- and/CC
- Mortimer/NNP
- B./NNP
- Zuckerman/NNP
- 's/POS
- U.S./NNP
- News/NNP
- &/CC
- World/NNP
- Report/NNP
- ./.
- Alan/NNP
- Spoon/NNP
- ,/,
- recently/RB
- named/VBN
- Newsweek/NNP
- president/NN
- ,/,
- said/VBD
- Newsweek/NNP
- 's/POS
- ad/NN
- rates/NNS
- would/MD
- increase/VB
- 5/CD
- %/NN
- in/IN
- January/NNP
- ./.
- A/DT
- full/JJ
- ,/,
- four-color/JJ
- page/NN
- in/IN
- Newsweek/NNP
- will/MD
- cost/VB
- $/$
- 100,980/CD
- ./.
- In/IN
- mid-October/NNP
- ,/,
- Time/NNP
- magazine/NN
- lowered/VBD
- its/PRP$
- guaranteed/VBN
- circulation/NN
- rate/NN
- base/NN
- for/IN
- 1990/CD
- while/IN
- not/RB
- increasing/VBG
- ad/NN
- page/NN
- rates/NNS
- ;/:
- with/IN
- a/DT
- lower/JJR
- circulation/NN
- base/NN
- ,/,
- Time/NNP
- 's/POS
- ad/NN
- rate/NN
- will/MD
- be/VB
- effectively/RB
- 7.5/CD
- %/NN
- higher/JJR
- per/IN
- subscriber/NN
- ;/:
- a/DT
- full/JJ
- page/NN
- in/IN
- Time/NNP
- costs/VBZ
- about/IN
- $/$
- 120,000/CD
- ./.
- U.S./NNP
- News/NNP
- has/VBZ
- yet/RB
- to/TO
- announce/VB
- its/PRP$
- 1990/CD
- ad/NN
- rates/NNS
- ./.
- Newsweek/NNP
- said/VBD
- it/PRP
- will/MD
- introduce/VB
- the/DT
- Circulation/NNP
- Credit/NNP
- Plan/NNP
- ,/,
- which/WDT
- awards/VBZ
- space/NN
- credits/NNS
- to/TO
- advertisers/NNS
- on/IN
- ``/``
- renewal/NN
- advertising/NN
- ./.
- ''/''
- The/DT
- magazine/NN
- will/MD
- reward/VB
- with/IN
- ``/``
- page/NN
- bonuses/NNS
- ''/''
- advertisers/NNS
- who/WP
- in/IN
- 1990/CD
- meet/VBP
- or/CC
- exceed/VBP
- their/PRP$
- 1989/CD
- spending/NN
- ,/,
- as/RB
- long/RB
- as/IN
- they/PRP
- spent/VBD
- $/$
- 325,000/CD
- in/IN
- 1989/CD
- and/CC
- $/$
- 340,000/CD
- in/IN
- 1990/CD
- ./.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。