当前位置:   article > 正文

NLP-基础知识-003(词性标注)_isvbz

isvbz
  1. 目标:词性标注
  2. s = w1w2w3......wn 单词
  3. z = (z1z2......zn) 词性
  4. 目的:argmax p(z|s) -> Noisy Channel Model
  5. = argmax p(s|z) p(z)
  6. p(s|z) - Translation Model
  7. p(z) - Language Model
  8. = argmax p(w1w2...wn|z1z2....zn)p(z1z2....zn) (假设条件独立)
  9. = argmax p(w1|z1) p(w2|z2) ..... p(wn|zn)p(z1)p(z2|z1)p(z3|z1z2)......
  10. (马尔科夫假设)
  11. = argmax Pi(i=1..n) P(wi|zi) * p(z1)p(z2|z1)p(z3|z2)......
  12. => argmax logpi(i=1...n)p(wi|zi)p(z1) pi(i=2..n)p(zj|zj-1)
  13. = argmax sum(i=1..n) log p(wi|zi) + logp(z1) + sum(j=2..n)logp(zj|zj-1)
  14. z' = argmax sum(i=1..n) log p(wi|zi) + logp(z1) + sum(j=2..n)logp(zj|zj-1)

  1. "计算pi、A、B代码,traindata.txt文件数据见文章结尾"
  2. tag2id, id2tag = {}, {} # maps tag to id . tag2id: {"VB": 0, "NNP":1,..} , id2tag: {0: "VB", 1: "NNP"....}
  3. word2id, id2word = {}, {} # maps word to id
  4. for line in open('traindata.txt'):
  5. items = line.split('/')
  6. word, tag = items[0], items[1].rstrip() # 抽取每一行里的单词和词性
  7. if word not in word2id:
  8. word2id[word] = len(word2id)
  9. id2word[len(id2word)] = word
  10. if tag not in tag2id:
  11. tag2id[tag] = len(tag2id)
  12. id2tag[len(id2tag)] = tag
  13. M = len(word2id) # M: 词典的大小、# of words in dictionary
  14. N = len(tag2id) # N: 词性的种类个数 # of tags in tag set
  15. # 构建 pi, A, B
  16. import numpy as np
  17. pi = np.zeros(N) # 每个词性出现在句子中第一个位置的概率, N: # of tags pi[i]: tag i出现在句子中第一个位置的概率
  18. A = np.zeros((N, M)) # A[i][j]: 给定tag i, 出现单词j的概率。 N: # of tags M: # of words in dictionary
  19. B = np.zeros((N,N)) # B[i][j]: 之前的状态是i, 之后转换成转态j的概率 N: # of tags
  20. prev_tag = ""
  21. for line in open('traindata.txt'):
  22. items = line.split('/')
  23. wordId, tagId = word2id[items[0]], tag2id[items[1].rstrip()]
  24. if prev_tag == "": # 这意味着是句子的开始
  25. pi[tagId] += 1
  26. A[tagId][wordId] += 1
  27. else: # 如果不是句子的开头
  28. A[tagId][wordId] += 1
  29. B[tag2id[prev_tag]][tagId] += 1
  30. if items[0] == ".":
  31. prev_tag = ""
  32. else:
  33. prev_tag = items[1].rstrip()
  34. # normalize
  35. pi = pi/sum(pi)
  36. for i in range(N):
  37. A[i] /= sum(A[i])
  38. B[i] /= sum(B[i])
  39. # 到此为止计算完了模型的所有的参数: pi, A, B

知道了pi、A、B,需要求出最优的z

维特比算法最终为一个动态规划寻找最优路径的问题,最终代码如下:

  1. def log(v):
  2. if v == 0:
  3. return np.log(v+0.000001)
  4. return np.log(v)
  5. def viterbi(x, pi, A, B):
  6. """
  7. x: user input string/sentence: x: "I like playing soccer"
  8. pi: initial probability of tags
  9. A: 给定tag, 每个单词出现的概率
  10. B: tag之间的转移概率
  11. """
  12. x = [word2id[word] for word in x.split(" ")] # x: [4521, 412, 542 ..]
  13. T = len(x)
  14. dp = np.zeros((T,N)) # dp[i][j]: w1...wi, 假设wi的tag是第j个tag
  15. ptr = np.array([[0 for x in range(N)] for y in range(T)] ) # T*N
  16. # TODO: ptr = np.zeros((T,N), dtype=int)
  17. for j in range(N): # basecase for DP算法
  18. dp[0][j] = log(pi[j]) + log(A[j][x[0]])
  19. for i in range(1,T): # 每个单词
  20. for j in range(N): # 每个词性
  21. # TODO: 以下几行代码可以写成一行(vectorize的操作, 会使得效率变高)
  22. dp[i][j] = -9999999
  23. for k in range(N): # 从每一个k可以到达j
  24. score = dp[i-1][k] + log(B[k][j]) + log(A[j][x[i]])
  25. if score > dp[i][j]:
  26. dp[i][j] = score
  27. ptr[i][j] = k
  28. # decoding: 把最好的tag sequence 打印出来
  29. best_seq = [0]*T # best_seq = [1,5,2,23,4,...]
  30. # step1: 找出对应于最后一个单词的词性
  31. best_seq[T-1] = np.argmax(dp[T-1])
  32. # step2: 通过从后到前的循环来依次求出每个单词的词性
  33. for i in range(T-2, -1, -1): # T-2, T-1,... 1, 0
  34. best_seq[i] = ptr[i+1][best_seq[i+1]]
  35. # 到目前为止, best_seq存放了对应于x的 词性序列
  36. for i in range(len(best_seq)):
  37. print (id2tag[best_seq[i]])

最终验证,输入一句话,可以得出对应的词性:

  1. x = "Social Security number , passport number and details about the services provided for the payment"
  2. print(viterbi(x, pi, A, B))
  3. NNP
  4. NNP
  5. NN
  6. ,
  7. NN
  8. NN
  9. CC
  10. NNS
  11. IN
  12. DT
  13. NNS
  14. VBN
  15. IN
  16. DT
  17. NN
  1. traindata.txt 部分训练语料如下所示:
  2. Newsweek/NNP
  3. ,/,
  4. trying/VBG
  5. to/TO
  6. keep/VB
  7. pace/NN
  8. with/IN
  9. rival/JJ
  10. Time/NNP
  11. magazine/NN
  12. ,/,
  13. announced/VBD
  14. new/JJ
  15. advertising/NN
  16. rates/NNS
  17. for/IN
  18. 1990/CD
  19. and/CC
  20. said/VBD
  21. it/PRP
  22. will/MD
  23. introduce/VB
  24. a/DT
  25. new/JJ
  26. incentive/NN
  27. plan/NN
  28. for/IN
  29. advertisers/NNS
  30. ./.
  31. The/DT
  32. new/JJ
  33. ad/NN
  34. plan/NN
  35. from/IN
  36. Newsweek/NNP
  37. ,/,
  38. a/DT
  39. unit/NN
  40. of/IN
  41. the/DT
  42. Washington/NNP
  43. Post/NNP
  44. Co./NNP
  45. ,/,
  46. is/VBZ
  47. the/DT
  48. second/JJ
  49. incentive/NN
  50. plan/NN
  51. the/DT
  52. magazine/NN
  53. has/VBZ
  54. offered/VBN
  55. advertisers/NNS
  56. in/IN
  57. three/CD
  58. years/NNS
  59. ./.
  60. Plans/NNS
  61. that/WDT
  62. give/VBP
  63. advertisers/NNS
  64. discounts/NNS
  65. for/IN
  66. maintaining/VBG
  67. or/CC
  68. increasing/VBG
  69. ad/NN
  70. spending/NN
  71. have/VBP
  72. become/VBN
  73. permanent/JJ
  74. fixtures/NNS
  75. at/IN
  76. the/DT
  77. news/NN
  78. weeklies/NNS
  79. and/CC
  80. underscore/VBP
  81. the/DT
  82. fierce/JJ
  83. competition/NN
  84. between/IN
  85. Newsweek/NNP
  86. ,/,
  87. Time/NNP
  88. Warner/NNP
  89. Inc./NNP
  90. 's/POS
  91. Time/NNP
  92. magazine/NN
  93. ,/,
  94. and/CC
  95. Mortimer/NNP
  96. B./NNP
  97. Zuckerman/NNP
  98. 's/POS
  99. U.S./NNP
  100. News/NNP
  101. &/CC
  102. World/NNP
  103. Report/NNP
  104. ./.
  105. Alan/NNP
  106. Spoon/NNP
  107. ,/,
  108. recently/RB
  109. named/VBN
  110. Newsweek/NNP
  111. president/NN
  112. ,/,
  113. said/VBD
  114. Newsweek/NNP
  115. 's/POS
  116. ad/NN
  117. rates/NNS
  118. would/MD
  119. increase/VB
  120. 5/CD
  121. %/NN
  122. in/IN
  123. January/NNP
  124. ./.
  125. A/DT
  126. full/JJ
  127. ,/,
  128. four-color/JJ
  129. page/NN
  130. in/IN
  131. Newsweek/NNP
  132. will/MD
  133. cost/VB
  134. $/$
  135. 100,980/CD
  136. ./.
  137. In/IN
  138. mid-October/NNP
  139. ,/,
  140. Time/NNP
  141. magazine/NN
  142. lowered/VBD
  143. its/PRP$
  144. guaranteed/VBN
  145. circulation/NN
  146. rate/NN
  147. base/NN
  148. for/IN
  149. 1990/CD
  150. while/IN
  151. not/RB
  152. increasing/VBG
  153. ad/NN
  154. page/NN
  155. rates/NNS
  156. ;/:
  157. with/IN
  158. a/DT
  159. lower/JJR
  160. circulation/NN
  161. base/NN
  162. ,/,
  163. Time/NNP
  164. 's/POS
  165. ad/NN
  166. rate/NN
  167. will/MD
  168. be/VB
  169. effectively/RB
  170. 7.5/CD
  171. %/NN
  172. higher/JJR
  173. per/IN
  174. subscriber/NN
  175. ;/:
  176. a/DT
  177. full/JJ
  178. page/NN
  179. in/IN
  180. Time/NNP
  181. costs/VBZ
  182. about/IN
  183. $/$
  184. 120,000/CD
  185. ./.
  186. U.S./NNP
  187. News/NNP
  188. has/VBZ
  189. yet/RB
  190. to/TO
  191. announce/VB
  192. its/PRP$
  193. 1990/CD
  194. ad/NN
  195. rates/NNS
  196. ./.
  197. Newsweek/NNP
  198. said/VBD
  199. it/PRP
  200. will/MD
  201. introduce/VB
  202. the/DT
  203. Circulation/NNP
  204. Credit/NNP
  205. Plan/NNP
  206. ,/,
  207. which/WDT
  208. awards/VBZ
  209. space/NN
  210. credits/NNS
  211. to/TO
  212. advertisers/NNS
  213. on/IN
  214. ``/``
  215. renewal/NN
  216. advertising/NN
  217. ./.
  218. ''/''
  219. The/DT
  220. magazine/NN
  221. will/MD
  222. reward/VB
  223. with/IN
  224. ``/``
  225. page/NN
  226. bonuses/NNS
  227. ''/''
  228. advertisers/NNS
  229. who/WP
  230. in/IN
  231. 1990/CD
  232. meet/VBP
  233. or/CC
  234. exceed/VBP
  235. their/PRP$
  236. 1989/CD
  237. spending/NN
  238. ,/,
  239. as/RB
  240. long/RB
  241. as/IN
  242. they/PRP
  243. spent/VBD
  244. $/$
  245. 325,000/CD
  246. in/IN
  247. 1989/CD
  248. and/CC
  249. $/$
  250. 340,000/CD
  251. in/IN
  252. 1990/CD
  253. ./.

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小丑西瓜9/article/detail/599264
推荐阅读
相关标签
  

闽ICP备14008679号