当前位置:   article > 正文

计词unigram和bigram的频次_biggram

biggram

自然语言处理中,我们经常需要用到n元语法模型。

其中,有关中文分词的一些概念是我们需要掌握的,譬如:

unigram 一元分词,把句子分成一个一个的汉字
bigram 二元分词,把句子从头到尾每两个字组成一个词语
trigram 三元分词,把句子从头到尾每三个字组成一个词语.

我们来简单的做个练习:

输入的是断好词的文本,每个句子一行。
统计词unigram和bigram的频次,并将它们分别输出到`data.uni`和`data.bi`两个文件中。

  1. #!/usr/bin/env python
  2. class NGram(object):
  3. def __init__(self, n):
  4. # n is the order of n-gram language model
  5. self.n = n
  6. self.unigram = {}
  7. self.bigram = {}
  8. # scan a sentence, extract the ngram and update their
  9. # frequence.
  10. #
  11. # @param sentence list{str}
  12. # @return none
  13. def scan(self, sentence):
  14. # file your code here
  15. for line in sentence:
  16. self.ngram(line.split())
  17. #unigram
  18. if self.n == 1:
  19. try:
  20. fip = open("data.uni","w")
  21. except:
  22. print >> sys.stderr ,"failed to open data.uni"
  23. for i in self.unigram:
  24. fip.write("%s %d\n" % (i,self.unigram[i]))
  25. if self.n == 2:
  26. try:
  27. fip = open("data.bi","w")
  28. except:
  29. print >> sys.stderr ,"failed to open data.bi"
  30. for i in self.bigram:
  31. fip.write("%s %d\n" % (i,self.bigram[i]))
  32. # caluclate the ngram of the words
  33. #
  34. # @param words list{str}
  35. # @return none
  36. def ngram(self, words):
  37. # unigram
  38. if self.n == 1:
  39. for word in words:
  40. if word not in self.unigram:
  41. self.unigram[word] = 1
  42. else:
  43. self.unigram[word] = self.unigram[word] + 1
  44. # bigram
  45. if self.n == 2:
  46. num = 0
  47. stri = ''
  48. for i in words:
  49. num = num + 1
  50. if num == 2:
  51. stri = stri + " "
  52. stri = stri + i
  53. if num == 2:
  54. if stri not in self.bigram:
  55. self.bigram[stri] = 1
  56. else:
  57. self.bigram[stri] = self.bigram[stri] + 1
  58. num = 0
  59. stri = ''
  60. if __name__=="__main__":
  61. import sys
  62. try:
  63. fip = open(sys.argv[1],"r")
  64. except:
  65. print >> sys.stderr, "failed to open input file"
  66. sentence = []
  67. for line in fip:
  68. if len(line.strip())!=0:
  69. sentence.append(line.strip())
  70. uni = NGram(1)
  71. bi = NGram(2)
  72. uni.scan(sentence)
  73. bi.scan(sentence)


声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/知新_RL/article/detail/397414
推荐阅读
相关标签
  

闽ICP备14008679号