赞
踩
在自然语言处理中,我们经常需要用到n元语法模型。
其中,有关中文分词的一些概念是我们需要掌握的,譬如:
unigram 一元分词,把句子分成一个一个的汉字
bigram 二元分词,把句子从头到尾每两个字组成一个词语
trigram 三元分词,把句子从头到尾每三个字组成一个词语.
我们来简单的做个练习:
输入的是断好词的文本,每个句子一行。
统计词unigram和bigram的频次,并将它们分别输出到`data.uni`和`data.bi`两个文件中。
- #!/usr/bin/env python
-
- class NGram(object):
-
- def __init__(self, n):
- # n is the order of n-gram language model
- self.n = n
- self.unigram = {}
- self.bigram = {}
-
- # scan a sentence, extract the ngram and update their
- # frequence.
- #
- # @param sentence list{str}
- # @return none
- def scan(self, sentence):
- # file your code here
- for line in sentence:
- self.ngram(line.split())
- #unigram
- if self.n == 1:
- try:
- fip = open("data.uni","w")
- except:
- print >> sys.stderr ,"failed to open data.uni"
- for i in self.unigram:
- fip.write("%s %d\n" % (i,self.unigram[i]))
- if self.n == 2:
- try:
- fip = open("data.bi","w")
- except:
- print >> sys.stderr ,"failed to open data.bi"
- for i in self.bigram:
- fip.write("%s %d\n" % (i,self.bigram[i]))
- # caluclate the ngram of the words
- #
- # @param words list{str}
- # @return none
- def ngram(self, words):
- # unigram
- if self.n == 1:
- for word in words:
- if word not in self.unigram:
- self.unigram[word] = 1
- else:
- self.unigram[word] = self.unigram[word] + 1
-
- # bigram
- if self.n == 2:
- num = 0
- stri = ''
- for i in words:
- num = num + 1
- if num == 2:
- stri = stri + " "
- stri = stri + i
- if num == 2:
- if stri not in self.bigram:
- self.bigram[stri] = 1
- else:
- self.bigram[stri] = self.bigram[stri] + 1
- num = 0
- stri = ''
-
- if __name__=="__main__":
- import sys
- try:
- fip = open(sys.argv[1],"r")
- except:
- print >> sys.stderr, "failed to open input file"
- sentence = []
- for line in fip:
- if len(line.strip())!=0:
- sentence.append(line.strip())
- uni = NGram(1)
- bi = NGram(2)
- uni.scan(sentence)
- bi.scan(sentence)
-

Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。