赞
踩
首先,由概率链式法则可以得到
P
(
w
1
,
w
2
,
⋯
,
w
n
)
=
P
(
w
1
)
P
(
w
2
∣
w
1
)
⋯
P
(
w
n
∣
w
1
,
w
2
,
⋯
,
w
n
−
1
)
P(w_{1},w_{2},\cdots,w_{n})=P(w_{1})P(w_{2}|w_{1})\cdots P(w_{n}|w_{1},w_{2},\cdots,w_{n-1})
P(w1,w2,⋯,wn)=P(w1)P(w2∣w1)⋯P(wn∣w1,w2,⋯,wn−1)
如何计算每个词出现的条件概率?答案是极大似然估计(Maximum Likelihood Estimation,MLE),当样本数量足够大时,我们可以近似地用频率来代替概率,在这里我们就是要求我们的语料库要比较大,然后概率用数频数的方法来计算:
P
(
w
i
∣
w
i
−
1
)
=
P
(
w
i
−
1
w
i
)
P
(
w
i
−
1
)
=
C
(
w
i
−
1
w
i
)
C
(
w
i
−
1
)
⋯
P
(
w
i
∣
w
i
−
1
⋯
w
2
w
1
)
=
P
(
w
1
⋯
w
i
−
1
w
i
)
P
(
w
1
w
2
⋯
w
i
−
1
)
=
C
(
w
1
w
2
⋯
w
i
)
C
(
w
1
w
2
⋯
w
i
−
1
)
P(w_{i}|w_{i-1})=\frac{P(w_{i-1}w_{i})}{P(w_{i-1})}=\frac{C(w_{i-1}w_{i})}{C(w_{i-1})} \\ \cdots \\P(w_{i}|w_{i-1}\cdots w_{2}w_{1})=\frac{P(w_{1}\cdots w_{i-1}w_{i})}{P(w_{1}w_{2}\cdots w_{i-1})}=\frac{C(w_{1}w_{2}\cdots w_{i})}{C(w_{1}w_{2}\cdots w_{i-1})}
P(wi∣wi−1)=P(wi−1)P(wi−1wi)=C(wi−1)C(wi−1wi)⋯P(wi∣wi−1⋯w2w1)=P(w1w2⋯wi−1)P(w1⋯wi−1wi)=C(w1w2⋯wi−1)C(w1w2⋯wi)
其中,
C
(
⋅
)
C(\cdot)
C(⋅)表示子序列在训练集中出现的次数。
对于任意长的自然语言语句,直接计算
P
(
w
i
∣
w
1
w
2
⋯
w
i
−
1
)
P(w_{i}|w_{1}w_{2}\cdots w_{i-1})
P(wi∣w1w2⋯wi−1)显然不现实。
为了解决这个问题,我们引入马尔可夫假设(Markov assumption),即假设当前词出现的概率只依赖于前
n
−
1
n-1
n−1个词,可以得到
P
(
w
i
∣
w
1
w
2
⋯
w
i
−
1
)
=
P
(
w
i
∣
w
i
−
n
+
1
⋯
w
i
−
1
)
P(w_{i}|w_{1}w_{2}\cdots w_{i-1})=P(w_{i}|w_{i-n+1}\cdots w_{i-1})
P(wi∣w1w2⋯wi−1)=P(wi∣wi−n+1⋯wi−1)
基于上式,定义n-gram语言模型如下:
n = 1 u n i g r a m : P ( w 1 , w 2 , ⋯ , w n ) = ∏ i = 1 n P ( w i ) n=1\ \ unigram:\ P(w_{1},w_{2},\cdots,w_{n})=\prod_{i=1}^{n}P(w_{i}) n=1 unigram: P(w1,w2,⋯,wn)=∏i=1nP(wi)
n = 2 b i g r a m : P ( w 1 , w 2 , ⋯ , w n ) = ∏ i = 1 n P ( w i ∣ w i − 1 ) n=2\ \ bigram:\ P(w_{1},w_{2},\cdots,w_{n})=\prod_{i=1}^{n}P(w_{i}|w_{i-1}) n=2 bigram: P(w1,w2,⋯,wn)=∏i=1nP(wi∣wi−1)
n = 3 t r i g r a m : P ( w 1 , w 2 , ⋯ , w n ) = ∏ i = 1 n P ( w i ∣ w i − 2 , w i − 1 ) n=3\ \ trigram:\ P(w_{1},w_{2},\cdots,w_{n})=\prod_{i=1}^{n}P(w_{i}|w_{i-2},w_{i-1}) n=3 trigram: P(w1,w2,⋯,wn)=∏i=1nP(wi∣wi−2,wi−1)
⋯ \cdots ⋯
其中, 当
n
>
1
n>1
n>1时,为了使句首词的条件概率有意义,需要给原序列加上一个或多个起始符
<
s
>
<s>
<s>。可以说起始符的作用就是为了表征句首词出现的条件概率。
好了,以 bigram 为例,现在的序列变成了
<
s
>
,
w
1
,
w
2
,
⋯
,
w
n
<s>,w_{1},w_{2},\cdots,w_{n}
<s>,w1,w2,⋯,wn。看到这里,肯定有人会问n-gram中常见的序列形式不应该还有个结束符
<
/
s
>
</s>
</s>嘛,那结束符又是用来干嘛的呢?
我们使用的一个三句话的微型语料库,且我们需要在这三句话的前后分别加上开始符 < s > <s> <s>和结束符 < / s > </s> </s>,接下来我们来看语料:
< s > I a m S a m < / s > <s> I\ am\ Sam </s> <s>I am Sam</s>
< s > S a m I a m < / s > <s> Sam\ I\ am </s> <s>Sam I am</s>
< s > I d o n o t l i k e g r e e n e g g s a n d h a m < / s > <s> I\ do\ not\ like\ green\ eggs\ and\ ham\ </s> <s>I do not like green eggs and ham </s>
利用Bi-gram计算各个概率值:
P ( I ∣ < s > ) = 2 3 , P ( S a m ∣ < s > ) = 1 3 , P ( a m ∣ I ) = 2 3 P(I|<s>)=\dfrac{2}{3},P(Sam|<s>)=\dfrac{1}{3},P(am|I)=\dfrac{2}{3} P(I∣<s>)=32,P(Sam∣<s>)=31,P(am∣I)=32
P ( < / s > ∣ S a m ) = 1 2 , P ( S a m ∣ a m ) = 1 2 , P ( d o ∣ I ) = 1 3 P(</s>|Sam)=\dfrac{1}{2},P(Sam|am)=\dfrac{1}{2},P(do|I)=\dfrac{1}{3} P(</s>∣Sam)=21,P(Sam∣am)=21,P(do∣I)=31
这样我们就完成了Bi-gram各个概率值的计算,整个句子的概率就是挑选出对应的概率相乘即可。这里需要说明一下为什么要在句子前后分别加上开始符 < s > <s> <s>和结束符 < / s > </s> </s>:目的是为了让以某一词为条件的所有概率加起来是1,从而保证这确实是一个合法的概率分布。例如,对于上面的语料,以 S a m Sam Sam这个词为例,以 S a m Sam Sam为开头,排在 S a m Sam Sam之后的词有 < / s > </s> </s>和 I I I,那么他们的概率相加有: P ( I ∣ S a m ) + P ( < / s > ∣ S a m ) = 1 2 + 1 2 = 1 P(I|Sam)+P(</s>|Sam)=\dfrac{1}{2}+\dfrac{1}{2}=1 P(I∣Sam)+P(</s>∣Sam)=21+21=1。当然,如果选 I I I这个词,不加结束符是没有这个情况。所以,结论是结束符是一定要加的。
在上面的计算过程过会有一个很严重的问题,那就是当我们的语料库有限,大概率会在实际预测的时候遇到我们没见过的词或短语,这就是未登录词(OOV),这样就会造成概率计算的公式中,分子或分母为 0 0 0,毕竟它们都只是频率。分子为 0 0 0的话,整个句子的概率是连乘出来的结果是 0 0 0;分母是 0 0 0的话,数学上就根本没法计算了,这样的问题我们该怎么解决呢?
方法有几种:
平滑(smoothing):为每个
w
w
w对应的Count增加一个很小的值,目的是使所有的 N-gram 概率之和为
1
1
1、使所有的 N-gram 概率都不为
0
0
0。常见平滑方法:
Laplace Smoothing
Add-one:即强制让所有的n-gram至少出现一次,只需要在分子和分母上分别做加法即可。这个方法的弊端是,大部分n-gram都是没有出现过的,很容易为他们分配过多的概率空间。
P ( w n ∣ w n − 1 ) = C ( w n − 1 w n ) + 1 C ( w n − 1 ) + ∣ V ∣ P(w_{n}|w_{n-1})=\dfrac{C(w_{n-1}w_{n})+1}{C(w_{n-1})+|V|} P(wn∣wn−1)=C(wn−1)+∣V∣C(wn−1wn)+1
Add-K:在Add-one的基础上做了一点小改动,原本是加1,现在加上一个小于1的常数K。但是缺点是这个常数仍然需要人工确定,对于不同的语料库K可能不同。
P ( w n ∣ w n − 1 ) = C ( w n − 1 w n ) + k C ( w n − 1 ) + k ∣ V ∣ P(w_{n}|w_{n-1})=\dfrac{C(w_{n-1}w_{n})+k}{C(w_{n-1})+k|V|} P(wn∣wn−1)=C(wn−1)+k∣V∣C(wn−1wn)+k
另外还有其他平滑方法:
- Good-Turing smoothing
- Jelinek-Mercer smoothing (interpolation)
- Catz smoothing
- Witten-Bell smoothing
- Absolute discounting
- Kneser-Ney smoothing
回退(Katz backoff):从N-gram回退到(N-1)-gram,例如 C o u n t ( t h e , d o g ) ∼ C o u n t ( d o g ) Count(the,dog)\sim Count(dog) Count(the,dog)∼Count(dog)。
总结下基于统计的n-gram语言模型的优缺点:
优点:
缺点:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。