繁依Fanyi0

这个屌丝很懒，什么也没留下！

热门标签

[吴恩达团队自然语言处理第二课_2]自动补全与语言模型

作者：繁依Fanyi0 | 2024-04-10 08:06:37

踩

大语言模型问题语义补齐

自动补全与语言模型

overview

你将会：

create language model(LM) from text corpus to
- Estimate probability of word sequences
- Estimate probability of a word following a sequence of words
Apply this concept to autocomplete a sentence with most likely suggestions

language model(LM) 的其他应用：

Speech recognition 使用语言模型知道是I saw a van

而不是eyes awe of an
Spelling correction 用于拼写更正，如He entered the ship to buy some groceies，P(entered the shop to buy)>P(entered the ship to buy)
Augmentative communication 增强通信系统

Predict most likely word from menu for people unable to physically talk or sign.(Newell et al.,1998)

Learning objectives to sentence auto-complete :

Process text corpus to N-gram language model 通过句子的前一个词来预测下一个词
Out of vocabulary words 调整模型使其处理在训练中没有出现的词
Smoothing for previously unseen N-grams 假装未出现过的单词至少出现一次
Language model evaluation

N-grams and probabilities

N-gram介绍

An N-gram is a sequence of N words是单词序列(也可以是字符或者其他元素)，在处理语料库时，标点符号被视为单词，其他的特殊字符如代码都会去掉

sequence probabilities

给定句子，如何计算概率？

P(the teacher drinks tea)=?
会议条件概率和链式规则

$P (B | A) = \frac{P (A, B)}{P (A)} \to P (A, B) = P (A) P (B | A) P (A, B, C, D) = P (A) P (B | A) P (C | A) P (C | A, B) P (D | A, B, C)$ $P(B|A)=\frac{P(A,B)}{P(A)}→P(A,B)=P(A)P(B|A)\\ P(A,B,C,D)=P(A)P(B|A)P(C|A)P(C|A,B)P(D|A,B,C)$
所以P(the teacher drinks tea)=P(the)P(teacher|the)P(drinks|the teacher)P(tea|the teacher drinks)

这是理想情况，但是实际上有局限

Sentence not in corpus:

问题：语料库几乎从不包含我们感兴趣的确切句子甚至更长的子序列！

Input:the teacher drinks tea

计算公式为P(the teacher drinks tea)=P(the)P(teacher|the)P(drinks|the teacher)P(tea|the teacher drinks)，但是对于P(tea|the teacher drinks)这一项

$P (t e a | t h e t e a c h e r d r i n k s t e a) = \frac{C (t h e t e a c h e r d r i n k s t e a) \leftarrow b o t h}{C (t h e t e a c h e r d r i n k s) \leftarrow l i k e l y 0}$ $P(tea|the\ teacher\ drinks\ tea)=\frac{C(the\ teacher\ drinks\ tea)\leftarrow both }{C(the\ teacher\ drinks)\leftarrow likely\ 0}$
随着句子变长，越来越多单词的确切顺序可能性会越来越小，

所以P(the teacher drinks tea)<P(drinks)

Approximation of sequence probabilities

P(the teacher drinks tea)，如果不是在drinks之前查找所有单词，就只需要考虑前一个单词(即P(tea|the teacher drinks)的前一个是P(drinks|the teacher))drinks，使用 Bigram(二元组)近似，简化公式

P (t e a | t h e t e a c h e r d r i n k s) \approx P (t e a | d r i n k s) P (t h e t e a c h e r d r i n k s t e a) = P (t h e) P (t e a c h e r | t h e) P (d r i n k s | t h e t e a c h e r) P (t e a | t h e t e a c h e r d r i n k s) ⇊ P (t h e) P (t e a c h e r | t h e) P (d r i n k s | t e a c h e r) P (t e a | d r i n k s)

只需要应用马尔可夫假设而不是查找所有单词

Markov assumption: only last N words matter 只看前n个单词
Bigram

$P (w_{n} | w_{1}^{n - 1}) \approx P (w_{n} | w_{n - 1})$ $P(w_n|w_1^{n-1})\approx P(w_n|w_{n-1})$
N-gram

$P (w_{n} | w_{1}^{n - 1}) \approx P (w_{n} | w_{n - N + 1}^{n - 1})$ $P(w_n|w_1^{n-1})\approx P(w_n|w_{n-N+1}^{n-1})$
Entire sentence modeled with bigram

$P (w_{1}^{n}) \approx \prod_{i = 1}^{n} P (w_{i} | w_{i - 1}) P (w_{1}^{n}) \approx P (w_{1}) P (w_{2} | w_{1}) . . . P (w_{n} | w_{n - 1})$ $P(w_1^n)\approx \prod_{i=1}^nP(w_i|w_{i-1})\\ P(w_1^n)\approx P(w_1)P(w_2|w_1)...P(w_n|w_{n-1})$

Starting and ending sentence

start of sentence token<s>

the teacher drinks tea

P(the teacher drinks tea)≈P(the)P(teacher|the)P(drinks|teacher)P(tea|drinks)

P(the)因为the没有上下文，不能计算bigram概率，所以需要做预测，添加一个特殊术语<s>the teacher drinks tea将每一个词都变得可以计算

P(<s> the teacher drinks tea)≈P(the|<s>)P(teacher|the)P(drinks|teacher)P(tea|drinks)

类似的原理也使用于N-grams

start of sentence token <s> for N-grams：

Trigram: P(the teacher drinks tea)

≈P(the)P(teacher|the)P(drinks|teacher)P(tea|drinks)

缺少的上下文可以解决：

添加两个开头token：

the teacher drinks tea => <s> <s> the teacher drinks tea

所以现在句子的概率是

$P (w_{1}^{n}) \approx P (w_{1} | < s > < s >) P (w_{2} | < s > w_{1}) . . . P (w_{n} | w_{n - 1} w_{n - 1})$ $P(w_1^n)\approx P(w_1|<s>\ <s>)P(w_2|<s>w_1)...P(w_n|w_{n-1}w_{n-1})$
N-gram model: add N-1 start tokens <s>

End of sentence token </s>

来由

计算C(drinks w)的总数=1，而C(drinks)=2,为了继续在条件概率中用简化的公式，需要加入结尾标记

我们希望所有长度单词可能性和为1，如下图

solution

Bigram

<s> the teacher drinks tea => <s> the teacher drinks tea </s>

P(the|<s>)P(teacher|the)P(drinks|teacher)P(tea|drinks)P(</s>|tea)

Corpus:

<s> Lyn drinks chocolate </s>

<s> John drinks </s>
N-grams

N-grams => just ones </s>

E.g. Trigram:

the teacher drinks tea => <s> <s> the teacher drinks tea </s>

Example-bigram

现在尝试计算Lyn以<s>开头的概率，因为<s>在语料库出现了三次，而Lyn以<s>开头出现两次，所以

P (L y n | < s >) = \frac{2}{3}

$P(Lyn|<s>)=\frac{2}{3}$

计算<s> Lyn drinks chocolate </s>

P (s e n t e n c e) = \frac{2}{3} * \frac{1}{2} * \frac{1}{2} * \frac{2}{2} = \frac{1}{6}

$P(sentence)=\frac{2}{3}*\frac{1}{2}*\frac{1}{2}*\frac{2}{2}=\frac{1}{6}$

第一个2/3是<s> Lyn,1/2是Lyn drinks，1/2是drinks chocolate,2/2是chocolate </s>

得到1/6，比我们想象的1/3要低

N-gram language model

Outline:

Count matrix 将语料库处理到矩阵中
Probabilities matrix 将count 矩阵转换为概率矩阵
Language model
Log probability to avoid underflow
Generative language model

Count matrix

Rows: unique corpus (N-1)-grams
Columns: unique corpus words
Bigram

count matrix

通过两个单词的滑动窗口在corpus上的移动，将矩阵中的值加一

Probability matrix

P (w_{n} | w_{n - N + 1}^{n - 1}) = \frac{C (w_{n - N + 1}^{n - 1}, w_{n})}{C (w_{n - N + 1}^{n - 1})}

$P(w_n|w_{n-N+1}^{n-1})=\frac{C(w_{n-N+1}^{n-1},w_n)}{C(w_{n-N+1}^{n-1})}$

Divide each cell by its row sum
$s u m (r o w) = \sum_{w \in V} C (w_{n - N + 1}^{n - 1}, w_{n}) = C (w_{n - N + 1}^{n - 1})$ $sum(row)=\sum_{w\in V}C(w_{n-N+1}^{n-1},w_n)=C(w_{n-N+1}^{n-1})$

Language model

probability matrix => language model
- sentence probability
- next word prediction

Log probability

P (w_{1}^{n}) \approx \prod_{i = 1}^{n} P (w_{i} | w_{i - 1}) l o g (a * b) = l o g (a) + l o g (b)

$P(w_1^n)\approx\prod_{i=1}^nP(w_i|w_{i-1})\\ log(a*b)=log(a)+log(b)$

Generative Language model

从头或以很少的提示生成文本

算法：

Choose sentence start 即 <s>
Choose next bigram starting with previous word 选取二元单词组
coninue until </s> is picked 重复直到选取了结束符号

Evaluation model

Test data

将预料库分割为 train/validation/test
For small corpora
- 80% Train
- 10% validation
- 10% test
For largecorpora (typical for text)
- 98% train
- 1% validaton
- 1% test

Test data-split method

Perplexity

困惑度为样本中复杂度的量度，如文本复杂度；

困惑度告诉我们文本是由人类编写而不是由一个随机选择单词的简单程序编写：

由人类编写则proplexity低，随机生成则高

P P (W) = P (s_{1}, s_{2}, . . ., s_{m})^{- \frac{1}{m}}

$PP(W)=P(s_1,s_2,...,s_m)^{-\frac{1}{m}}\\$


W→test set containing m sentencess s
i→i-th sentence in the test set,each ending with </s>
m→number of all words in entire test set W including
</s> but not including <s>

Eg. m=100 (100个单词)

第一个模型：

P (W) = 0.9 => P P (W) = {0.9}^{- \frac{1}{100}} = 1.00105416

$P(W)=0.9=>PP(W)=0.9^{-\frac{1}{100}}=1.00105416$

P(W)=0.9测试集概率0.9说明可以较好的预测。同时困惑度很低

第二个模型：

P (W) = 10^{- 250} => P P (W) = (10^{- 25})^{- \frac{1}{100}} \approx 316

$P(W)=10^{-250}=>PP(W)=(10^{-25})^{-\frac{1}{100}}\approx 316$

Perplexity 越小越好

好的模型在20-60之间
Character level models PP < word-based models PP

Perplexity for bigram models

P P (W) = \sqrt[m]{\prod_{i = 1}^{m} \prod_{j = 1}^{| s_{i} |} \frac{1}{P (w_{j}^{(i)} | w_{j - 1}^{(i)})}} w_{j}^{(i)} \to j - t h w o r d i n i - t h s e n t e n c e

$PP(W)=\sqrt[m]{\prod_{i=1}^m\prod_{j=1}^{|s_i|}\frac{1}{P(w_j^{(i)}|w_{j-1}^{(i)})}}\\ w_j^{(i)} → j-th\ word\ in\ i-th\ sentence$

一样的概率面对不同的测试集时，m越大，PP越低

concatenate all sentences in W

如果测试集中所以句子都是串联在一起的，公式可简化为

P P (W) = \sqrt[m]{\prod_{i = 1}^{m} \frac{1}{P (w_{i} | w_{i - 1})}} w_{i} \to i - t h w o r d i n t e s t s e t

$PP(W)=\sqrt[m]{\prod_{i=1}^m\frac{1}{P(w_i|w_{i-1})}}\\ w_i→i-th\ word\ in\ test\ set$

Log perplexity

使用Log更容易计算,因为好的模型PP在20-60，所以LogPP在4.3-5.9

l o g P P (W) = - \frac{1}{m} \sum_{i = 1}^{m} l o g_{2} (P (w_{i} | w_{i - 1}))

$logPP(W)=-\frac{1}{m}\sum_{i=1}^mlog_2(P(w_i|w_{i-1}))$

Unigram困惑度很高，Bigram才合理一些，Trigram困惑度更好

<UNK> Unknown word

Out of vocabulary words

Closed vs. Open vocabularies
Unknown word = Out of vocabulary word (OOV)
special tag <UNK> in corpus and in input 用特殊词UNK建模

Using <UNK> in corpus

建立词汇表 V
Replace any word in corpus and not in V by <UNK>
Count the probabilities with <UNK>as with any other word

如何建立词汇表

Criteria 标准
- Min word frenquency f :最小词频f，超过f的选入V中，替换所有其他单词为UNK
- Max |V|, include words by frequency :决定最大词汇量，选入词频高的直到装满
Use <UNK> sparingly 谨慎使用

UNK会让困惑度降低，但是有可能模型会生成一系列UNK代替有意义的句子
Perplexity -ony compare LMs with same V 只比较相同词汇表的语言模型

Smoothing

Add-one smoothing (Laplacian smoothing)

P (w_{n} | w_{n - 1}) = \frac{C (w_{n - 1, w_{n}}) + 1}{\sum_{w \in V} (C (w_{n - 1}, w) + 1)} = \frac{C (w_{n - 1}, w_{n}) + 1}{C (w_{n - 1}) + V}

$P(w_n|w_{n-1})=\frac{C(w_{n-1,w_n})+1}{\sum_{w\in V}(C(w_{n-1},w)+1)}\\ =\frac{C(w_{n-1},w_n)+1}{C(w_{n-1})+V}$

Add-k smoothing

P (w_{n} | w_{n - 1}) = \frac{C (w_{n - 1}, w n) + k}{\sum_{w \in V (C (w_{n - 1}, w) + k)}} = \frac{C (w_{n - 1}, w_{n}) + k}{C (w_{n - 1}) + k * V}

$P(w_n|w_{n-1})=\frac{C(w_{n-1},wn)+k}{\sum_{w\in V(C(w_{n-1},w)+k)}}\\ =\frac{C(w_{n-1},w_n)+k}{C(w_{n-1})+k*V}$

其他更先进的平滑

Kneser-Ney

Good-Turing

Backoff

If N-gram missing => user (N-1) gram,... 如果Ngram缺失就用N-1 gram,如果N-1 gram缺失就用 N-2 gram
- Probability discounting e.g. Katz backoff
  
  一般来说低阶模型出现的频率更高(低阶的元组包含在高阶元组中，阶数越高越稀疏)，从而更加可靠，于是乎，未出现的高阶元组可以利用高阶元组中包含的低阶元组对其进行平滑。总之，高阶模型可靠时候便使用高阶模型，否则使用低阶模型。
- “Stupid" backoff
  
  即对一个语言模型，每当某个片段计数为0时，就往前一个词计算概率，并设有一个权重0.4
  $S (w_{i} ∣ w_{i - k + 1}^{i - 1}) = {\begin{aligned} \frac{f (w_{i - k + 1}^{i})}{f (w_{i - k + 1}^{i - 1})} & if f (w_{i - k + 1}^{i}) > 0 \\ α S (w_{i} ∣ w_{i - k + 2}^{i - 1}) & otherwise \end{aligned}$ $S\left(w_{i} \mid w_{i-k+1}^{i-1}\right)= \quad\left\{\begin{aligned} \frac{f\left(w_{i-k+1}^{i}\right)}{f\left(w_{i-k+1}^{i-1}\right)} & \text { if } f\left(w_{i-k+1}^{i}\right)>0 \\ \alpha S\left(w_{i} \mid w_{i-k+2}^{i-1}\right) & \text { otherwise } \end{aligned}\right. \\$

interpolation插值

如在Tri-gram中

\hat{P} (c h o c o l a t e | J o h n d r i n k s) = 0.7 * P (c h o c o l a t e | J o h n d r i n k s) + 0.2 * P (c h o c o l a t e | d r i n k s) + 0.1 * P (c h o c o l a t e)

$\hat P(chocolate|John drinks)=0.7*P(chocolate|John\ drinks)+0.2*P(chocolate|drinks)+0.1*P(chocolate)$

\hat{P} (w_{n} | w_{n - 2} w_{n - 1}) = λ_{1} * P (w_{n} | w_{n - 2} w_{n - 1}) + λ_{2} * P (w_{n} | w_{n - 1}) + λ * P (w_{n}) \sum_{i} λ_{i} = 1

$\hat P(w_n|w_{n-2}\ w_{n-1})=\lambda_1\ *P(w_n|w_{n-2}\ w_{n-1})+\\ \lambda_2*P(w_n|w_{n-1})+\lambda*P(w_n)\\ \mathop{\sum}_{i}\lambda_i=1$

Summary

N-Grams and probabilities
Approximate sentence probability from N-Grams
Build language model from corpus
Fix missing information
Out of vocabulary words with <UNK>
Missing N-Gram in corpus with smoothing, backoff and interpolation
Evaluate language model with perplexity
Coding assignment!

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/繁依Fanyi0/article/detail/397504