当前位置:   article > 正文

NLP入门——语言结构/语言建模_three way mutual information

three way mutual information

一、Linguistics 语言学

  • words
  • morphology 形态学:词的构成和内部结构研究。如英语的dog、dogs和dog-catcher有相当的关系
  • morpheme 语素:最小的语法单位,是最小的音义结合体
  • lexeme 词位:词的意义的基本抽象单位,是一组通过形态变化相联系的词的集合。例如 run、runs、ran 和 running 都属于同一个词位,可标识为 RUN。词位也是词法学分析中,指代相同词根(stems)、不同形式的一组词的单位。词性不同词位会不同
  • lemma 词目:词位按惯例确认的典型形式。例如词位 RUN 的词目是 run。词目是词典的基本查检单位和解释对象,位于词条首位,不规则的形态变化、不常见的其他形式常在词条以下列出。不过,有的语言学字典视两者为同样的概念。
  • lexicon:词库,所有词汇
  • root and affix:词根和词缀(包含prefix, infix, and suffix)(eg., mothers-in-law, passersby)
  • parts of speech(POS) 词类
  1. Open classes: one that commonly accepts the addition of new words (e.g., nouns, verbs, adjectives, adverbs) 通常包含大多数词
  2. Closed classes: one to which new items are very rarely added (e.g., prepositions 介词, determiners 限定词, pronouns 代词, conjunctions 连词, auxiliary verbs 助动词)
  • syntax 句法:句子的构成和内部结构的研究
  • semantics 语义:句子意义的研究
  • discourse 语篇理解文档中的句子之间是如何关联起来的。因此,当我们浏览一个文档时,语篇为我们提供了一个关于该文档的连贯的故事线
  • metonymy 借代或称转喻:用与被修饰对象相关的其他事物来指代被修饰对象 for example using the White House for the US president
  • hyponym/hypernym:下位词,上位词 (eg., Red是color的下位词)
  • meronym / holonymy:分体词,群体词(eg., White House和US,可互相代指)
  • antonym:反义词
  • polysemous:多义词
  • anaphoric / Cataphoric:回指 / 后指
  • Phonetics语音学:人类语言声音的研究
  • Phonology音韵学:人类语言中声音系统的研究
  • Pragmatics语用学:研究具有语义意义的句子用于特定交际目标的方式

1. 语素 morphemes

词素是最小的有意义的语言单位,不能够进一步划分为更小的单位而不破话或彻底改变其词汇意义或语法意义。

语素和词的区别在于,许多语素不能独立存在。而能够单独存在并且有意义的语素叫做词根(Stems);不能独立存在,要借助其他语素表达意义的语素则称之为词缀 (affixes)。每个词都包含最少一个语素。按能否独立成词对语素分类:

  • 自由语素(Free morphemes):可以独立成词,如cat,town
  • 规范语素(Bound morphemes):必须和词根或者其他规范语素一起成词,如-tion,-ing, -s
    所有的规范语素(Bound morphemes)都是词缀 (affixes),从功能上也可分为派生词缀屈折词缀。位于词根前的词缀,叫做前缀(prefix),跟在词根后面的词缀,叫做后缀(suffix),如果插入在词干中间的,叫做中缀(infix)

规范语素可以进一步分类,分为:

  • 派生语素 ( Derivational morpheme):可以和其他语素结合成为一个新的词,带来新的语义。如英语中的派生后缀-ly,使形容词slow变为副词slowly。前缀un-在一些形容词和动词前面表示否定的意义,如unhealthy
  • 屈折语素 ( Inflectional morpheme ):通常改变词的语法范畴,而不改变词义。屈折语素改变动词的时、体、式、态等范畴,改变名词、代词、形容词的格、数和性等范畴。比如说英语的动词加-s表示第三人称单数,但不改变其动词属性

需要特别指出,有时候派生语素和屈折语素会采用同一种表现形式。比如说,-er,当这个后缀跟在形容词后面,它作为屈折语素,表示比较级;当跟在动词后面,则是一个派生语素,形成一个新词,如cook-cooker。

意义的虚实来说,语素又可分为:

  • 内容语素(Content morpheme):携带具体的语义,如cat,-able和-un。内容语素包括名词、动词、副词、形容词,也包括规范语素中的派生语素(Derivational morpheme)
  • 功能语素(Function morpheme):仅仅提供语法信息或句法一致,如and, plural-s。包括介词、连词、代词,和限定词,以及规范语素中的屈折语素(Inflectional morpheme)

2. 形态学 Morphology

词的分类

  1. variable and invariable words 变化词:实义词 非变化词:语法词
  2. Lexical words and grammatical words 实义词和语法词
  3. Open class and closed class 开放类词:实义词 封闭类词:语法词

3. Laws of text 语言统计学三大定律

  • Zipf’s Law:在给定的语料中,对于任意一个term,其频度(freq)的排名(rank)和freq的乘积大致是一个常数
  • Benford’s Law:在自然形成的十进制数据中,任何一个数据的第一个数字d出现的概率大致log10(1+1/d)
  • Heap’s Law:在给定的语料中,其独立的term数(vocabulary的size)v(n)大致是语料大小(n)的一个指数函数

3.1. Zipf’s Law 齐夫定律

齐夫定律是一个实验定律。在自然语言的语料库里,一个单词出现的频率与它在频率表里的排名成反比。所以,频率最高的单词出现的频率大约是出现频率第二位的单词的2倍,而出现频率第二位的单词则是出现频率第四位的单词的2倍

给出一组齐夫分布的频率,按照从最常见到非常见排列,第二常见的频率是最常见频率的出现次数的½,第三常见的频率是最常见的频率的1/3,第n常见的频率是最常见频率出现次数的1/n。然而,这并不精确,因为所有的项必须出现一个整数次数,一个单词不可能出现2.5次。

3.2. Benford’s Law 本福特定律

Benford's law 描述了真实数字数据集中首位数字的频率分布。一堆从实际生活得出的数据中,以1为首位数字的数的出现概率约为总数的三成,接近直觉得出之期望值1/9的3倍。推广来说,越大的数,以它为首几位的数出现的概率就越低。它可用于检查各种数据是否有造假。但要注意使用条件:1.数据之间的差距应该足够大。2.不能有人为操控。

本福特定律说明在b进位制中,以数n起头的数出现的概率为

P(n)=\log _{b}(n+1)-\log _{b}(n)=\log _{b}\left({\frac {n+1}{n}}\right) 其中,n=1,2,...,b-1

本福特定律(Benford's law)的直观解释 - 知乎

3.3 Heap’s Law

Heaps定律将词项的数目估计成文档集大小的一个函数V(n)=kn^{\beta},其中n是文档集合中的词条个数。参数k和b的典型取值为30<=k<=100, β≈0.5

Heaps定律认为,文档集大小和词汇量之间可能存在的就最简单的关系是它们在对数空间(log-log space)中存在线性关系.

不同的文档集下参数k的取值差异会比较大,这是因为词汇量的大小依赖于文档集本身以及对它进行处理的方法。

  • 大小写转换和词干还原会降低词汇量的增长率
  • 允许加入数字和容忍拼写错误会增加该增长率

不论在某个特定的文档集中的参数取值如何,Heaps定律提出了如下两点假设:

  1. (1)随着文档数目的增加,词汇量会持续增长而不会稳定到一个最大值;
  2. (2)大规模文档集的词汇量也会非常大。

因此在大规模文档集情况下,词典压缩对于建立一个有效的信息检索系统来说是非常重要的。

4. 句法学 Syntax

  • constituents(句法成分):Constituents are the units of language that work together to build a sentence. They can be morphemes, phrases, and clauses. The vital constituents within a sentence are the subject and its predicate. A subject is who/what the sentence is about, and a predicate is the part of a sentence that adds detail or information to the subject (they usually contain a verb.)
    汉语中:主语、谓语、动语、宾语、定语、状语
  • Heads and phrases 中心词和短语:In a X-phrase (egNP), the key occurrence of X (egN) is called the head, and controls how the phrase interacts (both syntactically and semantically) with the rest of the sentence.

Noun(N): Noun Phrase(NP)
Adjective(A): Adjective Phrase(AP)
Verb(V): Verb Phrase(VP)
Preposition(P): Prepositional Phrase(PP)

Morphemes are the smallest unit of lexical units; they can be free (can stand alone) or bound (cannot stand alone). In simple terms, free morphemes are individual words that cannot be broken down further (e.g., cat, he, laptop, etc.), and bound morphemes are the affixes we attach to them (e.g., quicklyunfinishedhouses, etc.).
Phrases are constituents that contain a group of words that act as a particular word class, i.e., a noun phrase works as a noun. Phrases can be categorized into noun phrases, verb phrases, prepositional phrases, adverb phrases, and adjective phrases.
A clause is a constituent that contains a subject and a predicate. There are two types of clauses: independent and dependent. Independent clauses can stand alone, whereas dependent classes must be connected to independent clauses to get their meaning. (because ..., it...)

5. 语法 Grammar

5.1 Desirable properties of a grammar

Chomsky specified two properties that make a grammar “interesting and satisfying”:

  • It should be a finite specification of the strings of the language, rather than a list of its sentences.
  • It should be revealing, in allowing strings to be associated with meaning (semantics) in a systematic way.

We can add another desirable property: It should capture structural and distributional properties of the language. (E.g. where heads of phrases are located; how a sentence transforms into a question; which phrases can float around the sentence.)

Context-free grammars (CFGs) provide a pretty good approximation.

Some features of NLs are more easily captured using mildly context-sensitive grammars.
There are also more modern grammar formalisms that better capture structural and distributional properties of human languages. (E.g. combinatory categorial grammar.)

5.2 Context-free grammars (CFGs)

A context-free grammar defines:

  • A set of sentential forms: strings that have nonterminals and terminals mixed together and can be derived from the start nonterminal
  • A language: this is the set of all terminal strings that are derivable from the grammar

5.3 Recursion 递归

Recursion in a grammar makes it possible to generate an infinite number of sentences.

In direct recursion, a non-terminal on the LHS of a rule also appears on its RHS. The following rules add direct recursion to G1:

VP → VP Conj VP
Conj → and | or

In indirect recursion, some non-terminal can be expanded (via several steps) to a sequence of symbols containing that non-terminal:

NP → Det N PP
PP → Prep NP

5.4 Sentence types

  • Declarative: I prefer a morning flight.
  • Imperative: Give me the newspaper.
  • Yes-noquestion: Do any of these flights have stops?
  • Wh-questions: What is your name?

5.5 Chomsky Normal Form

A context-free grammar is in Chomsky normal form if all productions are of the form A → B C or A → a where A, B, C are nonterminals (or preterminals) in the grammar and a is a word in the grammar.

Disregarding the empty string, every CFG is equivalent to a grammar in Chomsky normal form (the grammars’ string languages are identical)

5.5.1 转换到 Chomsky Normal Form

S → A1 A2 A3 transforms into S → A1 B, B → A2 A3

5.6 Agreement phenomena

Constituents of a sentence (often separated) may be constrained to agree on an attribute such as person, number, gender.

You, I imagine, are unable to attend.
• The hills are looking lovely today, aren’t they?
He came very close to injuring himself.

One solution is to refine our grammar by splitting certain non-terminals according to various attributes. (Node-splitting via attributes)

Person: 1st, 2nd, 3rd
Number: singular, plural
Gender: masculine, feminine, neuter
Case: nominative, accusative, dative, . . .
Tense: present, past, future, . . .

比如: Split NP on person, number, case (e.g. NP[3,sg,nom]),
Split VP on person, number, tense (e.g. VP[3,sg,fut]).

5.7 Parameterized CFG productions

We can often use such attributes to enforce agreement constraints. This works because of the head phrase structure typical of NLs.

我们可以写出一些例子 parameterized rules

S → NP[p,n,nom] VP[p,n]
NP[3,n,c] → Det[n] Nom[n]

6. Lexical Semantics 词汇语义

6.1 synsets: sets of synonymous words

WordNet: a hand-built ontology containing 117,000 synsets: sets of synonymous words.

Synsets are connected by relations such as:
hyponym/hypernym (IS-A: chair-furniture) 下位词/上位词。上下位关系是指一个子类型的下位词与一个父类型的上位词之间的语义关系
meronym/holonymy (PART-WHOLE: leg-chair):部位词/群体词
antonym (OPPOSITES: good-bad):反义词

6.2 Word Sense Ambiguity 词义歧义

Two completely different words can be spelled the same (homonyms) 同音词

More generally, words can have multiple (related or unrelated) senses (polysemes) 一词多义

6.3 Word sense disambiguation (WSD) 词义消歧

Classifiers for WSD

Semi-supervised WSD (bootstrapping, the Yarowsky algorithm)

Unsupervised WSD (Word Sense Induction): use clustering

6.4 Sparse Word Vectors 稀疏词向量

Distributional hypothesis:similar contexts imply similar meanings
Linguistic distribution: the set of contexts that a particular item (here, word) occurs in 【Sometimes displayed in Keyword In Context (KWIC) format】
Distributional semantics:vector-space models
词语搭配 (Collocation):经常一起出现的词组上下文,衡量方法:

1. pointwise mutual information (PMI) 互信息: 就是MLE,log\frac{p(x,y)}{p(x)p(y)},测量"colocationness"。Vectors have many dimensions and very sparse (when PMI < 0 is changed to 0

2. TF-IDF (Term frequency - Inverse document frequency): Words are represented as a row of the term–document matrix ▶ Each dimension of this vector corresponds to a document
一个单词就是在那些文档出现过的向量

How to measure similarity:Vector space representation(word embeddings)。cosine of the angle \frac{\overrightarrow{v}\overrightarrow{w}}{|\overrightarrow{v}||\overrightarrow{w}|}

6.5 Dense Word Vectors

to represent a word with vectors that are short (with length in the 100s or 1000s) and dense (most values are non-zero)

  • Static word vectors “cram” all senses of a word into its representation.
  • Contextualised dense vectors instead depend not just on a word’s identity, but also on their context in each sentence

Skip-gram modelling (or word2vec):预测与给定中心词相对应的上下文词。 relying on the idea of pairing words with dense context and target vectors. If a word co-occurs with a context word w^c , then its target vector should be similar to the context vector of w^c

The computational problem with skip-gram models and an example solution: negative sampling 负采样 (采样得到一个上下文词和一个目标词,生成一个正样本(positive example),生成一个负样本(negative example),则是用与正样本相同的上下文词,再在字典中随机选择一个单词)

6.5.1 Skip-gram modelling (or word2vec)

Train a classifier on a binary prediction task:

  • A by-product (衍生物/副产品) of learning this classifier will be the context and target vectors discussed. 
  • These are the *parameters* of the classifier, and we will use these parameters as our word embeddings.
  • No need for hand-labelled supervision - use text with co-occurrence
6.5.2 Prediction with Skip-Grams

6.5.3 Skip-gram with Negative Sampling 负采样

Problem with skip-grams: Computing the denominator requires computing dot product between each word in V and the target word wj , which may take a long time 计算需要遍历词汇表里的所有词

我们想让他预测二分类问题,是否同时出现。Given a pair of target and context words, predict + or - (telling whether they co-occur together or not) 所以提出负采样方法 Solution: randomly sample “negative” examples

use dot-product to indicative positive/negative label, coupled with logistic regression.
​​​​​​​【a contrastive loss based on a distance metric (Euclidean,cosine, etc.)】

maximise:

where we have k negative-sampled words w^n1 , · · · ,w^nk

6.5.4 Properties of Embeddings: Word Analogies

Offsets between embeddings can capture semantic relations between words, e.g. vector(king)+ (vector(woman) − vector(man)) is close to vector(queen). . . . . . but also grammatical relations (e.g., tense) and ontology predicates (e.g., capital of)

continuous bag of words (CBOW) 连续词袋模型:给定上下文,来预测input word

7. Semantic Analysis:meaning representation language (MRL)

  • semantic parsing:represent meanings of utterances(e.g., dialogue systems)
  • how to deal with compositionality

7.1 What is Compositionality

Frege’s principle:The meaning of an expression is a function of the meanings of its parts and of the way they are syntactically combined.

The principle of compositionality is often argued from:

7.2 meaning representation language (MRL)

Basic assumption:The symbols in our meaning representations correspond to objects, properties, and relations in the world (a model or knowledge base of known facts).

What do we want from an MRL?:

  1. Compositional: The meaning of an expression is a function of the meanings of its parts and of the way they are syntactically combined.
  2. Verifiable: Can use the MR of a sentence to determine whether the sentence is true with respect to some given model of the world
  3. Unambiguous: an MR should have exactly one interpretation. So, an ambiguous sentence should have a different MR for each sense
  4. Canonical form 标准形式: sentences with the same (literal) meaning should have the same MR.
  5. Inference: we should be able to verify sentences not only directly, but also by drawing conclusions based on the input MR and facts in the knowledge base.
  6. Expressivity: the MRL should allow us to handle a wide range of meanings and express appropriate relationships between the words in a sentence.

7.3 FOL: First-order Logic (Predicate Logic)

例子: ∃x. movie(x) ∧ ∀y. person(y) ⇒ loves(y, x)

Expressions are constructed from terms:

  • • constant and variable symbols that represent entities
  • • function symbols that allow us to indirectly specify entities
  • predicate symbols that represent properties of entities and relations between entities

Terms can be combined into predicate-argument structures, which in turn are combined into complex expressions using:

Logical connectives: ∨ (logical or) ∧ (logical and) ¬ (logical not) ⇒ (logical implication)
Quantifiers: ∀ (universal quantifier, i.e., “for all”) ∃ (existential quantifier, i.e. “exists”)

Constants in FOL:• Each constant symbol denotes exactly one entity

example:The Evening Star ≡ Venus

Predicates in FOL:Predicates with one argument represent properties of entities;Predicates with multiple arguments represent relations between entities;We write “/N” to indicate that a predicate has arity N (takes N arguments)

examples:nation(Scotland);introduced(John, Marie, Sue); introduced/3

7.4 The semantics of predicates

A predicate of arity N denotes the set of N-tuples that satisfy it.

Functions in FOL:

Like constants, are used to specify (denote) unique entities.
Unlike constants, they refer to entities indirectly, so we don’t need to store as many constants

example:president-of(FIFA), right-knee(Kim), owner-of(Kalpna)

Syntactically, they look like unary predicates, but denote entities, not sets.

Logical connectives:Given FOL expressions P and Q, the meaning of an expression containing P and Q is determined from the meaning of each part and the logical connective.

Variables and Quantifiers in FOL:Variable symbols (e.g., x, y, z) range over entities. (举例子:likes(x, Kim) = { John, Marie })

A predicate with a variable among its arguments only has a truth value if it is bound by a quantifier. (∀x.likes(x, Kim); Cats are mammals has MR ∀x.cat(x) ⇒ mammal(x)

Marie owns a cat has MR ∃x.cat(x) ∧ owns(Marie,x)

∃x.cat(x) ∧ own(Marie, x) vs ∃x.cat(x) ⇒ own(Marie, x):
There is something that is a cat and Marie owns it vs There is something that if it’s a cat, Marie owns it

7.5 quantifier scope ambiguity

• No ambiguity in POS tags, syntactic structure, or word senses. • But this sentence is still ambiguous!

(a) There is a single movie that everyone loves
(b) Everyone loves at least one movie, but the movies might be different

Determiners (a, some, every) and coordination (if, and, or) can often be expressed with logical connectives and quantifiers.

7.6 Compositionality- Lambda (λ) Expressions

• Extension to FOL, allows us to work with ‘partially constructed’ formulae.

A λ-expression consists of: Greek letter λ, followed by a variable (formal parameter); + a FOL expression that may involve that variable. 例子:

λx.sleep(x) : ‘The function from an entity x to the FOL expression sleep(x)’

λ-Reduction: λx.sleep(x)(Marie)  -> sleep(Marie) 等价, 其中functor为λ表达式,marie是argument

Nested λ-expressions:Allows predicates with several arguments to accept them one by one. 比如 λy. λx. love(x,y)

7.7 Verbal (event) MRs

但是参数长度固定,以下情况不能满足:

John gave Mary a book for Susan. ->  Giving2(John, Mary, Book, Susan)
John gave Mary a book for Susan on Wednesday. ->  Giving3(John, Mary, Book, Susan, Wednesday)

Predicates in First-order Logic have fixed arity!!
Requires separate Giving predicate for each syntactic subcategorisation frame (number/type/position of arguments).
• Separate predicates have no logical relation, but they ought to. (【if Giving3(a, b, c, d, e) is true, then so are Giving2(a, b, c, d) and Giving1(a, b, c).】)

7.8 Reification of events

We can solve these problems by reifying events. 具象化 give events the same status as entities.In practice, introduce variables for events, which we can quantify over.​​​​​​​

The giving event is now a single predicate of arity 1: Giving(e); 

这种表示方法给出事件之间的逻辑关系 logical entailment relations between events.

Entailment relations: (“A entails B” means “A ⇒ B”.)

Given this way of representing meanings, how do we compute meaning representations from sentences?? 需要 semantic analysis or semantic parsing. 语义分析

7.9 Semantic Analysis

a compositional rule-to-rule approach based on FOL augmented with λ-expressions. 根据Based on the principle of compositionality (meaning of the whole built up from the meaning of the parts​​​​​​​

Build up the MR by augmenting CFG rules with semantic composition rules. • Representation produced is literal meaning: context independent and free of inference

Example of final analysis:

7.10 CFG Rules with Semantic Attachments

To compute the final MR, we add semantic attachments to our CFG rules.

These specify how to compute the MR of the parent from those of its children.

规则形式: A → α1 . . . αn        {f (αj .sem, . . . , αk .sem)}

其中, A.sem (the MR for A) is computed by applying the function f to the MRs of some subset of A’s children.

Unary rules normally just copy the semantics of the child to the parents (例子:NP → ProperNoun    {ProperNoun.sem})

λs allowed us to compose arguments with predicate. (比如serving在例子:λy. λx. Serving(x,y)里) 对于动词形式的semantic attachment,它对应的MR:

Verb → serves      { λy. λx. ∃e. Serving(e) ∧ Server(e, x) ∧ Served(e, y) }

我们需要表达更大的语义结构Building larger constituents,用λs, 那么VP rule is:

VP → Verb NP      {Verb.sem(NP.sem)}

那么对于AyCaramba serves meat来说, 最终规则 S → NP VP    {VP.sem(NP.sem)}

但存在问题,比如句子​​​​​​​ Every child sleeps. 其中没有宾语,而且有every限定词:

他的MR是∀ x. Child(x) ⇒ ∃ e. Sleeping(e) ∧ Sleeper(e, x),如果套用之前的规则(把S.sem定义为VP.sem(NP.sem)),得到: λy. ∃ e. Sleeping(e) ∧ Sleeper(e, y) (∀ x. Child(x) ⇒ Q(x)) 其中 where Q(x) is some (potentially quite complex) expression with a predicate-like meaning

但其不是合法有效的FOL格式 complex expressions cannot be arguments to predicates!!

解决方法:S.sem as NP.sem(VP.sem),并使 NP.sem变为 a functor by adding λ , 于是:
λQ ∀ x. Child(x) ⇒ Q(x)

λQ ∀ x. Child(x) ⇒ Q(x) (λy. ∃ e. Sleeping(e) ∧ Sleeper(e, y))

∀ x. Child(x) ⇒ (λy. ∃ e. Sleeping(e) ∧ Sleeper(e, y)) (x)

∀ x. Child(x) ⇒ ∃ e. Sleeping(e) ∧ Sleeper(e, x)  得证!

NP.sem 则具体定义为:

没有限定词呢?如Kate sleeps,

Assign a different MR to proper nouns, allowing them to take VPs as arguments:
 ProperNoun → Kate {λP. P(Kate)}
这里我们进行了类型提升 Terminology: we type-raised the the argument a of a function f , turning it into a function g from f as argument.

7.11 Semantic parsing algorithms

有了a CFG with semantic attachments,如何获得semantic analysis of a sentence?

  • • One option (integrated): Modify syntactic parser to apply semantic attachments at the time syntactic constituents are constructed.
  • ​​​​​​​• Second option (pipelined): Complete the syntactic parse, then walk the tree bottom-up to apply semantic attachments.

8. other semantic phenomena 难语义现象

Meaning is Hard:

  • anaphoric回指现象 (-It's raining. -Yes)
  • implicit explanation 隐式因果关系 (John went to jail. He embezzled the company funds.)
  • Locutionary / Illocutionary / Perlocutionary: 表义,隐含警告,可能后果 (Smoking is bad for your health)
  • emotional content 情感上看待结果或者预测 (hopefully and fortunately)

Reference is Hard:

  • common sense knowledge 提供背景信息

Coreference is Hard:上下句联系

1. John can open Bill’s safe. He knows the combination.
2. John can open Bill’s safe. He should change the combination.
Q: What does he refer to?

Charles just graduated, and Bob wants Alice to give him a job. Q: What does him refer to?

1. Kim kicked Sandy. It hurt.
2. John said that Kim kicked Sandy. But Bill didn’t believe it.
Q:What does it refer to? • In 1, it refers to the kicking event. • In 2, it refers to the proposition of the first sentence.

Temporal Reasoning is Hard:时序推断,依赖上下文

1. Max fell. John helped him up. 2. Max fell. John pushed him. Q: Which event happened first?

Lexical Semantics is Hard:词义

  • Metaphor
  • metonym
  • Multi-word expressions
  • Telicity:封闭体。在语言学中,telicity是动词或动词短语的一种属性,它表示具有特定端点endpoint的动作或事件。具有此属性的动词或动词短语称为telic;如果它描述的情况没有达到任何特定的终点,则称其为atelic。(下面例子的play)

Q: If the sentence I was playing in the park I was winning the match is true, does it necessarily mean that the sentence I played in the park I won the match is true?

Aspect is Hard:

In English, aspect is often compositional (depending on syntactic relationships to complements and modifiers), rather than lexical (inherent to the verb) as it is in other languages.

Evidentiality is Hard:Language can make explicit the evidentiality (source of information) for a statement

Politeness is Hard:

Information Structure is Hard: Prosody often indicates focus, or new information.

Discourse is hard:

Semantics is Hard

9. Ethics and Bias in NLP

types harms:

  • Harms to fairness and justice: different populations have different levels of access to benefits, and are subject to different degrees of harm.
  • Harms to privacy and security: data aggregation and inference can lead to many types of harm to individuals.
  • Harms to transparency and autonomy: Many machine learning systems cannot explain outcomes, and individuals may not be able to challenge outcomes that harm them.

Demographic bias is inherent to NLP

  • • Feedback loops—reinforcement of social norms that already foster inequity, aka representational harm.
  • ​​​​​​​• Exclusion of people from already marginalised demographics, aka allocative harm.

Performance reviews are biased

Many applications of NLP have dual use

​​​​​​​Ethics and legality are not equivalent

二、语言建模

1. 概率建模原则

  • 句子概率作为衡量句子是否符合语法的标准是没有用的(sentence probability is useless as a measure of whether a sentence is grammatical)
  • 语言概率建模:自回归因子分解(两个问题:不是所有条件概率都有效;无限序列时概率会丢失)

2. Language Models 

2.1 formal language theorys

例如:

2.2 language models

即一个简单的语言模型应至少满足概率和为一:

3. 词性标注(POS tagging ) 

3.1 介绍

  • Parts-of-speech(也称为词性、词类或句法类别)很有用,因为它们揭示了一个单词及其相邻词的很多信息。知道一个单词是名词还是动词可以告诉我们可能的相邻单词(名词前面有限定词和形容词,动词前面有名词)和句法结构单词(名词通常是名词短语的一部分),使词性标注成为解析的一个关键方面。
  • 词性标注是为输入文本中的每个词性标注词分配词性标记的过程标记算法的输入是一系列(标记化的)单词和标记集,输出是一系列标记,每个标记一个。
  • 词性标注的两种算法——词性赋值算法。一种是生成隐马尔可夫模型(HMM),另一种是判别最大熵马尔可夫模型(MEMM)

一些tagsets

  • Penn Treebank 45 tags 
  • Brown corpus 87 tags
  • C7 tagset 146 tags

​​​​​

3.2 Criteria for classifying words

  1. Distributional criteria: Where can the words occur? 上下文位置
  2. Morphological criteria: What form does the word have? (E.g. -tion, -ize). What affixes can it take? (E.g. -s, -ing, -est). 单词的形态
  3. Notional (or semantic) criteria: What sort of concept does the word refer to? (E.g. nouns often refer to ‘people, places or things’). More problematic: less useful for us. 具体词汇代指的含义

3.3 Open and closed classes 自然语言的封闭词类和开放词类

  • Open classes are typically large, have fluid membership, and are often stable under translation. Four major open classes are widely found in languages worldwide: nouns, verbs, adjectives, adverbs. 所有语言至少有前两种类型(名词和动词)
  • Closed classes are typically small, have relatively fixed membership, and the repertoire of classes varies widely between languages. E.g. prepositions (English, German), post-positions (Hungarian, Urdu, Korean), particles (Japanese), classifiers (Chinese), etc.
  • Closed-class words (e.g. of, which, could) often play a structural role in the grammar as function words.

一些closed class words:

  • prepositions: on, under, over, near, by, at, from, to, with 
  • determiners: a, an, the
  • conjunctions: and, but, or, as, if, when
  • particles: up, down, on, off, in, out, at, by
  • numerals: one, two, three, first, second, third

3.4 Types of lexical ambiguity 词性不确定

Part of Speech (PoS) Ambiguity: e.g., still:

  1. adverb: at present, as yet
  2. noun: (1) silence; (2) individual frame from a film; (3) vessel for distilling alcohol
  3. adjective: motionless, quiet
  4. transitive verb: to calm

Sense Ambiguity: e.g., bark:

  1. sound a dog makes (The dog’s bark woke me up in the morning)
  2. outer covering of a tree (The tree’s bark was rough to the touch)

3.5 Rule-based tagging 基于规则的标注

Very old POS taggers used to work in two stages, based on hand-written rules: the first stage identifies a set of possible POS for each word in the sentence (based on a lexicon), and the second uses a set of hand-crafted rules in order to select a POS from each of the lists for each word. Example

  1. If a word belongs to the set of determiners, tag is at DT.
  2. Tag a word next to a determiner as an adjective (JJ), if it ends with -ed.
  3. If a word appears after “is” or “are” and it ends with ing, tag it as a verb VBG.

3.6 Type-ambiguity versus token-ambiguity

Let w1, . . . ,wn be a corpus of words taken from W and t1, . . . ,tn their parts of speech. Let W_{ambg} = \{w \in W | \exists i \neq j \ s.t. \ w_i = w_j = w, t_i \neq t_j \}.
Calculating type-ambiguity: the size of the set Wambg divided by the size of W .
Calculating token-ambiguity: the size of the set \{i | 1 \leq i \leq n, \ w_i \in W_{ambg} \} divided by n.
The bottom line: The type-ambiguous words are the more frequent words!

3.7 tagging tasks (sequence labelling)

  • Case restoration: If we just get lowercased text, we may want to restore proper casing, e.g. the river Thames
  • Named entity recognition: it may also be useful to find names of persons, organizations, etc. in the text, e.g. Barack Obama
  • Information field segmentation: Given specific type of text (classified advert, bibiography entry), identify which words belong to which “fields” (price/size/#bedrooms, author/title/year)
  • Prosodic marking 韵律标记: In speech synthesis, which words/syllables have stress/intonation changes, e.g. He’s going. vs He’s going?

3.8 Universal POS tags

NOUN (nouns)
VERB (verbs)
ADJ (adjectives)
ADV (adverbs)
PRON (pronouns)
DET (determiners and articles)
ADP (prepositions and postpositions)
NUM (numerals)
CONJ (conjunctions)
PRT (particles)
’.’ (punctuation marks)
X (anything else, such as abbreviations or foreign words)

3.9 A probabilistic model for tagging

Let’s define a new generative process for sentences. To generate sentence of length n:

Let t0 = <s>
For i = 1 to n
        Choose a tag conditioned on previous tag: P(t_i | t_{i-1})
        Choose a word conditioned on its tag: P(w_i |t_i)

模型假设:

  • Each tag depends only on previous tag: a bigram model over tags.
  • Words are conditionally independent given tags
Arrows indicate probabilistic dependencies

使用 Probabilistic finite-state machine 有限机模型

  • Prob of moving from state s to s′ (transition probability): P(t_i = s' |t_{i-1} = s)
  • Prob of emitting w from state s (emission probability): P(w_i = w|t_i = s)

Let S = w1 . . .wn be the sentence and T = t1 . . .tn be the corresponding tag sequence. Then:

p(S,T) = \prod_{i=1}^{n} P(t_i | t_{i-1}) P(w_i | t_i)

举例子:
对句子 This/DT is/VB a/DT simple/JJ sentence/NN 来说:

 

3.10 HMM(隐马尔科夫)词性标注

在本节中,我们将介绍隐马尔可夫模型在词性标注中的应用。HMM是一个序列模型。序列模型或序列分类器是一个模型,其工作是为序列中的每个单元分配一个标签或类从而将一个观察序列(观察状态)映射到一个标签序列(隐藏状态)。HMM是一种概率序列模型:给定一个单位序列(单词、字母、语素、句子等等),它计算可能的标签序列的概率分布,并选择最佳标签序列

  • Evaluation: Given a sentence S, calculate P(S) (“summing out” T)
  • Decoding: Given a sentence S, calculate argmax_T P(T | S).
  • Learning: Given a set of sentences S (1) , S (2) , ..., find the HMM that maximizes their log-likelihood

对于任意tag序列T,有:

P(S,T)=P(S|T)P(T)= \prod_{i} P(w_i|t_i)P(t_i|t_{i-1})

隐马尔可夫模型(HMM)允许我们讨论观察到的事件(比如我们在输入中看到的单词)和我们在概率模型中认为是因果因素的隐藏事件(比如词性标记)。HMM由以下部分组成:

一阶隐马尔可夫模型实例化了两个简化假设。首先,和一阶马尔可夫链一样,特定状态的概率只取决于前一种状态(即任意时刻的隐藏状态只依赖于它前一个隐藏状态):

第二,输出观测 o_i 的概率只取决于产生观测 q_i 的状态,而不取决于任何其他状态或任何其他观测(即任意时刻的观察状态(单词)只仅仅依赖于当前时刻的隐藏状态(词性))

HMM模型的三个基本问题

  1. 评估观察序列概率(Evaluation)。即给定模型 λ =(A,B,Π) 和观测序列 O ={o_1, o_2,...o_n} ,计算在模型λ下观测序列O出现的概率 P(O|λ) 。这个问题的求解需要用到前向后向算法(Forward-Backward(Baum-Welch) algorithm)
  2. 模型参数学习问题(Learning)。即给定观测序列 O={o1,o2,...o_n} ,估计模型 λ =(A,B,Π) 的参数,使该模型下观测序列的条件概率 P(O|λ)  最大。这个问题的求解需要用到基于EM算法的鲍姆-韦尔奇算法(Forward-Backward(Baum-Welch) algorithm)
  3. 预测问题,也称为解码(decoding)问题。即给定模型 λ =(A,B,Π)  和观测序列(单词) O ={o_1, o_2,...o_n}  ,求给定观测序列条件下,最可能出现的对应的隐藏状态(词性)序列,这个问题的求解需要用到基于动态规划的维特比算法。更关注第三个问题

Decoding,求解Tagger序列:

一个例子

我们标记句子“Janet will back the bill”; 目标是正确的标签系列(另见图8.6)

(8.21) Janet/NNP will/MD back/VB the/DT bill/NN

设HMM由图8.7和图8.8中的两个表定义。图8.7列出了隐藏状态(词性标记)之间转换的 a_ij 概率。图8.8表示 b_i(o_t) 概率,即给定标签的单词的观察概率

有N = 5个状态列。我们从第1列(对于Janet这个词)开始,将每个单元格中的维特比值设置为p转换概率的乘积(该状态i的起始概率,我们从图8.7的<s>条目得到),以及Janet这个词的观察可能性给出了该单元格的标签。

Learning 学习更新HMM的参数,即(state,word)和(pair,pair)对子的概率:

前向算法(算出观测序列的概率):

后向算法:

两种算法复杂度都为 O(TN^2)

我们的训练数据为{(O_1, I_1), (O_2, I_2), ...(O_D, I_D)},其中任意一个观测序列O_d = {o_1{(d)}, o_2{(d)}, ... o_T{(d)}},其对应的未知的隐藏状态序列表示为:O_d = {i_1{(d)}, i_2{(d)}, ... i_T{(d)}}

首先看鲍姆-韦尔奇算法的E步,我们需要先计算联合分布 P(O,I|\lambda) 的表达式如下:

P(O,I|\lambda)=\pi_{i_1}b_{i_1}(o_1)a_{i_1i_2}b_{i_2}(o_2)...a_{i_{T-1}i_T}b_{i_T}(o_T)

我们的E步得到的期望表达式为:

L(\lambda, \overline{\lambda}) = \sum\limits_{d=1}^D\sum\limits_{I}P(I|O,\overline{\lambda})logP(O,I|\lambda)

在M步我们要极大化上式。由于 P(I|O,\overline{\lambda}) = P(I,O|\overline{\lambda})/P(O|\overline{\lambda}), 而 P(O|\overline{\lambda}) 是常数,因此我们要极大化的式子等价于:

\overline{\lambda} = arg;\max_{\lambda}\sum\limits_{d=1}^D\sum\limits_{I}P(O,I|\overline{\lambda})logP(O,I|\lambda)

我们将上面 P(O|\overline{\lambda}) 的表达式带入我们的极大化式子,得到的表达式如下

\overline{\lambda} = arg;\max_{\lambda}\sum\limits_{d=1}D\sum\limits_{I}P(O,I|\overline{\lambda})(log\pi_{i_1} + \sum\limits_{t=1}{T-1}log;a_{i_t}a_{i_{t+1}} + \sum\limits_{t=1}^Tb_{i_t}(o_t))

我们的隐藏模型参数 λ=(A,B,Π), 因此下面我们只需要对上式分别对 A,B,Π 求导即可得到我们更新的模型参数。

对模型参数 Π 的求导:

\overline{\pi_i} = arg;\max_{\pi_{i_1}} \sum\limits_{d=1}D\sum\limits_{I}P(O,I|\overline{\lambda})log\pi_{i_1} = arg;\max_{\pi_{i}} \sum\limits_{d=1}D\sum\limits_{i=1}NP(O,i_1{(d)} =i|\overline{\lambda})log\pi_{i}

由于 \sum\limits_{i=1}N\pi_i =1, 因此根据拉格朗日子乘法,我们得到 \pi_i 要极大化的拉格朗日函数为:

arg;\max_{\pi_{i}}\sum\limits_{d=1}D\sum\limits_{i=1}NP(O,i_1{(d)} =i|\overline{\lambda})log\pi_{i} + \gamma(\sum\limits_{i=1}^N\pi_i -1)

其中,\gamma 为拉格朗日系数。

鲍姆-韦尔奇算法求解HMM参数-机器学习原理

这里我们概括总结下鲍姆-韦尔奇算法的流程

输入:D个观测序列样本 {(O_1), (O_2), ...(O_D)}

输出:HMM模型参数

1)随机初始化所有的 \pi_i, a_{ij},b_{j}(k)

2) 对于每个样本 d = 1,2,...D,用前向后向算法计算 \gamma_t{(d)}(i), \xi_t{(d)}(i,j), t =1,2...T

3) 更新模型参数:

\pi_i = \frac{\sum\limits_{d=1}D\gamma_1{(d)}(i)}{D}

a_{ij} = \frac{\sum\limits_{d=1}D\sum\limits_{t=1}{T-1}\xi_t{(d)}(i,j)}{\sum\limits_{d=1}D\sum\limits_{t=1}{T-1}\gamma_t{(d)}(i)}

b_{j}(k) = \frac{\sum\limits_{d=1}D\sum\limits_{t=1, o_t{(d)}=v_k}{T}\gamma_t{(d)}(i)}{\sum\limits_{d=1}D\sum\limits_{t=1}{T}\gamma_t^{(d)}(i)}

4) 如果 \pi_i, a_{ij},b_{j}(k) 的值已经收敛,则算法结束,否则回到第2)步继续迭代。

4. Parsing algorithms

Two reasons to care about syntactic structure (parse tree):

As a guide to the semantic interpretation of the sentence  
As a way to prove whether a sentence is grammatical or not

We also need a parsing algorithm to compute the parse tree for a given input string and grammar.

More specific: suppose that we have a grammar in Chomsky normal form. How many possible parses are there for a sentence of n + 1 words? Imagine that every nonterminal can rewrite as every pair of nonterminals (A→B C) and every terminal (A→a)

Let C(n) be the number of binary trees over a sentence of length n. The root of this tree has two subtrees: one over k words (1 ≤ k < n), and one over n − k words. Hence, for all values of k, we can combine any subtree over k words with any subtree over n − k words:

C(n) = \sum_{k=1}^{n-1} C(k) \times C(n - k)

C(n + 1) = \frac{(2n)!}{(n + 1)!n!}

4.1 CYK: The general algorithm

• In any case, for a fixed grammar, the CYK algorithm runs in time O(n^3 ) on an input string of n tokens. The algorithm identifies all possible parses.

CYK Parsing | Armin's Blog

https://fightinggg.github.io/material/QV7MPR.html

5. Probabilistic parsing

A probabilistic context-free grammar (PCFG) is a CFG where each rule A → α (where α is a symbol sequence) is assigned a probability P(α|A).

• With PCFGs, we are interested in the most likely derivation given the sentence

5.1 Probabilistic CKY

Sentence S of length n and CFG grammar with V non-terminals

Ordinary CKY: 
2-d(n + 1) ∗ (n + 1) array where a value in cell (i, j) is list of non-terminals spanning position i through j in S.

Probabilistic CKY:
3-d(n + 1) ∗ (n + 1) ∗ V array where a value in cell (i, j,K) is probability of non-terminal K spanning position i through j in S

We have analogues to HMM algorithms.

• The inside algorithm computes the probability of the sentence (analogous to forward algorithm)         
        • Same as above, but instead of storing the best parse for A, store the sum of all parses.
• The inside-outside algorithm algorithm is a form of EM that learns grammar rule probs from unannotated sentences (analogous to forward-backward).

5.2 Evaluating parse accuracy

• Output constituent is counted correct if there is a gold constituent that spans the same sentence positions.
• Pre-terminals don’t count as constituents.

5.3 Ways to fix PCFGs

Vanilla PCFGs: no lexical dependencies 换个词但是概率不变
Vanilla PCFGs: no global structural preferences 意思相同的词但是概率差很多比如he和him 不同结构中概率不同

解决方法:

1. parent annotation:Automatically create new categories that include the old category and its parent.

Ex. rules: • S^ROOT → NP^S VP^S
• NP^S → Pro^NP
• NP^S → NP^NP PP^NP

2. lexicalization: adding the lexical head of the phrase

5.4 Best-first probabilistic parsing

因为Treebank很大,大部分概率很小选不上

Use probabilities of subtrees to decide which ones to build up further.

• Each time we find a new constituent, we give it a score (“figure of merit”) and add it to an agenda1 , which is ordered by score.
• Then we pop the next item off the agenda, add it to the chart, and see which new constituents we can make using it.
• We add those to the agenda, and iterate.

Notice we are no longer filling the chart in any fixed order. Many variations on this idea, often limiting the size of the agenda by pruning out low-scoring edges (elements in the chart; this is beam search).

6. Dependency parsing

6.1 Dependency syntax

An alternative approach to sentence structure.

• A fully lexicalized formalism: no phrasal categories. 没有短语类
• Assumes binary, asymmetric grammatical relations between words: head-dependent relations, shown as directed edges: 

6.2 Dependency trees

A valid dependency tree for a sentence requires:

• A single distinguished root word.
• All other words have exactly one incoming edge.
• A unique path from the root to each other word.

Labelled dependencies: It is often useful to distinguish different kinds of head → modifier relations, by labelling edges:

In languages with free word order, phrase structure (constituency) grammars don’t make as much sense.

6.3 Pros and Cons

• Sensible framework for free word order languages.
• Identifies syntactic relations directly. (using CFG, how would you identify the subject of a sentence?)
• Dependency pairs/chains can make good features in classifiers, for information extraction, etc.
• Parsers can be very fast (coming up...)

But, the assumption of asymmetric binary relations isn’t always right... e.g., how to parse dogs and cats?

6.4 How to annotate dependencies

Two options:

1. Annotate dependencies directly.
2. Convert phrase structure annotations to dependencies. (Convenient if we already have a phrase structure treebank.)

Constituency Tree → Dependency Tree:the lexical head of the phrase can be used to collapse down to a dependency tree

 → 

How to find each phrase’s head? - Head Rules

6.5 Projectivity vs. Nonprojectivity

 Projectivity:当单词按线性顺序排列时,没有交叉的依赖弧,所有的弧都在单词的上方

有一些非投影树不一定是非投影的,区分非投射树和可转换为投射树的关键在于是否存在一种重新排列树结构的方式,使得所有箭头都不交叉。如果不存在这样的方式,那么该树就是非投射树。

也就是要遍历所有可能的依赖关系。

6.6 Dependency parsing algorithm 依赖解析算法

Some of the algorithms you have seen for PCFGs can be adapted to dependency parsing.

• CKY can be adapted, though efficiency is a concern: obvious approach is O(Gn5 ); Eisner algorithm brings it down to O(Gn3 ) (G is the size of the grammar) Related: parse into constituency, and convert into dependency. We assume a model that can score trees, similar to the parsing with PCFGs
Shift-reduce: more efficient, doesn’t even require a grammar! Use an algorithm that finds a directed maximal spanning tree

There are two main approaches to dependency parsing:

• Graph-based methods 基于图的方式:(Chu-Liu-Edmonds algorithm) 艾德蒙德算法(寻找有向图最小生成树形图)
• Transition-based methods 基于转换的方式(Shift-reduce):定义可能转换的状态机,以创建从输入句到依存句法树的映射。学习问题是创建一个可以根据转移历史来预测状态机中的下一个转换的模型。分析问题是使用在学习问题中得到的模型对输入句子构建一个最优的转移序列

6.7 Graph parsing for dependency parsing

  • Projective trees: dynamic programming
  • Nonprojective trees: maximum spanning tree algorithms such as CLE
6.7.1 Projective trees 

• Stage 1: Start with a complete graph, all edges connected to each other with some weight

        - 每个边都有一个weight vector ‘w’,每个边和该sentence都有一个“feature” vector ϕ(e, x)。权重和特征向量点乘结果为score

• Stage 2: Run a parsing algorithm to find the highest scoring tree (maximum spanning tree algorithms)找到argmax整个解析树的参数

6.7.2 Nonprojective case - Chu-Liu-Edmonds illustrated 艾德蒙德算法

给定一个图 G (with edges weighted):

Step 1: Select a root. Remove all incoming nodes into the root node. 选定根节点

Step 2: For each node, choose the highest scoring edge into the node. If the resulting edge selection is a tree, we are done. Otherwise:

        • There is a cycle. Collapse the cycle into a single node to create a new graph G′ (nontrivial collapse).
        • Call the algorithm recursively on the new graph G′ .
        • Expand the resulting tree back to a tree on G

如何获得边的权重w?

w ← w + ϕ(ti) − ϕ(t ∗ ).  Infer a tree t ∗ using w for training example i with true tree ti

6.8 Transition-based dependency parsing

  • arc-standard always returns projective trees
  • Other more complex systems may return non-projective trees

example:

Feature functions

6.9 Evaluating dependency parsers

Labelled Attachment Score (LAS): Proportion of words where we predicted the correct head and label.

Unlabelled Attachment Score (UAS): Proportion of words where we predicted the correct head, regardless of label.

三、Computational methods 计算模型

  • finite state machines (morphological analysis, POS tagging)
  • grammars and parsing (CKY, statistical parsing)
  • probabilistic models and machine learning (HMMS, PCFGs, logistic regression, neural networks)
  • vector spaces (distributional semantics)
  • lambda calculus (compositional semantics)

1. 有限状态机 finite state machines

状态机一般指有限状态机(英语:finite-state machine,缩写:FSM)又称有限状态自动机(英语:finite-state automaton,缩写:FSA),是表示有限个状态以及在这些状态之间的转移和动作等行为的数学计算模型

有限状态机是在自动机理论和计算理论中研究的一类自动机。如下图所示,有限状态机归属于自动机理论范畴,从下面的自动机理论的领域分层图可以看出,越往外层,概念越复杂。

状态机中有几个术语:state(状态) 、transition(转移) 、action(动作) 、transition condition(转移条件) 。

  • state(状态) :将一个系统离散化,可以得到很多种状态,当然这些状态是有限的。例如:门禁闸机可以划分为开启状态、关闭状态;电扇可以划分为关、一档、二档、三档等状态。
  • transition(转移) :一个状态接收一个输入执行了某些动作到达了另外一个状态的过程就是一个transition(转移)。定义transition(转移)就是在定义状态机的转移流程。
  • transition condition(转移条件) :也叫做Event(事件),在某一状态下,只有达到了transition condition(转移条件),才会按照状态机的转移流程转移到下一状态,并执行相应的动作。
  • action(动作):在状态机的运转过程中会有很多种动作。如:进入动作(entry action)[在进入状态时进行]、退出动作(exit action)[在退出状态时进行]、转移动作[在进行特定转移时进行]。
  • emissions 输出: letters produced at each transition

下图,就定义了一个只有openedclosed两种状态的状态机。当系统处于opened状态,在收到输入“关闭事件”,达到了状态机转移条件,系统就转移到了closed状态,并执行相应的动作,此例有一个进入动作(entry action),进入closed状态,会执行close door动作。

有限状态机数学模型为(Σ, S, s0, δ, F):

  • Σ,alphabet表示有效输入
  • S 有限的,非空的一组状态
  • s0 初始状态,S的一个元素
  • δ,transition表示从一种状态到另一种状态的规则方法
  • F是一组有限状态,是S的子集(可能为空)
  • O一组输出(可能为空)

以售票机为例:

  • Σ (m, t, r) : 付款m, 出票t, 退款r
  • S (1, 2) : 没付款状态, 付款状态
  • s0 (1) :初始状态
  • δ (shown below) : transition function: δ : S x Σ → S
  • F : empty
  • O (p/d) :打印票据p, 退款d

有限状态机可以被分为不同的类型,主要可以被分为:acceptors(接收器)transducers(转换器) 两大类。

Acceptors(接收器) 产生二进制输出,指示是否接受接收到的输入。接受者的每个状态要么接受(accepting)要么不接受(non accepting)。一旦接收到所有输入,如果当前状态为接受(accepting)状态,则该输入被接受;否则将被拒绝。通常,输入是一个符号(字符)序列;没有动作(actions)。

如下图是一种acceptors(接收器) 类型有限状态机,用来识别所输入的字符串是否为nice,其总共被划分为了七种状态,其中只有第七种状态Success被认为是可接受状态。如果所输入的字串不是nice,则会被转移到第六种状态Error

如下给出acceptors(接收器) 的数学形式化定义,acceptors(接收器) 型有限状态机是一个五元组(Σ,S,s0,δ,F) 其中:

  • Σ 是输入字符集合(有限的非空符号集合);
  • S 是有限非空状态集合
  • s0 是初始状态,属于 S 中的元素;
  • δ 是状态转移函数:δ:S×Σ→ S;
  • F 是最终状态集合,是 S 的子集。

Transducers(转换器) 根据给定的输入与/或使用动作的状态产生输出,输出的同时可能也伴随着状态的转移(不是必须)。应用在控制系统,在控制系统又分为两种:moore machine (摩尔型有限状态机)mealy machine (米利型有限状态机) 

  • 若输出只和状态有关而与输入无关,则称为moore状态机
  • 输出不仅和状态有关而且和输入有关系,则称为mealy状态机

1.1 moore状态机 (moore machine)

如下所示的状态机为moore状态机,其有q0​、q1、q2​、q3四个状态,X,Y,Z三个输入,a、b、c三个输出。可以看出其四个状态q0​、q1​、q2、q3​对应的输出分别为b、a、a、c,就是说输出已经和状态绑定好,不管输入为哪一个,均不影响输出。其中q0为初始状态,假设输入为XYZY,可以看出输出为bcac;假设输入为ZXYZ,则输出为baca,可以看出,虽然输出只和状态有关而与输入无关,但改变输入的序列顺序,输出序列也会改变。

1.2 mealy状态机

mealy状态机的输出与当前状态和输入都有关。但是对于每个mealy状态机都有一个等价的moore机。如下所示为一个简单的mealy状态机,它有一个输入和一个输出。在每一个有向边上,标注出了输入(红色)和输出(蓝色)。这个状态机的初始状态为Si,当输入为0,输出0,状态变为S0,接着输入0,输出0,状态还是为S0,在此状态下一直输入0,输出会一直是0,但输入为1是,输出才为1,状态变为S1​,在此状态再接着输入1,输出一直还是0,直到遇到输入为0,输出才变为1。此状态机其实实现了一个边缘触发检测器,每次输入从1到0或是从1到0发生跳变时,输出为1否则输出为0。如下所示时序图,当输入为0111001110时,输出为0100101001

如下给出transducers(转换器) 的数学形式化定义,transducers(转换器) 型有限状态机是一个六元组(Σ, Γ, S, s0, δ, ω),其中:

  • Σ 是输入字符集合(有限的非空符号集合);
  • Γ 是输出字符集合(有限的非空符号集合);
  • S是有限非空状态集合;
  • s0​是初始状态,属于S中的元素;
  • δ是状态转移函数:δ:S×Σ→S;
  • ω 是输出函数

如果输出函数 ω 依赖于状态和输入(ω:S×Σ→Γ),则定义的是mealy状态机;如果输出函数仅仅依赖于状态(ω:S→Γ),那么定义的是moore状态机。如果,有限状态机没有输出函数 ω 这一项,那么可以称作transition system(转移系统) 。很多应用程序用到的有限状态机并没有输出序列,仅仅用到了状态机的转移过程和动作,其实可以称为转移系统。

1.3 代码

1.3.1 简单有限状态机

  1. class Transition:
  2. """A change from one state to a next"""
  3. def __init__(self, current_state, state_input, next_state):
  4. self.current_state = current_state
  5. self.state_input = state_input
  6. self.next_state = next_state
  7. def match(self, current_state, state_input):
  8. """Determines if the state and the input satisfies this transition relation"""
  9. return self.current_state == current_state and self.state_input == state_input
  10. class FSM:
  11. """A basic model of computation"""
  12. def __init__(
  13. self,
  14. states=[],
  15. alphabet=[],
  16. accepting_states=[],
  17. initial_state=''):
  18. self.states = states
  19. self.alphabet = alphabet
  20. self.accepting_states = accepting_states
  21. self.initial_state = initial_state
  22. self.valid_transitions = False
  23. def add_transitions(self, transitions=[]):
  24. """Before we use a list of transitions, verify they only apply to our states"""
  25. for transition in transitions:
  26. if transition.current_state not in self.states:
  27. print(
  28. 'Invalid transition. {} is not a valid state'.format(
  29. transition.current_state))
  30. return
  31. if transition.next_state not in self.states:
  32. print('Invalid transition. {} is not a valid state'.format)
  33. return
  34. self.transitions = transitions
  35. self.valid_transitions = True
  36. def __accept(self, current_state, state_input):
  37. """Looks to see if the input for the given state matches a transition rule"""
  38. # If the input is valid in our alphabet
  39. if state_input in self.alphabet:
  40. for rule in self.transitions:
  41. if rule.match(current_state, state_input):
  42. return rule.next_state
  43. print('No transition for state and input')
  44. return None
  45. return None
  46. def accepts(self, sequence):
  47. """Process an input stream to determine if the FSM will accept it"""
  48. # Check if we have transitions
  49. if not self.valid_transitions:
  50. print('Cannot process sequence without valid transitions')
  51. print('Starting at {}'.format(self.initial_state))
  52. # When an empty sequence provided, we simply check if the initial state
  53. # is an accepted one
  54. if len(sequence) == 0:
  55. return self.initial_state in self.accepting_states
  56. # Let's process the initial state
  57. current_state = self.__accept(self.initial_state, sequence[0])
  58. if current_state is None:
  59. return False
  60. # Continue with the rest of the sequence
  61. for state_input in sequence[1:]:
  62. print('Current state is {}'.format(current_state))
  63. current_state = self.__accept(current_state, state_input)
  64. if current_state is None:
  65. return False
  66. print('Ending state is {}'.format(current_state))
  67. # Check if the state we've transitioned to is an accepting state
  68. return current_state in self.accepting_states

2. 状态机生成语言

2.1 regular languages and regular expressions and Chomsky Hierarchy

  • Languages produced by FSMs are called regular languages
  • Reg. languages can also be described with regular expressions 正则表达式.
  • 正式地,一个正则表达式包含以下运算:
    • 从字母集中抽样得到的符号:Σ
    • 空字符串:ε
    • 两个正则表达式的连接:RS
    • 两个正则表达式的交替:R∣S
    • 星号表示出现 0 次或者重复多次:R∗
    • 圆括号定义运算的有效范围:()
  • Chomsky Hierarchy:  four major classes of formal languages

3. regular 正则语言 (generated by finite state machines, usually assumed sufficient to describe phonology and morphology)
2. 上下文无关 (context-free) (和正则语言相比,它具有更少的约束,可能捕获更多的自然语言的特性)
1. 上下文敏感 (Context Sensitive) (possibly needed for some natural language phenomena)
0. recursively enumerable (anything a computer program can produce)

判断正则和非正则语言:

  • 一个语言是正则的当且仅当它能被有穷自动机接受
  • 正则语言共有而非正则语言不具备的两条性质:
  • 从左到右扫描一个字符串时,为了确定这个字符串最终是否在该语言中,所需要的存储量必须是有界的事先固定的只与该语言有关而与具体的输入字符串无关的
  • 有无穷多个字符串的正则语言用带的有穷自动机或含Kleene星号的正则表达式表示。这样的语言一定有具有某种简单的重复构造的无穷子集。

比如,\{a^{n}b^{n}:n\geq 0\} 或者  {{a^n:n\geq 1}且是一个素数} 都是非正则的

2.2 状态机FSM用于形态学分析(morphological analysis

2.2.1 有限状态接收器(Finite state acceptors,FSA)

FSA 包含:

  • 输入符号的字母集:Σ
  • 状态集合:Q
  • 起始状态:q0 ∈ Q
  • 最终状态:F ⊆ Q
  • 转移函数:符号和状态 → 下一个状态

Build components as separate FSAs

  • – L: FSA for lexicon
  • – D: FSA for derivational morphology
  • – I: FSA for inflectional morphology

Concatenate L + D + I (there are standard algorithms)

2.2.2 有限状态转换器(Finite State Transducers,FST)

FST 在 FSA 的基础上增加了字符串的输出能力:

  • 包含一个 输出字母集(output alphabet)
  • 现在,转移函数会采用输入符号并发出输出符号 (Q,Σ,Σ,Q)
    其中,第一个 Q 表示输入状态,最后一个 Q 表示输出状态,第一个 Σ 表示输入字母集,第二个 Σ 表示输出字母集。

3. Edit distance algorithm 一种动态规划

编辑距离(Minimum Edit Distance,MED),由俄罗斯科学家 Vladimir Levenshtein 在1965年提出,也因此而得名 Levenshtein Distance。在信息论、语言学和计算机科学领域,Levenshtein Distance 是用来度量两个序列相似程度的指标。通俗地来讲,编辑距离指的是在两个单词<w1, w2>之间,由其中一个单词w1转换为另一个单词w2所需要的最少单字符编辑操作次数。

在这里定义的单字符编辑操作有且仅有三种:

  • 插入(Insertion)
  • 删除(Deletion)
  • 替换(Substitution)

一般时候,可以设置插入和删除cost值为1,替换cost值为2.

譬如,"kitten" 和 "sitting" 这两个单词,由 "kitten" 转换为 "sitting" 需要的最少单字符编辑操作有:

1.kitten → sitten (substitution of "s" for "k")
2.sitten → sittin (substitution of "i" for "e")
3.sittin → sitting (insertion of "g" at the end)

因此,"kitten" 和 "sitting" 这两个单词之间的编辑距离为 3 。

4. 编辑距离自动机 (Edit Distance Automata)

\delta(q,a,a,q) = \delta(q,b,b,q)=0 \\ \delta(q,a,b,q) = \delta(q,b,a,q)=1 \\ \delta(q,a,\varepsilon,q) = \delta(q,b,\varepsilon,q)=1 \\ \delta(q,\varepsilon,a,q) = \delta(q,\varepsilon,b,q)=1

其中,δ 是状态转移的记分函数,在编辑距离的 WFST(加权FST) 实现下,我们只需要一个状态 q。上面的式子表明:将 a 替换为 a 或者将 b 替换为 b 的代价为 0;将 a 替换为 b 或者将 b 替换为 a 的代价为 1;将 a 删除或者将 b 删除的代价为 1;插入一个 a 或者 b 的代价为 1。

5. 下推自动机(PushDown Automaton,PDA):自带栈的有限状态机

为什么要在有限自动机的基础上给其增加一个栈来构建下推自动机呢?其根本原因在于有限状态机本身无法存储数据,而加上了栈之后的下推自动机则具有了数据存储能力。具有了存储能力之后,下推自动机的计算能力相较于有限自动机得到了进一步的增强。

我们已经知道,有限自动机在进行状态转移的时候,需要读取一个输入,然后根据当前状态已经输入来进行状态转移。下推自动机在进行状态转移的时候同样要依赖于当前状态和输入,不过,因为下推自动机有了栈的概念,所以在进行状态转移的时候我们还需要一点别的东西 —— 读栈和写栈操作。例如下图这样的一个下推自动机:

下推自动机

我们可以把这个下推自动机用这样的转移列表列出来,注意这里的 $ 符号,因为栈为空的状态并不是很好判断,所以我们添加了这个字符来代表栈为空,也就是说栈的最底部永远应该保持有 $ 这个字符。最下面的虚线代表着当栈为空且状态为 2 时(我们使用虚线来代表不需要有输入的转移情形,即自由移动),此时不需要任何输入,栈会自动的弹出 $ 然后再压入 $,随后状态从 2 变为 1。

当前状态输入字符转移状态栈顶字符(读入 / 弹出)压入字符
1(2$b$
2(2bbb
2)2b不压入
2无输入1$$

四、语言模型

n-gram模型、隐马尔可夫模型(HMM)和条件随机场(CRF)

1. n-gram模型

1.1 unigram(bag-of-words)模型

如何计算单词概率,用极大似然估计

1.2 bigram 或 trigram模型

MLE计数估计概率参数

trigram: 

对于给定x_i-1, 所有可能的x_i 概率之和为一

要注意p(xi,xi-1)中x的顺序!

2. 平滑方法(smoothing) 概率计数

平滑策略的引入,主要使为了解决语言模型计算过程中出现的零概率问题。零概率问题又会对语言模型中N-gram模型的Perplexity评估带来困难。

  • Additive smoothing:Add-one smoothing / 加一或加权平滑
  • Good-Turing estimate
  • Jelinek-Mercer smoothing (interpolation) 插值

2.1 Add-one smoothing 加一平滑,拉普拉斯平滑

Add-α smoothing:

与贝叶斯估计(bayesian estimation)形态近似,dirichlet prior

2.2 Good-Turing smoothing

基本思想: 用观察计数较高的 N-gram 数量来重新估计概率量大小,并把它指派给那些具有零计数或较低计数的 N-gram

2.3 卡茨退避法(Katz backoff)

Katz平滑方法通过加入高阶模型与低阶模型的结合,扩展了Good-Turing估计方法。

当某一事件在样本中的频率大于k时,运用最大似然估计MLE经过减值来估计其概率。当某一事件的频率小于k时,运用低阶的语法模型作为代替高阶语法模型的后备,而这种代替必须受归一化因子α的作用。根据低阶的语法模型分配由于减值而节省下来的剩余概率给未见事件,这比将剩余概率均匀分配给未见事件更合理。

2.4 Interpolation 线性插值(Jelinek-Mercer 插值

简单来讲,就是把不同阶的模型结合起来:用线性差值把不同阶的 N-gram 结合起来,这里结合了 trigram,bigram 和 unigram。用 lambda 进行加权:

2.5 绝对减值法(Absolute Discounting)

  • 从每个 观测到的 n-gram 计数中 “借” 一个 固定的概率质量(a fixed probability mass) d。(因为对于add-one平滑来说出现次数多的词汇会损失更多有效计数)
  • 然后将其 重新分配(redistributes) 到 未知的 n-grams 上。

和Add-one平滑一样,如果我们想计算 Absolute Discounting 平滑概率,只需要用 Absolute Discounting 平滑下的有效计数除以上下文在语料库中出现的总次数即可

2.6 Kneser-Ney Smoothing

Kneser-Ney 平滑采用的方法是基于当前单词 在多少个 不同的上下文 中出现过,或者说是基于该单词的 多功能性(versatility)或者 延续概率(continuation probability)。

对比 Katz Backoff 的概率公式,在 Kneser-Ney 平滑概率公式中,只有一项不同:

P_{\text{KN}}(w_i\mid w_{i-1})=\begin{cases}\dfrac{C(w_{i-1},w_i)-D}{C(w_{i-1})}\;, & \text{if } C(w_{i-1},w_i)>0\\\\ \alpha(w_{i-1})\color{red}{P_{\text{cont}}(w_i)}\;, & \text{otherwise} \end{cases}

其中,

P_{\text{cont}}(w_i)=\dfrac{|\{w_{i-1}:C(w_{i-1},w_i)>0\}|}{\sum_{w_i}|\{w_{i-1}:C(w_{i-1},w_i)>0\}|}

可以看到,对于观测到的 n-grams,Kneser-Ney 概率和 Katz Backoff 概率两者是相同的。而对于那些未知的 n-grams,Kneser-Ney 概率中从观测 n-grams 中 “借来” 的概率质量 \alpha(w_{i-1}) 与之前一样保持不变,而之前关于低阶模型 P(w_i) 的部分则由我们所说的延续概率 P_{\text{cont}}(w_i) 来替代。

在延续概率的计算公式中P_{\text{cont}}(w_i):分子部分计算的是一共有多少个 唯一 的上下文单词w_i-1 和当前单词w_i 共同出现过(co-occurrence);而分母部分就是将所有可能的单词 w_i 对应的不同共现上下文单词数量(即分子部分)进行一个累加。

3. 对数线性模型(Log-linear Model)

Log-linear语言模型的本质是把语言模型的建立看作是一个多元分类问题,核心是从上文中提取特征。

形式化地说,假设上文为,log-linear语言模型需要一个特征函数​​​​​,其将上文信息作为输入,输出一个N 维特征向量(注意这里是feature vector而不是eigenvector),这个特征向量x就是对上文信息的一个表示(这里的N不是N元语法里面的那个N,不表示上文单词数量!)

log-linear语言模型还允许灵活加入其它特征,这也是该模型的一大长处。常见的特征还包括

  • 上文的语义类别。可以使用聚类方法将相似单词聚类,这样,上文每个单词的独热编码不再是单词表长度,而是聚类得到的类别个数
  • 上文单词的其它语义信息,例如词性标注(POS-Tag)信息
  • 词袋特征。此时,不止考虑前面少数几个单词,而是考虑前面所有单词,统计它们出现的个数。注意在这种情况下会失去单词的位置信息,不过可以捕捉到单词的共现信息

Summary

Part I: Words
* Inflectional and derivational morphology//
* Finite state methods and Regular expressions//
* Bag-of-Words models and their applications
* Word Classes and Parts of speech
* Sequence Models (n-gram and Hidden Markov models, smoothing)
* The Viterbi algorithm, Forward Backward, EM

Part II: Syntax
* Syntactic Concepts (e.g., constituency, subcategorisation, bounded and unbounded dependencies, feature representations)
* Analysis in CFG - Greedy algorithms---Shift-reduce parsing
* Divide-and-conquer algorithms---CKY
* Lexicalised grammar formalisms (e.g., CCG, dependency grammar)
* Statistical parsing (PCFGs, dependency parsing)

Part III: Semantics and Discourse
* Logical semantics and compositionality
* Semantic derivations in grammar
* Lexical Semantics (e.g., word embeddings, word senses, semantic roles)
* Discourse (e.g., anaphora)

References

集合: 

https://blog.csdn.net/bensonrachel/category_8109347.html

multi language NLP - 知乎

Archive - YEY 的博客 | YEY Blog

单篇:

语言学小知识-什么是语素 - 知乎

什么是状态机?一篇文章就够了 - 掘金

自然语言处理 03:N-gram 语言模型 - YEY 的博客 | YEY Blog

词性标注(Part-of-Speech Tagging),HMM - 知乎

本文内容由网友自发贡献,转载请注明出处:https://www.wpsshop.cn/w/小小林熬夜学编程/article/detail/249336
推荐阅读
相关标签
  

闽ICP备14008679号