Word Embedding是NLP中最频繁出现的词了,关于word embedding,其实很简单。
word embedding的意思是:给出一个文档,文档就是一个单词序列比如 “A B A C B F G”, 希望对文档中每个不同的单词都得到一个对应的向量(往往是低维向量)表示。 比如,对于这样的“A B A C B F G”的一个序列,也许我们最后能得到:A对应的向量为[0.1 0.6 -0.5],B对应的向量为[-0.2 0.9 0.7] (此处的数值只用于示意)
word embedding不是一个新的topic,很早就已经有人做了,比如bengio的paper“Neural probabilistic language models”,这其实还不算最早,更早的时候,Hinton就已经提出了distributed representation的概念“Learning distributed representations of concepts”(只不过不是用在word embedding上面) ,AAAI2015的时候问过Hinton怎么看google的word2vec,他说自己20年前就已经搞过了,哈哈,估计指的就是这篇paper。
既然word embedding是一个老的topic,为什么会火呢?原因是Tomas Mikolov在Google的时候发的这两篇paper:“Efficient Estimation of Word Representations in Vector Space”、“Distributed Representations of Words and Phrases and their Compositionality”。
这两篇paper是2013年的工作,至今(2015.8),这两篇paper的引用量早已经超好几百,足以看出其影响力很大。当然,word embedding的方案还有很多,常见的word embedding的方法有: 1. Distributed Representations of Words and Phrases and their Compositionality 2. Efficient Estimation of Word Representations in Vector Space 3. GloVe Global Vectors forWord Representation 4. Neural probabilistic language models 5. Natural language processing (almost) from scratch 6. Learning word embeddings efficiently with noise contrastive estimation 7. A scalable hierarchical distributed language model 8. Three new graphical models for statistical language modelling 9. Improving word representations via global context and multiple word prototypes
word2vec中的模型至今(2015.8)还是存在不少未解之谜,因此就有不少papers尝试去解释其中一些谜团,或者建立其与其他模型之间的联系,下面是paper list 1. Neural Word Embeddings as Implicit Matrix Factorization 2. Linguistic Regularities in Sparse and Explicit Word Representation 3. Random Walks on Context Spaces Towards an Explanation of the Mysteries of Semantic Word Embeddings 4. word2vec Explained Deriving Mikolov et al.’s Negative Sampling Word Embedding Method 5. Linking GloVe with word2vec 6. Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective