当前位置:   article > 正文

词向量化 Vector Representation of Words 方法汇总

词向量化

PART I: Classical Machine Leaning

为什么要进行词向量化?

“向量化”可以理解为“数值化”,为什么要“数值化”?因为文字是不可以运算的,而数值可以。【明天探索者 2021】不管用什么方法把词向量化,都是为了下一步放进训练模型计算。一句话总结,目的:put words to vector space。

文字中词的特征和向量化又有什么关系?

NLP中字、词、词频、ngram、词性…等都可以认为是特征,只要将这些特征向量化就能放入模型中计算。比如词袋模型就是利用的词频特征,word2vec 可以认为是利用了窗口内文本的共现特征。【明天探索者 2021】

下图展示了文章中所提到的算法级别的联系。

Bag-of-Words (BoW): The BoW model separately matches and counts each element in the document to form a vector representation of a document. [Dongyang Yan 2020 Network-Based]

具体做法:A document is mapped into a vector as v = [x1,x2,...,xn] where xi denotes the occurrence of the ith word in basic terms. 

        - The basic terms (原型 ate --> eat, jumping --> jump, 无 stopwords 'a', 'the') are usually the top n highest-frequency words collected from the datasets (注意:from all documents, not a single document being analysed).

        - The value of occurrence feature can be a binary, term frequency, or term frequency-inverse document frequency (TF-IDF). A binary value denotes whether the ith word is presented in a document, which reckons without the weight of words. The term frequency is the number of occurrences of each word. TF-IDF assumes that the importance of a word increases proportionally to its frequency in a document but is offset by its frequency in the word corpus. [Dongyang Yan 2020 Network-Based]

举例:我喜欢水,很想喝水。[Jonathan Hui]

        basic terms:[我,喜,欢,水,很&

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/不正经/article/detail/376602
推荐阅读
相关标签
  

闽ICP备14008679号