赞
踩
PART I: Classical Machine Leaning
为什么要进行词向量化?
“向量化”可以理解为“数值化”,为什么要“数值化”?因为文字是不可以运算的,而数值可以。【明天探索者 2021】不管用什么方法把词向量化,都是为了下一步放进训练模型计算。一句话总结,目的:put words to vector space。
文字中词的特征和向量化又有什么关系?
NLP中字、词、词频、ngram、词性…等都可以认为是特征,只要将这些特征向量化就能放入模型中计算。比如词袋模型就是利用的词频特征,word2vec 可以认为是利用了窗口内文本的共现特征。【明天探索者 2021】
下图展示了文章中所提到的算法级别的联系。
Bag-of-Words (BoW): The BoW model separately matches and counts each element in the document to form a vector representation of a document. [Dongyang Yan 2020 Network-Based]
具体做法:A document is mapped into a vector as v = [x1,x2,...,xn] where xi denotes the occurrence of the ith word in basic terms.
- The basic terms (原型 ate --> eat, jumping --> jump, 无 stopwords 'a', 'the') are usually the top n highest-frequency words collected from the datasets (注意:from all documents, not a single document being analysed).
- The value of occurrence feature can be a binary, term frequency, or term frequency-inverse document frequency (TF-IDF). A binary value denotes whether the ith word is presented in a document, which reckons without the weight of words. The term frequency is the number of occurrences of each word. TF-IDF assumes that the importance of a word increases proportionally to its frequency in a document but is offset by its frequency in the word corpus. [Dongyang Yan 2020 Network-Based]
举例:我喜欢水,很想喝水。[Jonathan Hui]
basic terms:[我,喜,欢,水,很&
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。