# coding: utf8 from sklearn.metrics.pairwise import cosine_similarity import numpy as np import math """ arr1 = [[1, 2]] arr2 = [[4, 5]] """ arr1 = np.arange(2).reshape(1, 2) arr1[0] = [100 * 1 / math.sqrt(5), 100 * 2 / math.sqrt(5)] arr2 = np.arange(2).reshape(1, 2) arr2[0] = [100 * 4 / math.sqrt(41), 100 * 5 / math.sqrt(41)] print(arr1) print(arr2) ret = cosine_similarity(arr1, arr2) print(ret) """ [[44 89]] [[62 78]] [[0.97751451]] """
在 sklearn 中,有如下代码
def cosine_similarity(X, Y=None, dense_output=True): """Compute cosine similarity between samples in X and Y. Cosine similarity, or the cosine kernel, computes similarity as the normalized dot product of X and Y: K(X, Y) = <X, Y> / (||X||*||Y||) On L2-normalized data, this function is equivalent to linear_kernel. Read more in the :ref:`User Guide <cosine_similarity>`. Parameters ---------- X : {ndarray, sparse matrix} of shape (n_samples_X, n_features) Input data. Y : {ndarray, sparse matrix} of shape (n_samples_Y, n_features), \ default=None Input data. If ``None``, the output will be the pairwise similarities between all samples in ``X``. dense_output : bool, default=True Whether to return dense output even when the input is sparse. If ``False``, the output is sparse if both input arrays are sparse. .. versionadded:: 0.17 parameter ``dense_output`` for dense output. Returns ------- kernel matrix : ndarray of shape (n_samples_X, n_samples_Y) """ # to avoid recursive import X, Y = check_pairwise_arrays(X, Y) X_normalized = normalize(X, copy=True) if X is Y: Y_normalized = X_normalized else: Y_normalized = normalize(Y, copy=True) K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output) return K
关于 word2vec 的定义就不过多描述了,我对于 word2vec 的认知也不是很深,但是给我的第一印象就是,它能用来背单词。
The resulting vector from “king-man+woman” doesn’t exactly equal “queen”, but “queen” is the closest word to it from the 400,000 word embeddings we have in this collection.
不管是中文和英文,都是有一定的语义相关性的。可能我提到张三,你就能想到李四。提到狮子就能想到老虎一样。而 word2vec 恰恰是能将此特性通过词向量的形式进行具象化的一种技术手段。因此,在语义相关的众多候选单词中,通过 word2vec 的牵引,往往能起到“举一反三”的功效,不仅背单词的效率更高了,速度也会有提升。
想了解更多关于 word2vec 的内容,可以点击下面两篇文章,个人感觉降得很好。
word2Vec 是如何得到词向量的4
相对来讲,较为合适的就是维基百科、英语 46 级、雅思等内容相关的文本。这样的内容天生就是适合的。收集物料的过程很繁琐,需要针对性的写一些爬虫脚本,我这里目前还不需要做到如此精准,就先用哈利波特英文剧本代替了。
sentences = word2vec.LineSentence("haripoter.txt")
这里借助 gensim 库进行模型的训练与构建,方便快捷。
from gensim.models import word2vec sentences = word2vec.LineSentence("haripoter.txt") model = word2vec.Word2Vec( sentences, sg=0, # size=100, # window=2, # negative=3, # sample=0, # hs=1, # workers=4, size=250, ) model.save("haripoter.model")
from gensim.models import word2vec
model = word2vec.Word2Vec.load("haripoter.model")
print(model.most_similar("kitchen", topn=3))
[('hall', 0.9646997451782227), ('corridor,', 0.9580787420272827), ('hole', 0.9575302600860596)]
半成品背单词应用 https://github.com/guoruibiao/vocabulary
1 爬取更多物料,构筑更精准的模型
2 针对错误单词的“组词成句”需求跟进,在我看来,这也是一个 NLP 的热点。
本文主要从一个数学公式(余弦相似度)说起,再到 word2vec 算法,再应用到具体的工具中(物料不全,暂未使用)。整体思路还是蛮清晰的,后续其实可以投入精力去好好优化下,时间有限,就写到这里了。
