赞
踩
Google word2vec的Python接口
训练是用原生C代码完成的,其他功能是纯Python与numpy
要安装word2vec库,打开命令行,输入:
pip install word2vec
报错:
ERROR: Failed building wheel for word2vec
ERROR: Could not build wheels for word2vec which use PEP 517 and cannot be installed directly,用pip3也不行
升级一下pip:
pip3 install --upgrade pip
没用,解决办法:需要gcc编译器,参考windows照样命令行gcc/g++和python3安装word2vec模块错误处理,安装Dev c++后添加环境变量如下:
再次使用:pip install word2vec,安装成功:
参考word2vec Jupyter notebook,从以下网址下载text8.zip这个数据集,长这个样子:
这就是语料库了,看起来感觉杂乱无章的样子,还是先跟着做下去:
word2phrase——把类似 "Los Angeles "的词组合成 “Los_Angeles”
word2vec.word2phrase('...your path/text8','...your path/text8-phrases', verbose=True)
Running command: word2phrase -train D:/University_Study/2021_Graduation_project/Code/text8/text8 -output D:/University_Study/2021_Graduation_project/Code/text8/text8-phrases -min-count 5 -threshold 100 -debug 2
Starting training using file D:/University_Study/2021_Graduation_project/Code/text8/text8
Vocab size (unigrams + bigrams): 2419827
Words in train file: 17005206
这样就创建了一个text8-phrases文件,可以更好的用来作为word2vec的的输入,注意,可以很容易地跳过前面这一步,直接使用文本数据作为word2vec的输入
word2vec——现在训练word2vec模型
word2vec.word2vec('../data/text8-phrases', '../data/text8.bin', size=100, binary=True, verbose=True)
Running command: word2vec -train D:/University_Study/2021_Graduation_project/Code/text8/text8-phrases -output D:/University_Study/2021_Graduation_project/Code/text8/text8.bin -size 100 -window 5 -sample 1e-3 -hs 0 -negative 5 -threads 12 -iter 5 -min-count 5 -alpha 0.025 -debug 2 -binary 1 -cbow 1
Starting training using file D:/University_Study/2021_Graduation_project/Code/text8/text8-phrases
Vocab size: 98331
Words in train file: 15857306
Alpha: 0.000136 Progress: 99.52% Words/thread/sec: 1339.35k
这就创建了一个text8.bin文件,其中包含二进制格式的词向量。
word2clusters——根据训练后的模型生成向量的聚类
word2vec.word2clusters('../data/text8', '../data/text8-clusters.txt', 100, verbose=True)
Running command: word2vec -train D:/University_Study/2021_Graduation_project/Code/text8/text8 -output D:/University_Study/2021_Graduation_project/Code/text8/text8-clusters.txt -size 100 -window 5 -sample 1e-3 -hs 0 -negative 5 -threads 12 -iter 5 -min-count 5 -alpha 0.025 -debug 2 -binary 0 -cbow 1 -classes 100
Starting training using file D:/University_Study/2021_Graduation_project/Code/text8/text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000129 Progress: 99.55% Words/thread/sec: 1238.97k
这就创建了一个text8-clusters.txt,其中包含了词汇表中每一个单词的群集
运行程序,报错:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb1 in position 0: ordinal not in range(128)
解决办法,参考解决UnicodeDecodeError:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
并不能解决,这里把文件路径里的中文改成英文得以解决
得到的聚类结果如下图所示:
导入上面创建的word2vec二进制文件
model = word2vec.load('../data/text8.bin')
我们可以把词汇看成一个numpy数组
> model.vocab
['</s>' 'the' 'of' ... 'dupree' 'rauchbier' 'erythropoietin']
或者看一下整个矩阵
> model.vectors.shape
(98331, 100)
> model.vectors
[[-0.16299245 -0.12382638 -0.11257623 ... 0.08034179 -0.12345736 -0.11461084]
[ 0.01057408 0.04060991 0.00174766 ... 0.15856159 0.03144943 -0.06085312]
[-0.07785802 -0.0693066 -0.15338072 ... 0.06256067 0.00675137 0.06014424]
...
[ 0.08107938 0.19705792 -0.20361277 ... -0.06893674 -0.06521299 0.05619101]
[ 0.13266037 0.10146856 -0.18063438 ... -0.02563198 -0.02282242 -0.02883757]
[ 0.04729127 0.27832553 -0.26021555 ... 0.00669185 -0.05703441 0.11786372]]
我们可以检索出单个词的向量
> model['dog'].shape
(100,)
> model['dog'][:10]
[-0.01505851 0.15153734 -0.17255363 -0.05107318 0.01521252 -0.10319622 -0.06863069 -0.14023714 0.06822077 0.16727005]
我们可以计算两个或多个(所有组合)词之间的距离
> model.distance("dog", "cat", "fish")
[('dog', 'cat', 0.8169705260977251), ('dog', 'fish', 0.5983957374283233), ('cat', 'fish', 0.6765780316732376)]
我们可以根据余弦相似度做简单的查询,检索出与 "dog "相似的词
> indexes, metrics = model.similar("dog")
> indexes, metrics
[ 2436 4762 5473 3774 7611 9571 6955 17003 2428 17265] [0.81697053 0.75968385 0.75632396 0.75286234 0.74466203 0.73437095 0.73330186 0.73220809 0.73136231 0.72864531]
这将返回一个包含2个项目的元组:
我们可以得到这些索引的词
> model.vocab[indexes]
['cat' 'bull' 'cow' 'grey' 'goat' 'frog' 'pink' 'coyote' 'bear' 'stuffed']
有一个辅助函数来创建一个numpy记录数组的组合响应
> model.generate_response(indexes, metrics)
[('cat', 0.81697053) ('bull', 0.75968385) ('cow', 0.75632396)
('grey', 0.75286234) ('goat', 0.74466203) ('frog', 0.73437095)
('pink', 0.73330186) ('coyote', 0.73220809) ('bear', 0.73136231)
('stuffed', 0.72864531)]
很容易让那个numpy数组成为一个纯粹的python响应
> model.generate_response(indexes, metrics).tolist()
[('cat', 0.8169705260977251), ('bull', 0.7596838526023106), ('cow', 0.7563239593288873), ('grey', 0.752862335507535), ('goat', 0.7446620309365173), ('frog', 0.7343709509684961), ('pink', 0.7333018648258744), ('coyote', 0.732208090693304), ('bear', 0.7313623065960382), ('stuffed', 0.7286453086029295)]
由于我们用word2phrase的输出来训练模型,我们可以要求 "短语 "的相似度,基本上是类似的词,比如 “Los Angeles”
> indexes, metrics = model.similar('los_angeles')
> model.generate_response(indexes, metrics).tolist()
[('san_francisco', 0.8607764509277348), ('detroit', 0.8342075097061693), ('chicago', 0.8261110777891385), ('boston', 0.8202495210798733), ('kansas_city', 0.7897960239270461), ('california', 0.788759543323913), ('melbourne', 0.7883319200147727), ('cincinnati', 0.7871500441116634), ('st_louis', 0.7845951620359091), ('new_jersey', 0.7816740879950232)]
其可以做更复杂的查询,比如类比:king-man+woman=queen 这个方法返回的是和余弦一样的词汇索引和度量值
> indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'])
> indexes, metrics
[ 648 344 1006 1335 7530 1087 6756 1145 1770 1269] [0.29450657 0.29016691 0.28739799 0.28567908 0.28342768 0.27816884 0.27671222 0.273693 0.26996926 0.26953137]
> model.generate_response(indexes, metrics).tolist()
[('emperor', 0.294506569146581), ('son', 0.2901669060777821), ('daughter', 0.2873979913641791), ('wife', 0.2856790781367689), ('empress', 0.28342767855598494), ('queen', 0.27816883698775996), ('roman_emperor', 0.2767122238883053), ('prince', 0.27369300450863804), ('pope', 0.26996926448193403), ('mary', 0.26953137270309613)]
这里感觉有点没训练好,怎么第一个是“emperor”(皇帝),queen在很后面了……
clusters = word2vec.load_clusters('../data/text8-clusters.txt')
我们可以看到得到单个词的群集数
> clusters.vocab
['</s>' 'the' 'of' ... 'koba' 'skirting' 'selectors']
我们可以看到在一个特定的集群上得到所有的词组
> clusters.get_words_on_cluster(90).shape
(483,)
> clusters.get_words_on_cluster(90)[:10]
['wrote' 'appeared' 'met' 'says' 'told' 'doctor' 'felt' 'got' 'heard' 'learned']
我们可以将聚类添加到word2vec模型中,并生成一个包含聚类的响应
> model.clusters = clusters
> indexes, metrics = model.analogy(pos=["paris", "germany"], neg=["france"])
> model.generate_response(indexes, metrics).tolist()
[('vienna', 0.28506898987953155, 85), ('munich', 0.27632371395430655, 53), ('moscow', 0.27571983939821915, 13), ('grammar_school', 0.27490988138939293, 30), ('leipzig', 0.27365748314826543, 75), ('geneva', 0.272046117377734, 1), ('berlin', 0.2700350465374118, 96), ('prague', 0.26876277033638435, 75), ('trinity_college', 0.26694007759783456, 24), ('forest_lawn', 0.26564294888051326, 55)]
完整代码如下:
import word2vec # Training word2vec.word2phrase('...your path/text8', '...your path/text8-phrases',verbose=True) word2vec.word2vec('...your path/text8-phrases','...your path/text8.bin',size=100,binary=True,verbose=True) word2vec.word2clusters('...your path/text8', '...your path/text8-clusters.txt', 100, verbose=True) # Predictions model = word2vec.load('...your path/text8.bin') print(model.vocab) print(model.vectors.shape) print(model.vectors) print(model['dog'].shape) print(model['dog'][:10]) print(model.distance('dog','cat','fish')) # Similarity indexes, metrics = model.similar("dog") print(indexes, metrics) print(model.vocab[indexes]) print(model.generate_response(indexes, metrics)) print(model.generate_response(indexes, metrics).tolist()) # Phrases indexes, metrics = model.similar('los_angeles') print(model.generate_response(indexes, metrics).tolist()) # Analogies indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man']) print(indexes, metrics) print(model.generate_response(indexes, metrics).tolist()) # Clusters clusters = word2vec.load_clusters('...your path/text8-clusters.txt') print(clusters.vocab) print(clusters.get_words_on_cluster(90).shape) print(clusters.get_words_on_cluster(90)[:10]) model.clusters = clusters indexes, metrics = model.analogy(pos=["paris", "germany"], neg=["france"]) print(model.generate_response(indexes, metrics).tolist())
安装并简单入门了一下google的word2vec库,了解了基本的词汇表表示形式、向量形式、相似度计算、类比计算、聚类计算等等
上一节了解了gensim基于glove向量转换为word2vec的相关知识,还想看一下gensim中word2vec入门以及如何基于中文语料库进行学习
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。