当前位置:   article > 正文

【NLP】2安装word2vec库与基于text8数据集实例

text8

1. 安装word2vec库

Google word2vec的Python接口

训练是用原生C代码完成的,其他功能是纯Python与numpy

安装word2vec库,打开命令行,输入:

pip install word2vec
  • 1

报错:
ERROR: Failed building wheel for word2vec
ERROR: Could not build wheels for word2vec which use PEP 517 and cannot be installed directly,用pip3也不行

在这里插入图片描述
升级一下pip:

pip3 install --upgrade pip
  • 1

没用,解决办法:需要gcc编译器,参考windows照样命令行gcc/g++python3安装word2vec模块错误处理,安装Dev c++后添加环境变量如下:

在这里插入图片描述
再次使用:pip install word2vec,安装成功:

在这里插入图片描述

2. word2vec库基于text8数据集实例

2.1 训练

参考word2vec Jupyter notebook,从以下网址下载text8.zip这个数据集,长这个样子:

在这里插入图片描述
这就是语料库了,看起来感觉杂乱无章的样子,还是先跟着做下去:

word2phrase——把类似 "Los Angeles "的词组合成 “Los_Angeles”

word2vec.word2phrase('...your path/text8','...your path/text8-phrases', verbose=True)
  • 1
Running command: word2phrase -train D:/University_Study/2021_Graduation_project/Code/text8/text8 -output D:/University_Study/2021_Graduation_project/Code/text8/text8-phrases -min-count 5 -threshold 100 -debug 2
Starting training using file D:/University_Study/2021_Graduation_project/Code/text8/text8

Vocab size (unigrams + bigrams): 2419827
Words in train file: 17005206
  • 1
  • 2
  • 3
  • 4
  • 5

这样就创建了一个text8-phrases文件,可以更好的用来作为word2vec的的输入,注意,可以很容易地跳过前面这一步,直接使用文本数据作为word2vec的输入

word2vec——现在训练word2vec模型

word2vec.word2vec('../data/text8-phrases', '../data/text8.bin', size=100, binary=True, verbose=True)
  • 1
Running command: word2vec -train D:/University_Study/2021_Graduation_project/Code/text8/text8-phrases -output D:/University_Study/2021_Graduation_project/Code/text8/text8.bin -size 100 -window 5 -sample 1e-3 -hs 0 -negative 5 -threads 12 -iter 5 -min-count 5 -alpha 0.025 -debug 2 -binary 1 -cbow 1
Starting training using file D:/University_Study/2021_Graduation_project/Code/text8/text8-phrases
Vocab size: 98331
Words in train file: 15857306
Alpha: 0.000136  Progress: 99.52%  Words/thread/sec: 1339.35k
  • 1
  • 2
  • 3
  • 4
  • 5

这就创建了一个text8.bin文件,其中包含二进制格式的词向量。

word2clusters——根据训练后的模型生成向量的聚类

word2vec.word2clusters('../data/text8', '../data/text8-clusters.txt', 100, verbose=True)
  • 1
Running command: word2vec -train D:/University_Study/2021_Graduation_project/Code/text8/text8 -output D:/University_Study/2021_Graduation_project/Code/text8/text8-clusters.txt -size 100 -window 5 -sample 1e-3 -hs 0 -negative 5 -threads 12 -iter 5 -min-count 5 -alpha 0.025 -debug 2 -binary 0 -cbow 1 -classes 100
Starting training using file D:/University_Study/2021_Graduation_project/Code/text8/text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000129  Progress: 99.55%  Words/thread/sec: 1238.97k
  • 1
  • 2
  • 3
  • 4
  • 5

这就创建了一个text8-clusters.txt,其中包含了词汇表中每一个单词的群集

运行程序,报错:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb1 in position 0: ordinal not in range(128)
  • 1

解决办法,参考解决UnicodeDecodeError

import sys
reload(sys)
sys.setdefaultencoding('utf8')
  • 1
  • 2
  • 3

并不能解决,这里把文件路径里的中文改成英文得以解决

得到的聚类结果如下图所示:

在这里插入图片描述

2.2 预测

导入上面创建的word2vec二进制文件

model = word2vec.load('../data/text8.bin')
  • 1

我们可以把词汇看成一个numpy数组

> model.vocab
  • 1
['</s>' 'the' 'of' ... 'dupree' 'rauchbier' 'erythropoietin']
  • 1

或者看一下整个矩阵

> model.vectors.shape
  • 1
(98331, 100)
  • 1
> model.vectors
  • 1
[[-0.16299245 -0.12382638 -0.11257623 ...  0.08034179 -0.12345736  -0.11461084]
 [ 0.01057408  0.04060991  0.00174766 ...  0.15856159  0.03144943  -0.06085312]
 [-0.07785802 -0.0693066  -0.15338072 ...  0.06256067  0.00675137   0.06014424]
 ...
 [ 0.08107938  0.19705792 -0.20361277 ... -0.06893674 -0.06521299   0.05619101]
 [ 0.13266037  0.10146856 -0.18063438 ... -0.02563198 -0.02282242  -0.02883757]
 [ 0.04729127  0.27832553 -0.26021555 ...  0.00669185 -0.05703441   0.11786372]]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

我们可以检索出单个词的向量

> model['dog'].shape
  • 1
(100,)
  • 1
> model['dog'][:10]
  • 1
[-0.01505851  0.15153734 -0.17255363 -0.05107318  0.01521252 -0.10319622 -0.06863069 -0.14023714  0.06822077  0.16727005]
  • 1

我们可以计算两个或多个(所有组合)词之间的距离

> model.distance("dog", "cat", "fish")
  • 1
[('dog', 'cat', 0.8169705260977251), ('dog', 'fish', 0.5983957374283233), ('cat', 'fish', 0.6765780316732376)]
  • 1

2.3 相似性

我们可以根据余弦相似度做简单的查询,检索出与 "dog "相似的词

> indexes, metrics = model.similar("dog")
> indexes, metrics
  • 1
  • 2
[ 2436  4762  5473  3774  7611  9571  6955 17003  2428 17265] [0.81697053 0.75968385 0.75632396 0.75286234 0.74466203 0.73437095 0.73330186 0.73220809 0.73136231 0.72864531]
  • 1

这将返回一个包含2个项目的元组:

  1. numpy数组,包含词汇表中相似词的索引
  2. numpy数组,每个词的余弦相似度

我们可以得到这些索引的词

> model.vocab[indexes]
  • 1
['cat' 'bull' 'cow' 'grey' 'goat' 'frog' 'pink' 'coyote' 'bear' 'stuffed']
  • 1

有一个辅助函数来创建一个numpy记录数组的组合响应

> model.generate_response(indexes, metrics)
  • 1
[('cat', 0.81697053) ('bull', 0.75968385) ('cow', 0.75632396)
 ('grey', 0.75286234) ('goat', 0.74466203) ('frog', 0.73437095)
 ('pink', 0.73330186) ('coyote', 0.73220809) ('bear', 0.73136231)
 ('stuffed', 0.72864531)]
  • 1
  • 2
  • 3
  • 4

很容易让那个numpy数组成为一个纯粹的python响应

> model.generate_response(indexes, metrics).tolist()
  • 1
[('cat', 0.8169705260977251), ('bull', 0.7596838526023106), ('cow', 0.7563239593288873), ('grey', 0.752862335507535), ('goat', 0.7446620309365173), ('frog', 0.7343709509684961), ('pink', 0.7333018648258744), ('coyote', 0.732208090693304), ('bear', 0.7313623065960382), ('stuffed', 0.7286453086029295)]
  • 1

2.4 短语

由于我们用word2phrase的输出来训练模型,我们可以要求 "短语 "的相似度,基本上是类似的词,比如 “Los Angeles”

> indexes, metrics = model.similar('los_angeles')
> model.generate_response(indexes, metrics).tolist()
  • 1
  • 2
[('san_francisco', 0.8607764509277348), ('detroit', 0.8342075097061693), ('chicago', 0.8261110777891385), ('boston', 0.8202495210798733), ('kansas_city', 0.7897960239270461), ('california', 0.788759543323913), ('melbourne', 0.7883319200147727), ('cincinnati', 0.7871500441116634), ('st_louis', 0.7845951620359091), ('new_jersey', 0.7816740879950232)]
  • 1

2.5 类比

其可以做更复杂的查询,比如类比:king-man+woman=queen 这个方法返回的是和余弦一样的词汇索引和度量值

> indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'])
> indexes, metrics
  • 1
  • 2
[ 648  344 1006 1335 7530 1087 6756 1145 1770 1269] [0.29450657 0.29016691 0.28739799 0.28567908 0.28342768 0.27816884 0.27671222 0.273693   0.26996926 0.26953137]
  • 1
> model.generate_response(indexes, metrics).tolist()
  • 1
[('emperor', 0.294506569146581), ('son', 0.2901669060777821), ('daughter', 0.2873979913641791), ('wife', 0.2856790781367689), ('empress', 0.28342767855598494), ('queen', 0.27816883698775996), ('roman_emperor', 0.2767122238883053), ('prince', 0.27369300450863804), ('pope', 0.26996926448193403), ('mary', 0.26953137270309613)]
  • 1

这里感觉有点没训练好,怎么第一个是“emperor”(皇帝),queen在很后面了……

2.6 聚类

clusters = word2vec.load_clusters('../data/text8-clusters.txt')
  • 1

我们可以看到得到单个词的群集数

> clusters.vocab
  • 1
['</s>' 'the' 'of' ... 'koba' 'skirting' 'selectors']
  • 1

我们可以看到在一个特定的集群上得到所有的词组

> clusters.get_words_on_cluster(90).shape
  • 1
(483,)
  • 1
> clusters.get_words_on_cluster(90)[:10]
  • 1
['wrote' 'appeared' 'met' 'says' 'told' 'doctor' 'felt' 'got' 'heard' 'learned']
  • 1

我们可以将聚类添加到word2vec模型中,并生成一个包含聚类的响应

> model.clusters = clusters
> indexes, metrics = model.analogy(pos=["paris", "germany"], neg=["france"])
> model.generate_response(indexes, metrics).tolist()
  • 1
  • 2
  • 3
[('vienna', 0.28506898987953155, 85), ('munich', 0.27632371395430655, 53), ('moscow', 0.27571983939821915, 13), ('grammar_school', 0.27490988138939293, 30), ('leipzig', 0.27365748314826543, 75), ('geneva', 0.272046117377734, 1), ('berlin', 0.2700350465374118, 96), ('prague', 0.26876277033638435, 75), ('trinity_college', 0.26694007759783456, 24), ('forest_lawn', 0.26564294888051326, 55)]
  • 1

完整代码如下:

import word2vec

# Training
word2vec.word2phrase('...your path/text8', '...your path/text8-phrases',verbose=True)
word2vec.word2vec('...your path/text8-phrases','...your path/text8.bin',size=100,binary=True,verbose=True)
word2vec.word2clusters('...your path/text8', '...your path/text8-clusters.txt', 100, verbose=True)

# Predictions
model = word2vec.load('...your path/text8.bin')

print(model.vocab)
print(model.vectors.shape)
print(model.vectors)
print(model['dog'].shape)
print(model['dog'][:10])
print(model.distance('dog','cat','fish'))

# Similarity
indexes, metrics = model.similar("dog")
print(indexes, metrics)
print(model.vocab[indexes])
print(model.generate_response(indexes, metrics))
print(model.generate_response(indexes, metrics).tolist())

# Phrases
indexes, metrics = model.similar('los_angeles')
print(model.generate_response(indexes, metrics).tolist())

# Analogies
indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'])
print(indexes, metrics)
print(model.generate_response(indexes, metrics).tolist())

# Clusters
clusters = word2vec.load_clusters('...your path/text8-clusters.txt')
print(clusters.vocab)
print(clusters.get_words_on_cluster(90).shape)
print(clusters.get_words_on_cluster(90)[:10])

model.clusters = clusters
indexes, metrics = model.analogy(pos=["paris", "germany"], neg=["france"])
print(model.generate_response(indexes, metrics).tolist())
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42

小结

安装并简单入门了一下google的word2vec库,了解了基本的词汇表表示形式、向量形式、相似度计算、类比计算、聚类计算等等

上一节了解了gensim基于glove向量转换为word2vec的相关知识,还想看一下gensim中word2vec入门以及如何基于中文语料库进行学习

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/不正经/article/detail/468552
推荐阅读
相关标签
  

闽ICP备14008679号