赞
踩
1 Google用word2vec预训练了300维的新闻语料的词向量googlenews-vecctors-negative300.bin,解压后3.39个G。
可以用gensim加载进来,但是需要内存足够大。
- #加载Google训练的词向量
- import gensim
- model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)
- print(model['love'])
Glove300维的词向量有5.25个G。
- # 用gensim打开glove词向量需要在向量的开头增加一行:所有的单词数 词向量的维度
- import gensim
- import os
- import shutil
- import hashlib
- from sys import platform
- #计算行数,就是单词数
- def getFileLineNums(filename):
- f = open(filename, 'r')
- count = 0
- for line in f:
- count += 1
- return count
-
- #Linux或者Windows下打开词向量文件,在开始增加一行
- def prepend_line(infile, outfile, line):
- with open(infile, 'r') as old:
- with open(outfile, 'w') as new:
- new.write(str(line) + "\n")
- shutil.copyfileobj(old, new)
-
- def prepend_slow(infile, outfile, line):
- with open(infile, 'r') as fin:
- with open(outfile, 'w') as fout:
- fout.write(line + "\n")
- for line in fin:
- fout.write(line)
-
- def load(filename):
- num_lines = getFileLineNums(filename)
- gensim_file = 'glove_model.txt'
- gensim_first_line = "{} {}".format(num_lines, 300)
- # Prepends the line.
- if platform == "linux" or platform == "linux2":
- prepend_line(filename, gensim_file, gensim_first_line)
- else:
- prepend_slow(filename, gensim_file, gensim_first_line)
-
- model = gensim.models.KeyedVectors.load_word2vec_format(gensim_file)
-
- load('glove.840B.300d.txt')
生成的glove_model.txt就是可以直接用gensim打开的模型。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。