赞
踩
先吐槽一下,我觉得Gensim的种种设计都特别脑残,开发这个库的人脑子都是浆糊。下面给出示例代码:
- import gensim
- import datetime
-
- def Doc2VecModelBuilding():
-
- sentence='0 INT_EQUAL ARG3 CONST'.split(' ')
-
- print('Start reading the corpus')
- sentences=gensim.models.doc2vec.TaggedLineDocument('ALLCorpus.txt')
-
-
- print('Start building the model')
- time_1=datetime.datetime.now()
- model=gensim.models.Doc2Vec(sentences,dm=1,vector_size=256,window=5)
- time_2=datetime.datetime.now()
- print("Total elapse time for building the model (s): "+str((time_2-time_1).total_seconds()))
-
- print('Start Training')
- # model.build_vocab(sentences)
- time_1=datetime.datetime.now()
- model.train(sentences,total_examples=model.corpus_count,epochs=100)
- time_2=datetime.datetime.now()
- print("Total elapse time for training (s): "+str((time_2-time_1).total_seconds()))
-
- print(len(model.dv))
-
- print(model.infer_vector(sentence))
- print(model.infer_vector(sentence))
-
- corpus_file=open('ALLCorpus2.txt')
- lines=corpus_file.readlines()
-
- sentences2=gensim.models.doc2vec.TaggedLineDocument('ALLCorpus2.txt')
- # model.build_vocab(sentences2, update=True)
- model.train(sentences2,total_examples=len(lines),epochs=100)
- print(len(model.dv))
- print(model.infer_vector(sentence))
-
- Doc2VecModelBuilding()
其中有两个语料文件,并且我注释掉了两行,如果不注释的话,第二个build_vocab会报:
段错误 (核心已转储)
另外,我们也可以看到,即便有更多的训练数据,model.dv的长度也是不会变化的。所以需要通过infer_vector来生成新的sentence embedding。另外,接连运行两次infer_vector的结果也不相同。
呵呵呵呵呵,这个库真的太垃圾了。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。