赞
踩
先放个代码和结果,改天闲了总结。
用余弦距离计算相似度以判断向量化效果
tf-idf、doc2bow稀疏,适合短文本
doc2vec效果时好时坏,偶然性大,不稳
lsi、lda效果好且较稳,但lda计算量偏大
- from gensim.models import doc2vec
- from gensim import corpora,models
- import jieba,os
- from gensim.similarities.docsim import Similarity
- raw_documents=[]
- for root,p,files in os.walk('C:/Users/Administrator/Desktop/testdata/'):
- for file in files:
- f=open(root+file,encoding='utf8')
- s=f.read().replace(' ','').replace('\t','').replace('\r\n','').replace('\r','').replace('\n','')
- raw_documents.append(s)
- f.close()
- print('data ok!')
- corpora_documents = []
- corpora_documents2=[]
- for i, item_text in enumerate(raw_documents):
- words_list = list(jieba.cut(item_text))
- document = doc2vec.TaggedDocument(words
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。