当前位置:   article > 正文

文本挖掘--相似度对比_1、计算data1.txt 、data2.txt 、data3.txt这三篇文档之间的相似度。这三篇

1、计算data1.txt 、data2.txt 、data3.txt这三篇文档之间的相似度。这三篇文档

对比盗墓笔记、鬼吹灯和金九门的相似度

import jieba

from gensim import corpora, models, similarities
import urllib.request

from collections import defaultdict

#下面我们使用phpstudy的服务器来打开txt文档

doc1=urllib.request.urlopen("http://127.0.0.1/daomubiji.html").read().decode("utf-8","ignore")
doc2=urllib.request.urlopen("http://127.0.0.1/jjgc.html").read().decode("utf-8","ignore")
data1=jieba.cut(doc1)
data2=jieba.cut(doc2)
data11=""
for item in data1:
    data11+=item+" "
data21=""
for item in data2:
    data21+=item+" "
documents=[data11,data21]
texts=[[word for word in document.split()]
       for document in documents]
frequency=defaultdict(int)


for text in texts:
    for token in text:
        frequency[token]+=1
texts=[[word for word in text if frequency[token]>5]
       for text in texts]
dictionary=corpora.Dictionary(texts)


#dictionary.save("H:/wenben2.txt")
doc3=urllib.request.urlopen("http://127.0.0.1/llmk.html").read().decode("utf-8","ignore")
data3=jieba.cut(doc3)
data31=""
for item in data3:
    data31+=item+" "
new_doc=data31
new_vec=dictionary.doc2bow(new_doc.split())
corpus=[dictionary.doc2bow(text) for text in texts]
tfidf=models.TfidfModel(corpus)
featureNum=len(dictionary.token2id.keys())
index=similarities.SparseMatrixSimilarity(tfidf[corpus],num_features=featureNum)
sims=index[tfidf[new_vec]]
print(sims)


声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/黑客灵魂/article/detail/877943
推荐阅读
相关标签
  

闽ICP备14008679号