当前位置:   article > 正文

python使用jieba实现tf-idf_python jieba tfidf

python jieba tfidf

分词实现:

import pandas as pd
import jieba
import jieba.analyse
# 数据源
df_news = pd.read_table('C:/Users/Shirley/Desktop/python/article.txt',names=['id','content'],encoding='utf-8')
# 存在缺失值,drop掉
df_news = df_news.dropna()
content = df_news.content.values.tolist()
content_S = []
for line in content:
 	current_segment = jieba.lcut(line)
  	if len(current_segment)>1 and current_segment != '\r\n':
  		content_S.append(current_segment)
df_content = pd.DataFrame({'content_S':content_S})
print(df_content.head())
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15

加入停用词:

stopwords= pd.read_csv("C:/Users/Shirley/Desktop/python/stopwords_3.txt",index_col=False,sep='\t',quoting=3,names=['stopwords'],encoding='utf-8')
stopwords.head()
def drop_stopwords(contents,stopwords):
	contents_clean = []
 	all_words = []
 	for line in conten
  • 1
  • 2
  • 3
  • 4
  • 5
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/IT小白/article/detail/584152
推荐阅读
相关标签
  

闽ICP备14008679号