赞
踩
一.概述
TF-IDF(英文名: term frequency-inverse document frequency),引用百度百科的说法: TF-IDF是一种用于信息检索与数据挖掘的常用加权技术。
TF意思是词频(Term Frequency),用在句子构成的语料中,就是字或者词在文本中出现的频率。
一般计算是: TF = 字或词在句子中出现的次数 / 字或词在所有语料中出现的次数
IDF意思是逆文本频率指数(Inverse Document Frequency),就是出现该字或者词的句子条数。
一般计算是: IDF = Log ( 语料中句子总数 / (包含该词或字的句子数+1) )
TF-IDF = TF * IDF
这是前文介绍TF-IDF时候的说法,正巧面试也手撸了一次这个算法,真实环境是不是这样呢? 我们一探究竟。
github地址:
https://github.com/yongzhuo/Tookit-Sihui/tree/master/tookit_sample/tf_idf_compare
本文主要介绍4中方案实现tf-idf,它们各有优点:
- 1.gensim
- 2.jieba
- 3.sklearn
- 4.by_hand(手动)
二.优缺点(推荐 sklearn)
a. gensim: corpora生成token,doc2bow生成词袋模型,tfidf_model计算tf-idf,idfs可给出。
未出现的词语idfs等不计算在内,中规中矩的一个模型,可输入list或者文件地址等。
b. jieba: 有idf.txt,即计算好的idf,未出现的词语使用平均idf(但对于句子来说,尤其是对单个单词的句子很不友好)。
c. sklearn: CountVectorizer统计词频,TfidfTransformer计算tfidf,csr_matrix数据格式压缩,
可选择n-gram特征,可平滑处理,可选features,选择很多,还是推荐这个吧。
d. by_hand: 手工版可配置, 批量计算词频字典与合并,可在小内存下计算大样本,比如说wikicorpus。
三.实现与最后代码说明
2.1 gensim
- # -*- coding: UTF-8 -*-
- # !/usr/bin/python
- # @time :2019/7/31 21:20
- # @author :Mo
- # @function :
-
- from gensim import corpora, models
- import jieba
-
-
-
- def tfidf_from_questions(corpora_documents):
- """
- 从文件读取并计算tf-idf
- :param sources_path:
- :return:
- """
- dictionary = corpora.Dictionary(corpora_documents)
- corpus = [dictionary.doc2bow(text) for text in corpora_documents]
- tfidf_model = models.TfidfModel(corpus)
- return dictionary, tfidf_model
-
-
- def tfidf_from_corpora(sources_path):
- """
- 从文件读取并计算tf-idf
- :param sources_path:
- :return:
- """
- from tookit_sihui.utils.file_utils import txt_read, txt_write
- questions = txt_read(sources_path)
- corpora_documents = []
- for item_text in questions:
- item_seg = list(jieba.cut(str(item_text).strip()))
- corpora_documents.append(item_seg)
-
- dictionary = corpora.Dictionary(corpora_documents)
- corpus = [dictionary.doc2bow(text) for text in corpora_documents]
- tfidf_model = models.TfidfModel(corpus)
- return dictionary, tfidf_model
-
-
- if __name__ == '__main__':
- # test 1 from questions
- corpora_documents = [['大漠', '帝国'],['紫色', 'Angle'],['花落', '惊', '飞羽'],
- ['我', 'm', 'o'], ['你', 'the', 'a', 'it', 'this']]
- dictionary, tfidf_model = tfidf_from_questions(corpora_documents)
- sentence = '大漠 大漠 大漠'
- seg = list(jieba.cut(sentence))
- bow = dictionary.doc2bow(seg)
- tfidf_vec = tfidf_model[bow]
- print(bow)
- print(tfidf_vec)
- bow = dictionary.doc2bow(['i', 'i', '大漠', '大漠', '大漠'])
- tfidf_vec = tfidf_model[bow]
- print(bow)
- print(tfidf_vec)
-
- # test 2 from file of text
- from tookit_sihui.conf.path_config import path_tf_idf_corpus
- dictionary, tfidf_model = tfidf_from_corpora(path_tf_idf_corpus)
- sentence = '大漠帝国'
- seg = list(jieba.cut(sentence))
- bow = dictionary.doc2bow(seg)
- tfidf_vec = tfidf_model[bow]
- print(bow)
- print(tfidf_vec)
- bow = dictionary.doc2bow(['sihui'])
- tfidf_vec = tfidf_model[bow]
- print(bow)
- print(tfidf_vec)
- gg = 0
- # 结果
- # [(12, 1)]
- # [(12, 1.0)]
- # []
- # []
- # [(172, 1), (173, 1)]
- # [(172, 0.7071067811865475), (173, 0.7071067811865475)]
- # []
- # []
-
-
-
- # # 说明:
- # 1.左边的是字典id,右边是词的tfidf,
- # 2.中文版停用词(如the)、单个字母(如i)等,不会去掉
- # 3.去除没有被训练到的词,如'sihui',没有出现就不会计算
- # 4.计算细节
- # 4.1 idf = add + log_{log\_base} \frac{totaldocs}{docfreq}, 如下:
- # eps = 1e-12, idf只取大于eps的数字
- def df2idf(docfreq, totaldocs, log_base=2.0, add=0.0):
- import numpy as np
- # np.log()什么都不写就以e为低, 由公式log(a)(b)=log(c)(b)/log(c)(a),
- # 可得函数中为log(2)(totaldocs / docfreq)
- # debug进去可以发现, 没有进行平滑处理, 即log(2)(文本数 / 词出现在文本中的个数),
- # 这也很好理解, 因为如果输入为[],则不会给出模型,出现的文本中的至少出现一次,也没有必要加1了
- return add + np.log(float(totaldocs) / docfreq) / np.log(log_base)
- # 注意self.initialize(corpus)函数
- # 4.2 tf 从下面以及debug结果可以发现, gensim的tf取值是词频,
- # 也就是说出现几次就取几次,如句子'大漠 大漠 大漠', '大漠'的tf就取3
- # termid_array, tf_array = [], []
- # for termid, tf in bow:
- # termid_array.append(termid)
- # tf_array.append(tf)
- #
- # tf_array = self.wlocal(np.array(tf_array))
- #
- # vector = [
- # (termid, tf * self.idfs.get(termid))
- # for termid, tf in zip(termid_array, tf_array)
- # if abs(self.idfs.get(termid, 0.0)) > self.eps
- # ]
-
-
2.2 jieba
- # -*- coding: UTF-8 -*-
- # !/usr/bin/python
- # @time :2019/7/31 21:21
- # @author :Mo
- # @function :
-
-
- import jieba.analyse
- import jieba
-
- sentence = '大漠 帝国 和 紫色 Angle'
- seg = jieba.cut(sentence)
- print(seg)
- tf_idf = jieba.analyse.extract_tags(sentence, withWeight=True)
- print(tf_idf)
-
- # 结果
- # [('Angle', 2.988691875725), ('大漠', 2.36158258893), ('紫色', 2.10190405216), ('帝国', 1.605909794915)]
-
-
- # 说明,
- # 1.1 idf jieba中的idf来自默认文件idf.txt,
- # idf默认一段话来作为一个docunment,
- # 没出现过的词语的idf默认为所有idf的平均值,即为11.多
- #
- # 1.2 tf tf只统计当前句子出现的频率除以所有词语数,
- # 例如'大漠 帝国 和 紫色 Angle'这句话, '大漠'的tf为1/5
- # tfidf的停用词"和"去掉了
- # tf计算代码
- # freq[w] = freq.get(w, 0.0) + 1.0
- # total = sum(freq.values())
- # for k in freq:
- # kw = k.word if allowPOS and withFlag else k
- # freq[k] *= self.idf_freq.get(kw, self.median_idf) / total
2.3 sklearn
- # -*- coding: UTF-8 -*-
- # !/usr/bin/python
- # @time :2019/7/31 21:21
- # @author :Mo
- # @function :
-
-
- from sklearn.feature_extraction.text import TfidfTransformer
- from sklearn.feature_extraction.text import CountVectorizer
-
-
- def tfidf_from_ngram(questions):
- """
- 使用TfidfVectorizer计算n-gram
- :param questions:list, like ['孩子气', '大漠帝国']
- :return:
- """
- from sklearn.feature_extraction.text import TfidfVectorizer
- import jieba
- def jieba_cut(x):
- x = list(jieba.cut(x))
- return ' '.join(x)
- questions = [jieba_cut(''.join(ques)) for ques in questions]
- tfidf_model = TfidfVectorizer(ngram_range=(1, 2), # n-gram特征, 默认(1,1)
- max_features=10000,
- token_pattern=r"(?u)\b\w+\b", # 过滤停用词
- min_df=1,
- max_df=0.9,
- use_idf=1,
- smooth_idf=1,
- sublinear_tf=1)
- tfidf_model.fit(questions)
- print(tfidf_model.transform(['紫色 ANGEL 是 虾米 回事']))
- return tfidf_model
-
-
- if __name__ == "__main__":
- # test 1
- corpora_documents = [['大漠', '帝国'], ['紫色', 'Angle'], ['花落', '惊', '飞羽'],
- ['我', 'm', 'o'], ['你', 'the', 'a', 'it', 'this'], ['大漠', '大漠']]
- corpora_documents = [''.join(ques) for ques in corpora_documents]
- # 统计词频
- vectorizer = CountVectorizer()
- # 初始化,fit和transformer tf-idf
- transformer = TfidfTransformer()
- # 第一个fit_transform是计算tf-idf, 第二个是将文本转为词频矩阵
- tfidf = transformer.fit_transform(vectorizer.fit_transform(corpora_documents))
- print(tfidf)
- # 模型所有词语
- word = vectorizer.get_feature_names()
- print(word)
- weight = tfidf.toarray()
- print(weight)
-
-
- # test 2 from file of text
- tf_idf_model = tfidf_from_ngram(corpora_documents)
- print(tf_idf_model.transform(['你 谁 呀, 小老弟']))
-
-
- # sklearn的tfidf模型,可以采用TfidfVectorizer,提取n-gram特征,直接用于特征计算
- # 和gensim一样, 都有TfidfVectorizer, 继承的是CountVectorizer
- # df += int(self.smooth_idf) # 平滑处理
- # n_samples += int(self.smooth_idf) # 平滑处理
- # idf = np.log(n_samples / df) + 1 # 加了个1
2.4 byhand
- # -*- coding: UTF-8 -*-
- # !/usr/bin/python
- # @time :2019/6/19 21:32
- # @author :Mo
- # @function :tf-idf
-
-
- from tookit_sihui.utils.file_utils import save_json
- from tookit_sihui.utils.file_utils import load_json
- from tookit_sihui.utils.file_utils import txt_write
- from tookit_sihui.utils.file_utils import txt_read
- import jieba
- import json
- import math
- import os
-
-
- from tookit_sihui.conf.logger_config import get_logger_root
- logger = get_logger_root()
-
-
- def count_tf(questions):
- """
- 统计字频,或者词频tf
- :param questions: list, 输入语料, 字级别的例子:[['我', '爱', '你'], ['爱', '护', '士']]
- :return: dict, 返回字频,或者词频, 形式:{'我':1, '爱':2}
- """
- tf_char = {}
- for question in questions:
- for char in question:
- if char.strip():
- if char not in tf_char:
- tf_char[str(char).encode('utf-8', 'ignore').decode('utf-8')] = 1
- else:
- tf_char[str(char).encode('utf-8', 'ignore').decode('utf-8')] = tf_char[char] + 1
- tf_char['[LENS]'] = sum([v for k,v in tf_char.items()])
- return tf_char
-
-
- def count_idf(questions):
- """
- 统计逆文档频率idf
- :param questions: list, 输入语料, 字级别的例子:[['我', '爱', '你'], ['爱', '护', '士']]
- :return: dict, 返回逆文档频率, 形式:{'我':1, '爱':2}
- """
- idf_char = {}
- for question in questions:
- question_set = set(question) # 在句子中,重复的只计数一次
- for char in question_set:
- if char.strip(): # ''不统计
- if char not in idf_char: # 第一次计数为1
- idf_char[char] = 1
- else:
- idf_char[char] = idf_char[char] + 1
- idf_char['[LENS]'] = len(questions) # 保存一个所有的句子长度
- return idf_char
-
-
- def count_tf_idf(freq_char, freq_document, ndigits=12, smooth =0):
- """
- 统计tf-idf
- :param freq_char: dict, tf
- :param freq_document: dict, idf
- :return: dict, tf-idf
- """
-
- len_tf = freq_char['[LENS]']
- len_tf_mid = int(len(freq_char)/2)
- len_idf = freq_document['[LENS]']
- len_idf_mid = int(len(freq_document) / 2)
- # tf
- tf_char = {}
- for k2, v2 in freq_char.items():
- tf_char[k2] = round((v2 + smooth)/(len_tf + smooth), ndigits)
- # idf
- idf_char = {}
- for ki, vi in freq_document.items():
- idf_char[ki] = round(math.log((len_idf + smooth) / (vi + smooth), 2), ndigits)
- # tf-idf
- tf_idf_char = {}
- for kti, vti in freq_char.items():
- tf_idf_char[kti] = round(tf_char[kti] * idf_char[kti], ndigits)
-
- # 删去文档数统计
- tf_char.pop('[LENS]')
- idf_char.pop('[LENS]')
- tf_idf_char.pop('[LENS]')
-
- # 计算平均/最大/中位数
- tf_char_values = tf_char.values()
- idf_char_values = idf_char.values()
- tf_idf_char_values = tf_idf_char.values()
-
- tf_char['[AVG]'] = round(sum(tf_char_values) / len_tf, ndigits)
- idf_char['[AVG]'] = round(sum(idf_char_values) / len_idf, ndigits)
- tf_idf_char['[AVG]'] = round(sum(tf_idf_char_values) / len_idf, ndigits)
- tf_char['[MAX]'] = max(tf_char_values)
- idf_char['[MAX]'] = max(idf_char_values)
- tf_idf_char['[MAX]'] = max(tf_idf_char_values)
- tf_char['[MIN]'] = min(tf_char_values)
- idf_char['[MIN]'] = min(idf_char_values)
- tf_idf_char['[MIN]'] = min(tf_idf_char_values)
- tf_char['[MID]'] = sorted(tf_char_values)[len_tf_mid]
- idf_char['[MID]'] = sorted(idf_char_values)[len_idf_mid]
- tf_idf_char['[MID]'] = sorted(tf_idf_char_values)[len_idf_mid]
-
- return tf_char, idf_char, tf_idf_char
-
-
- def save_tf_idf_dict(path_dir, tf_char, idf_char, tf_idf_char):
- """
- 排序和保存
- :param path_dir:str, 保存文件目录
- :param tf_char: dict, tf
- :param idf_char: dict, idf
- :param tf_idf_char: dict, tf-idf
- :return: None
- """
- if not os.path.exists(path_dir):
- os.mkdir(path_dir)
- # store and save
- tf_char_sorted = sorted(tf_char.items(), key=lambda d: d[1], reverse=True)
- tf_char_sorted = [tf[0] + '\t' + str(tf[1]) + '\n' for tf in tf_char_sorted]
- txt_write(tf_char_sorted, path_dir + 'tf.txt')
-
- idf_char_sorted = sorted(idf_char.items(), key=lambda d: d[1], reverse=True)
- idf_char_sorted = [idf[0] + '\t' + str(idf[1]) + '\n' for idf in idf_char_sorted]
- txt_write(idf_char_sorted, path_dir + 'idf.txt')
-
- tf_idf_char_sorted = sorted(tf_idf_char.items(), key=lambda d: d[1], reverse=True)
- tf_idf_char_sorted = [tf_idf[0] + '\t' + str(tf_idf[1]) + '\n' for tf_idf in tf_idf_char_sorted]
- txt_write(tf_idf_char_sorted, path_dir + 'tf_idf.txt')
-
-
- def save_tf_idf_json(path_dir, tf_freq, idf_freq, tf_char, idf_char, tf_idf_char):
- """
- json排序和保存
- :param path_dir:str, 保存文件目录
- :param tf_char: dict, tf
- :param idf_char: dict, idf
- :param tf_idf_char: dict, tf-idf
- :return: None
- """
- if not os.path.exists(path_dir):
- os.mkdir(path_dir)
- # freq
- save_json([tf_freq], path_dir + '/tf_freq.json')
- save_json([idf_freq], path_dir + '/idf_freq.json')
- # json_tf = json.dumps([tf_char])
- save_json([tf_char], path_dir + '/tf.json')
- # json_idf = json.dumps([idf_char])
- save_json([idf_char], path_dir + '/idf.json')
- # json_tf_idf = json.dumps([tf_idf_char])
- save_json([tf_idf_char], path_dir + '/tf_idf.json')
-
-
- def load_tf_idf_json(path_tf_freq=None, path_idf_freq=None, path_tf=None, path_idf=None, path_tf_idf=None):
- """
- 从json文件下载tf, idf, tf_idf
- :param path_tf:
- :param path_idf:
- :param path_tf_idf:
- :return:
- """
- json_tf_freq = load_json(path_tf_freq)
- json_idf_freq = load_json(path_idf_freq)
- json_tf = load_json(path_tf)
- json_idf = load_json(path_idf)
- json_tf_idf = load_json(path_tf_idf)
- return json_tf_freq[0], json_idf_freq[0], json_tf[0], json_idf[0], json_tf_idf[0]
-
-
- def dict_add(dict1, dict2):
- """
- 两个字典合并
- :param dict1:
- :param dict2:
- :return:
- """
- for i,j in dict2.items():
- if i in dict1.keys():
- dict1[i] += j
- else:
- dict1.update({f'{i}' : dict2[i]})
- return dict1
-
-
- class TFIDF:
- def __init__(self, questions=None, path_tf=None,
- path_idf=None, path_tf_idf=None,
- path_tf_freq=None, path_idf_freq=None,
- ndigits=12, smooth=0):
- """
- 统计字频,或者词频tf
- :param questions: list, 输入语料, 字级别的例子:[['我', '爱', '你'], ['爱', '护', '士']]
- """
- self.esplion = 1e-16
- self.questions = questions
- self.path_tf_freq = path_tf_freq
- self.path_idf_freq = path_idf_freq
- self.path_tf=path_tf
- self.path_idf=path_idf
- self.path_tf_idf=path_tf_idf
- self.ndigits=ndigits
- self.smooth=smooth
- self.create_tfidf()
-
- def create_tfidf(self):
- if self.questions != None: # 输入questions list, 即corpus语料
- self.tf_freq = count_tf(self.questions)
- self.idf_freq = count_idf(self.questions)
- self.tf, self.idf, self.tfidf = count_tf_idf(self.tf_freq,
- self.idf_freq,
- ndigits=self.ndigits,
- smooth =self.smooth)
- else: # 输入训练好的
- self.tf_freq, self.idf_freq, \
- self.tf, self.idf, self.tfidf = load_tf_idf_json(path_tf_freq = self.path_tf_freq,
- path_idf_freq = self.path_idf_freq,
- path_tf=self.path_tf,
- path_idf=self.path_idf,
- path_tf_idf=self.path_tf_idf)
- self.chars = [idf for idf in self.idf.keys()]
-
- def extract_tfidf_of_sentence(self, ques):
- """
- 获取tf-idf
- :param ques: str
- :return: float
- """
- assert type(ques)==str
- if not ques.strip():
- return None
- ques_list = list(jieba.cut(ques.replace(' ', '').strip()))
- logger.info(ques_list)
- score = 0.0
- score_list = {}
- for char in ques_list:
- if char in self.chars:
- score = score + self.tfidf[char]
- score_list[char] = self.tfidf[char]
- else: #
- score = score + self.esplion
- score_list[char] = self.esplion
- score = score/len(ques_list)# 求平均避免句子长度不一的影响
- logger.info(score_list)
- logger.info({ques:score})
- return score
-
- def extract_tf_of_sentence(self, ques):
- """
- 获取idf
- :param ques: str
- :return: float
- """
- assert type(ques)==str
- if not ques.strip():
- return None
- ques_list = list(jieba.cut(ques.replace(' ', '').strip()))
- logger.info(ques_list)
- score = 0.0
- score_list = {}
- for char in ques_list:
- if char in self.chars:
- score = score + self.tf[char]
- score_list[char] = self.tf[char]
- else: #
- score = score + self.esplion
- score_list[char] = self.esplion
- score = score/len(ques_list)# 求平均避免句子长度不一的影响
- logger.info(score_list)
- logger.info({ques:score})
- return score
-
- def extract_idf_of_sentence(self, ques):
- """
- 获取idf
- :param ques: str
- :return: float
- """
- assert type(ques)==str
- if not ques.strip():
- return None
- ques_list = list(jieba.cut(ques.replace(' ', '').strip()))
- logger.info(ques_list)
- score = 0.0
- score_list = {}
- for char in ques_list:
- if char in self.chars:
- score = score + self.idf[char]
- score_list[char] = self.idf[char]
- else: #
- score = score + self.esplion
- score_list[char] = self.esplion
- score = score/len(ques_list) # 求平均避免句子长度不一的影响
- logger.info(score_list)
- logger.info({ques:score})
- return score
-
-
- def create_TFIDF(path):
- # 测试1,根据corpus生成
- import time
- time_start = time.time()
- # 首先输入全部文本构建tf-idf,然后再拿去用
- from tookit_sihui.conf.path_config import path_tf_idf_corpus
- from tookit_sihui.utils.file_utils import txt_write, txt_read
-
- path_wiki = path if path else path_tf_idf_corpus
- # 测试1, tf-idf, 调用
- path_dir = 'tf_idf_freq/'
- # ques = ['大漠帝国最强', '花落惊飞羽最漂亮', '紫色Angle最有气质', '孩子气最活泼', '口袋巧克力和过路蜻蜓最好最可爱啦', '历历在目最烦恼']
- # questions = [list(q.strip()) for q in ques]
- # questions = [list(jieba.cut(que)) for que in ques]
- questions = txt_read(path_wiki)
- len_questions = len(questions)
- batch_size = 1000000
- size_trade = len_questions // batch_size
- print(size_trade)
- size_end = size_trade * batch_size
- # 计算tf-freq, idf-freq
- ques_tf_all, ques_idf_all = {}, {}
- for i, (start, end) in enumerate(zip(range(0, size_end, batch_size),
- range(batch_size, size_end, batch_size))):
- print("第{}次".format(i))
- question = questions[start: end]
- questionss = [ques.strip().split(' ') for ques in question]
- ques_idf = count_idf(questionss)
- ques_tf = count_tf(questionss)
- print('tf_idf_{}: '.format(i) + str(time.time() - time_start))
- # 字典合并 values相加
- ques_tf_all = dict_add(ques_tf_all, ques_tf)
- ques_idf_all = dict_add(ques_idf_all, ques_idf)
- print('dict_add_{}: '.format(i) + str(time.time() - time_start))
- print('的tf:{}'.format(ques_tf_all['的']))
- print('的idf:{}'.format(ques_idf_all['的']))
- # 不足batch-size部分
- if len_questions - size_end >0:
- print("第{}次".format('last'))
- question = questions[size_end: len_questions]
- questionss = [ques.strip().split(' ') for ques in question]
- ques_tf = count_idf(questionss)
- ques_idf = count_tf(questionss)
- # tf_char, idf_char, tf_idf_char = count_tf_idf(ques_tf, ques_idf)
- ques_tf_all = dict_add(ques_tf_all, ques_tf)
- ques_idf_all = dict_add(ques_idf_all, ques_idf)
- print('{}: '.format('last') + str(time.time() - time_start))
- print('的tf:{}'.format(ques_tf_all['的']))
- print('的idf:{}'.format(ques_idf_all['的']))
- # 计算tf-idf
- tf_char, idf_char, tf_idf_char = count_tf_idf(ques_tf_all, ques_idf_all)
- print(len(tf_char))
- print('tf-idf ' + str(time.time()-time_start))
- print('tf-idf ok!')
- # 保存, tf,idf,tf-idf
- save_tf_idf_json(path_dir, ques_tf_all, ques_idf_all, tf_char, idf_char, tf_idf_char)
- gg=0
-
-
- if __name__=="__main__":
- # 测试1
- path = None # 语料地址, 格式为切分后的句子, 例如'孩子 气 和 紫色 angle'
- create_TFIDF(path)
-
- # # 测试2, 调用class、json, input预测
- # path_dir = 'tf_idf_freq/'
- # path_tf = path_dir + '/tf.json'
- # path_idf = path_dir + '/idf.json'
- # path_tf_idf = path_dir + '/tf_idf.json'
- #
- # tfidf = TFIDF(path_tf=path_tf, path_idf=path_idf, path_tf_idf=path_tf_idf)
- # score1 = tfidf.extract_tf_of_sentence('大漠帝国')
- # score2 = tfidf.extract_idf_of_sentence('大漠帝国')
- # score3 = tfidf.extract_tfidf_of_sentence('大漠帝国')
- # print('tf: ' + str(score1))
- # print('idf: ' + str(score2))
- # print('tfidf: ' + str(score3))
- # while True:
- # print("请输入: ")
- # ques = input()
- # tfidf_score = tfidf.extract_tfidf_of_sentence(ques)
- # print('tfidf:' + str(tfidf_score))
-
希望对你有所帮助!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。