当前位置:   article > 正文

TF-IDF计算比较compare(gensim、jieba、sklearn、手工的异同)_结巴 tf-idf合并

结巴 tf-idf合并

一.概述

        TF-IDF(英文名: term frequency-inverse document frequency),引用百度百科的说法: TF-IDF是一种用于信息检索与数据挖掘的常用加权技术。

        TF意思是词频(Term Frequency),用在句子构成的语料中,就是字或者词在文本中出现的频率。

                 一般计算是: TF = 字或词在句子中出现的次数 / 字或词在所有语料中出现的次数

        IDF意思是逆文本频率指数(Inverse Document Frequency),就是出现该字或者词的句子条数。

                 一般计算是: IDF = Log ( 语料中句子总数 / (包含该词或字的句子数+1) )

        TF-IDF = TF * IDF

         这是前文介绍TF-IDF时候的说法,正巧面试也手撸了一次这个算法,真实环境是不是这样呢? 我们一探究竟。

          github地址: 

                           https://github.com/yongzhuo/Tookit-Sihui/tree/master/tookit_sample/tf_idf_compare

         本文主要介绍4中方案实现tf-idf,它们各有优点:

  1. 1.gensim
  2. 2.jieba
  3. 3.sklearn
  4. 4.by_hand(手动)

二.优缺点(推荐 sklearn)

            a. gensim: corpora生成token,doc2bow生成词袋模型,tfidf_model计算tf-idf,idfs可给出。

                                未出现的词语idfs等不计算在内,中规中矩的一个模型,可输入list或者文件地址等。

            b. jieba:     有idf.txt,即计算好的idf,未出现的词语使用平均idf(但对于句子来说,尤其是对单个单词的句子很不友好)。

            c. sklearn:    CountVectorizer统计词频,TfidfTransformer计算tfidf,csr_matrix数据格式压缩,

                                 可选择n-gram特征,可平滑处理,可选features,选择很多,还是推荐这个吧。

            d. by_hand:  手工版可配置, 批量计算词频字典与合并,可在小内存下计算大样本,比如说wikicorpus。

 

三.实现与最后代码说明

    2.1 gensim

  1. # -*- coding: UTF-8 -*-
  2. # !/usr/bin/python
  3. # @time :2019/7/31 21:20
  4. # @author :Mo
  5. # @function :
  6. from gensim import corpora, models
  7. import jieba
  8. def tfidf_from_questions(corpora_documents):
  9. """
  10. 从文件读取并计算tf-idf
  11. :param sources_path:
  12. :return:
  13. """
  14. dictionary = corpora.Dictionary(corpora_documents)
  15. corpus = [dictionary.doc2bow(text) for text in corpora_documents]
  16. tfidf_model = models.TfidfModel(corpus)
  17. return dictionary, tfidf_model
  18. def tfidf_from_corpora(sources_path):
  19. """
  20. 从文件读取并计算tf-idf
  21. :param sources_path:
  22. :return:
  23. """
  24. from tookit_sihui.utils.file_utils import txt_read, txt_write
  25. questions = txt_read(sources_path)
  26. corpora_documents = []
  27. for item_text in questions:
  28. item_seg = list(jieba.cut(str(item_text).strip()))
  29. corpora_documents.append(item_seg)
  30. dictionary = corpora.Dictionary(corpora_documents)
  31. corpus = [dictionary.doc2bow(text) for text in corpora_documents]
  32. tfidf_model = models.TfidfModel(corpus)
  33. return dictionary, tfidf_model
  34. if __name__ == '__main__':
  35. # test 1 from questions
  36. corpora_documents = [['大漠', '帝国'],['紫色', 'Angle'],['花落', '惊', '飞羽'],
  37. ['我', 'm', 'o'], ['你', 'the', 'a', 'it', 'this']]
  38. dictionary, tfidf_model = tfidf_from_questions(corpora_documents)
  39. sentence = '大漠 大漠 大漠'
  40. seg = list(jieba.cut(sentence))
  41. bow = dictionary.doc2bow(seg)
  42. tfidf_vec = tfidf_model[bow]
  43. print(bow)
  44. print(tfidf_vec)
  45. bow = dictionary.doc2bow(['i', 'i', '大漠', '大漠', '大漠'])
  46. tfidf_vec = tfidf_model[bow]
  47. print(bow)
  48. print(tfidf_vec)
  49. # test 2 from file of text
  50. from tookit_sihui.conf.path_config import path_tf_idf_corpus
  51. dictionary, tfidf_model = tfidf_from_corpora(path_tf_idf_corpus)
  52. sentence = '大漠帝国'
  53. seg = list(jieba.cut(sentence))
  54. bow = dictionary.doc2bow(seg)
  55. tfidf_vec = tfidf_model[bow]
  56. print(bow)
  57. print(tfidf_vec)
  58. bow = dictionary.doc2bow(['sihui'])
  59. tfidf_vec = tfidf_model[bow]
  60. print(bow)
  61. print(tfidf_vec)
  62. gg = 0
  63. # 结果
  64. # [(12, 1)]
  65. # [(12, 1.0)]
  66. # []
  67. # []
  68. # [(172, 1), (173, 1)]
  69. # [(172, 0.7071067811865475), (173, 0.7071067811865475)]
  70. # []
  71. # []
  72. # # 说明:
  73. # 1.左边的是字典id,右边是词的tfidf,
  74. # 2.中文版停用词(如the)、单个字母(如i)等,不会去掉
  75. # 3.去除没有被训练到的词,如'sihui',没有出现就不会计算
  76. # 4.计算细节
  77. # 4.1 idf = add + log_{log\_base} \frac{totaldocs}{docfreq}, 如下:
  78. # eps = 1e-12, idf只取大于eps的数字
  79. def df2idf(docfreq, totaldocs, log_base=2.0, add=0.0):
  80. import numpy as np
  81. # np.log()什么都不写就以e为低, 由公式log(a)(b)=log(c)(b)/log(c)(a),
  82. # 可得函数中为log(2)(totaldocs / docfreq)
  83. # debug进去可以发现, 没有进行平滑处理, 即log(2)(文本数 / 词出现在文本中的个数),
  84. # 这也很好理解, 因为如果输入为[],则不会给出模型,出现的文本中的至少出现一次,也没有必要加1了
  85. return add + np.log(float(totaldocs) / docfreq) / np.log(log_base)
  86. # 注意self.initialize(corpus)函数
  87. # 4.2 tf 从下面以及debug结果可以发现, gensim的tf取值是词频,
  88. # 也就是说出现几次就取几次,如句子'大漠 大漠 大漠', '大漠'的tf就取3
  89. # termid_array, tf_array = [], []
  90. # for termid, tf in bow:
  91. # termid_array.append(termid)
  92. # tf_array.append(tf)
  93. #
  94. # tf_array = self.wlocal(np.array(tf_array))
  95. #
  96. # vector = [
  97. # (termid, tf * self.idfs.get(termid))
  98. # for termid, tf in zip(termid_array, tf_array)
  99. # if abs(self.idfs.get(termid, 0.0)) > self.eps
  100. # ]

 

    2.2 jieba

  1. # -*- coding: UTF-8 -*-
  2. # !/usr/bin/python
  3. # @time :2019/7/31 21:21
  4. # @author :Mo
  5. # @function :
  6. import jieba.analyse
  7. import jieba
  8. sentence = '大漠 帝国 和 紫色 Angle'
  9. seg = jieba.cut(sentence)
  10. print(seg)
  11. tf_idf = jieba.analyse.extract_tags(sentence, withWeight=True)
  12. print(tf_idf)
  13. # 结果
  14. # [('Angle', 2.988691875725), ('大漠', 2.36158258893), ('紫色', 2.10190405216), ('帝国', 1.605909794915)]
  15. # 说明,
  16. # 1.1 idf jieba中的idf来自默认文件idf.txt,
  17. # idf默认一段话来作为一个docunment,
  18. # 没出现过的词语的idf默认为所有idf的平均值,即为11.多
  19. #
  20. # 1.2 tf tf只统计当前句子出现的频率除以所有词语数,
  21. # 例如'大漠 帝国 和 紫色 Angle'这句话, '大漠'的tf为1/5
  22. # tfidf的停用词"和"去掉了
  23. # tf计算代码
  24. # freq[w] = freq.get(w, 0.0) + 1.0
  25. # total = sum(freq.values())
  26. # for k in freq:
  27. # kw = k.word if allowPOS and withFlag else k
  28. # freq[k] *= self.idf_freq.get(kw, self.median_idf) / total

 

    2.3 sklearn

  1. # -*- coding: UTF-8 -*-
  2. # !/usr/bin/python
  3. # @time :2019/7/31 21:21
  4. # @author :Mo
  5. # @function :
  6. from sklearn.feature_extraction.text import TfidfTransformer
  7. from sklearn.feature_extraction.text import CountVectorizer
  8. def tfidf_from_ngram(questions):
  9. """
  10. 使用TfidfVectorizer计算n-gram
  11. :param questions:list, like ['孩子气', '大漠帝国']
  12. :return:
  13. """
  14. from sklearn.feature_extraction.text import TfidfVectorizer
  15. import jieba
  16. def jieba_cut(x):
  17. x = list(jieba.cut(x))
  18. return ' '.join(x)
  19. questions = [jieba_cut(''.join(ques)) for ques in questions]
  20. tfidf_model = TfidfVectorizer(ngram_range=(1, 2), # n-gram特征, 默认(1,1)
  21. max_features=10000,
  22. token_pattern=r"(?u)\b\w+\b", # 过滤停用词
  23. min_df=1,
  24. max_df=0.9,
  25. use_idf=1,
  26. smooth_idf=1,
  27. sublinear_tf=1)
  28. tfidf_model.fit(questions)
  29. print(tfidf_model.transform(['紫色 ANGEL 是 虾米 回事']))
  30. return tfidf_model
  31. if __name__ == "__main__":
  32. # test 1
  33. corpora_documents = [['大漠', '帝国'], ['紫色', 'Angle'], ['花落', '惊', '飞羽'],
  34. ['我', 'm', 'o'], ['你', 'the', 'a', 'it', 'this'], ['大漠', '大漠']]
  35. corpora_documents = [''.join(ques) for ques in corpora_documents]
  36. # 统计词频
  37. vectorizer = CountVectorizer()
  38. # 初始化,fit和transformer tf-idf
  39. transformer = TfidfTransformer()
  40. # 第一个fit_transform是计算tf-idf, 第二个是将文本转为词频矩阵
  41. tfidf = transformer.fit_transform(vectorizer.fit_transform(corpora_documents))
  42. print(tfidf)
  43. # 模型所有词语
  44. word = vectorizer.get_feature_names()
  45. print(word)
  46. weight = tfidf.toarray()
  47. print(weight)
  48. # test 2 from file of text
  49. tf_idf_model = tfidf_from_ngram(corpora_documents)
  50. print(tf_idf_model.transform(['你 谁 呀, 小老弟']))
  51. # sklearn的tfidf模型,可以采用TfidfVectorizer,提取n-gram特征,直接用于特征计算
  52. # 和gensim一样, 都有TfidfVectorizer, 继承的是CountVectorizer
  53. # df += int(self.smooth_idf) # 平滑处理
  54. # n_samples += int(self.smooth_idf) # 平滑处理
  55. # idf = np.log(n_samples / df) + 1 # 加了个1

    2.4 byhand

  1. # -*- coding: UTF-8 -*-
  2. # !/usr/bin/python
  3. # @time :2019/6/19 21:32
  4. # @author :Mo
  5. # @function :tf-idf
  6. from tookit_sihui.utils.file_utils import save_json
  7. from tookit_sihui.utils.file_utils import load_json
  8. from tookit_sihui.utils.file_utils import txt_write
  9. from tookit_sihui.utils.file_utils import txt_read
  10. import jieba
  11. import json
  12. import math
  13. import os
  14. from tookit_sihui.conf.logger_config import get_logger_root
  15. logger = get_logger_root()
  16. def count_tf(questions):
  17. """
  18. 统计字频,或者词频tf
  19. :param questions: list, 输入语料, 字级别的例子:[['我', '爱', '你'], ['爱', '护', '士']]
  20. :return: dict, 返回字频,或者词频, 形式:{'我':1, '爱':2}
  21. """
  22. tf_char = {}
  23. for question in questions:
  24. for char in question:
  25. if char.strip():
  26. if char not in tf_char:
  27. tf_char[str(char).encode('utf-8', 'ignore').decode('utf-8')] = 1
  28. else:
  29. tf_char[str(char).encode('utf-8', 'ignore').decode('utf-8')] = tf_char[char] + 1
  30. tf_char['[LENS]'] = sum([v for k,v in tf_char.items()])
  31. return tf_char
  32. def count_idf(questions):
  33. """
  34. 统计逆文档频率idf
  35. :param questions: list, 输入语料, 字级别的例子:[['我', '爱', '你'], ['爱', '护', '士']]
  36. :return: dict, 返回逆文档频率, 形式:{'我':1, '爱':2}
  37. """
  38. idf_char = {}
  39. for question in questions:
  40. question_set = set(question) # 在句子中,重复的只计数一次
  41. for char in question_set:
  42. if char.strip(): # ''不统计
  43. if char not in idf_char: # 第一次计数为1
  44. idf_char[char] = 1
  45. else:
  46. idf_char[char] = idf_char[char] + 1
  47. idf_char['[LENS]'] = len(questions) # 保存一个所有的句子长度
  48. return idf_char
  49. def count_tf_idf(freq_char, freq_document, ndigits=12, smooth =0):
  50. """
  51. 统计tf-idf
  52. :param freq_char: dict, tf
  53. :param freq_document: dict, idf
  54. :return: dict, tf-idf
  55. """
  56. len_tf = freq_char['[LENS]']
  57. len_tf_mid = int(len(freq_char)/2)
  58. len_idf = freq_document['[LENS]']
  59. len_idf_mid = int(len(freq_document) / 2)
  60. # tf
  61. tf_char = {}
  62. for k2, v2 in freq_char.items():
  63. tf_char[k2] = round((v2 + smooth)/(len_tf + smooth), ndigits)
  64. # idf
  65. idf_char = {}
  66. for ki, vi in freq_document.items():
  67. idf_char[ki] = round(math.log((len_idf + smooth) / (vi + smooth), 2), ndigits)
  68. # tf-idf
  69. tf_idf_char = {}
  70. for kti, vti in freq_char.items():
  71. tf_idf_char[kti] = round(tf_char[kti] * idf_char[kti], ndigits)
  72. # 删去文档数统计
  73. tf_char.pop('[LENS]')
  74. idf_char.pop('[LENS]')
  75. tf_idf_char.pop('[LENS]')
  76. # 计算平均/最大/中位数
  77. tf_char_values = tf_char.values()
  78. idf_char_values = idf_char.values()
  79. tf_idf_char_values = tf_idf_char.values()
  80. tf_char['[AVG]'] = round(sum(tf_char_values) / len_tf, ndigits)
  81. idf_char['[AVG]'] = round(sum(idf_char_values) / len_idf, ndigits)
  82. tf_idf_char['[AVG]'] = round(sum(tf_idf_char_values) / len_idf, ndigits)
  83. tf_char['[MAX]'] = max(tf_char_values)
  84. idf_char['[MAX]'] = max(idf_char_values)
  85. tf_idf_char['[MAX]'] = max(tf_idf_char_values)
  86. tf_char['[MIN]'] = min(tf_char_values)
  87. idf_char['[MIN]'] = min(idf_char_values)
  88. tf_idf_char['[MIN]'] = min(tf_idf_char_values)
  89. tf_char['[MID]'] = sorted(tf_char_values)[len_tf_mid]
  90. idf_char['[MID]'] = sorted(idf_char_values)[len_idf_mid]
  91. tf_idf_char['[MID]'] = sorted(tf_idf_char_values)[len_idf_mid]
  92. return tf_char, idf_char, tf_idf_char
  93. def save_tf_idf_dict(path_dir, tf_char, idf_char, tf_idf_char):
  94. """
  95. 排序和保存
  96. :param path_dir:str, 保存文件目录
  97. :param tf_char: dict, tf
  98. :param idf_char: dict, idf
  99. :param tf_idf_char: dict, tf-idf
  100. :return: None
  101. """
  102. if not os.path.exists(path_dir):
  103. os.mkdir(path_dir)
  104. # store and save
  105. tf_char_sorted = sorted(tf_char.items(), key=lambda d: d[1], reverse=True)
  106. tf_char_sorted = [tf[0] + '\t' + str(tf[1]) + '\n' for tf in tf_char_sorted]
  107. txt_write(tf_char_sorted, path_dir + 'tf.txt')
  108. idf_char_sorted = sorted(idf_char.items(), key=lambda d: d[1], reverse=True)
  109. idf_char_sorted = [idf[0] + '\t' + str(idf[1]) + '\n' for idf in idf_char_sorted]
  110. txt_write(idf_char_sorted, path_dir + 'idf.txt')
  111. tf_idf_char_sorted = sorted(tf_idf_char.items(), key=lambda d: d[1], reverse=True)
  112. tf_idf_char_sorted = [tf_idf[0] + '\t' + str(tf_idf[1]) + '\n' for tf_idf in tf_idf_char_sorted]
  113. txt_write(tf_idf_char_sorted, path_dir + 'tf_idf.txt')
  114. def save_tf_idf_json(path_dir, tf_freq, idf_freq, tf_char, idf_char, tf_idf_char):
  115. """
  116. json排序和保存
  117. :param path_dir:str, 保存文件目录
  118. :param tf_char: dict, tf
  119. :param idf_char: dict, idf
  120. :param tf_idf_char: dict, tf-idf
  121. :return: None
  122. """
  123. if not os.path.exists(path_dir):
  124. os.mkdir(path_dir)
  125. # freq
  126. save_json([tf_freq], path_dir + '/tf_freq.json')
  127. save_json([idf_freq], path_dir + '/idf_freq.json')
  128. # json_tf = json.dumps([tf_char])
  129. save_json([tf_char], path_dir + '/tf.json')
  130. # json_idf = json.dumps([idf_char])
  131. save_json([idf_char], path_dir + '/idf.json')
  132. # json_tf_idf = json.dumps([tf_idf_char])
  133. save_json([tf_idf_char], path_dir + '/tf_idf.json')
  134. def load_tf_idf_json(path_tf_freq=None, path_idf_freq=None, path_tf=None, path_idf=None, path_tf_idf=None):
  135. """
  136. 从json文件下载tf, idf, tf_idf
  137. :param path_tf:
  138. :param path_idf:
  139. :param path_tf_idf:
  140. :return:
  141. """
  142. json_tf_freq = load_json(path_tf_freq)
  143. json_idf_freq = load_json(path_idf_freq)
  144. json_tf = load_json(path_tf)
  145. json_idf = load_json(path_idf)
  146. json_tf_idf = load_json(path_tf_idf)
  147. return json_tf_freq[0], json_idf_freq[0], json_tf[0], json_idf[0], json_tf_idf[0]
  148. def dict_add(dict1, dict2):
  149. """
  150. 两个字典合并
  151. :param dict1:
  152. :param dict2:
  153. :return:
  154. """
  155. for i,j in dict2.items():
  156. if i in dict1.keys():
  157. dict1[i] += j
  158. else:
  159. dict1.update({f'{i}' : dict2[i]})
  160. return dict1
  161. class TFIDF:
  162. def __init__(self, questions=None, path_tf=None,
  163. path_idf=None, path_tf_idf=None,
  164. path_tf_freq=None, path_idf_freq=None,
  165. ndigits=12, smooth=0):
  166. """
  167. 统计字频,或者词频tf
  168. :param questions: list, 输入语料, 字级别的例子:[['我', '爱', '你'], ['爱', '护', '士']]
  169. """
  170. self.esplion = 1e-16
  171. self.questions = questions
  172. self.path_tf_freq = path_tf_freq
  173. self.path_idf_freq = path_idf_freq
  174. self.path_tf=path_tf
  175. self.path_idf=path_idf
  176. self.path_tf_idf=path_tf_idf
  177. self.ndigits=ndigits
  178. self.smooth=smooth
  179. self.create_tfidf()
  180. def create_tfidf(self):
  181. if self.questions != None: # 输入questions list, 即corpus语料
  182. self.tf_freq = count_tf(self.questions)
  183. self.idf_freq = count_idf(self.questions)
  184. self.tf, self.idf, self.tfidf = count_tf_idf(self.tf_freq,
  185. self.idf_freq,
  186. ndigits=self.ndigits,
  187. smooth =self.smooth)
  188. else: # 输入训练好的
  189. self.tf_freq, self.idf_freq, \
  190. self.tf, self.idf, self.tfidf = load_tf_idf_json(path_tf_freq = self.path_tf_freq,
  191. path_idf_freq = self.path_idf_freq,
  192. path_tf=self.path_tf,
  193. path_idf=self.path_idf,
  194. path_tf_idf=self.path_tf_idf)
  195. self.chars = [idf for idf in self.idf.keys()]
  196. def extract_tfidf_of_sentence(self, ques):
  197. """
  198. 获取tf-idf
  199. :param ques: str
  200. :return: float
  201. """
  202. assert type(ques)==str
  203. if not ques.strip():
  204. return None
  205. ques_list = list(jieba.cut(ques.replace(' ', '').strip()))
  206. logger.info(ques_list)
  207. score = 0.0
  208. score_list = {}
  209. for char in ques_list:
  210. if char in self.chars:
  211. score = score + self.tfidf[char]
  212. score_list[char] = self.tfidf[char]
  213. else: #
  214. score = score + self.esplion
  215. score_list[char] = self.esplion
  216. score = score/len(ques_list)# 求平均避免句子长度不一的影响
  217. logger.info(score_list)
  218. logger.info({ques:score})
  219. return score
  220. def extract_tf_of_sentence(self, ques):
  221. """
  222. 获取idf
  223. :param ques: str
  224. :return: float
  225. """
  226. assert type(ques)==str
  227. if not ques.strip():
  228. return None
  229. ques_list = list(jieba.cut(ques.replace(' ', '').strip()))
  230. logger.info(ques_list)
  231. score = 0.0
  232. score_list = {}
  233. for char in ques_list:
  234. if char in self.chars:
  235. score = score + self.tf[char]
  236. score_list[char] = self.tf[char]
  237. else: #
  238. score = score + self.esplion
  239. score_list[char] = self.esplion
  240. score = score/len(ques_list)# 求平均避免句子长度不一的影响
  241. logger.info(score_list)
  242. logger.info({ques:score})
  243. return score
  244. def extract_idf_of_sentence(self, ques):
  245. """
  246. 获取idf
  247. :param ques: str
  248. :return: float
  249. """
  250. assert type(ques)==str
  251. if not ques.strip():
  252. return None
  253. ques_list = list(jieba.cut(ques.replace(' ', '').strip()))
  254. logger.info(ques_list)
  255. score = 0.0
  256. score_list = {}
  257. for char in ques_list:
  258. if char in self.chars:
  259. score = score + self.idf[char]
  260. score_list[char] = self.idf[char]
  261. else: #
  262. score = score + self.esplion
  263. score_list[char] = self.esplion
  264. score = score/len(ques_list) # 求平均避免句子长度不一的影响
  265. logger.info(score_list)
  266. logger.info({ques:score})
  267. return score
  268. def create_TFIDF(path):
  269. # 测试1,根据corpus生成
  270. import time
  271. time_start = time.time()
  272. # 首先输入全部文本构建tf-idf,然后再拿去用
  273. from tookit_sihui.conf.path_config import path_tf_idf_corpus
  274. from tookit_sihui.utils.file_utils import txt_write, txt_read
  275. path_wiki = path if path else path_tf_idf_corpus
  276. # 测试1, tf-idf, 调用
  277. path_dir = 'tf_idf_freq/'
  278. # ques = ['大漠帝国最强', '花落惊飞羽最漂亮', '紫色Angle最有气质', '孩子气最活泼', '口袋巧克力和过路蜻蜓最好最可爱啦', '历历在目最烦恼']
  279. # questions = [list(q.strip()) for q in ques]
  280. # questions = [list(jieba.cut(que)) for que in ques]
  281. questions = txt_read(path_wiki)
  282. len_questions = len(questions)
  283. batch_size = 1000000
  284. size_trade = len_questions // batch_size
  285. print(size_trade)
  286. size_end = size_trade * batch_size
  287. # 计算tf-freq, idf-freq
  288. ques_tf_all, ques_idf_all = {}, {}
  289. for i, (start, end) in enumerate(zip(range(0, size_end, batch_size),
  290. range(batch_size, size_end, batch_size))):
  291. print("第{}次".format(i))
  292. question = questions[start: end]
  293. questionss = [ques.strip().split(' ') for ques in question]
  294. ques_idf = count_idf(questionss)
  295. ques_tf = count_tf(questionss)
  296. print('tf_idf_{}: '.format(i) + str(time.time() - time_start))
  297. # 字典合并 values相加
  298. ques_tf_all = dict_add(ques_tf_all, ques_tf)
  299. ques_idf_all = dict_add(ques_idf_all, ques_idf)
  300. print('dict_add_{}: '.format(i) + str(time.time() - time_start))
  301. print('的tf:{}'.format(ques_tf_all['的']))
  302. print('的idf:{}'.format(ques_idf_all['的']))
  303. # 不足batch-size部分
  304. if len_questions - size_end >0:
  305. print("第{}次".format('last'))
  306. question = questions[size_end: len_questions]
  307. questionss = [ques.strip().split(' ') for ques in question]
  308. ques_tf = count_idf(questionss)
  309. ques_idf = count_tf(questionss)
  310. # tf_char, idf_char, tf_idf_char = count_tf_idf(ques_tf, ques_idf)
  311. ques_tf_all = dict_add(ques_tf_all, ques_tf)
  312. ques_idf_all = dict_add(ques_idf_all, ques_idf)
  313. print('{}: '.format('last') + str(time.time() - time_start))
  314. print('的tf:{}'.format(ques_tf_all['的']))
  315. print('的idf:{}'.format(ques_idf_all['的']))
  316. # 计算tf-idf
  317. tf_char, idf_char, tf_idf_char = count_tf_idf(ques_tf_all, ques_idf_all)
  318. print(len(tf_char))
  319. print('tf-idf ' + str(time.time()-time_start))
  320. print('tf-idf ok!')
  321. # 保存, tf,idf,tf-idf
  322. save_tf_idf_json(path_dir, ques_tf_all, ques_idf_all, tf_char, idf_char, tf_idf_char)
  323. gg=0
  324. if __name__=="__main__":
  325. # 测试1
  326. path = None # 语料地址, 格式为切分后的句子, 例如'孩子 气 和 紫色 angle'
  327. create_TFIDF(path)
  328. # # 测试2, 调用class、json, input预测
  329. # path_dir = 'tf_idf_freq/'
  330. # path_tf = path_dir + '/tf.json'
  331. # path_idf = path_dir + '/idf.json'
  332. # path_tf_idf = path_dir + '/tf_idf.json'
  333. #
  334. # tfidf = TFIDF(path_tf=path_tf, path_idf=path_idf, path_tf_idf=path_tf_idf)
  335. # score1 = tfidf.extract_tf_of_sentence('大漠帝国')
  336. # score2 = tfidf.extract_idf_of_sentence('大漠帝国')
  337. # score3 = tfidf.extract_tfidf_of_sentence('大漠帝国')
  338. # print('tf: ' + str(score1))
  339. # print('idf: ' + str(score2))
  340. # print('tfidf: ' + str(score3))
  341. # while True:
  342. # print("请输入: ")
  343. # ques = input()
  344. # tfidf_score = tfidf.extract_tfidf_of_sentence(ques)
  345. # print('tfidf:' + str(tfidf_score))

 

希望对你有所帮助!


 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家自动化/article/detail/369203
推荐阅读
相关标签
  

闽ICP备14008679号