当前位置:   article > 正文

TF-IDF算法及实现_tfidf算法实现

tfidf算法实现

最近在看莫烦的NLP的课程,其中关于TF-IDF算法实际编程的时候还是遇到一些小问题,主要是计算方法问题,整理后放上来,加深记忆。

TF-IDF的计算方法有很多种,这里主要用的是SKlearn中的计算方式,根示例代码不太一样,费了点劲儿才搞明白。

目录

一、 TF-IDF算法简介

1. TF:Term Frequency,词频

2. IDF:Inverse Document Frequency,逆向文本频率

3. TF-IDF = TF * IDF

二、代码示例

1. 自编函数

2. 使用sklearn库



一、 TF-IDF算法简介

TF-IDF(term frequency–inverse document frequency,词频-逆向文件频率)算法是一种用于信息检索与文本数据挖掘的常用加权技术。它用统计学方法评估一个词对某篇文章的重要程度,常用来提取文章的关键词,算法简单高效,因此常用于信息检索的粗排阶段。

TF-IDF算法的核心思想是通过统计的方法,评估一个词对一个文件集或者语料库的重要程度。一个词的重要程度跟它在文章中出现的次数成正比,跟它在语料库出现的次数成反比。这种计算方式能有效避免常用词对关键词的影响,提高了关键词与文章之间的相关性。

1. TF:Term Frequency,词频

指的是某个词在某篇文章中出现的次数, 计算公式为:TF = 某词在某文档中出现的次数

也就是说,就一篇文章局部来看,一个单词出现的次数越多就越重要,但这并不是绝对的。比如,a、the、of等单词出现的次数一定不会少,很显然它们并没有什么重要信息。所以,我们接下来要引入IDF。

注意:也有 “TF = 某词在某文档中出现的次数 / 该文档的总词量” 这种计算,但SKLEARN中是采用直接计次。

2. IDF:Inverse Document Frequency,逆向文本频率

指的是某个词在一个文件集或者语料库中区分力指标。计算公式为:

IDF = \log \frac{Nd + 1}{df(d, t)) + 1}+1

其中,Nd是训练集文档总数量,df(d,t)是包含某个单词的文档数量,+1的原因是避免分母为0。

也就是说,对一个文件集或者语料库而言,包含某个单词的文档越少,IDF的值越大,这个词的区分力越强,就越重要。

特别需要注意的是,IDF是针对文件集或者语料库而言的。计算机领域的IDF用在医学领域往往是不合适的。

3. TF-IDF = TF * IDF

综合考虑以某篇文章为中心的局部信息TF,和以某个语料库全局信息为基础的IDF,得到以下公式:

TF-IDF = TF * IDF

特别注意:

在sklearn中,上述计算的TF-IDF会经过一个欧几里得范数归一化:

二、代码示例

以下代码改编自莫烦的NLP课程中的源码。

输入15篇文章,形成一个44个单词的词汇表(去掉两个高频词,a 和 i),计算这15篇文章的tf-idf矩阵。再输入查询语句,计算该语句的tf-idf向量,然后求该语句的tf-idf向量和每一篇文章tf-idf向量的cosin距离,找出距离最近的三篇文章即是搜索结果。

核心思想--向量化。将文章向量化,将待查询语句也向量化,就可以通过计算余弦距离来比较相近程度。注意这里使用两个向量的夹角的余弦值来衡量两个文本间的相似度,而不是常用的欧氏距离,余弦相似度更加注重两个向量在方向上的差异,而不是实际距离差异。

1. 自编函数

  1. import numpy as np
  2. from collections import Counter
  3. import itertools
  4. from sklearn import preprocessing
  5. from plot import show_tfidf
  6. # from sklearn.metrics.pairwise import cosine_similarity
  7. #15 docs
  8. docs = [
  9. "it is a good day, I like to stay here",
  10. "I am happy to be here",
  11. "I am bob",
  12. "it is sunny today",
  13. "I have a party today",
  14. "it is a dog and that is a cat",
  15. "there are dog and cat on the tree",
  16. "I study hard this morning",
  17. "today is a good day",
  18. "tomorrow will be a good day",
  19. "I like coffee, I like book and I like apple",
  20. "I do not like it",
  21. "I am kitty, I like bob",
  22. "I do not care who like bob, but I like kitty",
  23. "It is coffee time, bring your cup",
  24. ]
  25. #vocablist包括44 words 去掉两个超高频单词
  26. docs_words = [d.lower().replace(",", "").split(" ") for d in docs]
  27. wordlist = list(itertools.chain(*docs_words)) #遍历对象,去除内嵌,为什么需要加*没细看
  28. vocablist = list(set(wordlist))
  29. vocablist.sort(key=wordlist.index) #转set去重,保持原序 => 全部单词表
  30. vocablist.remove('a') #为了根sklearn保持一致,去掉两个超高频单词
  31. vocablist.remove('i')
  32. #print(vocablist)
  33. v2i = {v: i for i, v in enumerate(vocablist)} #给单词编索引,eg: 'tree': 0 #enumerate函数 index, value
  34. i2v = {i: v for v, i in v2i.items()} #逆索引,eg: 0: 'tree' #items函数,value, index
  35. #print(v2i)
  36. #print(i2v)
  37. # tf = 每个单词出现频率
  38. def get_tf():
  39. # term frequency: how frequent a word appears in a doc
  40. _tf = np.zeros((len(vocablist), len(docs)), dtype=np.float64) # [n_vocab, n_doc] =》 [44 * 15]矩阵
  41. for i, d in enumerate(docs_words): #循环每篇文章
  42. counter = Counter(d)
  43. for v in counter.keys(): #统计每篇文章单词计数
  44. if v in v2i:
  45. _tf[v2i[v], i] = counter[v] #每个单词出现频率
  46. return _tf
  47. # idf = 1 + np.log((len(docs) + 1) / (该单词在几篇文章中出现 + 1))
  48. def get_idf(method="sklearn"):
  49. # inverse document frequency: low idf for a word appears in more docs, mean less important
  50. df = np.zeros((len(i2v), 1))
  51. for i in range(len(i2v)): #循环词汇表每一个单词
  52. d_count = 0
  53. for d in docs_words:
  54. d_count += 1 if i2v[i] in d else 0 #该单词在几篇文章中出现
  55. df[i, 0] = d_count
  56. idf_fn = lambda x: 1 + np.log((len(docs) + 1) / (x+1))
  57. if idf_fn is None:
  58. raise ValueError
  59. return idf_fn(df)
  60. def cosine_similarity(_tf_idf, q):
  61. unit_ds = _tf_idf / np.sqrt(np.sum(np.square(_tf_idf)))
  62. unit_q = q / np.sqrt(np.sum(np.square(q)))
  63. similarity = unit_ds.T.dot(unit_q).ravel()
  64. return similarity
  65. #根据输入,比较每篇文章的相似度,不考虑输入句子中新加入的单词
  66. def docs_score(q):
  67. q_words = q.replace(",", "").split(" ")
  68. counter = Counter(q_words)
  69. q_tf = np.zeros((len(idf), 1), dtype=float)
  70. for v in counter.keys():
  71. if v in v2i:
  72. q_tf[v2i[v], 0] = counter[v] # 每个单词出现频率
  73. q_vec = q_tf * idf
  74. q_tf_idf = preprocessing.normalize(q_vec, norm='l2', axis=0) # 欧几里得范数归一化
  75. #q_scores = cosine_similarity(tf_idf.transpose(), q_tf_idf.transpose()) #如果用库中函数,就用归一化后的
  76. q_scores = cosine_similarity(origin_tf_idf, q_vec) #如果自已写,就用未归一化的
  77. return q_scores
  78. #获得tf_idf最高的n个词
  79. def get_keywords(n=2):
  80. for c in range(15):
  81. col = tf_idf[:, c]
  82. idx = np.argsort(col)[-n:][::-1] #从小到大排列,提取其对应的index, 从后向前反向取
  83. print("doc{}, top{} keywords {}".format(c, n, [i2v[i] for i in idx])) #TOP2TOP1
  84. #----------TEST
  85. tf = get_tf() # [n_vocab, n_doc] 44*15
  86. idf = get_idf() # [n_vocab, 1] 44*1
  87. origin_tf_idf = tf * idf # [n_vocab, n_doc] 44*15
  88. tf_idf = preprocessing.normalize(origin_tf_idf, norm='l2', axis=0) #欧几里得范数归一化
  89. print("\ntf samples:\n", tf[:2])
  90. print("\nidf sample:\n", idf[:2])
  91. print("\ntf_idf sample:\n", tf_idf[:2])
  92. #--- 提取关键词
  93. get_keywords()
  94. #--- 搜索最相似的句子
  95. q = "I get a coffee cup"
  96. scores = docs_score(q)
  97. d_ids = scores.ravel().argsort()[-3:][::-1]
  98. print("\ntop 3 docs for '{}':\n{}".format(q, [docs[i] for i in d_ids]))
  99. show_tfidf(tf_idf.T, [i2v[i] for i in range(tf_idf.shape[0])], "tfidf_matrix")

2. 使用sklearn库

  1. from sklearn.feature_extraction.text import TfidfVectorizer
  2. from sklearn.metrics.pairwise import cosine_similarity
  3. from plot import show_tfidf
  4. docs = [
  5. "it is a good day, I like to stay here",
  6. "I am happy to be here",
  7. "I am bob",
  8. "it is sunny today",
  9. "I have a party today",
  10. "it is a dog and that is a cat",
  11. "there are dog and cat on the tree",
  12. "I study hard this morning",
  13. "today is a good day",
  14. "tomorrow will be a good day",
  15. "I like coffee, I like book and I like apple",
  16. "I do not like it",
  17. "I am kitty, I like bob",
  18. "I do not care who like bob, but I like kitty",
  19. "It is coffee time, bring your cup",
  20. ]
  21. vectorizer = TfidfVectorizer()
  22. tf_idf = vectorizer.fit_transform(docs)
  23. # print("idf: ", [(n, idf) for idf, n in zip(vectorizer.idf_, vectorizer.get_feature_names())])
  24. # print("v2i: ", vectorizer.vocabulary_)
  25. # print(tf_idf)
  26. q = "I get a coffee cup"
  27. qtf_idf = vectorizer.transform([q])
  28. res = cosine_similarity(tf_idf, qtf_idf)
  29. res = res.ravel().argsort()[-3:]
  30. print("\ntop 3 docs for '{}':\n{}".format(q, [docs[i] for i in res[::-1]]))
  31. i2v = {i: v for v, i in vectorizer.vocabulary_.items()}
  32. dense_tfidf = tf_idf.todense() #tf_idf为稀疏矩阵
  33. show_tfidf(dense_tfidf, [i2v[i] for i in range(dense_tfidf.shape[1])], "tfidf_sklearn_matrix")

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/2023面试高手/article/detail/362408
推荐阅读
相关标签
  

闽ICP备14008679号