当前位置:   article > 正文

TF-IDF

TF-IDF


提示:以下是本篇文章正文内容,下面案例可供参考

一、TF-IDF

1、TF-IDF是什么?

TF-IDF是一种用于信息检索与数据挖掘的常用加权技术。

  • TF意思是词频(Term Frequency)
  • DF(t,D)包含词语t的文档数量
  • |D|文档数
  • IDF意思是逆文本频率指数(Inverse Document Frequency)
    在这里插入图片描述
    显然,|D|比上DF(t,D)越大表示该词语越能代表该文档,当每个文档中都有该词语时,那么取对数时为0,为了防止分母为0,因此将分母加1,为了维持取对数后|D|和DF相等时为0,因此对分子也加1。

但是,一个文档中可能出现很多重复的而没有实际意义的词语,比如a,an,the,为了表示词语对文档的重要性,采用TF-IDF。 在这里插入图片描述
从公式中可以看出,词频如果很大且在很多文档中都出现,那么IDF就会很小,所以两者结合,就能很好判定词语对文档的重要性。

2、spark官方代码实现

def tfidf():Unit={
    val spark = SparkSession.builder().appName("TFIDF").getOrCreate()

    val sentenceData = spark.createDataFrame(Array(
      (0.0, "Hi I heard about Spark"),
      (0.0, "I wish Java could use case classes"),
      (1.0, "Logistic regression models are neat")
    )).toDF("label","sentence")


    /**\
     * 单词分割
     */
    val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
    val wordsData = tokenizer.transform(sentenceData)
    /*
    +-----+-----------------------------------+------------------------------------------+
    |label|sentence                           |words                                     |
    +-----+-----------------------------------+------------------------------------------+
    |0.0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |
    |0.0  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|
    |1.0  |Logistic regression models are neat|[logistic, regression, models, are, neat] |
    +-----+-----------------------------------+------------------------------------------+
     */


    /**
     * 通过 hashingTF.transform() 创建特征向量
     */
    val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeature")
    val featurizedData =  hashingTF.transform(wordsData)
    featurizedData.show(10,false)
/*
|label|sentence                           |words                                     |rawFeature                                                                          |
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+
|0.0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |(262144,[18700,19036,33808,66273,173558],[1.0,1.0,1.0,1.0,1.0])                     |
|0.0  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(262144,[19036,20719,55551,58672,98717,109547,192310],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
|1.0  |Logistic regression models are neat|[logistic, regression, models, are, neat] |(262144,[46243,58267,91006,160975,190884],[1.0,1.0,1.0,1.0,1.0])                    |
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+

根据该表可以看 [hi, i, heard, about, spark] 分别对应 [18700,19036,33808,66273,173558],其中 [1.0,1.0,1.0,1.0,1.0] 代表单词在该句中出现的次数。
 */


	 /**
     * 调用IDF方法来重新构造特征向量的规模,生成的idf是一个Estimator,在特征向量上应用它的fit()方法,会产生一个IDFModel
     */
    val idf = new IDF().setInputCol("rawFeature").setOutputCol("feature")
    val idfModel = idf.fit(featurizedData)
    val rescalaData = idfModel.transform(featurizedData)
    rescalaData.show(10,false)

/*
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|sentence                           |words                                     |rawFeature                                                                          |feature                                                                                                                                                                                       |
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0.0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |(262144,[18700,19036,33808,66273,173558],[1.0,1.0,1.0,1.0,1.0])                     |(262144,[18700,19036,33808,66273,173558],[0.6931471805599453,0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453])                                                   |
|0.0  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(262144,[19036,20719,55551,58672,98717,109547,192310],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])|(262144,[19036,20719,55551,58672,98717,109547,192310],[0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453])|
|1.0  |Logistic regression models are neat|[logistic, regression, models, are, neat] |(262144,[46243,58267,91006,160975,190884],[1.0,1.0,1.0,1.0,1.0])                    |(262144,[46243,58267,91006,160975,190884],[0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453])                                                   |
+-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

从上表可以看出,hi仅在第一句中出现,所以hi的TF-IDF值比i大,hi更能代表第一句
 */
  }

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小小林熬夜学编程/article/detail/350049
推荐阅读
相关标签
  

闽ICP备14008679号