赞
踩
提示:以下是本篇文章正文内容,下面案例可供参考
TF-IDF是一种用于信息检索与数据挖掘的常用加权技术。
但是,一个文档中可能出现很多重复的而没有实际意义的词语,比如a,an,the,为了表示词语对文档的重要性,采用TF-IDF。
从公式中可以看出,词频如果很大且在很多文档中都出现,那么IDF就会很小,所以两者结合,就能很好判定词语对文档的重要性。
def tfidf():Unit={ val spark = SparkSession.builder().appName("TFIDF").getOrCreate() val sentenceData = spark.createDataFrame(Array( (0.0, "Hi I heard about Spark"), (0.0, "I wish Java could use case classes"), (1.0, "Logistic regression models are neat") )).toDF("label","sentence") /**\ * 单词分割 */ val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words") val wordsData = tokenizer.transform(sentenceData) /* +-----+-----------------------------------+------------------------------------------+ |label|sentence |words | +-----+-----------------------------------+------------------------------------------+ |0.0 |Hi I heard about Spark |[hi, i, heard, about, spark] | |0.0 |I wish Java could use case classes |[i, wish, java, could, use, case, classes]| |1.0 |Logistic regression models are neat|[logistic, regression, models, are, neat] | +-----+-----------------------------------+------------------------------------------+ */ /** * 通过 hashingTF.transform() 创建特征向量 */ val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeature") val featurizedData = hashingTF.transform(wordsData) featurizedData.show(10,false) /* |label|sentence |words |rawFeature | +-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+ |0.0 |Hi I heard about Spark |[hi, i, heard, about, spark] |(262144,[18700,19036,33808,66273,173558],[1.0,1.0,1.0,1.0,1.0]) | |0.0 |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(262144,[19036,20719,55551,58672,98717,109547,192310],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])| |1.0 |Logistic regression models are neat|[logistic, regression, models, are, neat] |(262144,[46243,58267,91006,160975,190884],[1.0,1.0,1.0,1.0,1.0]) | +-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+ 根据该表可以看 [hi, i, heard, about, spark] 分别对应 [18700,19036,33808,66273,173558],其中 [1.0,1.0,1.0,1.0,1.0] 代表单词在该句中出现的次数。 */ /** * 调用IDF方法来重新构造特征向量的规模,生成的idf是一个Estimator,在特征向量上应用它的fit()方法,会产生一个IDFModel */ val idf = new IDF().setInputCol("rawFeature").setOutputCol("feature") val idfModel = idf.fit(featurizedData) val rescalaData = idfModel.transform(featurizedData) rescalaData.show(10,false) /* +-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |label|sentence |words |rawFeature |feature | +-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |0.0 |Hi I heard about Spark |[hi, i, heard, about, spark] |(262144,[18700,19036,33808,66273,173558],[1.0,1.0,1.0,1.0,1.0]) |(262144,[18700,19036,33808,66273,173558],[0.6931471805599453,0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453]) | |0.0 |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(262144,[19036,20719,55551,58672,98717,109547,192310],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])|(262144,[19036,20719,55551,58672,98717,109547,192310],[0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453])| |1.0 |Logistic regression models are neat|[logistic, regression, models, are, neat] |(262144,[46243,58267,91006,160975,190884],[1.0,1.0,1.0,1.0,1.0]) |(262144,[46243,58267,91006,160975,190884],[0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453]) | +-----+-----------------------------------+------------------------------------------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 从上表可以看出,hi仅在第一句中出现,所以hi的TF-IDF值比i大,hi更能代表第一句 */ }
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。