当前位置:   article > 正文

TF-IDF实现关键词提取_tfidf关键词提取

tfidf关键词提取

TF-IDF方法简介

TF-IDF,实际上是两个部分:TF和IDF的乘积。下面分别对两个次解释。

TF:词频。简单理解,就是词语在文章中出现的频率。计算方法也很简单:

即文档i中词语j的词频等于词语j在文档i中的出现次数nij除以文档i中所有词语的数量。

IDF:逆向词频,也叫反文档频率。首先了解一下文档频率DF:一个词在所有文档中出现的频率,如共有100篇文章,10篇文章中出现,则频率为0.1。那么,IDF就是这个DF的倒数,也就是10。之后,在分母上+1,防止分母为0,再取对数。逆向词频解决的问题是方式常用词霸占词频榜,导致提取出来的关键词都是没有意义的常用词...(例如介词)。

即词i的逆向词频等于文档总数除以包含词i的文档数+1,再取对数。

最终的tf-idf算法将词频和逆向词频相乘,解决了常用词的问题,便可提取出文章的关键词。

实现代码

  1. string = "Automatic keyword extraction is to extract topical and important words or phrases form document or document set. It is a basic and necessary work in text mining tasks such as text retrieval and text summarization. This paper discusses the connotation of keyword extraction and automatic keyword extraction. In the light of linguistics, cognitive science, complexity science, psychology and social science, this paper studies the theoretical basis of automatic keyword extraction. From macro, meso and micro perspectives, the development, techniques and methods of automatic keyword extraction are reviewed and analyzed. This paper summarizes the current key technologies and research progress of automatic keyword extraction methods, including statistical methods, topic based methods, and network based methods. The evaluation approach of automatic keyword extraction is analyzed, and the challenges and trends of automatic keyword extraction are also predicted."
  2. from jieba.analyse import *
  3. # print(jieba.cut(str))
  4. # print()
  5. for keyword, weight in extract_tags(string, withWeight=True):
  6. print('%s %s' % (keyword, weight))
  7. # kw = tfidf(str)
  8. # print(kw)

基于复杂网络的关键词提取方法

        基于复杂网络的提取方法,主要是利用单词在文本中的共现解决问题。假设在一个句子中,词A和词B同时出现,那么在网络中就会新建一条权重为1的边(若已存在这条边,则权值+1即可),由此构建复杂网络,再利用网络的一些拓扑特征去衡量单词在网络中的重要性,以达到关键词提取的目的。如度等紧密中心性的度量方法,可以反映网络中的节点的重要程度,但也要进行进一步的处理,否则同样会导致词频计算中的问题:常用词的干扰过大。如在英文中的and/is/or等类似的词汇,并不能表示实际意义,但是出现的频率却很高(后续的代码结果中也有体现),这往往需要使用多种不同的网络拓扑特征进行关键字的提取。

以下是实现代码:

  1. # coding = utf-8
  2. import re
  3. import jieba
  4. import networkx as nx
  5. import matplotlib.pyplot as plt
  6. from operator import itemgetter, attrgetter
  7. string = "Automatic keyword extraction is to extract topical and important words or phrases form document or document set. It is a basic and necessary work in text mining tasks such as text retrieval and text summarization. This paper discusses the connotation of keyword extraction and automatic keyword extraction. In the light of linguistics, cognitive science, complexity science, psychology and social science, this paper studies the theoretical basis of automatic keyword extraction. From macro, meso and micro perspectives, the development, techniques and methods of automatic keyword extraction are reviewed and analyzed. This paper summarizes the current key technologies and research progress of automatic keyword extraction methods, including statistical methods, topic based methods, and network based methods. The evaluation approach of automatic keyword extraction is analyzed, and the challenges and trends of automatic keyword extraction are also predicted."
  8. G = nx.Graph()
  9. str = string.split('.')
  10. # print(str)
  11. for s in str:
  12. # s = ' '.join(jieba.cut(s))
  13. ss = re.split(' ',s)
  14. for sss in ss:
  15. for yyy in ss:
  16. # print(sss)
  17. if yyy != sss:
  18. # G.add_node(sss)
  19. G.add_edge(sss,yyy)
  20. nx.draw(G,pos = nx.spring_layout(G),with_labels= True,node_size = 50)
  21. plt.savefig('network.png')
  22. plt.show()
  23. print('------------度排序-------------')
  24. degree = list(G.degree())
  25. degree.sort(key = itemgetter(1),reverse = True)
  26. print(degree)
  27. print('----------紧密中心性-----------')
  28. closenessCentrality = nx.closeness_centrality(G)
  29. c = sorted(closenessCentrality.items(),key= lambda closenessCentrality:closenessCentrality[1],reverse=True) #紧密中心性
  30. print(c)

结果:

tf-idf方法:

  1. keyword 1.0395450002521738
  2. extraction 1.0395450002521738
  3. automatic 0.7796587501891303
  4. methods 0.6497156251576086
  5. text 0.38982937509456517
  6. paper 0.38982937509456517
  7. science 0.38982937509456517
  8. document 0.25988625006304344
  9. analyzed 0.25988625006304344
  10. based 0.25988625006304344
  11. Automatic 0.12994312503152172
  12. extract 0.12994312503152172
  13. topical 0.12994312503152172
  14. important 0.12994312503152172
  15. words 0.12994312503152172
  16. phrases 0.12994312503152172
  17. form 0.12994312503152172
  18. set 0.12994312503152172
  19. basic 0.12994312503152172
  20. necessary 0.12994312503152172

复杂网络方法:

  1. ------------度排序-------------
  2. [('and', 78), ('', 67), ('keyword', 65), ('extraction', 65), ('the', 54), ('of', 54), ('automatic', 54), ('is', 40), ('paper', 35), ('methods', 31), ('are', 26), ('This', 23), ('summarizes', 21), ('current', 21), ('key', 21), ('technologies', 21), ('research', 21), ('progress', 21), ('methods,', 21), ('including', 21), ('statistical', 21), ('topic', 21), ('based', 21), ('network', 21), ('In', 19), ('light', 19), ('linguistics,', 19), ('cognitive', 19), ('science,', 19), ('complexity', 19), ('psychology', 19), ('social', 19), ('this', 19), ('studies', 19), ('theoretical', 19), ('basis', 19), ('From', 17), ('macro,', 17), ('meso', 17), ('micro', 17), ('perspectives,', 17), ('development,', 17), ('techniques', 17), ('reviewed', 17), ('analyzed', 17), ('The', 16), ('evaluation', 16), ('approach', 16), ('analyzed,', 16), ('challenges', 16), ('trends', 16), ('also', 16), ('predicted', 16), ('It', 15), ('a', 15), ('basic', 15), ('necessary', 15), ('work', 15), ('in', 15), ('text', 15), ('mining', 15), ('tasks', 15), ('such', 15), ('as', 15), ('retrieval', 15), ('summarization', 15), ('Automatic', 14), ('to', 14), ('extract', 14), ('topical', 14), ('important', 14), ('words', 14), ('or', 14), ('phrases', 14), ('form', 14), ('document', 14), ('set', 14), ('discusses', 10), ('connotation', 10)]
  3. ----------紧密中心性-----------
  4. [('and', 1.0), ('', 0.8764044943820225), ('keyword', 0.8571428571428571), ('extraction', 0.8571428571428571), ('the', 0.7647058823529411), ('of', 0.7647058823529411), ('automatic', 0.7647058823529411), ('is', 0.6724137931034483), ('paper', 0.6446280991735537), ('methods', 0.624), ('are', 0.6), ('This', 0.5864661654135338), ('summarizes', 0.5777777777777777), ('current', 0.5777777777777777), ('key', 0.5777777777777777), ('technologies', 0.5777777777777777), ('research', 0.5777777777777777), ('progress', 0.5777777777777777), ('methods,', 0.5777777777777777), ('including', 0.5777777777777777), ('statistical', 0.5777777777777777), ('topic', 0.5777777777777777), ('based', 0.5777777777777777), ('network', 0.5777777777777777), ('In', 0.5693430656934306), ('light', 0.5693430656934306), ('linguistics,', 0.5693430656934306), ('cognitive', 0.5693430656934306), ('science,', 0.5693430656934306), ('complexity', 0.5693430656934306), ('psychology', 0.5693430656934306), ('social', 0.5693430656934306), ('this', 0.5693430656934306), ('studies', 0.5693430656934306), ('theoretical', 0.5693430656934306), ('basis', 0.5693430656934306), ('From', 0.5611510791366906), ('macro,', 0.5611510791366906), ('meso', 0.5611510791366906), ('micro', 0.5611510791366906), ('perspectives,', 0.5611510791366906), ('development,', 0.5611510791366906), ('techniques', 0.5611510791366906), ('reviewed', 0.5611510791366906), ('analyzed', 0.5611510791366906), ('The', 0.5571428571428572), ('evaluation', 0.5571428571428572), ('approach', 0.5571428571428572), ('analyzed,', 0.5571428571428572), ('challenges', 0.5571428571428572), ('trends', 0.5571428571428572), ('also', 0.5571428571428572), ('predicted', 0.5571428571428572), ('It', 0.5531914893617021), ('a', 0.5531914893617021), ('basic', 0.5531914893617021), ('necessary', 0.5531914893617021), ('work', 0.5531914893617021), ('in', 0.5531914893617021), ('text', 0.5531914893617021), ('mining', 0.5531914893617021), ('tasks', 0.5531914893617021), ('such', 0.5531914893617021), ('as', 0.5531914893617021), ('retrieval', 0.5531914893617021), ('summarization', 0.5531914893617021), ('Automatic', 0.5492957746478874), ('to', 0.5492957746478874), ('extract', 0.5492957746478874), ('topical', 0.5492957746478874), ('important', 0.5492957746478874), ('words', 0.5492957746478874), ('or', 0.5492957746478874), ('phrases', 0.5492957746478874), ('form', 0.5492957746478874), ('document', 0.5492957746478874), ('set', 0.5492957746478874), ('discusses', 0.5342465753424658), ('connotation', 0.5342465753424658)]

附上string原文:

Automatic keyword extraction is to extract topical and important words or phrases form document or document set. It is a basic and necessary work in text mining tasks such as text retrieval and text summarization. This paper discusses the connotation of keyword extraction and automatic keyword extraction. In the light of linguistics, cognitive science, complexity science, psychology and social science, this paper studies the theoretical basis of automatic keyword extraction. From macro, meso and micro perspectives, the development, techniques and methods of automatic keyword extraction are reviewed and analyzed. This paper summarizes the current key technologies and research progress of automatic keyword extraction methods, including statistical methods, topic based methods, and network based methods. The evaluation approach of automatic keyword extraction is analyzed, and the challenges and trends of automatic keyword extraction are also predicted.

 

参考文献:

http://www.jos.org.cn/html/2017/9/5301.htm

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小蓝xlanll/article/detail/342873
推荐阅读
相关标签
  

闽ICP备14008679号