当前位置:   article > 正文

【python】13_英文词频统计&前 K 个高频元素_相应的过程输出,文章地址、前k个词频统计和爬取的文章的来源地址。如图:

相应的过程输出,文章地址、前k个词频统计和爬取的文章的来源地址。如图:

1.英文词频统计

作为字典(key-value)的经典应用题目,单词统计几乎出现在每一种语言键值对学习后的必练题目。
主要需求:
写一个函数wordcount统计一篇文章的每个单词出现的次数(词频统计)。统计完成后,对该统计按单词频次进行排序。

from collections import  Counter  #计数排序
text = """
    Enterprise architects will appreciate new capabilities such as lightweight application isolation.
    Application developers will welcome an updated development environment and application-profiling tools. Read more at the Red Hat Developer Blog.
    System administrators will appreciate new management tools and expanded file-system options with improved performance and scalability.

    Deployed on physical hardware, virtual machines, or in the cloud, Red Hat Enterprise Linux 7 delivers the advanced features required for next-generation architectures.
    Where to go from here:

    Red Hat Enterprise Linux 7 Product Page

    The landing page for Red Hat Enterprise Linux 7 information. Learn how to plan, deploy, maintain, and troubleshoot your Red Hat Enterprise Linux 7 system.
    Red Hat Customer Portal

    Your central access point to finding articles, videos, and other Red Hat content, as well as manage your Red Hat support cases.

    Documentation

    Provides documentation related to Red Hat Enterprise Linux and other Red Hat offerings.
    Red Hat Subscription Management

    Web-based administration interface to efficiently manage systems.
    Red Hat Enterprise Linux Product Page

    Provides an entry point to Red Hat Enterprise Linux product offerings.
"""

# # 1. 先拿出字符串里面的所有单词;
words = text.split()    

# 2. 统计每个单词出现的次数
#       1). 如何存储统计好的信息: 字典存储
#       2). 如何处理?
word_count_dict = {}
for word in words:
    if word not in word_count_dict:
        word_count_dict[word] = 1
    else:
        word_count_dict[word]  += 1
print(word_count_dict)
# 3. 排序,获取出现次数最多的单词
counter = Counter(word_count_dict)
print(counter.most_common(7))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43

在这里插入图片描述
但是在上一篇博文的介绍中,我们知道setdefault( key , value)
方法可以保证使用一个键之前总会将它初始化位一个初始值,同时如果这个键已经存在,调用setdefault没有任何影响。因此,将上述代码可以用这个方法优化。此外,由于print是将信息打印在一行,所以我们可以导入pprint模块。

import pprint
from collections import Counter
words = text.split()

word_dict = {}  #key:word value:count

for item in set(words):
    word_dict.setdefault(item,words.count(item))
    
pprint.pprint(word_dict)

count = Counter(word_dict)
pprint.pprint(count.most_common(5))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

在这里插入图片描述

2.前 K 个高频元素: topKFrequent.py

给定一个非空的整数数组,返回其中出现频率前 k 高的元素。

例如,

给定数组 [1,1,1,2,2,3] , 和 k = 2,返回 [1,2]。

nums = [1,1,1,2,2,3,3,2,1]
k = int(input('>>'))
results = []

count = Counter(nums)
for item in count.most_common(k):
    results.append(item[0])

print(results)

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

在这里插入图片描述

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Cpp五条/article/detail/72812
推荐阅读
相关标签
  

闽ICP备14008679号