词频统计排序的几种方法(手写 pandas NLTK)_词频排序

作者：知新_RL | 2024-02-18 18:17:55

踩

词频排序


#在list中词频统计小程序分析
'''
利用dict将list中的词频统计一kv的形式展现出来
'''
ls = ["综合", "理工", "综合", "综合", "综合", "综合", "综合", "综合", \
      "综合", "综合", "师范", "理工", "综合", "理工", "综合", "综合", \
      "综合", "综合", "综合", "理工", "理工", "理工", "理工", "师范", \
      "综合", "农林", "理工", "综合", "理工", "理工", "理工", "综合", \
      "理工", "综合", "综合", "理工", "农林", "民族", "军事"]
 
 
d = {}
for word in ls:
    #如果d中无该k则d[newk]=newv可添加一个kv并计该新词频为一有该k则加一
    d[word] = d.get(word, 0) + 1
或者:
 
 
 
word_dict=dict()
for word in ls:
    if word not in ls:
        word_dict[word] = 1
    else:
        word_dict[word] +=1
        
 
 
    
 
for k in d:
    print("{}:{}".format(k, d[k]))

中文txt文件的词频统计:

步骤:

1.用read()读出txt文件所有数据

2.对读出的数据进行结巴分词

3.将分词结果放到字典中

4.将字典中数据按照指定格式放入list中

5.如果要排序就用list(dict1.items())后将该list用lambda函数排序

ls.sort(key=lambda x: x[1], reverse=True)

6.新整个list将排序后dictlist按指定保存格式放入其中

或者还用ls,只不过换了个格式:

for i in range(100):

ls[i] = "{}:{}".format(ls[i][0], ls[i][1])

7.用join把新list写入csv


import jieba
fi = open("天龙八部-网络版.txt", "r", encoding='utf-8')
fo = open("天龙八部-词语统计.txt", "w", encoding='utf-8')
txt = fi.read()
words = jieba.lcut(txt)#如果没有该行则为对单个汉字的统计
d = {}
for w in words:
    d[w] = d.get(w, 0) + 1
del d[' ']
del d['\n']
ls = []
for key in d:
    ls.append("{}:{}".format(key, d[key]))
fo.write(",".join(ls))
fi.close()
fo.close()

关于list的sort()排序可以看我的另一篇博客:

https://blog.csdn.net/qq_41228218/article/details/87303183

sort()的应用:

中文词频排序输出前几名保存成csv


dict1={'w':1,'e':2}
ls=list(dict1.items())
ls1=[]
for i in dict1:
    ls1.append("{}:{}".format(i, dict1[i]))
print(ls)
print(ls1)
#结果:
[('w', 1), ('e', 2)]
['w:1', 'e:2']
#所以ls可以通过sort排序,而ls1不行
ls.sort(key=lambda x: x[1], reverse=True)

关于list.sort()与lambda:

dictListsort.sort(key=lambda x: x[1], reverse=True)

https://blog.csdn.net/qq_41228218/article/details/87303183

整体源码:


 
import jieba
fi = open(r"FilepPath", "r", encoding='utf-8')
fo = open("FilepPath.csv", "w", encoding='utf-8')
 
 
txt = fi.read()
words = jieba.lcut(txt)#如果没有该行则为对单个汉字的统计
 
d = {}
for w in words:
    #if W in ''' \n,>;:'?!@#$%^&*()''':
            #continue
    d[w] = d.get(w, 0) + 1
 
del d[' ']
del d['\n']
del d["，"]
del d["。"]
del d["“"]
del d['”']
del d['：']
del d['？']
del d['…']
del d['！']
del d['、']
del d['‘']
del d["’"]
 
DictListSave = []
DictListSort = []
DictListSaveSort = []
 
for key in d:
    DictListSave.append("{}:{}".format(key, d[key]))
fo.write(",".join(DictListSave))#乱序保存
 
dictListsort = list(d.items())
dictListsort.sort(key=lambda x: x[1], reverse=True)
 
#输出TOP(100)
for i in range(100):
    #print(dictListsave[i])
    word, count = dictListsort[i]
    print('{0:<20}{1:>10}'.format(word, count))
 
for l in dictListsort:
    DictListSaveSort.append("{}:{}".format(l[0], l[1]))
fo.write(",".join(DictListSaveSort))#排序后保存
fi.close()
fo.close()

英文词频统计见我的另一篇博客:

https://blog.csdn.net/qq_41228218/article/details/87305610

用pandas的series:


import pandas as pd
 
list_data = ["综合", "理工", "综合", "综合", "综合", "综合", "综合", "综合", \
	  "综合", "综合", "师范", "理工", "综合", "理工", "综合", "综合", \
	  "综合", "综合", "综合", "理工", "理工", "理工", "理工", "师范", \
	  "综合", "农林", "理工", "综合", "理工", "理工", "理工", "综合", \
	  "理工", "综合", "综合", "理工", "农林", "民族", "军事"]
 
data = pd.Series(list_data)
print(data.value_counts())
输出前三个:
print(data.value_counts()[0:3])

pandas英文:


import pandas as pd
import re
a = 'We need to use window.load, not document.ready, because in Chrome'.lower()
list_data = a.split() 
list01=[re.sub(',',' ',i) for i in list_data]
print(pd.Series(list01).value_counts())

结果:

可见series的方便快捷

过滤关键词:


import pandas as pd
 
list_data = ["综合", "理工", "综合", "综合", "综合", "综合", "综合", "综合", \
	  "综合", "综合", "师范", "理工", "综合", "理工", "综合", "综合", \
	  "综合", "综合", "综合", "理工", "理工", "理工", "理工", "师范", \
	  "综合", "农林", "理工", "综合", "理工", "理工", "理工", "综合", \
	  "理工", "综合", "综合", "理工", "农林", "民族", "军事"]
FILTER_WORDS=['这个','那个','什么','怎么','如果']
keywords_counts = pd.Series(list_data)
#过滤分词后关键词小于1的词
keywords_counts = keywords_counts[keywords_counts.str.len()>1]
#过滤一些没有意义的词
keywords counts = keywords counts[~keywords counts.str.contains('|'.join(FILTER_WORDS) )]
#输出前30
keywords_counts = keywords_counts.value_counts()[:30]
 
print(keywords counts)

用NLTK来进行词频统计:


import nltk
from nltk import FreqDist
list_data = ["综合", "理工", "综合", "综合", "综合", "综合", "综合", "综合", \
	  "综合", "综合", "师范", "理工", "综合", "理工", "综合", "综合", \
	  "综合", "综合", "综合", "理工", "理工", "理工", "理工", "师范", \
	  "综合", "农林", "理工", "综合", "理工", "理工", "理工", "综合", \
	  "理工", "综合", "综合", "理工", "农林", "民族", "军事"]
fdist = FreqDist(list_data)
print(fdist["综合"])
 
#排序输出前5
standard_freq_vector = fdist.most_common(5)
print(standard_freq_vector)

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/知新_RL/article/detail/109672