赞
踩
- #在list中词频统计小程序分析
- '''
- 利用dict将list中的词频统计一kv的形式展现出来
- '''
- ls = ["综合", "理工", "综合", "综合", "综合", "综合", "综合", "综合", \
- "综合", "综合", "师范", "理工", "综合", "理工", "综合", "综合", \
- "综合", "综合", "综合", "理工", "理工", "理工", "理工", "师范", \
- "综合", "农林", "理工", "综合", "理工", "理工", "理工", "综合", \
- "理工", "综合", "综合", "理工", "农林", "民族", "军事"]
-
-
- d = {}
- for word in ls:
- #如果d中无该k则d[newk]=newv可添加一个kv并计该新词频为一有该k则加一
- d[word] = d.get(word, 0) + 1
- 或者:
-
-
-
- word_dict=dict()
- for word in ls:
- if word not in ls:
- word_dict[word] = 1
- else:
- word_dict[word] +=1
-
-
-
-
-
- for k in d:
- print("{}:{}".format(k, d[k]))
步骤:
1.用read()读出txt文件所有数据
2.对读出的数据进行结巴分词
3.将分词结果放到字典中
4.将字典中数据按照指定格式放入list中
5.如果要排序就用list(dict1.items())后将该list用lambda函数排序
ls.sort(key=lambda x: x[1], reverse=True)
6.新整个list将排序后dictlist按指定保存格式放入其中
或者还用ls,只不过换了个格式:
for i in range(100):
ls[i] = "{}:{}".format(ls[i][0], ls[i][1])
7.用join把新list写入csv
-
- import jieba
- fi = open("天龙八部-网络版.txt", "r", encoding='utf-8')
- fo = open("天龙八部-词语统计.txt", "w", encoding='utf-8')
- txt = fi.read()
- words = jieba.lcut(txt)#如果没有该行则为对单个汉字的统计
- d = {}
- for w in words:
- d[w] = d.get(w, 0) + 1
- del d[' ']
- del d['\n']
- ls = []
- for key in d:
- ls.append("{}:{}".format(key, d[key]))
- fo.write(",".join(ls))
- fi.close()
- fo.close()
-
-
关于list的sort()排序可以看我的另一篇博客:
https://blog.csdn.net/qq_41228218/article/details/87303183
sort()的应用:
中文词频排序输出前几名保存成csv
- dict1={'w':1,'e':2}
- ls=list(dict1.items())
- ls1=[]
- for i in dict1:
- ls1.append("{}:{}".format(i, dict1[i]))
- print(ls)
- print(ls1)
- #结果:
- [('w', 1), ('e', 2)]
- ['w:1', 'e:2']
- #所以ls可以通过sort排序,而ls1不行
- ls.sort(key=lambda x: x[1], reverse=True)
关于list.sort()与lambda:
dictListsort.sort(key=lambda x: x[1], reverse=True)
https://blog.csdn.net/qq_41228218/article/details/87303183
整体源码:
-
- import jieba
- fi = open(r"FilepPath", "r", encoding='utf-8')
- fo = open("FilepPath.csv", "w", encoding='utf-8')
-
-
- txt = fi.read()
- words = jieba.lcut(txt)#如果没有该行则为对单个汉字的统计
-
- d = {}
- for w in words:
- #if W in ''' \n,>;:'?!@#$%^&*()''':
- #continue
- d[w] = d.get(w, 0) + 1
-
- del d[' ']
- del d['\n']
- del d[","]
- del d["。"]
- del d["“"]
- del d['”']
- del d[':']
- del d['?']
- del d['…']
- del d['!']
- del d['、']
- del d['‘']
- del d["’"]
-
- DictListSave = []
- DictListSort = []
- DictListSaveSort = []
-
- for key in d:
- DictListSave.append("{}:{}".format(key, d[key]))
- fo.write(",".join(DictListSave))#乱序保存
-
- dictListsort = list(d.items())
- dictListsort.sort(key=lambda x: x[1], reverse=True)
-
- #输出TOP(100)
- for i in range(100):
- #print(dictListsave[i])
- word, count = dictListsort[i]
- print('{0:<20}{1:>10}'.format(word, count))
-
- for l in dictListsort:
- DictListSaveSort.append("{}:{}".format(l[0], l[1]))
- fo.write(",".join(DictListSaveSort))#排序后保存
- fi.close()
- fo.close()
-
-
英文词频统计见我的另一篇博客:
https://blog.csdn.net/qq_41228218/article/details/87305610
- import pandas as pd
-
- list_data = ["综合", "理工", "综合", "综合", "综合", "综合", "综合", "综合", \
- "综合", "综合", "师范", "理工", "综合", "理工", "综合", "综合", \
- "综合", "综合", "综合", "理工", "理工", "理工", "理工", "师范", \
- "综合", "农林", "理工", "综合", "理工", "理工", "理工", "综合", \
- "理工", "综合", "综合", "理工", "农林", "民族", "军事"]
-
- data = pd.Series(list_data)
- print(data.value_counts())
- 输出前三个:
- print(data.value_counts()[0:3])
pandas英文:
- import pandas as pd
- import re
- a = 'We need to use window.load, not document.ready, because in Chrome'.lower()
- list_data = a.split()
- list01=[re.sub(',',' ',i) for i in list_data]
- print(pd.Series(list01).value_counts())
结果:
可见series的方便快捷
过滤关键词:
- import pandas as pd
-
- list_data = ["综合", "理工", "综合", "综合", "综合", "综合", "综合", "综合", \
- "综合", "综合", "师范", "理工", "综合", "理工", "综合", "综合", \
- "综合", "综合", "综合", "理工", "理工", "理工", "理工", "师范", \
- "综合", "农林", "理工", "综合", "理工", "理工", "理工", "综合", \
- "理工", "综合", "综合", "理工", "农林", "民族", "军事"]
- FILTER_WORDS=['这个','那个','什么','怎么','如果']
- keywords_counts = pd.Series(list_data)
- #过滤分词后关键词小于1的词
- keywords_counts = keywords_counts[keywords_counts.str.len()>1]
- #过滤一些没有意义的词
- keywords counts = keywords counts[~keywords counts.str.contains('|'.join(FILTER_WORDS) )]
- #输出前30
- keywords_counts = keywords_counts.value_counts()[:30]
-
- print(keywords counts)
- import nltk
- from nltk import FreqDist
- list_data = ["综合", "理工", "综合", "综合", "综合", "综合", "综合", "综合", \
- "综合", "综合", "师范", "理工", "综合", "理工", "综合", "综合", \
- "综合", "综合", "综合", "理工", "理工", "理工", "理工", "师范", \
- "综合", "农林", "理工", "综合", "理工", "理工", "理工", "综合", \
- "理工", "综合", "综合", "理工", "农林", "民族", "军事"]
- fdist = FreqDist(list_data)
- print(fdist["综合"])
-
- #排序输出前5
- standard_freq_vector = fdist.most_common(5)
- print(standard_freq_vector)
赞
踩
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。