赞
踩
jieba是优秀的中文分词第三方库
jieba分词利用中文词库
jieba分词的三种模式
把文本精确的切分开,不存在冗余单词
把文本中所有可能词语都扫描出来,有冗余
在精确模式的基础上,对长词再次切分
wordcloud库把词云当作一个WordCloud对象
import wordcloud
# 步骤一:配置对象参数
c = wordcloud.WordCloud()
# 步骤二:加载词云文本
c.generate("wordcloud by Python")
# 步骤三:输出词云文件
c.to_file("pywordcloud.png")
import wordcloud
txt = "life is short,you need python"
w = wordcloud.WordCloud(background_color="white")
w.generate(txt)
w.to_file("pywordcloud1.png")
import jieba
import wordcloud
txt = "程序设计语言是计算机能够理解和识别用户\
操作意图的一种交互体系,它按照特定规则组织计算机指令,\
使计算机能够自动进行各种运算处理"
w = wordcloud.WordCloud(width=1000,font_path="msyh.ttc",height=700)
w.generate(" ".join(jieba.lcut(txt)))
w.to_file("pywordcloud2.png")
#GovRptWordCloudv1.py import jieba import wordcloud f = open("新时代中国特色社会主义.txt", "r", encoding="utf-8") t = f.read() f.close() ls = jieba.lcut(t) txt = " ".join(ls) w = wordcloud.WordCloud( \ width = 1000, height = 700,\ background_color = "white", font_path = "msyh.ttc" ) w.generate(txt) w.to_file("grwordcloud.png")
#GovRptWordCloudv2.py import jieba import wordcloud from imageio import imread mask = imread("fivestar.png") #excludes = { } f = open("新时代中国特色社会主义.txt", "r", encoding="utf-8") t = f.read() f.close() ls = jieba.lcut(t) txt = " ".join(ls) w = wordcloud.WordCloud(\ width = 1000, height = 700,\ background_color = "white", font_path = "msyh.ttc", mask = mask ) w.generate(txt) w.to_file("grwordcloudm.png")
需求:一篇文章,出现了哪些词?哪些词出现的最多?
做法:先判断文章是英文的还是中文的
#CalHamletV1.py def getText(): txt = open("hamlet.txt", "r").read() txt = txt.lower() for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~': txt = txt.replace(ch, " ") #将文本中特殊字符替换为空格 return txt hamletTxt = getText() words = hamletTxt.split() counts = {} for word in words: counts[word] = counts.get(word,0) + 1 #判断获取的词是否在字典中,默认为0 items = list(counts.items()) #转换为列表类型 items.sort(key=lambda x:x[1], reverse=True) #进行按照键值对的2个元素的第二个元素进行排序 for i in range(10): word, count = items[i] print ("{0:<10}{1:>5}".format(word, count)) 结果: the 1138 and 965 to 754 of 669 you 550 i 542 a 542 my 514 hamlet 462 in 436
#CalThreeKingdomsV1.py import jieba txt = open("threekingdoms.txt", "r", encoding='utf-8').read() words = jieba.lcut(txt) #分词处理,形参列表 counts = {} #构造字典 for word in words: if len(word) == 1: continue else: counts[word] = counts.get(word,0) + 1 items = list(counts.items()) #转换为列表类型 items.sort(key=lambda x:x[1], reverse=True) for i in range(15): word, count = items[i] print ("{0:<10}{1:>5}".format(word, count)) 结果: 曹操 953 孔明 836 将军 772 却说 656 玄德 585 关公 510 丞相 491 二人 469 不可 440 荆州 425 玄德曰 390 孔明曰 390 不能 384 如此 378 张飞 358
#CalThreeKingdomsV2.py import jieba excludes = {"将军","却说","荆州","二人","不可","不能","如此"} #将确定不是人名的取出掉 txt = open("threekingdoms.txt", "r", encoding='utf-8').read() words = jieba.lcut(txt) counts = {} for word in words: if len(word) == 1: continue elif word == "诸葛亮" or word == "孔明曰": #进行人名关联 rword = "孔明" elif word == "关公" or word == "云长": rword = "关羽" elif word == "玄德" or word == "玄德曰": rword = "刘备" elif word == "孟德" or word == "丞相": rword = "曹操" else: rword = word counts[rword] = counts.get(rword,0) + 1 for word in excludes: del counts[word] items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(10): word, count = items[i] print ("{0:<10}{1:>5}".format(word, count)) 结果: 曹操 1451 孔明 1383 刘备 1252 关羽 784 张飞 358 商议 344 如何 338 主公 331 军士 317 吕布 300
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。