WordCloud词云图去除停用词的正确方法

作者：小小林熬夜学编程 | 2024-04-01 03:28:01

踩

wordcloud stopwords

前言

之前我们已经学习了如何使用wordcloud制作英文和中文词云，今天我们接着讲解，在实际制作词云中，有很多词是没有展示出的意义的，例如我，他等主语，那如何不显示这些词了，这就涉及到停用词。

wordcloud自带停用词

wordcloud自带一个停用词表，是一个集合的数据类型。


from wordcloud import STOPWORDS
 
print(STOPWORDS)

如果我们需要添入一些其他的词的话，也很简单，直接用add或者update方法即可（因为这是集合数据）。


from matplotlib import pyplot as plt
from wordcloud import WordCloud,STOPWORDS
 
text = 'my is luopan. he is zhangshan'
stopwords = STOPWORDS
stopwords.add('luopan')
 
wc = WordCloud(stopwords=stopwords)
wc.generate(text)
 
plt.imshow(wc)

中文停用词使用

用wordcloud库制作中文词云图，必须要分词，所以总结下来，中文中需要设置停用词的话可以有三种方法。

在分词前，将中文文本的停用词先过滤掉。
分词的时候，过滤掉停用词。
在wordcloud中设置stopwords。

在这里我们只讲解第三种方法，设置stopwords，我们需要先有一个中文停用词表，在网上下载即可，然后将停用词表清洗为集合数据格式。

首先我们读取停用词表的内容，设置为集合数据结构。


stopwords = set()
content = [line.strip() for line in open('hit_stopwords.txt','r').readlines()]
stopwords.update(content)
stopwords

接着，我们就对文本进行分词，制作词云图即可。


from matplotlib import pyplot as plt
from wordcloud import WordCloud
import jieba
 
text = '我叫罗攀，他叫关羽，我叫罗攀，他叫刘备'
cut_word = " ".join(jieba.cut(text))
 
stopwords = set()
content = [line.strip() for line in open('hit_stopwords.txt','r').readlines()]
stopwords.update(content)
 
wc = WordCloud(font_path = r'/System/Library/Fonts/Supplemental/Songti.ttc',
              stopwords = stopwords)
wc.generate(cut_word)
 
plt.imshow(wc)

最后，如何美化词云图，我们下期再见~

本文内容由网友自发贡献，转载请注明出处：【wpsshop博客】