当前位置:   article > 正文

Python wordcloud词云:源码分析及简单使用_valueerror: couldn't find space to draw. either th

valueerror: couldn't find space to draw. either the canvas size is too small

Python版本的词云生成模块从2015年的v1.0到现在,已经更新到了v1.7。

下载请移步至:https://pypi.org/project/wordcloud/

wordcloud简单应用:

  1. import jieba
  2. import wordcloud
  3. w = wordcloud.WordCloud(
  4. width=600,
  5. height=600,
  6. background_color='white',
  7. font_path='msyh.ttc'
  8. )
  9. text = '看到此标题,我也是感慨万千 首先弄清楚搞IT和被IT搞,谁是搞IT的?马云就是,马化腾也是,刘强东也是,他们都是叫搞IT的, 但程序员只是被IT搞的人,可以比作盖楼砌砖的泥瓦匠,你想想,四十岁的泥瓦匠能跟二十左右岁的年轻人较劲吗?如果你是老板你会怎么做?程序员只是技术含量高的泥瓦匠,社会是现实的,社会的现实是什么?利益驱动。当你跑的速度不比以前快了时,你就会被挨鞭子赶,这种窘境如果在做程序员当初就预料到的话,你就会知道,到达一定高度时,你需要改变行程。 程序员其实真的不是什么好职业,技术每天都在更新,要不停的学,你以前学的每天都在被淘汰,加班可能是标配了吧。 热点,你知道什么是热点吗?社会上啥热就是热点,我举几个例子:在早淘宝之初,很多人都觉得做淘宝能让自己发展,当初的规则是产品按时间轮候展示,也就是你的商品上架时间一到就会被展示,不论你星级多高。这种一律平等的条件固然好,但淘宝随后调整了显示规则,对产品和店铺,销量进行了加权,一下导致小卖家被弄到了很深的胡同里,没人看到自己的产品,如何卖?做广告费用也非常高,入不敷出,想必做过淘宝的都知道,再后来淘宝弄天猫,显然,天猫是上档次的商城,不同于淘宝的摆地摊,因为摊位费涨价还闹过事,闹也白闹,你有能力就弄,没能力就淘汰掉。前几天淘宝又推出C2M,客户反向定制,客户直接挂钩大厂家,没你小卖家什么事。 后来又出现了微商,在微商出现当天我就知道这东西不行,它比淘宝假货还下三滥.我对TX一直有点偏见,因为骗子都使用QQ 我说这么多只想说一个事,世界是变化的,你只能适应变化,否则就会被淘汰。 还是回到热点这个话题,育儿嫂这个职位有很多人了解吗?前几年放开二胎后,这个职位迅速串红,我的一个亲戚初中毕业,现在已经月入一万五,职务就是照看刚出生的婴儿28天,节假日要双薪。 你说这难到让我一个男的去当育儿嫂吗?扯,我只是说热点问题。你没踩在热点上,你赚钱就会很费劲 这两年的热点是什么?短视频,你可以看到抖音的一些作品根本就不是普通人能实现的,说明专业级人才都开始努力往这上使劲了。 我只会编程,别的不会怎么办?那你就去编程。没人用了怎么办?你看看你自己能不能雇佣你自己 学会适应社会,学会改变自己去适应社会 最后说一句:科大讯飞的刘鹏说的是对的。那我为什么还做程序员?他可以完成一些原始积累,只此而已。'
  10. new_str = ' '.join(jieba.lcut(text))
  11. w.generate(new_str)
  12. w.to_file('x.png')

 下面分析源码:

wordcloud源码中生成词云图的主要步骤有:

1、分割词组

2、生成词云

3、保存图片

我们从 generate(self, text)切入,发现它仅仅调用了自身对象的一个方法 self.generate_from_text(text)

  1. def generate_from_text(self, text):
  2. """Generate wordcloud from text.
  3. """
  4. words = self.process_text(text) # 分割词组
  5. self.generate_from_frequencies(words) # 生成词云的主要方法(重点分析)
  6. return self

process_text()源码如下,处理的逻辑比较简单:分割词组、去除数字、去除's、去除数字、去除短词、去除禁用词等。

  1. def process_text(self, text):
  2. """Splits a long text into words, eliminates the stopwords.
  3. Parameters
  4. ----------
  5. text : string
  6. The text to be processed.
  7. Returns
  8. -------
  9. words : dict (string, int)
  10. Word tokens with associated frequency.
  11. ..versionchanged:: 1.2.2
  12. Changed return type from list of tuples to dict.
  13. Notes
  14. -----
  15. There are better ways to do word tokenization, but I don't want to
  16. include all those things.
  17. """
  18. flags = (re.UNICODE if sys.version < '3' and type(text) is unicode else 0)
  19. regexp = self.regexp if self.regexp is not None else r"\w[\w']+"
  20. # 获得分词
  21. words = re.findall(regexp, text, flags)
  22. # 去除 's
  23. words = [word[:-2] if word.lower().endswith("'s") else word for word in words]
  24. # 去除数字
  25. if not self.include_numbers:
  26. words = [word for word in words if not word.isdigit()]
  27. # 去除短词,长度小于指定值min_word_length的词,被视为短词,筛除
  28. if self.min_word_length:
  29. words = [word for word in words if len(word) >= self.min_word_length]
  30. # 去除禁用词
  31. stopwords = set([i.lower() for i in self.stopwords])
  32. if self.collocations:
  33. word_counts = unigrams_and_bigrams(words, stopwords, self.normalize_plurals, self.collocation_threshold)
  34. else:
  35. # remove stopwords
  36. words = [word for word in words if word.lower() not in stopwords]
  37. word_counts, _ = process_tokens(words, self.normalize_plurals)
  38. return word_counts

重头戏来了

generate_from_frequencies(self, frequencies, max_font_size=None) 方法体内的代码比较多,总体上分为以下几步:

1、排序

2、词频归一化

3、创建绘图对象

4、确定初始字体大小(字号)

5、扩展单词集

6、确定每个单词的字体大小、位置、旋转角度、颜色等信息

源码如下(根据个人理解已添加中文注释):

  1. def generate_from_frequencies(self, frequencies, max_font_size=None):
  2. """Create a word_cloud from words and frequencies.
  3. Parameters
  4. ----------
  5. frequencies : dict from string to float
  6. A contains words and associated frequency.
  7. max_font_size : int
  8. Use this font-size instead of self.max_font_size
  9. Returns
  10. -------
  11. self
  12. """
  13. # make sure frequencies are sorted and normalized
  14. # 1、排序
  15. # 对“单词-频率”列表按频率降序排序
  16. frequencies = sorted(frequencies.items(), key=itemgetter(1), reverse=True)
  17. if len(frequencies) <= 0:
  18. raise ValueError("We need at least 1 word to plot a word cloud, "
  19. "got %d." % len(frequencies))
  20. # 确保单词数在设置的最大范围内,超出的部分被舍弃掉
  21. frequencies = frequencies[:self.max_words]
  22. # largest entry will be 1
  23. # 取第一个单词的频率作为最大词频
  24. max_frequency = float(frequencies[0][1])
  25. # 2、词频归一化
  26. # 把所有单词的词频归一化,由于单词已经排序,所以归一化后应该是这样的:[('xxx', 1),('xxx', 0.96),('xxx', 0.87),...]
  27. frequencies = [(word, freq / max_frequency)
  28. for word, freq in frequencies]
  29. # 随机对象,用于产生一个随机数,来确定是否旋转90度
  30. if self.random_state is not None:
  31. random_state = self.random_state
  32. else:
  33. random_state = Random()
  34. if self.mask is not None:
  35. boolean_mask = self._get_bolean_mask(self.mask)
  36. width = self.mask.shape[1]
  37. height = self.mask.shape[0]
  38. else:
  39. boolean_mask = None
  40. height, width = self.height, self.width
  41. # 用于查找单词可能放置的位置,例如图片有效范围内的空白处(非文字区域)
  42. occupancy = IntegralOccupancyMap(height, width, boolean_mask)
  43. # 3、创建绘图对象
  44. # create image
  45. img_grey = Image.new("L", (width, height))
  46. draw = ImageDraw.Draw(img_grey)
  47. img_array = np.asarray(img_grey)
  48. font_sizes, positions, orientations, colors = [], [], [], []
  49. last_freq = 1.
  50. # 4、确定初始字号
  51. # 确定最大字号
  52. if max_font_size is None:
  53. # if not provided use default font_size
  54. max_font_size = self.max_font_size
  55. # 如果最大字号是空的,就需要确定一个最大字号作为初始字号
  56. if max_font_size is None:
  57. # figure out a good font size by trying to draw with
  58. # just the first two words
  59. if len(frequencies) == 1:
  60. # we only have one word. We make it big!
  61. font_size = self.height
  62. else:
  63. # 递归进入当前函数,以获得一个self.layout_,其中只有前两个单词的词频信息
  64. # 使用这两个词频计算出一个初始字号
  65. self.generate_from_frequencies(dict(frequencies[:2]),
  66. max_font_size=self.height)
  67. # find font sizes
  68. sizes = [x[1] for x in self.layout_]
  69. try:
  70. font_size = int(2 * sizes[0] * sizes[1]
  71. / (sizes[0] + sizes[1]))
  72. # quick fix for if self.layout_ contains less than 2 values
  73. # on very small images it can be empty
  74. except IndexError:
  75. try:
  76. font_size = sizes[0]
  77. except IndexError:
  78. raise ValueError(
  79. "Couldn't find space to draw. Either the Canvas size"
  80. " is too small or too much of the image is masked "
  81. "out.")
  82. else:
  83. font_size = max_font_size
  84. # we set self.words_ here because we called generate_from_frequencies
  85. # above... hurray for good design?
  86. self.words_ = dict(frequencies)
  87. # 5、扩展单词集
  88. # 如果单词数不足最大值,则扩展单词集以达到最大值
  89. if self.repeat and len(frequencies) < self.max_words:
  90. # pad frequencies with repeating words.
  91. times_extend = int(np.ceil(self.max_words / len(frequencies))) - 1
  92. # get smallest frequency
  93. frequencies_org = list(frequencies)
  94. downweight = frequencies[-1][1]
  95. # 扩展单词数,词频会保持原有词频的递减规则。
  96. for i in range(times_extend):
  97. frequencies.extend([(word, freq * downweight ** (i + 1))
  98. for word, freq in frequencies_org])
  99. # 6、确定每一个单词的字体大小、位置、旋转角度、颜色等信息
  100. # start drawing grey image
  101. for word, freq in frequencies:
  102. if freq == 0:
  103. continue
  104. # select the font size
  105. rs = self.relative_scaling
  106. if rs != 0:
  107. font_size = int(round((rs * (freq / float(last_freq))
  108. + (1 - rs)) * font_size))
  109. if random_state.random() < self.prefer_horizontal:
  110. orientation = None
  111. else:
  112. orientation = Image.ROTATE_90
  113. tried_other_orientation = False
  114. # 寻找可能放置的位置,如果寻找一次,没有找到,则尝试改变文字方向或缩小字体大小,继续寻找。
  115. # 直到找到放置位置或者字体大小超出字号下限
  116. while True:
  117. # try to find a position
  118. font = ImageFont.truetype(self.font_path, font_size)
  119. # transpose font optionally
  120. transposed_font = ImageFont.TransposedFont(
  121. font, orientation=orientation)
  122. # get size of resulting text
  123. box_size = draw.textsize(word, font=transposed_font)
  124. # find possible places using integral image:
  125. result = occupancy.sample_position(box_size[1] + self.margin,
  126. box_size[0] + self.margin,
  127. random_state)
  128. if result is not None or font_size < self.min_font_size:
  129. # either we found a place or font-size went too small
  130. break
  131. # if we didn't find a place, make font smaller
  132. # but first try to rotate!
  133. if not tried_other_orientation and self.prefer_horizontal < 1:
  134. orientation = (Image.ROTATE_90 if orientation is None else
  135. Image.ROTATE_90)
  136. tried_other_orientation = True
  137. else:
  138. font_size -= self.font_step
  139. orientation = None
  140. if font_size < self.min_font_size:
  141. # we were unable to draw any more
  142. break
  143. # 收集该词的信息:字体大小、位置、旋转角度、颜色
  144. x, y = np.array(result) + self.margin // 2
  145. # actually draw the text
  146. # 此处绘制图像仅仅用于寻找放置单词的位置,而不是最终的词云图片。词云图片是在另一个函数中生成:to_image
  147. draw.text((y, x), word, fill="white", font=transposed_font)
  148. positions.append((x, y))
  149. orientations.append(orientation)
  150. font_sizes.append(font_size)
  151. colors.append(self.color_func(word, font_size=font_size,
  152. position=(x, y),
  153. orientation=orientation,
  154. random_state=random_state,
  155. font_path=self.font_path))
  156. # recompute integral image
  157. if self.mask is None:
  158. img_array = np.asarray(img_grey)
  159. else:
  160. img_array = np.asarray(img_grey) + boolean_mask
  161. # recompute bottom right
  162. # the order of the cumsum's is important for speed ?!
  163. occupancy.update(img_array, x, y)
  164. last_freq = freq
  165. # layout_是单词信息列表,表中每项信息:单词、频率、字体大小、位置、旋转角度、颜色等信息。为后续步骤的绘图工作做好准备。
  166. self.layout_ = list(zip(frequencies, font_sizes, positions,
  167. orientations, colors))
  168. return self

注意

在第6步确定位置时,程序使用循环和随机数来查找合适的放置位置,源码如下。

  1. # 寻找可能放置的位置,如果寻找一次,没有找到,则尝试改变文字方向或缩小字体大小,继续寻找。
  2. # 直到找到放置位置或者字体大小超出字号下限
  3. while True:
  4. # try to find a position
  5. font = ImageFont.truetype(self.font_path, font_size)
  6. # transpose font optionally
  7. transposed_font = ImageFont.TransposedFont(
  8. font, orientation=orientation)
  9. # get size of resulting text
  10. box_size = draw.textsize(word, font=transposed_font)
  11. # find possible places using integral image:
  12. result = occupancy.sample_position(box_size[1] + self.margin,
  13. box_size[0] + self.margin,
  14. random_state)
  15. if result is not None or font_size < self.min_font_size:
  16. # either we found a place or font-size went too small
  17. break
  18. # if we didn't find a place, make font smaller
  19. # but first try to rotate!
  20. if not tried_other_orientation and self.prefer_horizontal < 1:
  21. orientation = (Image.ROTATE_90 if orientation is None else
  22. Image.ROTATE_90)
  23. tried_other_orientation = True
  24. else:
  25. font_size -= self.font_step
  26. orientation = None

其中 occupancy.sample_position() 是具体寻找合适位置的方法。当你试图进一步了解其中的奥秘时,却发现你的【Ctrl+左键】已经无法跳转到深层代码了,悲哀的事情还是发生了......o(╥﹏╥)o

在wordcloud.py文件的顶部有这么一行: from .query_integral_image import query_integral_image query_integral_image 是一个pyd文件,该文件无法直接查看。有关pyd格式的更多资料,请自行查阅。

再回到 generate_from_frequencies 上来,方法的最后把数据整理到了 self.layout_ 变量里,这里面就是所有词组绘制时所需要的信息了。然后就可以调用to_file()方法,保存图片了。

  1. def to_file(self, filename):
  2. img = self.to_image()
  3. img.save(filename, optimize=True)
  4. return self

核心方法 to_image() 就会把self.layout_里的信息依次取出,绘制每一个词组。

  1. def to_image(self):
  2. self._check_generated()
  3. if self.mask is not None:
  4. width = self.mask.shape[1]
  5. height = self.mask.shape[0]
  6. else:
  7. height, width = self.height, self.width
  8. img = Image.new(self.mode, (int(width * self.scale),
  9. int(height * self.scale)),
  10. self.background_color)
  11. draw = ImageDraw.Draw(img)
  12. for (word, count), font_size, position, orientation, color in self.layout_:
  13. font = ImageFont.truetype(self.font_path,
  14. int(font_size * self.scale))
  15. transposed_font = ImageFont.TransposedFont(
  16. font, orientation=orientation)
  17. pos = (int(position[1] * self.scale),
  18. int(position[0] * self.scale))
  19. draw.text(pos, word, fill=color, font=transposed_font)
  20. return self._draw_contour(img=img)

 

引申思考:

查找文字合适的放置该怎样实现呢?(注意:文字笔画的空隙里也是可以放置更小一字号的文字)

 

~ End ~

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/羊村懒王/article/detail/531976
推荐阅读
相关标签
  

闽ICP备14008679号