当前位置:   article > 正文

selectpicker默认展示前100条_Python 爬取京东商品评论 + 词云展示!

只能看前100条评论

利用python爬虫爬取京东商品评论数据,并绘制词云展示。

1. 爬取商品评论数据

在京东商城里搜索三只松鼠,选取一家店铺打开

d2d55abb80d2f20f6300bfd8e659237e.png


点开商品评价,选择只看当前商品评价,按时间排序查看,发现一页有10条评论。

c445e88ebf0d63e20338657a76df3ea0.png


打开谷歌的调试工具,点开Network查看,京东的商品评论信息是存放json包中的。

5231679fa76219ab23a29cb78b6c0a38.png

11a88c0087a12db9b085e548ab02ad0f.png


分析Request URL,里面有一些关键参数,productId是这个商品的ID,sortType为评论的排序方式,page为第几页,pageSize表示这一页有10条评论数据,复制Request URL,在浏览器中打开这个链接,可以发现:

a77fefcd05dfba47d027505edcddedf9.png


改变page参数的值可以实现翻页,效果如下:

1a8f0bf92333b9e20f3acccd48f5c965.png

python爬虫,正则匹配提取数据,保存到txt,代码如下:

  1. import asyncio
  2. import aiohttp
  3. import re
  4. import logging
  5. import datetime
  6. logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s: %(message)s')
  7. start = datetime.datetime.now()
  8. class Spider(object):
  9. def __init__(self):
  10. self.semaphore = asyncio.Semaphore(6)
  11. # 伪装请求头
  12. self.header = {
  13. "Host": "club.jd.com",
  14. "Cookie": "shshshfpa=c003ed54-a640-d73d-ba32-67b4db85fd3e-1594895561; shshshfpb=i5%20TzLvWAV56AeaK%20C9q5ew%3D%3D; __jdu=629096461; unpl=V2_ZzNtbUVRFkZ8DUddfRxcBGIEE1hKXhBGIQEVVnNLD1IwBkBeclRCFnQUR1JnGloUZwEZXkZcQxVFCEdkeR1ZAmYBEV1yZ0IXJQ4SXS9NVAZiChAJQAdGFnJfRFQrGlUAMFdACUtVcxZ1OEdkfBpUBG8EF1pCZ3MVfQ92ZDBMAGshQlBtQldEEXAKTlZyGGwEVwMTWUFXQxZ1DkFkMHddSGAAGlxKUEYSdThGVXoYXQVkBBVeclQ%3d; __jdv=122270672|baidu|-|organic|not set|1596847892017; areaId=0; ipLoc-djd=1-72-55653-0; PCSYCityID=CN_0_0_0; __jda=122270672.629096461.1595821561.1596847892.1597148792.3; __jdc=122270672; shshshfp=4866c0c0f31ebd5547336a334ca1ef1d; 3AB9D23F7A4B3C9B=DNFMQBTRNFJAYXVX2JODGAGXZBU3L2TIVL3I36BT56BKFQR3CNHE5ZTVA76S56HSJ2TX62VY7ZJ2TPKNIEQOE7RUGY; jwotest_product=99; shshshsID=ba4014acbd1aea969254534eef9cf0cc_5_1597149339335; __jdb=122270672.5.629096461|3.1597148792; JSESSIONID=99A8EA65B8D93A7F7E8DAEE494D345BE.s1",
  15. "Connection": "keep-alive",
  16. "Referer": "https://item.jd.com/4803334.html",
  17. "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
  18. }
  19. async def scrape(self, url):
  20. async with self.semaphore:
  21. session = aiohttp.ClientSession(headers=self.header)
  22. response = await session.get(url)
  23. result = await response.text()
  24. await session.close()
  25. return result
  26. async def scrape_page(self, page):
  27. url = f'https://club.jd.com/comment/skuProductPageComments.action?callback=fetchJSON_comment98&productId=4803334&score=0&sortType=6&page={page}&pageSize=10&isShadowSku=0&rid=0&fold=1'
  28. text = await self.scrape(url)
  29. await self.parse(text)
  30. async def parse(self, text):
  31. content = re.findall('"guid":".*?","content":"(.*?)"', text)
  32. with open('datas.txt', 'a+') as f:
  33. for con in content:
  34. f.write(con + 'n')
  35. logging.info(con)
  36. def main(self):
  37. # 100页的数据
  38. scrape_index_tasks = [asyncio.ensure_future(self.scrape_page(page)) for page in range(0, 100)]
  39. loop = asyncio.get_event_loop()
  40. tasks = asyncio.gather(*scrape_index_tasks)
  41. loop.run_until_complete(tasks)
  42. if __name__ == '__main__':
  43. spider = Spider()
  44. spider.main()
  45. delta = (datetime.datetime.now() - start).total_seconds()
  46. print("用时:{:.3f}s".format(delta))

2. 词云展示

代码如下:

  1. import jieba
  2. import collections
  3. import re
  4. from wordcloud import WordCloud
  5. import matplotlib.pyplot as plt
  6. with open('datas.txt') as f:
  7. data = f.read()
  8. # 文本预处理 去除一些无用的字符 只提取出中文出来
  9. new_data = re.findall('[u4e00-u9fa5]+', data, re.S)
  10. new_data = "/".join(new_data)
  11. # 文本分词
  12. seg_list_exact = jieba.cut(new_data, cut_all=True)
  13. result_list = []
  14. with open('stop_words.txt', encoding='utf-8') as f:
  15. con = f.readlines()
  16. stop_words = set()
  17. for i in con:
  18. i = i.replace("n", "") # 去掉读取每一行数据的n
  19. stop_words.add(i)
  20. for word in seg_list_exact:
  21. # 设置停用词并去除单个词
  22. if word not in stop_words and len(word) > 1:
  23. result_list.append(word)
  24. print(result_list)
  25. # 筛选后统计
  26. word_counts = collections.Counter(result_list)
  27. # 绘制词云
  28. my_cloud = WordCloud(
  29. background_color='white', # 设置背景颜色 默认是black
  30. width=800, height=550,
  31. font_path='simhei.ttf', # 设置字体 显示中文
  32. max_font_size=112, # 设置字体最大值
  33. min_font_size=12, # 设置子图最小值
  34. random_state=80 # 设置随机生成状态,即多少种配色方案
  35. ).generate_from_frequencies(word_counts)
  36. # 显示生成的词云图片
  37. plt.imshow(my_cloud, interpolation='bilinear')
  38. # 显示设置词云图中无坐标轴
  39. plt.axis('off')
  40. plt.show()

运行效果如下:

e167c0e915f3ec44cf29eab171542da3.png

源码获取点击源码

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/你好赵伟/article/detail/159382?site
推荐阅读
相关标签
  

闽ICP备14008679号