当前位置:   article > 正文

当当图书信息爬取

爬虫获取当当网近七日新书热卖前100图书信息

效果:

分析:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

version_0

声明:未经许可,不能作为商业用途

总结:通过//div[@class="xxx"]可能取到的数据是不全面的,这时候不妨考虑使用//div[contains(@calss,'xxx')]的方式来提取

   如果通过re模块去提取数据,在首页(book.dangdang.com/index)取获取分类信息的时候,会提示errordecode,

   这是因为当当图书在网页中插入了别国字符导致编码不统一的问题。

   当当网在获取图书信息,翻页时,未采用任何动态技术,通过价格也是直接嵌入在网页上的,这个就比较容易获取到。

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

源码

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. import re
  4. from copy import deepcopy
  5. from pprint import pprint
  6. from urllib import parse
  7. class DdtsSpider(scrapy.Spider):
  8. name = 'ddts'
  9. allowed_domains = ['dangdang.com']
  10. start_urls = ['http://book.dangdang.com/index']
  11. def process_info(self,con_list):
  12. """传入一个列表,处理空字符串并将字段拼接在一起"""
  13. con_list = [re.sub(r"\s|\n", '', i).strip() for i in con_list if i]
  14. s = str()
  15. for a_ in con_list:
  16. s += a_
  17. return s
  18. def parse(self, response):
  19. div_cate_list = response.xpath("//div[@class='con flq_body']//div[contains(@class,'level_one')]")
  20. # 去掉空字符串,去掉当当出版
  21. div_cate_list = div_cate_list[2:13]+div_cate_list[14:-4]
  22. for div_cate in div_cate_list:
  23. item = dict()
  24. # 获取大分类标题
  25. # 提取标题部分
  26. item["b_cate"] = div_cate.xpath(".//dl[contains(@class,'primary_dl')]/dt//text()").extract()
  27. item["b_cate"] = self.process_info(item["b_cate"])
  28. # 拿到所有弹出层列表
  29. t_list = div_cate.xpath(".//dl[contains(@class,'inner_dl')]")
  30. for t in t_list:
  31. # 获取中级标题
  32. item["m_cate"] = t.xpath(".//dt//text()").extract()
  33. item["m_cate"] = self.process_info(item["m_cate"])
  34. # 获取小分类及地址
  35. a_list = t.xpath(".//dd/a")
  36. for a in a_list:
  37. item["s_cate"] = a.xpath("./text()").extract()
  38. item["s_cate"] = self.process_info(item["s_cate"])
  39. s_href = a.xpath("./@href").extract_first()
  40. # 请求小分类的地址
  41. yield scrapy.Request(
  42. url = s_href,
  43. callback=self.parse_s_cate,
  44. meta={"item":deepcopy(item)}
  45. )
  46. def parse_s_cate(self,response):
  47. item = deepcopy(response.meta["item"])
  48. # 选取图书列表
  49. book_li_list = response.xpath("//div[contains(@id,'search_nature_rg')]/ul[contains(@class,'bigimg')]/li")
  50. # 当前请求的url包含该页面下所有的请求,无任何动态加载
  51. for book_li in book_li_list:
  52. book_info = dict()
  53. book_info["title"] = book_li.xpath(".//p[contains(@class,'name')]//a/@title").extract()
  54. book_info["title"] = self.process_info(book_info["title"])
  55. book_info["href"] = book_li.xpath(".//p[contains(@class,'name')]//a/@href").extract_first()
  56. book_info["price"] = book_li.xpath(".//p[contains(@class,'price')]//span[contains(@class,'earch_now_price')]/text()").extract_first()
  57. book_info["price"] = book_info["price"].split(r";",1)[-1]
  58. book_info["author"] = book_li.xpath(".//a[contains(@name,'itemlist-author')]/text()").extract_first()
  59. book_info["press"] = book_li.xpath(".//a[contains(@name,'P_cbs')]/text()").extract_first()
  60. book_info["description"] = book_li.xpath(".//p[contains(@class,'detail')]//text()").extract_first()
  61. item["book_info"] = book_info
  62. pprint(item)
  63. url = response.xpath("//li[@class='next']/a/@href").extract_first()
  64. if url is not None:
  65. next_url = parse.urljoin(response.url,url)
  66. yield scrapy.Request(
  67. url=next_url,
  68. callback=self.parse_s_cate,
  69. meta={"item":response.meta["item"]}
  70. )
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/码创造者/article/detail/825064
推荐阅读
相关标签
  

闽ICP备14008679号