当前位置:   article > 正文

scrapy爬取当当网Python图书的部分数据_综合使用scrape技术爬取当当网中国古诗词图书的数据信息

综合使用scrape技术爬取当当网中国古诗词图书的数据信息

1.下载scrapy框架

    pip install scrapy

2.在E盘下创建一个文件夹scrapy01,在命令行窗体中进入该文件夹

3.创建项目:scrapy startproject 项目名

    scrapy startproject first_scrapy

4.使用pycharm打开scrapy01文件夹

5.在items.py文件中创建所需的字段,用于保存数据

  1. # -*- coding: utf-8 -*-
  2. # Define here the models for your scraped items
  3. #
  4. # See documentation in:
  5. # https://docs.scrapy.org/en/latest/topics/items.html
  6. import scrapy
  7. class FirstScrapyItem(scrapy.Item):
  8. # define the fields for your item here like:
  9. # name = scrapy.Field()
  10. title = scrapy.Field() # 书名
  11. price = scrapy.Field() # 价格
  12. author = scrapy.Field() # 作者
  13. date = scrapy.Field() # 出版日期
  14. publisher = scrapy.Field() # 出版社

6.在spiders文件夹中创建爬虫程序test.py,代码如下:

    

  1. # author:WN
  2. # datetime:2019/11/3 15:29
  3. from abc import ABC
  4. import scrapy
  5. from .. import items
  6. class MySpider(scrapy.Spider, ABC):
  7. # 名字
  8. name = "mySpider"
  9. def start_requests(self):
  10. for num in range(1, 101):
  11. url = "http://search.dangdang.com/?key=Python&act=input&page_index=%d" % num
  12. # 使用yield:请求过后返回的数据等待被取走
  13. yield scrapy.Request(url=url, callback=self.parse)
  14. def parse(self, response):
  15. try:
  16. data = response.text
  17. # scrapy是使用Xpath进行查找数据的
  18. # 创建选择查找类Selector()对象
  19. select = scrapy.Selector(text=data)
  20. book_data = select.xpath("//ul[@class='bigimg']/li")
  21. item = items.FirstScrapyItem()
  22. # 查找具体数据
  23. for book in book_data:
  24. title = book.xpath("./a/img/@alt").extract_first().strip()
  25. price = book.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first().lstrip('¥')
  26. author = book.xpath("./p[@class='search_book_author']/span/a/@title").extract_first()
  27. date = book.xpath("./p[@class='search_book_author']/span[2]/text()").extract_first().strip()
  28. publisher = book.xpath("./p[@class='search_book_author']/span/a[@name='P_cbs']/text()").extract_first()
  29. item['title'] = title if title else ''
  30. item['price'] = price if price else ''
  31. item['author'] = author if author else ''
  32. item['date'] = date if date else ''
  33. item['publisher'] = publisher if publisher else ''
  34. yield item
  35. except Exception as e:
  36. print(e)

7.在setings.py中添加配置,以便将test.py中的item推送到piplines.py的类中

  1. # 设置将item配置到pipelines中的类中
  2. # 项目名.pipelines.类名
  3. # 300是一个默认整数,它可以是任意整数
  4. ITEM_PIPELINES = {
  5. 'first_scrapy.pipelines.FirstScrapyPipeline': 300,
  6. }

8.编写pipelines.py的代码,前提先创建mysql数据库book和表books:

  1. create database book;
  2. use book;
  3. set character_set_results=gbk;
  4. create table books(
  5. bTitle varchar(256) primary key,
  6. bPrice varchar(50),
  7. bAuthoe varchar(50),
  8. bDate varchar(32),
  9. bPublisher varchar(256)
  10. );
  1. # -*- coding: utf-8 -*-
  2. # Define your item pipelines here
  3. #
  4. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  5. # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
  6. import pymysql
  7. class FirstScrapyPipeline(object):
  8. # spider爬虫一开始就会执行下面的函数
  9. def open_spider(self, spider):
  10. print('opened')
  11. try:
  12. # 连接数据库
  13. self.con = pymysql.connect(host='localhost', port=3306, user='root', password='root', db='book', charset='utf8')
  14. # 创建游标
  15. self.cursor = self.con.cursor()
  16. self.opened = True
  17. self.count = 0
  18. except Exception as e:
  19. print(e)
  20. self.opened = False
  21. # spider爬虫关闭执行函数
  22. def close_spider(self, spider):
  23. if self.opened:
  24. self.con.commit()
  25. self.con.close()
  26. self.opened = False
  27. print("close")
  28. print("总共爬取:", self.count, "本书籍")
  29. def process_item(self, item, spider):
  30. try:
  31. print(item['title'])
  32. print(item['price'])
  33. print(item['author'])
  34. print(item['date'])
  35. print(item['publisher'])
  36. if self.opened:
  37. self.cursor.execute(
  38. 'insert into books(bTitle,bPrice,bAuthor,bDate,bPublisher) values (%s,%s,%s,%s,%s)', (
  39. item['title'], item['price'], item['author'], item['date'], item['publisher'])
  40. )
  41. self.count += 1
  42. except Exception as err:
  43. print(err)
  44. return item

9.运行此项目

    (1)在命令行窗体中运行:scrapy crawl 爬虫程序名 -s LOG_ENABLED=False,后边的参数是不显示调试信息

              scrapy crawl mySpider -s LOG_ENABLED=False

    (2)在spiders文件夹的上一级文件夹下创建run.py,运行此文件就可以运行该项目(不在dos窗口中运行项目)代码如下:

          

  1. # author:WN
  2. # datetime:2019/11/3 15:36
  3. from scrapy import cmdline
  4. # 运行语句,不需要再打开dos窗口
  5. # scrapy crawl 爬虫名 不显示调试信息的参数
  6. cmdline.execute("scrapy crawl mySpider -s LOG_ENABLED=False".split())

 

声明:本文内容由网友自发贡献,转载请注明出处:【wpsshop】
推荐阅读
相关标签
  

闽ICP备14008679号