当前位置:   article > 正文

python当当网爬虫_python爬虫当当网

python爬虫当当网

   最终要实现的是将当当网上面的书籍信息,书籍名字,网址和评论数爬取,存入到数据库中。(首先要做的是创建好数据库,创建的数据库名字为dd,创建的表为books,字段为title,link,comment)。

1、创建项目 scrapy startproject dangdang

2、进入项目文件夹创建爬虫文件

>scrapy genspider –t basic dd dangdang.com

3、用pycharm打开这个项目

编辑items.py文件

  1. # -*- coding: utf-8 -*-
  2. # Define here the models for your scraped items
  3. # See documentation in:
  4. # https://doc.scrapy.org/en/latest/topics/items.html
  5. import scrapy
  6. class DangdangItem(scrapy.Item):
  7. # define the fields for your item here like:
  8. # name = scrapy.Field()
  9. title=scrapy.Field()
  10. link=scrapy.Field()
  11. comment=scrapy.Field()

编辑dd.py

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. from dangdang.items import DangdangItem
  4. from scrapy.http import Request
  5. class DdSpider(scrapy.Spider):
  6. name = 'dd'
  7. allowed_domains = ['dangdang.com']
  8. start_urls = ['http://dangdang.com/']
  9. def parse(self, response):
  10. item=DangdangItem()
  11. item['title']=response.xpath('//a[@class="pic"]/@title').extract()
  12. item['link'] = response.xpath('//a[@class="pic"]/@href').extract()
  13. item['comment'] = response.xpath('//a[@class="search_comment_num"]/text()').extract()
  14. yield item
  15. for i in range(2,101):#循环爬多页的东西
  16. url='http://category.dangdang.com/pg'+str(i)+'-cp01.54.06.00.00.00.html'
  17. yield Request(url,callback=self.parse)

在seetings.py文件中打开pipelines

ITEM_PIPELINES = {
   
'dangdang.pipelines.DangdangPipeline': 300,
}

Pipelines.py文件,将数据写入数据库

  1. # -*- coding: utf-8 -*-
  2. # Define your item pipelines here
  3. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  4. # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
  5. import pymysql
  6. class DangdangPipeline(object):
  7. def process_item(self, item, spider):
  8. conn=pymysql.connect(host='localhost',port=3306,user='root',passwd='123456',db='dd')
  9. for i in range(0,len(item['title'])):
  10. title=item['title'][i]
  11. link=item['link'][i]
  12. comment=item['comment'][i]
  13. sql="insert into books(title,link,comment)values('"+title+"','"+link+"','"+comment+"')"
  14. conn.query(sql)
  15. conn.commit()
  16. conn.close()
  17. return item

 

 

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家自动化/article/detail/897119
推荐阅读
相关标签
  

闽ICP备14008679号