赞
踩
前面已经安装了Scrapy,下面来实现第一个测试程序。
Scrapy是一个爬虫框架,他的基本流程如下所示(下面截图来自互联网)
简单的说,我们需要写一个item文件,定义返回的数据结构;写一个spider文件,具体爬取的数据程序,以及一个管道 pipeline 文件,作为后续操作,比如保存数据等等。
下面以当当网为例,看看怎么实现。
这个例子里面我想爬取的内容是前面20页的羽绒服产品,包括产品名字,链接和评论数。
scrapy startproject dangdang
scrapy genspider -t basic dd dangdang.com
这样他会自动创建一个爬虫文件,结构如下所示:
items.py
- # -*- coding: utf-8 -*-
-
- # Define here the models for your scraped items
- #
- # See documentation in:
- # https://doc.scrapy.org/en/latest/topics/items.html
-
- import scrapy
- class DangdangItem(scrapy.Item):
- # define the fields for your item here like:
- # name = scrapy.Field()
- title=scrapy.Field()
- url=scrapy.Field()
- comment=scrapy.Field()
前面第二步已经自动生成了一个模板,我们直接修改就行。
dd.py
- # -*- coding: utf-8 -*-
-
- import scrapy
- from dangdang.items import DangdangItem
- from scrapy.http import Request
-
- class DdSpider(scrapy.Spider):
- name = 'dd'
- allowed_domains = ['dangdang.com']
- start_urls = ['http://category.dangdang.com/pg1-cid4010275.html']
-
- def parse(self, response):
-
- item=DangdangItem()
- item['title']=response.xpath(u"//a[@dd_name='单品标题']/text()").extract()
- item['url']=response.xpath("//a[@dd_name='单品标题']/@href").extract()
- item['comment']=response.xpath("//a[@dd_name='单品评论']/text()").extract()
- text = response.body
- # content_type = chardet.detect(text)
- # if content_type['encoding'] != 'UTF-8':
- # text = text.decode(content_type['encoding'])
- # text = text.encode('utf-8')
- # print(text)
-
- yield item
-
- for i in range(2,20):
- url='http://category.dangdang.com/pg%d-cid4010275.html'%i
- yield Request(url,callback=self.parse)
为了使用pipeline,配置文件需要做个小修改,我顺便关掉了对robot文件的确认
settings.py
- ROBOTSTXT_OBEY = False
-
- ITEM_PIPELINES = {
- 'dangdang.pipelines.DangdangPipeline': 300,
- }
pipeline.py
- # -*- coding: utf-8 -*-
-
- # Define your item pipelines here
- #
- # Don't forget to add your pipeline to the ITEM_PIPELINES setting
- # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
- import pymysql
-
- class DangdangPipeline(object):
- def process_item(self, item, spider):
- conn=pymysql.connect(host='127.0.0.1',user='root',passwd='root',db='dangdang',use_unicode=True,charset='utf8')
- for i in range(0,len(item['title'])):
- title=item['title'][i]
- link=item['url'][i]
- comment=item['comment'][i]
-
- print(type(title))
- print(title)
- # sql="insert into dd(title,link,comment) values ('"+title+"','"+link+"','"+comment+"')"
- sql = "insert into dd(title,link,comment) values('" + title + "','" + link + "','" + comment + "')"
- try:
- conn.query(sql)
- except Exception as err:
- pass
- conn.close()
-
- return item
我最后的数据要保存到mysql里面,python里面可以通过pymysql进行操作。我提前在mysql命令行界面里面创建了一个数据库和空表
- mysql> create database dangdang;
- mysql> create table dd(id int auto_increment primary, title varchar(100), link varchar(100), comment varchar(32));
scrapy crawl dd
如果不想看日志 可以使用
scrapy crawl dd --nolog
test.py
- #!/usr/bin/env python
- #! -*- coding:utf-8 -*-
- # Author: Yuan Li
- import pymysql
- conn=pymysql.connect(host='127.0.0.1',user='root',passwd='root',db='dangdang',use_unicode=True,charset='utf8')
-
- cursor = conn.cursor(cursor=pymysql.cursors.DictCursor)
- #SQL查询
- cursor.execute("select * from dd")
- row=cursor.fetchall()
- for i in row:
- print(i)
- conn.close()
结果测试成功
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。