赞
踩
本文使用了scrapy框架对电影信息进行爬取并将这些数据存入MySQL数据库。
根据你所使用的python包管理器安装相应的模块。比如使用pip:
pip install scrapy
pip install pymysql
和其他python框架一样,利用scrapy startproject projectname
命令创建项目即可:
出现上图提示即说明scrapy项目创建成功,如果出现command not found
等提示,说明你需要重新安装scrapy。项目创建成功后的项目目录如图所示:
这里介绍一下部分文件的主要作用。
items.py
文件里主要存放你的模型,即实体。pipelines.py
爬虫抓取到网页数据后在该文件中执行相关数据处理操作。settings.py
存放框架配置。spiders/
该文件夹下放爬虫业务代码。items.py
,我们需要分析我们爬取的信息。# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class DialogItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass class Movie(scrapy.Item): name = scrapy.Field() #电影名称 href = scrapy.Field() #电影链接 actor = scrapy.Field() #演员 status = scrapy.Field() #状态 district = scrapy.Field() #地区 director = scrapy.Field() #导演 genre = scrapy.Field() #类型 intro = scrapy.Field() #介绍
MovieSpider.py
,创建MovieSpider类时并继承scrapy.Spider。这里使用了xpath定位资源,下面会简单介绍,更多用法请点击这里,进入菜鸟教程进行学习。import scrapy from movie.items import Movie class MovieSpider(scrapy.Spider): # 爬虫名称,最终会利用该名称启动爬虫 name = 'MovieSpider' # 这里只填写域名即可,不需要协议和资源地址 allowed_domains = ['88ys.com'] # 开始url,即我们爬虫最开始需要爬取的地址 start_urls = ['https://www.88ys.com/vod-type-id-14-pg-1.html'] def parse(self, response): urls = response.xpath('//li[@class="p1 m1"]') for item in urls: movie = Movie() movie['name'] = item.xpath('./a/span[@class="lzbz"]/p[@class="name"]/text()').extract_first() movie['href'] = 'https://www.88ys.com' + item.xpath('./a/@href').extract_first() request = scrapy.Request(movie['href'], callback=self.crawl_details) request.meta['movie'] = movie yield request def crawl_details(self, response): movie = response.meta['movie'] movie['actor'] = response.xpath('//div[@class="ct-c"]/dl/dt[2]/text()').extract_first() movie['status'] = response.xpath('//div[@class="ct-c"]/dl/dt[1]/text()').extract_first() movie['district'] = response.xpath('//div[@class="ct-c"]/dl/dd[4]/text()').extract_first() movie['director'] = response.xpath('//div[@class="ct-c"]/dl/dd[3]/text()').extract_first() movie['genre'] = response.xpath('//div[@class="ct-c"]/dl/dd[1]/text()').extract_first() movie['intro'] = response.xpath('//div[@class="ee"]/text()').extract_first() yield movie
xpath使用
syntax | 说明 |
---|---|
// | 全文递归搜索 |
. | 选取当前结点 |
. . | 选取父节点 |
text() | 选取标签下的文本 |
@属性 | 选取该属性的值 |
label | 这里指节点名称,即html的标签 |
div[@class="ct-c"] | 指类属性为ct-c 的div |
/dl/dt[1] | 指dl下的第一个dt |
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html import pymysql class DialogPipeline(object): def __init__(self): self.conn = pymysql.connect('localhost', 'huangwei', '123456789', 'db_88ys') self.cursor = self.conn.cursor() def process_item(self, item, spider): sql = "insert into tb_movie(name, href, actor, status, district, director, genre, intro) values(%s, %s, %s, %s, %s, %s, %s, %s)" self.cursor.execute(sql, (item['name'], item['href'], item['actor'], item['status'], item['district'], item['director'], item['genre'], item['intro']) ) self.conn.commit() def close_spider(self, spider): self.cursor.close() self.conn.close()
# 是否遵循robots协议
ROBOTSTXT_OBEY = False
# 模拟浏览器进行数据请求
DEFAULT_REQUEST_HEADERS = {
"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}
# 启用pipelines,将爬取到的数据进行保存
ITEM_PIPELINES = {
'dialog.pipelines.DialogPipeline': 300,
}
进入项目目录,使用scrapy crawl MovieSpider
即可,执行中会打印相关日志,在命令中加入--nolog
即可不显示日志。当然,在启动前我们需要准备好数据表。启动过程如下:
最终,我们查看数据库,爬取成功!!!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。