赞
踩
Scrapy作为爬虫的进阶内容,可以实现多线程爬取目标内容,简化代码逻辑,提高开发效率,深受爬虫开发者的喜爱,本文主要以爬取某股票网站为例,简述如何通过Scrapy实现爬虫,仅供学习分享使用,如有不足之处,还请指正。
Scrapy是用python实现的一个为了爬取网站数据,提取结构性数据而编写的应用框架。使用Twisted高效异步网络框架来处理网络通信。Scrapy架构:
关于Scrapy架构各项说明,如下所示:
Scrapy数据流:
在命令行模式下,通过pip install scrapy命令进行安装Scrapy,如下所示:
当出现以下提示信息时,表示安装成功
在命令行模式下,切换到项目存放目录,通过scrapy startproject stockstar 创建爬虫项目,如下所示:
根据提示,通过提供的模板,创建爬虫【命令格式:scrapy genspider 爬虫名称 域名】,如下所示:
注意:爬虫名称,不能跟项目名称一致,否则会报错,如下所示:
通过Pycharm打开新创建的scrapy项目,如下所示:
本例主要爬取某证券网站行情中心股票ID与名称信息,如下所示:
通过命令行创建项目后,基本Scrapy爬虫框架已经形成,剩下的就是业务代码填充。
定义需要爬取的字段信息,如下所示:
- class StockstarItem(scrapy.Item):
- """
- 定义需要爬取的字段名称
- """
- # define the fields for your item here like:
- # name = scrapy.Field()
- stock_type = scrapy.Field() # 股票类型
- stock_id = scrapy.Field() # 股票ID
- stock_name = scrapy.Field() # 股票名称
Scrapy的爬虫结构是固定的,定义一个类,继承自scrapy.Spider,类中定义属性【爬虫名称,域名,起始url】,重写父类方法【parse】,根据需要爬取的页面逻辑不同,在parse中定制不同的爬虫代码,如下所示:
- class StockSpider(scrapy.Spider):
- name = 'stock'
- allowed_domains = ['quote.stockstar.com'] # 域名
- start_urls = ['http://quote.stockstar.com/stock/stock_index.htm'] # 启动的url
-
- def parse(self, response):
- """
- 解析函数
- :param response:
- :return:
- """
- item = StockstarItem()
- styles = ['沪A', '沪B', '深A', '深B']
- index = 0
- for style in styles:
- print('********************本次抓取' + style[index] + '股票********************')
- ids = response.xpath(
- '//div[@class="w"]/div[@class="main clearfix"]/div[@class="seo_area"]/div['
- '@class="seo_keywordsCon"]/ul[@id="index_data_' + str(index) + '"]/li/span/a/text()').getall()
- names = response.xpath(
- '//div[@class="w"]/div[@class="main clearfix"]/div[@class="seo_area"]/div['
- '@class="seo_keywordsCon"]/ul[@id="index_data_' + str(index) + '"]/li/a/text()').getall()
- # print('ids = '+str(ids))
- # print('names = ' + str(names))
- for i in range(len(ids)):
- item['stock_type'] = style
- item['stock_id'] = str(ids[i])
- item['stock_name'] = str(names[i])
- yield item
在Pipeline中,对抓取的数据进行处理,本例为简便,在控制进行输出,如下所示:
- class StockstarPipeline:
- def process_item(self, item, spider):
- print('股票类型>>>>'+item['stock_type']+'股票代码>>>>'+item['stock_id']+'股票名称>>>>'+item['stock_name'])
- return item
注意:在对item进行赋值时,只能通过item['key']=value的方式进行赋值,不可以通过item.key=value的方式赋值。
通过settings.py文件进行配置,包括请求头,管道,robots协议等内容,如下所示:
- # Scrapy settings for stockstar project
- #
- # For simplicity, this file contains only settings considered important or
- # commonly used. You can find more settings consulting the documentation:
- #
- # https://docs.scrapy.org/en/latest/topics/settings.html
- # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
- # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
-
- BOT_NAME = 'stockstar'
-
- SPIDER_MODULES = ['stockstar.spiders']
- NEWSPIDER_MODULE = 'stockstar.spiders'
-
-
- # Crawl responsibly by identifying yourself (and your website) on the user-agent
- #USER_AGENT = 'stockstar (+http://www.yourdomain.com)'
-
- # Obey robots.txt rules 是否遵守robots协议
- ROBOTSTXT_OBEY = False
-
- # Configure maximum concurrent requests performed by Scrapy (default: 16)
- #CONCURRENT_REQUESTS = 32
-
- # Configure a delay for requests for the same website (default: 0)
- # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
- # See also autothrottle settings and docs
- #DOWNLOAD_DELAY = 3
- # The download delay setting will honor only one of:
- #CONCURRENT_REQUESTS_PER_DOMAIN = 16
- #CONCURRENT_REQUESTS_PER_IP = 16
-
- # Disable cookies (enabled by default)
- #COOKIES_ENABLED = False
-
- # Disable Telnet Console (enabled by default)
- #TELNETCONSOLE_ENABLED = False
-
- # Override the default request headers:
- DEFAULT_REQUEST_HEADERS = {
- # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
- 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Mobile Safari/537.36' #,
- # 'Accept-Language': 'en,zh-CN,zh;q=0.9'
- }
-
- # Enable or disable spider middlewares
- # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
- #SPIDER_MIDDLEWARES = {
- # 'stockstar.middlewares.StockstarSpiderMiddleware': 543,
- #}
-
- # Enable or disable downloader middlewares
- # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
- #DOWNLOADER_MIDDLEWARES = {
- # 'stockstar.middlewares.StockstarDownloaderMiddleware': 543,
- #}
-
- # Enable or disable extensions
- # See https://docs.scrapy.org/en/latest/topics/extensions.html
- #EXTENSIONS = {
- # 'scrapy.extensions.telnet.TelnetConsole': None,
- #}
-
- # Configure item pipelines
- # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
- ITEM_PIPELINES = {
- 'stockstar.pipelines.StockstarPipeline': 300,
- }
-
- # Enable and configure the AutoThrottle extension (disabled by default)
- # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
- #AUTOTHROTTLE_ENABLED = True
- # The initial download delay
- #AUTOTHROTTLE_START_DELAY = 5
- # The maximum download delay to be set in case of high latencies
- #AUTOTHROTTLE_MAX_DELAY = 60
- # The average number of requests Scrapy should be sending in parallel to
- # each remote server
- #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
- # Enable showing throttling stats for every response received:
- #AUTOTHROTTLE_DEBUG = False
-
- # Enable and configure HTTP caching (disabled by default)
- # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
- #HTTPCACHE_ENABLED = True
- #HTTPCACHE_EXPIRATION_SECS = 0
- #HTTPCACHE_DIR = 'httpcache'
- #HTTPCACHE_IGNORE_HTTP_CODES = []
- #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
因scrapy是各个独立的页面,只能通过终端命令行的方式运行,格式为:scrapy crawl 爬虫名称,如下所示:
scrapy crawl stock
如下图所示:
本例内容相对简单,仅为说明Scrapy的常见用法,爬取的内容都是第一次请求能够获取到源码的内容,即所见即所得。实例源码
遗留两个小问题:
以上两个问题,待后续遇到时,再进一步分析。一首陶渊明的归田园居,与君共享。
归园田居(其一)
【作者】陶渊明 【朝代】魏晋
少无适俗韵,性本爱丘山。误落尘网中,一去三十年。
羁鸟恋旧林,池鱼思故渊。开荒南野际,守拙归园田。
方宅十余亩,草屋八九间。榆柳荫后檐,桃李罗堂前。
暧暧远人村,依依墟里烟。狗吠深巷中,鸡鸣桑树颠。
户庭无尘杂,虚室有余闲。久在樊笼里,复得返自然。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。