当前位置:   article > 正文

制作Scrapy爬虫和crawlspider爬整栈数据_execute(['scrapy','crawl','zhongguoxinwen'])

execute(['scrapy','crawl','zhongguoxinwen'])

制作scrapy爬虫一共需要4步:

1.新建项目(scrapy startproject xxx):新建一个新的爬虫项目

2.明确目标(编写items.py):明确你想要抓取的目标

3.制作爬虫(xxspider.py):制作爬虫开始爬取网页

4.存储内容(pipelines.py):设计管道存储爬取内容

爬虫顺序:main.py--->items/py-->xxspider.py-->settings.py-->pipelines.py

main.py是用于运行爬虫代码的,一般格式为:

  1. from scrapy.cmdline import execute
  2. import sys,os
  3. sys.path.append(os.path.dirname(os.path.abspath(__file__)))
  4. # execute(['scrapy','crawl','sun'])
  5. execute(['scrapy','crawl','你的爬虫名'])

 若你爬取了图片,可在settings.py里自定义设置图片保存路径

在管道文件里可获取setting信息

 

在提crawlspider之前说一下正则表达式的有个方法

re.sub(s1,s2,s3)      将s3里所有s1替换成s2

和自增10一样

 

Crawlspider

通过scrapy genspider -t crawl  xxx  xxxxxx.com可快速创建crawlspider模块的代码

在xxx.py文件里要导入:

1.

from scrapy.linkextractors import LinkExtractor     

就是取网页里的链接

2

from scrapy.spiders import CrawlSpider, Rule

调用里面的两个函数来处理链接

3.实现

#Rule和LinkExtractor放到一起写了.allow是匹配符合的,deny是匹配出不符合的,没有回调函数follow=True,就跟进链接
rules = (
    #跟进第一页,第二页...
    Rule(LinkExtractor(allow=r'type=4&page=\d+')),
    #有回调函数follow=False,就不跟进,利用回调函数来处理.这里的回调函数不能加(),必须得用双引号引起来
    Rule(LinkExtractor(allow=r'/html/question/\d+/\d+.shtml'), callback = 'parse_item'),
)

注意用crawlspider后解析函数就不能命名为parse,用其他任何的代替都行

response.url就可以取网址          follow=True  跟进       

allow是匹配符合的,deny是匹配出不符合的
有回调函数follow=False,就不跟进,利用回调函数来处理.这里的回调函数不能加(),必须得用双引号引起来
没有回调函数follow=True,就跟进链接
Rule自带process_links方法,用来过滤数据  , deal_links则是用来过滤数据的函数,要自定义
Rule(pagelink, process_links = "deal_links")

 

付一个爬取阳光热线问政平台的代码

main:

  1. from scrapy.cmdline import execute
  2. import sys,os
  3. sys.path.append(os.path.dirname(os.path.abspath(__file__)))
  4. # execute(['scrapy','crawl','sun'])
  5. execute(['scrapy','crawl','dongdong'])

 

items:

  1. # -*- coding: utf-8 -*-
  2. # Define here the models for your scraped items
  3. #
  4. # See documentation in:
  5. # https://doc.scrapy.org/en/latest/topics/items.html
  6. import scrapy
  7. class DongguanItem(scrapy.Item):
  8. # define the fields for your item here like:
  9. title = scrapy.Field()
  10. content = scrapy.Field()
  11. url = scrapy.Field()
  12. number = scrapy.Field()

 

xxx:

  1. import scrapy
  2. from scrapy.linkextractors import LinkExtractor
  3. from scrapy.spiders import CrawlSpider, Rule
  4. from dongguan.items import DongguanItem
  5. #爬整栈用到crawlspider 注意解析网页
  6. class DongdongSpider(CrawlSpider):
  7. name = 'dongdong'
  8. allowed_domains = ['wz.sun0769.com']
  9. start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4&page=']
  10. # 每一页的匹配规则
  11. pagelink = LinkExtractor(allow=("type=4"))
  12. # 每一页里的每个帖子的匹配规则
  13. contentlink = LinkExtractor(allow=(r"/html/question/\d+/\d+.shtml"))
  14. rules = (
  15. # url被web服务器篡改,需要调用process_links来处理提取出来的url
  16. #process_links是自带 的方法,用来过滤数据
  17. Rule(pagelink, process_links = "deal_links"),
  18. Rule(contentlink, callback = "parse_item")
  19. )
  20. # links 是当前response里提取出来的链接列表
  21. def deal_links(self, links):
  22. for each in links:
  23. each.url = each.url.replace("?","&").replace("Type&","Type?")
  24. return links
  25. def parse_item(self, response):
  26. item = DongguanItem()
  27. # 标题
  28. item['title'] = response.xpath('//div[contains(@class, "pagecenter p3")]//strong/text()').extract()[0]
  29. # 编号
  30. item['number'] = item['title'].split(' ')[-1].split(":")[-1]
  31. # 内容,先使用有图片情况下的匹配规则,如果有内容,返回所有内容的列表集合
  32. content = response.xpath('//div[@class="contentext"]/text()').extract()
  33. # 如果没有内容,则返回空列表,则使用无图片情况下的匹配规则
  34. if len(content) == 0:
  35. content = response.xpath('//div[@class="c1 text14_2"]/text()').extract()
  36. item['content'] = "".join(content).strip()
  37. else:
  38. item['content'] = "".join(content).strip()
  39. # 链接
  40. item['url'] = response.url
  41. yield item

 

setting:

  1. # -*- coding: utf-8 -*-
  2. # Scrapy settings for dongguan project
  3. #
  4. # For simplicity, this file contains only settings considered important or
  5. # commonly used. You can find more settings consulting the documentation:
  6. #
  7. # https://doc.scrapy.org/en/latest/topics/settings.html
  8. # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
  9. # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
  10. BOT_NAME = 'dongguan'
  11. SPIDER_MODULES = ['dongguan.spiders']
  12. NEWSPIDER_MODULE = 'dongguan.spiders'
  13. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  14. #USER_AGENT = 'dongguan (+http://www.yourdomain.com)'
  15. # Obey robots.txt rules
  16. # ROBOTSTXT_OBEY = True
  17. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  18. #CONCURRENT_REQUESTS = 32
  19. # Configure a delay for requests for the same website (default: 0)
  20. # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
  21. # See also autothrottle settings and docs
  22. #DOWNLOAD_DELAY = 3
  23. # The download delay setting will honor only one of:
  24. #CONCURRENT_REQUESTS_PER_DOMAIN = 16
  25. #CONCURRENT_REQUESTS_PER_IP = 16
  26. # Disable cookies (enabled by default)
  27. #COOKIES_ENABLED = False
  28. # Disable Telnet Console (enabled by default)
  29. #TELNETCONSOLE_ENABLED = False
  30. # Override the default request headers:
  31. DEFAULT_REQUEST_HEADERS = {
  32. 'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);'
  33. # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  34. # 'Accept-Language': 'en',
  35. }
  36. # Enable or disable spider middlewares
  37. # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
  38. #SPIDER_MIDDLEWARES = {
  39. # 'dongguan.middlewares.DongguanSpiderMiddleware': 543,
  40. #}
  41. # Enable or disable downloader middlewares
  42. # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
  43. #DOWNLOADER_MIDDLEWARES = {
  44. # 'dongguan.middlewares.DongguanDownloaderMiddleware': 543,
  45. #}
  46. # Enable or disable extensions
  47. # See https://doc.scrapy.org/en/latest/topics/extensions.html
  48. #EXTENSIONS = {
  49. # 'scrapy.extensions.telnet.TelnetConsole': None,
  50. #}
  51. # Configure item pipelines
  52. # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
  53. ITEM_PIPELINES = {
  54. 'dongguan.pipelines.DongguanPipeline': 300,
  55. }
  56. # Enable and configure the AutoThrottle extension (disabled by default)
  57. # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
  58. #AUTOTHROTTLE_ENABLED = True
  59. # The initial download delay
  60. #AUTOTHROTTLE_START_DELAY = 5
  61. # The maximum download delay to be set in case of high latencies
  62. #AUTOTHROTTLE_MAX_DELAY = 60
  63. # The average number of requests Scrapy should be sending in parallel to
  64. # each remote server
  65. #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  66. # Enable showing throttling stats for every response received:
  67. #AUTOTHROTTLE_DEBUG = False
  68. # Enable and configure HTTP caching (disabled by default)
  69. # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  70. #HTTPCACHE_ENABLED = True
  71. #HTTPCACHE_EXPIRATION_SECS = 0
  72. #HTTPCACHE_DIR = 'httpcache'
  73. #HTTPCACHE_IGNORE_HTTP_CODES = []
  74. #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

 

 pipelines:

  1. # -*- coding: utf-8 -*-
  2. # Define your item pipelines here
  3. #
  4. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  5. # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
  6. import json
  7. class DongguanPipeline(object):
  8. def __init__(self):
  9. self.filename = open("dongguan.json", "wb+")
  10. def process_item(self, item, spider):
  11. text = json.dumps(dict(item), ensure_ascii = False) + ",\n"
  12. self.filename.write(text.encode("utf-8"))
  13. return item
  14. def close_spider(self, spider):
  15. self.filename.close()

 

 

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/笔触狂放9/article/detail/599599
推荐阅读
相关标签
  

闽ICP备14008679号