设置allowed_domains的含义是过滤爬取的域名,在插件OffsiteMiddleware启用的情况下(默认是启用的),不在此允许范围内的域名就会被过滤,而不会进行爬取
但是有一个问题:像下面这种情况,对于start_urls里的起始爬取页面,它是不会过滤的,它的作用是过滤首页之后的页面-----待验证
#/usr/bin/env python #coding:utf-8 import scrapy # import sys # import os from scrapy_study.items import DemoItem class DemoScrapy(scrapy.Spider): name = 'demoscrapy' # start_urls = ['http://scrapy-chs.readthedocs.io/zh_CN/1.0/intro/tutorial.html'] allowed_domains = ["scrapypython.2org"] # start_urls = ['https://docs.python.org/2/library/os.path.html'] start_urls = ['http://yogoup.sinaapp.com/'] def parse(self,response): print response.body