赞
踩
我们之前都是使用slenium + PhantomJS来获取渲染后的页面,但是因为PhantomJS不再更新了,selenium也是不再对PhantomJS提供支持,但是我们还有Headless Chrome,这个是Chrome推出的能实现和PhantomJS一样的功能,
使用方式:以抓取京东商城的商品为例,京东商城的商品一页显示30个商品,但是如果我们向下拉的话,还会再加载30个商品,
我们就使用Headless Chrome来模拟这一过程,获取一页60个商品
selenium的核心思想就是我们在下载中间件的 process_request 将我们的reqeust转化成selenium的请求
jd.py
- # -*- coding: utf-8 -*-
- import scrapy
- import re
- from jingdong.items import JingdongItem
- from selenium import webdriver
- from selenium.webdriver.chrome.options import Options
-
-
- class JdSpider(scrapy.Spider):
-
- def __init__(self):
- chrome_options = Options()
- # 加上这个参数就可以了,能够实现跟PhantomJS 一样的功能
- chrome_options.add_argument('--headless')
- # chrome_options.add_argument('--disable-gpu')
- self.browser = webdriver.Chrome(chrome_options=chrome_options)
- # self.browser = webdriver.Chrome()
- self.browser.set_page_load_timeout(30)
-
- def closed(self, spider):
- print("spider closed")
- self.browser.close()
-
- name = 'jd'
- allowed_domains = ['jd.com']
- start_urls = ['https://search.jd.com/Search?keyword=%E7%BE%8E%E9%A3%9F&enc=utf-8']
-
- def parse(self, response):
- goods_list = response.xpath('//div[@id="J_goodsList"]/ul/li')
- for goods in goods_list:
- item = JingdongItem()
- item['goods_name'] = goods.xpath('.//div[@class="p-name p-name-type-2"]/a/em/text()').extract_first()
- item['price'] = goods.xpath('.//div[@class="p-price"]/strong/i/text()').extract_first()
- item['shop'] = goods.xpath('.//div[@class="p-shop"]/span/a/text()').extract_first()
- yield item
middlewares.py
- # -*- coding: utf-8 -*-
-
- # Define here the models for your spider middleware
- #
- # See documentation in:
- # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
-
- from scrapy import signals
- from selenium import webdriver
- from selenium.common.exceptions import TimeoutException
- from selenium.webdriver.common.by import By
- from selenium.webdriver.support.ui import WebDriverWait
- from selenium.webdriver.support import expected_conditions as EC
- from scrapy.http import HtmlResponse
- import time
-
-
- class JingdongDownloaderMiddleware(object):
-
- def process_request(self, request, spider):
- if spider.name == 'jd':
- try:
- spider.browser.get(request.url)
- spider.browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
- except TimeoutException as e:
- print('超时')
- spider.browser.execute_script('window.stop()')
- time.sleep(2)
- return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source,
- encoding="utf-8", request=request)
我们要记住process_request的返回值,如果返回的是None,那么就继续scrapy的运行流程,如果返回request,scrapy就会重新发送这个request,如果返回response,则不会再经过其他的process_request和process_exception,这就相当于直接得到了返回值,会进入返回值的流程。
那么selenium做了什么呢,他就相当于是截取了原本的request,然后自己请求,并返回结果,这个reqeust就没有由scrapy的下载器来下载response,这一步由selenium来做了
以上的两个是最关键的步骤,剩下的settings,items就跟其他的爬虫没啥区别,该开的开,该写的写
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。