赞
踩
继续之前的笔记。上节实现了数据爬取和导出文件。这节学点干的,模拟浏览器请求,对付拉钩的反爬策略,爬取二级页面,获取到具体的职位,薪资等数据。
我们上节爬取的是分类的内容,我们实际浏览网页也是点击分类进入二级页面看职位列表,上节爬取的链接,就是我们点击的那个链接,我们已拿到了:
现在我们点击Java进入二级页面,假如我们要获取如下信息:
首先我们通过如下代码访问到二级页面:
yield scrapy.Request(url = jobUrl , callback=self.parse_url)
callback是回调函数,我们需要在下面实现这个方法,但是有一点我们需要提前完成,那就是攻克拉勾的反爬虫机制,我们通过设置cookie来完成这个功能,接下来先教大家获取到cookie。
在Java这个二级页面,F12调出开发者模式
,按F5刷新界面,如图,可以看到发了4个请求,第一个根据字面猜测应该是我们找的那个请求。
将header里的request的cookie复制出来先放到记事本中,我们需要对其处理后变成键值对的格式,放入scrapy中,完整代码如下:
# -*- coding: utf-8 -*- import scrapy from First.items import FirstItem class SecondSpider(scrapy.Spider): name = 'second' allowed_domains = [] start_urls = ['https://www.lagou.com/'] cookie = { "JSESSIONID": "ABAAABAAAGGABCB090F51A04758BF627C5C4146A091E618", "_ga": "GA1.2.1916147411.1516780498", "_gid": "GA1.2.405028378.1516780498", "Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6": "1516780498", "user_trace_token": "20180124155458-df9f65bb-00db-11e8-88b4-525400f775ce", "LGUID": "20180124155458-df9f6ba5-00db-11e8-88b4-525400f775ce", "X_HTTP_TOKEN": "98a7e947b9cfd07b7373a2d849b3789c", "index_location_city": "%E5%85%A8%E5%9B%BD", "TG-TRACK-CODE": "index_navigation", "LGSID": "20180124175810-15b62bef-00ed-11e8-8e1a-525400f775ce", "PRE_UTM": "", "PRE_HOST": "", "PRE_SITE": "https%3A%2F%2Fwww.lagou.com%2F", "PRE_LAND": "https%3A%2F%2Fwww.lagou.com%2Fzhaopin%2FJava%2F%3FlabelWords%3Dlabel", "_gat": "1", "SEARCH_ID": "27bbda4b75b04ff6bbb01d84b48d76c8", "Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6": "1516788742", "LGRID": "20180124181222-1160a244-00ef-11e8-a947-5254005c3644" } def parse(self, response): for item in response.xpath('//div[@class="menu_box"]/div/dl/dd/a'): jobClass = item.xpath
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。