赞
踩
首先要明确拉勾网的招聘信息存储网页是post形式的,所以必须填写from_data信息。我们这里填的是from_data = {‘first’:‘true’, ‘pn’:‘1’, ‘kd’:‘设计’}
,其中pn代表当前页码,kd就是我们搜索的职位关键词。 第二就是记得要用Session 获取动态cookies,否则爬下来的数据空空如也,还容易被封IP封号。
拉勾网每页有15条数据,默认显示30页,一共450条数据。我这里直接写死啦,大家可以根据需要修改爬取页数。也可以选择不获取“岗位要求”信息,或者其他不需要的信息。保存下来的文件是这个样子的。
原网页点击
导入使用的库
import pymongo import requests from bs4 import BeautifulSoup import json import pandas as pd import time from datetime import datetime from pymongo import MongoClient # 从职位详情页面内获取职位要求 def getjobneeds(positionId): ''' :param positionId: :return: ''' url = 'https://www.lagou.com/jobs/{}.html' headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36', 'Host': 'www.lagou.com', 'Referer': 'https://www.lagou.com/jobs/list_%E8%AE%BE%E8%AE%A1/p-city_0?px=default', 'Upgrade-Insecure-Requests': '1' } s = requests.Session() s.get(url.format(positionId), headers=headers, timeout=3) # 请求首页获取cookies cookie = s.cookies # 为此次获取的cookies response = s.get(url.format(positionId), headers=headers, cookies=cookie, timeout=3) # 获取此次文本 time.sleep(5) # 休息 休息一下 soup = BeautifulSoup(response.text, 'html.parser') need = ' '.join([p.text.strip() for p in soup.select('.job_bt div')]) return need # 获取职位具体信息#获取职位具体 def getjobdetails(jd): ''' :param jd: :return:返回结果集 ''' results = { } results['businessZones'] = jd['businessZones'] results['companyFullName'] = jd['companyFullName']# 公司名 results['companyLabelList'] = jd['companyLabelList']# results['financeStage'] = jd['financeStage'] results['skillLables'] = jd['skillLables'] results['companySize'] = jd['companySize'] results['latitude'] = jd['latitude'] results['longitude'] = jd['longitude'] results['city'] = jd['city'] results['district'] = jd['district'] results['salary'] = jd['salary'] results['secondType'] = jd['secondType'] results['workYear'] = jd['workYear'] results['education'] = jd['education'] results['firstType'] = jd['firstType'] results['thirdType'] = jd['thirdType'] results['positionName'] = jd['positionName'] #职位 results['positionLables'] = jd['positionLables'] results['positionAdvantage'] = jd['positionAdvantage'] positionId = jd['positionId'] results['need'] = getjobneeds(positionId) time.sleep(2) # 设置暂停时间,控制频率 print(jd,
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。