赞
踩
首先查看下链家网二手房网站(深圳):链家二手房
可以看到如下部分网页截图,我们需要获取的是类似图中红框中二手房的信息
话不多说,先把开头的通用代码写下来:
import requests
from lxml import etree
import time
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36',
'Referer': 'https://sz.lianjia.com/'
}
url = 'https://sz.lianjia.com/ershoufang/'
r = requests.get(url, headers = headers)
html = etree.HTML(r.text)
右键“检查”网页,通过检查发现,我们所需要的内容都在这个ul标签里边:
因此,我们需要先找出网页中的这个ul标签,它的属性class=“sellListContent”,所以代码如下:
outer =html.xpath('//ul[@class="sellListContent"]')[0]
再在这个ul里找每条二手房信息,展开ul我们发现,每个li标签就是一个完整的二手房信息
因此,此处代码为:
lis = outer.xpath('./li')
将li标签下的小标签展开,找到我们所需要的信息,我们需要爬取如下所有信息:
这里的每一个小标签就不一一展示了,基本方法和之前的一样,这里直接贴代码:
houses = [] for li in lis: try: area = li.xpath('./a/img[@class="lj-lazy"]/@alt')[0] # 区域 title = li.xpath('./div[@class="info clear"]/div[@class="title"]/a/text()')[0] # 二手房标题 position = li.xpath('./div[@class="info clear"]/div[@class="flood"]/div/a/text()')[1] # 位置 garden_name = li.xpath('./div[@class="info clear"]/div[@class="flood"]/div/a/text()')[0] # 小区名 houseinfo = li.xpath('./div[@class="info clear"]/div[@class="address"]/div/text()')[0] # 房屋信息 followinfo = li.xpath('./div[@class="info clear"]/div[@class="followInfo"]/text()')[0] # 关注信息 tag = li.xpath('./div[@class="info clear"]/div[@class="tag"]/span/text()') # 标签 tag = "|".join(tag) price = li.xpath('./div[@class="info clear"]/div[@class="priceInfo"]/div[1]/span/text()') unit = li.xpath('./div[@class="info clear"]/div[@class="priceInfo"]/div[1]/text()') if unit: price = price[0] + unit[0] # 总价 unitprice = li.xpath('./div[@class="info clear"]/div[@class="priceInfo"]/div[2]/span/text()')[0] # 单价 house = {'area':area, 'title':title, 'position':position, 'garden_name':garden_name, 'houseinfo':houseinfo, 'followinfo':followinfo, 'tag':tag, 'price':price, 'unitprice':unitprice} houses.append(house) except Exception as err: print("current failed:" + str(err))
爬取到的部分二手房信息如下图:
因为我们不止需要爬取这一页url,对比下面的url,我们发现:
第一页: ‘https://sz.lianjia.com/ershoufang/’
第二页:https://sz.lianjia.com/ershoufang/pg2/
因此,我们使用如下代码获取所有页面的url
for x in range(100):
url = 'https://sz.lianjia.com/ershoufang/pg' + str(x+1)
将之前的代码封装成一个方法,传入url,就能爬取我们所需要的页数;
再定义一个方法,把爬取的信息保存到csv文件中,代码如下:
def save_csv(path, houses):
# os.mkdir()
with open(path, 'a', encoding='utf-8') as f:
for house in houses:
f.write("::".join([house['area'], house['title'], house['position'], house['garden_name'], house['houseinfo'], house['followinfo'], house['tag'], house['price'], house['unitprice']]) + '\n')
综合上述过程和代码,可以得到下面完整的代码:
import requests from lxml import etree import time def get_info(url): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36', 'Referer': 'https://sz.lianjia.com/' } r = requests.get(url, headers = headers) html = etree.HTML(r.text) outer =html.xpath('//ul[@class="sellListContent"]')[0] lis = outer.xpath('./li') houses = [] for li in lis: try: area = li.xpath('./a/img[@class="lj-lazy"]/@alt')[0] title = li.xpath('./div[@class="info clear"]/div[@class="title"]/a/text()')[0] position = li.xpath('./div[@class="info clear"]/div[@class="flood"]/div/a/text()')[1] garden_name = li.xpath('./div[@class="info clear"]/div[@class="flood"]/div/a/text()')[0] houseinfo = li.xpath('./div[@class="info clear"]/div[@class="address"]/div/text()')[0] followinfo = li.xpath('./div[@class="info clear"]/div[@class="followInfo"]/text()')[0] tag = li.xpath('./div[@class="info clear"]/div[@class="tag"]/span/text()') tag = "|".join(tag) price = li.xpath('./div[@class="info clear"]/div[@class="priceInfo"]/div[1]/span/text()') unit = li.xpath('./div[@class="info clear"]/div[@class="priceInfo"]/div[1]/text()') if unit: price = price[0] + unit[0] unitprice = li.xpath('./div[@class="info clear"]/div[@class="priceInfo"]/div[2]/span/text()')[0] house = {'area':area, 'title':title, 'position':position, 'garden_name':garden_name, 'houseinfo':houseinfo, 'followinfo':followinfo, 'tag':tag, 'price':price, 'unitprice':unitprice} houses.append(house) except Exception as err: print("current failed:" + str(err)) return houses def save_csv(path, houses): with open(path, 'a', encoding='utf-8') as f: for house in houses: f.write("::".join([house['area'], house['title'], house['position'], house['garden_name'], house['houseinfo'], house['followinfo'], house['tag'], house['price'], house['unitprice']]) + '\n') if __name__ == '__main__': path = 'lianjia_ershoufang.csv' for x in range(100): url = 'https://sz.lianjia.com/ershoufang/pg' + str(x+1) houses = get_info(url) save_csv(path, houses) print("第%s页爬取完成" %(x+1)) time.sleep(0.2)
好了,二手房信息爬取并保存完成!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。