当前位置:   article > 正文

python爬虫:lxml爬取链家网二手房信息_爬取广州链家二手房信息,房屋信息,面积,总价格,单价等。并根据小区和户型进行分类

爬取广州链家二手房信息,房屋信息,面积,总价格,单价等。并根据小区和户型进行分类

首先查看下链家网二手房网站(深圳):链家二手房
可以看到如下部分网页截图,我们需要获取的是类似图中红框中二手房的信息
在这里插入图片描述

话不多说,先把开头的通用代码写下来:

import requests
from lxml import etree
import time

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36',
    'Referer': 'https://sz.lianjia.com/'
}
url = 'https://sz.lianjia.com/ershoufang/'

r = requests.get(url, headers = headers)

html = etree.HTML(r.text)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

右键“检查”网页,通过检查发现,我们所需要的内容都在这个ul标签里边:
在这里插入图片描述
因此,我们需要先找出网页中的这个ul标签,它的属性class=“sellListContent”,所以代码如下:

outer =html.xpath('//ul[@class="sellListContent"]')[0]
  • 1

再在这个ul里找每条二手房信息,展开ul我们发现,每个li标签就是一个完整的二手房信息
在这里插入图片描述
因此,此处代码为:

lis = outer.xpath('./li')
  • 1

将li标签下的小标签展开,找到我们所需要的信息,我们需要爬取如下所有信息:
在这里插入图片描述
这里的每一个小标签就不一一展示了,基本方法和之前的一样,这里直接贴代码:

houses = []
for li in lis:
	try:
	    area = li.xpath('./a/img[@class="lj-lazy"]/@alt')[0]	# 区域
	
	    title = li.xpath('./div[@class="info clear"]/div[@class="title"]/a/text()')[0]	# 二手房标题
	
	    position = li.xpath('./div[@class="info clear"]/div[@class="flood"]/div/a/text()')[1]	# 位置
	    garden_name = li.xpath('./div[@class="info clear"]/div[@class="flood"]/div/a/text()')[0]	# 小区名
	
	    houseinfo = li.xpath('./div[@class="info clear"]/div[@class="address"]/div/text()')[0]	# 房屋信息
	
	    followinfo = li.xpath('./div[@class="info clear"]/div[@class="followInfo"]/text()')[0]	# 关注信息
	
	    tag = li.xpath('./div[@class="info clear"]/div[@class="tag"]/span/text()')	# 标签
	    tag = "|".join(tag)
	
	    price = li.xpath('./div[@class="info clear"]/div[@class="priceInfo"]/div[1]/span/text()')	
	    unit = li.xpath('./div[@class="info clear"]/div[@class="priceInfo"]/div[1]/text()')
	    if unit:
	        price = price[0] + unit[0]	# 总价
	
	    unitprice = li.xpath('./div[@class="info clear"]/div[@class="priceInfo"]/div[2]/span/text()')[0]		# 单价
	
	    house = {'area':area, 'title':title, 'position':position, 'garden_name':garden_name, 'houseinfo':houseinfo, 'followinfo':followinfo, 'tag':tag, 'price':price, 'unitprice':unitprice}
	    houses.append(house)
	except Exception as err:
        print("current failed:" + str(err))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28

爬取到的部分二手房信息如下图:
在这里插入图片描述
因为我们不止需要爬取这一页url,对比下面的url,我们发现:
第一页: ‘https://sz.lianjia.com/ershoufang/’
第二页:https://sz.lianjia.com/ershoufang/pg2/
因此,我们使用如下代码获取所有页面的url

for x in range(100):
    url = 'https://sz.lianjia.com/ershoufang/pg' + str(x+1)
  • 1
  • 2

将之前的代码封装成一个方法,传入url,就能爬取我们所需要的页数;
再定义一个方法,把爬取的信息保存到csv文件中,代码如下:

def save_csv(path, houses):
    # os.mkdir()
    with open(path, 'a', encoding='utf-8') as f:
        for house in houses:
            f.write("::".join([house['area'], house['title'], house['position'], house['garden_name'], house['houseinfo'], house['followinfo'], house['tag'], house['price'], house['unitprice']]) + '\n')
  • 1
  • 2
  • 3
  • 4
  • 5

综合上述过程和代码,可以得到下面完整的代码:

import requests
from lxml import etree
import time

def get_info(url):

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36',
        'Referer': 'https://sz.lianjia.com/'
    }

    r = requests.get(url, headers = headers)

    html = etree.HTML(r.text)

    outer =html.xpath('//ul[@class="sellListContent"]')[0]
    lis = outer.xpath('./li')

    houses = []
    for li in lis:
        try:
            area = li.xpath('./a/img[@class="lj-lazy"]/@alt')[0]
            title = li.xpath('./div[@class="info clear"]/div[@class="title"]/a/text()')[0]
            position = li.xpath('./div[@class="info clear"]/div[@class="flood"]/div/a/text()')[1]
            garden_name = li.xpath('./div[@class="info clear"]/div[@class="flood"]/div/a/text()')[0]
            houseinfo = li.xpath('./div[@class="info clear"]/div[@class="address"]/div/text()')[0]
            followinfo = li.xpath('./div[@class="info clear"]/div[@class="followInfo"]/text()')[0]
            tag = li.xpath('./div[@class="info clear"]/div[@class="tag"]/span/text()')
            tag = "|".join(tag)

            price = li.xpath('./div[@class="info clear"]/div[@class="priceInfo"]/div[1]/span/text()')
            unit = li.xpath('./div[@class="info clear"]/div[@class="priceInfo"]/div[1]/text()')
            if unit:
                price = price[0] + unit[0]

            unitprice = li.xpath('./div[@class="info clear"]/div[@class="priceInfo"]/div[2]/span/text()')[0]
            house = {'area':area, 'title':title, 'position':position, 'garden_name':garden_name, 'houseinfo':houseinfo, 'followinfo':followinfo, 'tag':tag, 'price':price, 'unitprice':unitprice}
            houses.append(house)
        except Exception as err:
            print("current failed:" + str(err))

    return houses

def save_csv(path, houses):
    with open(path, 'a', encoding='utf-8') as f:
        for house in houses:
            f.write("::".join([house['area'], house['title'], house['position'], house['garden_name'], house['houseinfo'], house['followinfo'], house['tag'], house['price'], house['unitprice']]) + '\n')

if __name__ == '__main__':

    path = 'lianjia_ershoufang.csv'

    for x in range(100):
        url = 'https://sz.lianjia.com/ershoufang/pg' + str(x+1)
        houses = get_info(url)
        save_csv(path, houses)
        print("第%s页爬取完成" %(x+1))
        time.sleep(0.2)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58

好了,二手房信息爬取并保存完成!

声明:本文内容由网友自发贡献,转载请注明出处:【wpsshop】
推荐阅读
相关标签
  

闽ICP备14008679号