爬虫抓取链家二手房数据_北京链家二手房爬虫

作者：知新_RL | 2024-07-07 02:29:13

踩

北京链家二手房爬虫

使用 Python 爬虫库完成链家二手房（https://bj.lianjia.com/ershoufang/rs/）房源信息抓取，包括楼层、区域、总价、单价等信息。

分析 URL 具有以下规律：

第一页：https://bj.lianjia.com/ershoufang/pg1/
第二页：https://bj.lianjia.com/ershoufang/pg2/
第三页：https://bj.lianjia.com/ershoufang/pg3/
第n页：https://bj.lianjia.com/ershoufang/pgn/
1
2
3
4

确定Xpath表达式

使用 Chrome 开发者工具对页面元素进行审查，从而确定 Xpath 表达式。首先根据要抓取的数据确定“基准表达式”。通过审查一处房源的元素结构，可以得知房源信息都包含在以下代码中：

  <div class="info clear">
    <div class="title"><a class="" href="https://bj.lianjia.com/ershoufang/101122052862.html" target="_blank"
        data-log_index="1" data-el="ershoufang" data-housecode="101122052862" data-is_focus="" data-sl="">精装修大三居双卫，南北通透，有车位满五唯一</a><!-- 拆分标签 只留一个优先级最高的标签--><span
        class="goodhouse_tag tagBlock">必看好房</span></div>
    <div class="flood">
      <div class="positionInfo"><span class="positionIcon"></span><a href="https://bj.lianjia.com/xiaoqu/1111027377165/"
          target="_blank" data-log_index="1" data-el="region">金汉绿港三区 </a> - <a href="https://bj.lianjia.com/ershoufang/shunyicheng/"
          target="_blank">顺义城</a> </div>
    </div>
    <div class="address">
      <div class="houseInfo"><span class="houseIcon"></span>3室2厅 | 139.96平米 | 南 北 | 简装 | 低楼层(共20层) | 2009年 | 板塔结合</div>
    </div>
    <div class="followInfo"><span class="starIcon"></span>12人关注 / 29天以前发布</div>
    <div class="tag"><span class="subway">近地铁</span><span class="vr">VR房源</span><span class="taxfree">房本满五年</span><span
        class="haskey">随时看房</span></div>
    <div class="priceInfo">
      <div class="totalPrice totalPrice2"><i> </i><span class="">485</span><i>万</i></div>
      <div class="unitPrice" data-hid="101122052862" data-rid="1111027377165" data-price="34653"><span>34,653元/平</span></div>
    </div>
  </div>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

1) 确定基准表达式

待抓取的房源信息都包含在相应的

标签中，如下所示：

<div class="positionInfo">..</div>
<div class="address">...</div>
<div class="priceInfo">...</div>
1
2
3

而每个页面中都包含 30 个房源，因此我们要匹配以下节点的父节点或者先辈节点，从而确定 Xpath 基准表达式：

<div class="info clear"></div>
1

通过页面结构分析可以得出每页的 30 个房源信息全部包含以下节点中：

<ul class="sellListContent" log-mod="list">
<li class="clear LOGVIEWDATA LOGCLICKDATA">
房源信息..
</li>
</ul>
1
2
3
4
5

特别注意：这里可以class="clear LOGVIEWDATA LOGCLICKDATA"需要使用源码查看
因此 Xpath 基准表达式如下所示：

//ul[@class="sellListContent"]/li[@class="clear LOGVIEWDATA LOGCLICKDATA"]
1

2) 确定抓取信息的表达式

根据页面元素结构确定待抓取信息的 Xpath 表达式，分别如下：

小区名称：position = h.xpath('.//a[@data-el="region"]/text()')[0]
房屋介绍：hourseInfo_list = h.xpath('.//div[@class="houseInfo"]/text()')
单价信息：addresunitPrice = h.xpath('.//div[@class="unitPrice"]//text()')[0].strip()
总价信息：totalPrice = h.xpath('.//div[@class="totalPrice totalPrice2"]//text()')[0].strip()
1
2
3
4

其中房屋介绍，主要包含了以下信息：

<div class="houseInfo">
		<span class="houseIcon"></span>4室2厅 | 133.68平米 | 南 北 | 精装 | 顶层(共6层)  | 板楼
</div>
1
2
3

因此，匹配出的 info_list 列表需要经过处理才能得出我们想要的数据，如下所示：

# 房屋信息：3室2厅 | 147.95平米 | 南 东南 | 简装 | 中楼层(共18层)  | 塔楼
                hourseInfo_list = h.xpath('.//div[@class="houseInfo"]/text()')
                if hourseInfo_list:
                    hourseInfo = hourseInfo_list[0].split("|")
                    if len(hourseInfo) >= 5:
                        if hourseInfo:
                            # 户型
                            item.append(hourseInfo[0].strip())
                            # 面积
                            item.append(hourseInfo[1].strip())
                            # 方向
                            item.append(hourseInfo[2].strip())
                            # 是否精装
                            item.append(hourseInfo[3].strip())
                            # 楼层
                            item.append(hourseInfo[4].strip())
                            # 楼型
                            item.append(hourseInfo[5].strip())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

3) 提高抓取效率

为了提高网页信息的抓取质量，减小网络波动带来的响应，我们可以设置一个规则：在超时时间内（3秒），在该时间内对于请求失败的页面尝试请求三次，如果均未成功，则抓取下一个页面。

编写程序代码

通过上述分析得出了所有的 Xpath 表达式，下面开始编写爬虫程序，代码如下:

在# coding:utf8
import requests
import random
from lxml import etree
import time
# 提供ua信息的的包
from fake_useragent import UserAgent
# mysql数据库模块
import pymysql


class LianJiaSpider(object):
    # 构造方法，设置请求路径，每个页面请求次数，数据库连接对象、游标对象
    def __init__(self):
        # 爬取路径{}表示第几页的数据
        self.url = 'https://bj.lianjia.com/ershoufang/pg{}/'
        # 计数，请求一个页面的次数，初始值为1
        self.count = 1
        # 数据库连接对象
        self.db = pymysql.connect(host="127.0.0.1", user='root', password="", db='lianjia')
        # 数据库游标对象
        self.cursor = self.db.cursor()

    # 随机获取UA
    def get_headers(self):
        ua = UserAgent()
        # headers = {  # 设置header
        #     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        #     # 'Accept-Encoding': 'gzip, deflate, br',
        #     'Accept-Language': 'zh-CN,zh;q=0.9',
        #     'Cache-Control': 'no-cache',
        #     'Connection': 'keep-alive',
        #     'User-Agent': 'Mozilla/4.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
        #     'referer': 'https://passport.meituan.com/',
        #     'Cookie': '__mta=42753434.1633656738499.1634781127005.1634781128998.34; uuid_n_v=v1; _lxsdk_cuid=17c5d879290c8-03443510ba6172-6373267-144000-17c5d879291c8; uuid=60ACEF00317A11ECAAC07D88ABE178B722CFA72214D742A2849B46660B8F79A8; _lxsdk=60ACEF00317A11ECAAC07D88ABE178B722CFA72214D742A2849B46660B8F79A8; _csrf=94b23e138a83e44c117736c59d0901983cb89b75a2c0de2587b8c273d115e639; Hm_lvt_703e94591e87be68cc8da0da7cbd0be2=1634716251,1634716252,1634719353,1634779997; Hm_lpvt_703e94591e87be68cc8da0da7cbd0be2=1634781129; _lxsdk_s=17ca07b2470-536-b73-84%7C%7C12'

        headers = {'User-Agent': ua.random}
        return headers

    #  获取html
    def get_html(self, url):

        # 在超时内，对失败请求页面尝试三次
        if self.count <= 3:

            try:
                # 获取HTTPResponse对象
                resp = requests.get(url=url, headers=self.get_headers(), timeout=5)
                html = resp.text
                # print(html)
                return html
            except Exception as e:
                print(e)
                self.get_html(url)
                self.count += 1

    #   数据的解析
    def parse_html(self, url):
        # print(url)
        html = self.get_html(url)
        # print(html)
        if html:
            parse_html = etree.HTML(html)
            # 获取每页显示的三十个房源列表
            sellListContent = parse_html.xpath(
                '//ul[@class="sellListContent"]/li[@class="clear LOGVIEWDATA LOGCLICKDATA"]')
            # print(sellListContent)
            # 存储所有的数据列表
            h_list = []
            for h in sellListContent:
                item = []
                # 小区
                position = h.xpath('.//a[@data-el="region"]/text()')[0]
                item.append(position)
                # 房屋信息：3室2厅 | 147.95平米 | 南 东南 | 简装 | 中楼层(共18层)  | 塔楼
                hourseInfo_list = h.xpath('.//div[@class="houseInfo"]/text()')
                if hourseInfo_list:
                    hourseInfo = hourseInfo_list[0].split("|")
                    if len(hourseInfo) >= 5:
                        if hourseInfo:
                            # 户型
                            item.append(hourseInfo[0].strip())
                            # 面积
                            item.append(hourseInfo[1].strip())
                            # 方向
                            item.append(hourseInfo[2].strip())
                            # 是否精装
                            item.append(hourseInfo[3].strip())
                            # 楼层
                            item.append(hourseInfo[4].strip())
                            # 楼型
                            item.append(hourseInfo[5].strip())
                # 单价
                unitPrice = h.xpath('.//div[@class="unitPrice"]//text()')[0].strip()
                # 总价
                totalPrice = h.xpath('.//div[@class="totalPrice totalPrice2"]//text()')[0].strip()
                item.append(unitPrice)
                item.append(totalPrice)

                h_list.append(item)
        return h_list

    def save_html(self, url):
        try:
            h_list = self.parse_html(url)
            # 将列表中的每个元素转换为元祖
            h_list = list(map(tuple, h_list))
            # print(h_list)
            sql = "insert into house (position,housetype,area,direction,hardcover,floor,buildtype,unitprice,totalprice)values(%s,%s,%s,%s,%s,%s,%s,%s,%s)"
            self.cursor.executemany(sql, h_list)
            self.db.commit()
        except Exception as e:
            self.db.rollback()
            print("数据库添加失败", str(e.args))

    def run(self):
        try:
            start = int(input("请输入开始页："))
            stop = int(input("请输入终结页："))
            for i in range(start, stop + 1):
                url = self.url.format(i)
                self.save_html(url)
                time.sleep(random.randint(2, 3))

        except Exception as e:
            print("抓取失败，", e)
        finally:
            self.cursor.close()
            self.db.close()


if __name__ == '__main__':
    begin = time.time()
    spider = LianJiaSpider()
    spider.run()
    end = time.time()
    print('数据爬取完毕，总共耗时：%.2f' % (end - begin))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138

数据库设计

mysql> create database lianjia;
Query OK, 1 row affected (0.00 sec)

mysql> use lianjia
Database changed
mysql> create table house(
    -> position varchar(40),
    -> housetype varchar(40),
    -> area varchar(20),
    -> direction varchar(20),
    -> hardcover varchar(20),
    -> floor varchar(20),
    -> buildtype varchar(20),
    -> unitPrice varchar(10),
    -> totalPrice varchar(10)
    -> );
Query OK, 0 rows affected (0.05 sec)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/知新_RL/article/detail/794613