笔触狂放9

这个屌丝很懒，什么也没留下！

热门标签

【Python】代码：获取猫途鹰的London酒店信息:基于Scrapy框架和requests库_scrapy爬取酒店信息

作者：笔触狂放9 | 2024-04-07 17:46:14

踩

scrapy爬取酒店信息

本文以代码+分析的形式记录：利用Scrapy框架和requests库爬取tripadvisor(猫途鹰)多个城市的酒店信息，数据量300w+条(1.09G)，运行时间约7h。多个城市与单个城市的操作类似，为避免代码过于冗长，本文仅以爬取London酒店的评论信息为例子。默认读者掌握Python爬虫原理、Scrapy框架及HTML基础语法。有错漏或疑问等等都欢迎在评论区提出

第一章具体需求

获取London所有酒店的评论信息，根据酒店星级将文件保存到不同的文件夹

文件夹结构树
爬取的具体内容： 分为固定的酒店数据和翻页获取的评论数据
数据在页面的位置

在这里插入图片描述

第二章分析过程

可以分两步进行：先获取酒店网址再获取每家酒店的数据

1. 获取酒店网址

在酒店模块搜索London跳转到London酒店的页面(在控制台分析可知：页面内容在源代码中)，可以看到页面中显示30家酒店，接着往后看：
第2页网址https://www.tripadvisor.com/Hotels-g186338-oa30-London_England-Hotels.html
第3页网址https://www.tripadvisor.com/Hotels-g186338-oa60-London_England-Hotels.html
继续往后查看发现oa后面的数字是当前页面酒店的起始序号，结合在页面获取到的酒店总数，通过循环可以获取到所有酒店的网址及名称。

2. 获取酒店数据

对酒店页面进行分析可以知道：固定的酒店数据(名称、星级、所在地)存储在源代码中，与评论相关的数据在https://www.tripadvisor.com/data/graphql/ids
在这里插入图片描述

获取酒店数据: 根据酒店网址在源代码获取名称、星级、英文评论数
评论相关数据: 修改payload中的locationid(酒店id)、geoid(城市id)和页数获取相应的评论，以https://www.tripadvisor.com/Hotel_Review-g186338-d242994-Reviews-or10-Central_Hotel-London_England.html为例：

第三章代码实现

1. 获取酒店网址

使用Scrapy框架实现，这里是为了练习，也可以使用requests库

items.py: 定义需要获取的数据

import scrapy

class TripadivisorItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 酒店名字
    hotel = scrapy.Field()
    # 网址
    href = scrapy.Field()
    # 评论数
    commentNum = scrapy.Field()
1
2
3
4
5
6
7
8
9
10
11

spider/trip.py: 主程序，获取数据

import scrapy
import bs4
import re
from ..items import TripadivisorItem


class TripSpider(scrapy.Spider):
    start_urls = 'https://www.tripadvisor.com/Hotels-g186338-London_England-Hotels.html'
    # 爬虫名字
    name = 'trip'
    # 限制爬虫爬取的域名
    allowed_domain = ['www.tripadvisor.com']
    pages = int(4364 / 30) + 1
    for page in (1,page+1):
    	url = 'https://www.tripadvisor.com/Hotels-g186338-oa' + str(page) + '-London_England-Hotels.html'
    	start_urls.append(url)
    	
    # 获取酒店的id、网址
    def parse(self, response):
        bs = bs4.BeautifulSoup(response.text, 'html.parser')
        datas = bs.find_all('div', class_='meta_listing')
        for data in datas:
            item = TripadivisorItem()
            data = data.find('div', class_='main_col')
            # 酒店名称
            item['hotel'] = data.find('div',class_='listing_title').find('a').text
            # 酒店网址
            item['href'] = 'https://www.tripadvisor.com' + data.find('div', class_='listing_title').find('a')['href']
            # 评论数
            item['commentNum'] = data.find('a', class_='review_count').text
            yield item
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

pipeline.py: 对返回的数据进行处理(存储)

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


# class TripadivisorPipeline:
#     def process_item(self, item, spider):
#         return item

import openpyxl

class TripadivisorPipeline(object):
    # 定义一个JobuiPipeline类，负责处理item
    def __init__(self):
        # 初始化函数 当类实例化时这个方法会自启动
        self.wb = openpyxl.Workbook()
        # 创建工作薄
        self.ws = self.wb.active
        # 定位活动表
        self.ws.append(['酒店名', '网址', '评论数'])
        # 用append函数往表格添加表头

    def process_item(self, item, spider):
        # 把酒店名称、酒店网址、评论数都写成列表的形式，赋值给line
        line = [item['hotel'], item['href'], item['commentNum']]
        # 用append函数把酒店名称、酒店网址、评论数的数据都添加进表格
        self.ws.append(line)
        # 将item丢回给引擎，如果后面还有这个item需要经过的itempipeline，引擎会自己调度
        return item

    def close_spider(self, spider):
        # close_spider是当爬虫结束运行时，这个方法就会执行
        self.wb.save('./all.xlsx')
        # 保存文件
        self.wb.close()
        # 关闭文件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

setting.py: 设置headers避免被网站拦截

# 请求头
DEFAULT_REQUEST_HEADERS = {
	'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    "referer": "https://www.tripadvisor.com/Tourism-g60763-New_York_City_New_York-Vacations.html",
    "user-agent": user-agent, # 自行填写
}

# 不遵循爬虫协议
ROBOTSTXT_OBEY = False
1
2
3
4
5
6
7
8
9
10

2. 获取酒店数据

a. 酒店信息

利用Scrapy框架获取，除了主程序，其他操作与上一步类似，这里不再重复

spider/star.py

import scrapy
import bs4
from ..items import StarItem
import openpyxl
import re


class StarSpider(scrapy.Spider):
    name = 'star'
    allowed_domains = ['tripadvisor.com']
    start_urls = []

    # 从excel表格中获取酒店网址
    wb = openpyxl.load_workbook('./all.xlsx')
    sheet = wb[wb.sheetnames[0]]
    rows = sheet.max_row
    cols = sheet.max_column
    for i in range(2, rows+1):
        cellValue = sheet.cell(row=i, column=2).value
        start_urls.append(cellValue)

    def parse(self, response):
        item = StarItem()
        bs = bs4.BeautifulSoup(response.text, 'html.parser')
        # 酒店名称
        item['hotel'] = bs.find('h1', id='HEADING').text
        # 酒店网址
        item['url'] = response.url
        # 酒店星级
        try:
            item['star'] = bs.find('svg', class_='JXZuC')['aria-label'][0:3]
        except:
            item['star'] = 'None'
        # 英文评论数量
        languages = bs.find_all('li', class_='ui_radio XpoVm')
        item['reviews'] = 0
        for language in languages:
            value = language.find('input')['value']
            # 提取的评论数有有括号,使用正则可以去除，也可以利用excel去除
            if value == 'en':
                item['reviews'] = language.find('span', class_='POjZy').text
        yield item

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

b. 评论信息

使用requests库

# 根据文件获取评论信息
# 注意monkey要放在最前面，否则会报错
from gevent import monkey
monkey.patch_all()
import requests
import os
import json
import openpyxl
import gevent
from bs4 import BeautifulSoup
import random
import pandas
import math
import time
import re

# 获取评论信息
def getComment(geoId, locationId, page, star):
    try:
        post_url = 'https://www.tripadvisor.com/data/graphql/ids'
        data = [{"query": "0eb3cf00f96dd65239a88a6e12769ae1", "variables": {"interaction": {"productInteraction": {"interaction_type": "CLICK", "site": {"site_name": "ta", "site_business_unit": "Hotels", "site_domain": "www.tripadvisor.com"}, "pageview": {"pageview_request_uid": "b3ad9a52-d1c6-4bbe-8eae-a19f04fd67ff", "pageview_attributes": {"location_id": locationId, "geo_id": geoId, "servlet_name": "Hotel_Review"}}, "user": {"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36", "site_persistent_user_uid": "web390a.218.17.207.101.187546A8056", "unique_user_identifiers": {"session_id": "F61F132D22034DC242255F44CFE7A54C"}}, "search": {}, "item_group": {"item_group_collection_key": "b3ad9a52-d1c6-4bbe-8eae-a19f04fd67ff"}, "item": {"product_type": "Hotels", "item_id_type": "ta-location-id", "item_id": locationId, "item_attributes": {"element_type": "number", "action_name": "REVIEW_NAV", "page_number": page, "offset": (page-1)*10, "limit": 10}}}}}}, {"query": "ea9aad8c98a6b21ee6d510fb765a6522", "variables": {"locationId": locationId, "offset": (page-1)*10, "filters": [{"axis": "LANGUAGE", "selections": ["en"]}], "prefs":None, "initialPrefs":{}, "limit": 10, "filterCacheKey": "locationReviewFilters_10810215", "prefsCacheKey": "locationReviewPrefs_10810215", "needKeywords": False, "keywordVariant": "location_keywords_v2_llr_order_30_en"}}, {"query": "dd297ef79164a42dba1997b10f33d055", "variables": {"locationId": locationId, "application": "HOTEL_DETAIL", "currencyCode": "HKD", "pricingMode": "BASE_RATE", "sessionId": "F61F132D22034DC242255F44CFE7A54C", "pageviewUid": "b3ad9a52-d1c6-4bbe-8eae-a19f04fd67ff", "travelInfo": {"adults": 2, "rooms": 1, "checkInDate": "2023-04-18", "checkOutDate": "2023-04-19", "childAgesPerRoom": [], "usedDefaultDates":False}, "requestNumber":2, "filters":None, "route":{"page": "Hotel_Review", "params": {"detailId": locationId, "geoId": geoId, "offset": "r10"}}}}]
        # data需要是json类型
        data = json.dumps(data)
        headers = {
            "user-agent": user-agent, # 自行填写
            "content-type": "application/json; charset=UTF-8",
            "origin": "https://www.tripadvisor.com",
            "x-requested-by": "TNI1625!ADcEQn+9K+sw7mHgZbsyGI2UftS4iOyyNcdidQPAc+vtAMBvJBsHrS9UBz+Q8f+v5FCuxfo8nOBnILfs1y6pgcNquOYSBOwj3GtzKolFduhNO6O8lTRGC4Eyiv2wQEKhghYw3/0e4t12H4q6zCgiTy3gXUu6p6bZ6FOT8OyQCRVH",
        }
        try:
            response = requests.post(post_url, headers=headers, data=data, timeout=30)
        except:
            time.sleep(3)
            print('正在重请求')
            response = requests.post(post_url, headers=headers, data=data, timeout=30)
        ILLEGAL_CHARACTERS_RE = re.compile(
            r'[\000-\010]|[\013-\014]|[\016-\037]')
        datas = response.json()[
            1]["data"]["locations"][0]['reviewListPage']['reviews']
        for data in datas:
            item = {}
            item['pages'] = pages
            item['page'] = page
            # 酒店名称
            item['hotel'] = response.json(
            )[1]["data"]["locations"][0]["name"]
            # 城市Id
            item['geoId'] = geoId
            # 酒店所在城市 
            item['city'] = data['location']['additionalNames']['geo']
            # 酒店星级
            item['star'] = star
            # 评论者ID
            try:
                item['displayName'] = data['userProfile']['displayName']
            except:
                item['displayName'] = 'None'
            # 替换非法字符
            item['displayName'] = ILLEGAL_CHARACTERS_RE.sub(
                r'', item['displayName'])
            # 评论者地址
            try:
                address = data["userProfile"]["hometown"]["location"]
                if address != None:
                    item['address'] = address['additionalNames']['long']
                    # 替换非法字符
                    item['address'] = ILLEGAL_CHARACTERS_RE.sub(
                        r'', item['address'])
                else:
                    item['address'] = 'None'
            except:
                item['address'] = 'None'

            # 评论者总评论数、总获赞数
            userProfile = data['userProfile']["contributionCounts"]
            if userProfile != None:
                # 评论者总评论数
                item['contribution'] = userProfile['sumAllUgc']
                # 评论者总获赞数
                item['helpfulVotes'] = userProfile['helpfulVote']
            else:
                item['contribution'] = 'None'
                item['helpfulVotes'] = 'None'
            # 评论获赞数
            item['helpVote'] = data['helpfulVotes']
            # 评论日期
            item['publishedDate'] = data['publishedDate']
            # 入住日期、旅行类型
            tripInfo = data['tripInfo']
            if tripInfo != None:
                # 入住日期
                item['stayDate'] = tripInfo['stayDate']
                # 旅行类型
                item['tripType'] = tripInfo['tripType']
            else:
                # 入住日期
                item['stayDate'] = 'None'
                # 旅行类型
                item['tripType'] = 'None'
            # 总体评分
            #总评
            item['rating'] = data["rating"]
            # 各属性评分 value location service rooms cleanliness sleepQuality
            item['value'] = 'None'
            item['location'] = 'None'
            item['service'] = 'None'
            item['rooms'] = 'None'
            item['cleanliness'] = 'None'
            item['sleepQuality'] = 'None'
            additionalRatings = data['additionalRatings']
            if additionalRatings != []:
                for rating in additionalRatings:
                    if rating["ratingLabel"] == "Value":
                        item['value'] = rating["rating"]
                    elif rating["ratingLabel"] == "Location":
                        item['location'] = rating["rating"]
                    elif rating["ratingLabel"] == "Service":
                        item['service'] = rating["rating"]
                    elif rating["ratingLabel"] == "Rooms":
                        item['rooms'] = rating["rating"]
                    elif rating["ratingLabel"] == "Cleanliness":
                        item['cleanliness'] = rating["rating"]
                    elif rating["ratingLabel"] == "Sleep Quality":
                        item['sleepQuality'] = rating["rating"]
            # 图片数量
            item['imgNum'] = len(data['photoIds'])
            # 文本评论
            item['comment'] = data['text']
            item['comment'] = ILLEGAL_CHARACTERS_RE.sub(
                r'', item['comment'])
            line = [item['hotel'], item['star'], item['city'], item['displayName'], item['address'], item['contribution'], item['helpfulVotes'],
                    item['helpVote'], item['publishedDate'], item['stayDate'], item['tripType'], item['rating'],item['value'],
                    item['location'], item['service'], item['rooms'], item['cleanliness'], item['sleepQuality'],
                    item['imgNum'], item['comment']]
            reviewsList.append(line)
        print(item['hotel']+" 第"+str(page)+"页评论")
    except requests.exceptions.ConnectionError or requests.exceptions.Timeout:
        # or requests.exceptions.ReadTimeout
        # urllib3.exceptions.ReadTimeoutError
        print("请求超时，正在重新请求")
        getComment(geoId, locationId, page, star)
    except:
        print('请求失败')
        # getComment(geoId, locationId, page, star)
        requestsList.append([geoId,locationId,page])


def storage(header, geoId, reviewsList, star, hotel):
	city = 'London'
    # 表头
    header = header
    wb = openpyxl.Workbook()
    sheet = wb.active
    sheet.title = "commentInfo"
    sheet.append(header)
    for reviewList in reviewsList:
        sheet.append(reviewList)
    foldername = f'./data/{city}/{star}-star'
    if not os.path.exists(foldername):
        os.makedirs(foldername)
    wb.save(f'./data/{city}/{star}-star/{hotel}.xlsx')


if __name__ == "__main__":
	try:
	    # 从id.xlsx中获取url
	    idsList = []
	    wb = openpyxl.load_workbook('star.xlsx')
	    sheet = wb[wb.sheetnames[0]]
	    rows = sheet.max_row
	    cols = sheet.max_column
	    # 记录请求失败的酒店及页数
	    requestsList =[]
	    # 记录存储失败的酒店，方便纠错
	    storeList = []
	    # 控制爬取的酒店数量
	    for i in range(1, rows+1):
	        cellValue = sheet.cell(row=i, column=2).value
	        idList = re.findall('\d+', cellValue)
	        # 城市Id
	        geoId = idList[0]
	        # 酒店Id
	        locationId = idList[1]
	        # 酒店星级
	        star = sheet.cell(row=i, column=3).value
	        # 英文评论数
	        reviews = int(sheet.cell(row=i, column=4).value)
	        if int(reviews) < 1000:
	            continue
	        # 评论的页数
	        pages = int(reviews / 10) + 1
	
	        taskList = []
	        reviewsList = []
	        # 多协程可以极大提升速度
	        for page in range(1, pages+1):
	            task = gevent.spawn(getComment, geoId, locationId, page, star)
	            taskList.append(task)
	        gevent.joinall(taskList)
	        # 存储
	        try:
	            # 忘记原因了
	            if reviewsList != []:
	                header = ['酒店名称', '酒店星级', '酒店所在城市', '评论者id', '评论者地址', '评论者分享评论数contribution',
	                        '评论者所获推荐数helpful votes', '该评论所获help votes', '评论日期',
	                        '入住日期Date of Stay', '旅行类型', '总体评分','Value', 'Location', 'Service',
	                        'Rooms', 'Cleanliness', 'Sleep Quality', '图片数量', '文本评论']
	                storage(header, geoId, reviewsList, star, reviewsList[0][0].replace('/','-'))
	            else:
	                storeList.append([locationId])
	        except OSError:
	            if reviewsList != []:
	                header = ['酒店名称', '酒店星级', '酒店所在城市', '评论者id', '评论者地址', '评论者分享评论数contribution','评论者所获推荐数helpful votes', '该评论所获help votes', '评论日期', '入住日期Date of Stay', '旅行类型', '总体评分','Value', 'Location', 'Service','Rooms', 'Cleanliness', 'Sleep Quality', '图片数量', '文本评论']
	                storage(header, geoId, reviewsList, star, geoId)
	            else:
	                print(str(geoId)+'存储失败')
	                storeList.append([locationId])
	        except:
	            storeList.append([locationId])
	except:
		print(requestsList)
		print(storeList)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222

第四章总结

最初的预期是所有的功能通过Scrapy框架实现，但在获取评论信息时发现没有实现Scrapy框架的并行功能，导致爬虫速度慢(大约1万条评论/小时)，暂时没有找到解决方法，因此改用requests库，利用gevent库后提升了速度(约7小时抓取3285780条评论数据)。后续熟悉Scrapy框架后会进行优化

除此之外，还可以在以下方向进行优化：

添加diff算法进行多次爬取，减少因网络问题丢失的数据
酒店名字存在/的名字都修改为-，尝试不修改名字的情况下存储

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/笔触狂放9/article/detail/379831

【Python】代码：获取猫途鹰的London酒店信息:基于Scrapy框架和requests库_scrapy爬取酒店信息

第一章 具体需求

第二章 分析过程

1. 获取酒店网址

2. 获取酒店数据

第三章 代码实现

1. 获取酒店网址

items.py: 定义需要获取的数据

spider/trip.py: 主程序，获取数据

pipeline.py: 对返回的数据进行处理(存储)

setting.py: 设置headers避免被网站拦截

2. 获取酒店数据

a. 酒店信息

spider/star.py

b. 评论信息

第四章 总结

第一章具体需求

第二章分析过程

第三章代码实现

第四章总结