当前位置:   article > 正文

携程景区爬取 + 保存Excel_景区数据 excel

景区数据 excel

各位小伙伴们,五一马上就要来了,你们想好去哪里玩了吗,没想好的同学看过来。今天是携程景区信息的爬取,让你轻松选择。

美好的一天从抓包开始,点击最后一个包,校对数据是否和页面一致,然后点击 Headers 复制 包的URL,先导入后面要用到的库

  1. import requests
  2. import json
  3. from openpyxl import workbook

然后就是发送请求,因为数据是json数据,所以就不是.text了,要用.json()

  1. response = requests.post(self.url, headers=self.headers).json()
  2. print(response)

再就是解析内容,因为是json数据,解析方法基本上和前面几期差不多的,所以我就不多说了

  1. result_list = response['attractionList']
  2. for result in result_list:
  3. city = result['card']['districtName'] # 城市
  4. place = result['card']['poiName'] # 景区
  5. status = result['card']['openStatus'] # 状态
  6. score = result['card']['commentScore'] # 评分
  7. tickets = result['card']['priceTypeDesc'] # 门票
  8. distance = result['card']['distanceStr'] # 距离市中心
  9. url = result['card']['detailUrl'] # 详情链接
  10. print(city)
  11. print(place)
  12. print(status)
  13. print(score)
  14. print(tickets)
  15. print(distance)
  16. print(url)
  17. print('===' * 30)
  18. # print(f'景区{place}的信息收集完毕!!!')

解析完成之后就是保存,这里我选择保存Excel

  1. self.wb = workbook.Workbook() # 创建Excel表格
  2. self.ws = self.wb.active # 激活当前表
  3. # 向当前表添加标题
  4. self.ws.append(['城市', '景区', '状态', '评分', '门票', '距离市中心', '详情链接'])

下面还要建一个保存函数

  1. def save(self, city, place, status, score, tickets, distance, url):
  2. # 向表格里添加数据
  3. mylist = [city, place, status, score, tickets, distance, url]
  4. self.ws.append(mylist)
  5. # 保存到表格wb
  6. self.wb.save('携程.xlsx')

这仅仅的一页的数据,下面就来设置翻页

  1. for i in range(1, 101):
  2. print('当前下载第{}页'.format(i))
  3. payload = json.dumps({
  4. "index": f'{i}',
  5. "count": 20,
  6. "sortType": 1,
  7. "isShowAggregation": True,
  8. "districtId": self.CityId, # 城市ID
  9. "scene": "DISTRICT",
  10. "pageId": "214062",
  11. "traceId": "f33070fa-82a6-6d22-2d18-164f0af07734",
  12. "extension": [
  13. {
  14. "name": "osVersion",
  15. "value": "10.3.1"
  16. },
  17. {
  18. "name": "deviceType",
  19. "value": "ios"
  20. }
  21. ],
  22. "filter": {
  23. "filterItems": []
  24. },
  25. "crnVersion": "2020-09-01 22:00:45",
  26. "isInitialState": True,
  27. "head": {
  28. "cid": "09031047314318028828",
  29. "ctok": "",
  30. "cver": "1.0",
  31. "lang": "01",
  32. "sid": "8888",
  33. "syscode": "09",
  34. "auth": "",
  35. "xsid": "",
  36. "extension": []
  37. }
  38. })

设置翻页需要在参数中设置,这里我也不知道具体有多少页,所以我设置了100页,然后参数里面还有一个城市的ID,这是换城市爬取的关键,比如说输入北京的城市id,就会返回北京景区的数据,其他城市也一样,目前我只知道北上广深的城市id,分别是1,2,152,26。你们如果想看其他城市的数据,在评论区说出是哪个城市,我都会帮你们找。

然后全部代码如下

  1. # -*- encoding: utf-8 -*-
  2. import requests
  3. import json
  4. from openpyxl import workbook
  5. """
  6. 城市ID{上海: 2, 北京: 1, 广州: 152, 深圳: 26}
  7. """
  8. class XcSpider(object):
  9. def __init__(self):
  10. self.CityId = input('请输入城市ID:')
  11. self.url = "https://m.ctrip.com/restapi/soa2/18254/json/getAttractionList?_fxpcqlniredt=09031047314318028828&x-traceID=09031047314318028828-1646054807738-9064633"
  12. self.headers = {
  13. 'content-type': 'application/json',
  14. 'origin': 'https://m.ctrip.com',
  15. 'referer': 'https://m.ctrip.com/webapp/you/gspoi/sight/2.html?seo=0&allianceid=4897&sid=155952&isHideNavBar=YES&from=https%3A%2F%2Fm.ctrip.com%2Fwebapp%2Fyou%2Fgsdestination%2Fplace%2F2.html%3Fseo%3D0%26ishideheader%3Dtrue%26secondwakeup%3Dtrue%26dpclickjump%3Dtrue%26allianceid%3D4897%26sid%3D155952%26ouid%3Dindex%26from%3Dhttps%253A%252F%252Fm.ctrip.com%252Fhtml5%252F',
  16. 'accept-language': 'zh-CN,zh;q=0.9',
  17. 'cookie': 'ibulanguage=CN; ibulocale=zh_cn; cookiePricesDisplayed=CNY; _gcl_au=1.1.2001712708.1646054591; _RF1=223.104.63.214; _RGUID=0731b0f7-45b5-4666-9828-888744fb269f; _RSG=cPKj5TFinS0VQo.4T8YeW9; _RDG=2868710522b1702c43085468305d1ce8b8; _bfaStatusPVSend=1; MKT_CKID=1646054594542.yi12k.1t3u; MKT_CKID_LMT=1646054594543; _ga=GA1.2.333705235.1646054595; _gid=GA1.2.1046662294.1646054595; appFloatCnt=2; nfes_isSupportWebP=1; GUID=09031047314318028828; nfes_isSupportWebP=1; MKT_Pagesource=H5; _bfs=1.4; _jzqco=%7C%7C%7C%7C1646054602232%7C1.1650478479.1646054594536.1646054655182.1646054672431.1646054655182.1646054672431.0.0.0.4.4; __zpspc=9.2.1646054672.1646054672.1%232%7Cwww.baidu.com%7C%7C%7C%25E6%2590%25BA%25E7%25A8%258B%7C%23; _bfi=p1%3D100101991%26p2%3D100101991%26v1%3D5%26v2%3D4; _bfaStatus=success; mktDpLinkSource=ullink; librauuid=MTPpuP1M6AmQCSUc; ibu_h5_lang=en; ibu_h5_local=en-us; _pd=%7B%22r%22%3A12%2C%22d%22%3A259%2C%22_d%22%3A247%2C%22p%22%3A260%2C%22_p%22%3A1%2C%22o%22%3A263%2C%22_o%22%3A3%2C%22s%22%3A263%2C%22_s%22%3A0%7D; Union=OUID=&AllianceID=4897&SID=155952&SourceID=&AppID=&OpenID=&exmktID=&createtime=1646054807&Expires=1646659606764; MKT_OrderClick=ASID=4897155952&AID=4897&CSID=155952&OUID=&CT=1646054806768&CURL=https%3A%2F%2Fm.ctrip.com%2Fwebapp%2Fyou%2Fgspoi%2Fsight%2F2.html%3Fseo%3D0%26allianceid%3D4897%26sid%3D155952%26isHideNavBar%3DYES%26from%3Dhttps%253A%252F%252Fm.ctrip.com%252Fwebapp%252Fyou%252Fgsdestination%252Fplace%252F2.html%253Fseo%253D0%2526ishideheader%253Dtrue%2526secondwakeup%253Dtrue%2526dpclickjump%253Dtrue%2526allianceid%253D4897%2526sid%253D155952%2526ouid%253Dindex%2526from%253Dhttps%25253A%25252F%25252Fm.ctrip.com%25252Fhtml5%25252F&VAL={"h5_vid":"1646054589723.2rr0y3"}; _bfa=1.1646054589723.2rr0y3.1.1646054589723.1646054806818.1.10.214062'
  18. }
  19. self.wb = workbook.Workbook() # 创建Excel表格
  20. self.ws = self.wb.active # 激活当前表
  21. # 向当前表添加标题
  22. self.ws.append(['城市', '景区', '状态', '评分', '门票', '距离市中心', '详情链接'])
  23. def get_data(self):
  24. for i in range(1, 101):
  25. print('当前下载第{}页'.format(i))
  26. payload = json.dumps({
  27. "index": f'{i}',
  28. "count": 20,
  29. "sortType": 1,
  30. "isShowAggregation": True,
  31. "districtId": self.CityId, # 城市ID
  32. "scene": "DISTRICT",
  33. "pageId": "214062",
  34. "traceId": "f33070fa-82a6-6d22-2d18-164f0af07734",
  35. "extension": [
  36. {
  37. "name": "osVersion",
  38. "value": "10.3.1"
  39. },
  40. {
  41. "name": "deviceType",
  42. "value": "ios"
  43. }
  44. ],
  45. "filter": {
  46. "filterItems": []
  47. },
  48. "crnVersion": "2020-09-01 22:00:45",
  49. "isInitialState": True,
  50. "head": {
  51. "cid": "09031047314318028828",
  52. "ctok": "",
  53. "cver": "1.0",
  54. "lang": "01",
  55. "sid": "8888",
  56. "syscode": "09",
  57. "auth": "",
  58. "xsid": "",
  59. "extension": []
  60. }
  61. })
  62. response = requests.post(self.url, headers=self.headers, data=payload).json()
  63. # print(response)
  64. self.parse(response)
  65. def parse(self, response):
  66. result_list = response['attractionList']
  67. for result in result_list:
  68. city = result['card']['districtName'] # 城市
  69. place = result['card']['poiName'] # 景区
  70. status = result['card']['openStatus'] # 状态
  71. score = result['card']['commentScore'] # 评分
  72. tickets = result['card']['priceTypeDesc'] # 门票
  73. distance = result['card']['distanceStr'] # 距离市中心
  74. url = result['card']['detailUrl'] # 详情链接
  75. print(city)
  76. print(place)
  77. print(status)
  78. print(score)
  79. print(tickets)
  80. print(distance)
  81. print(url)
  82. print('===' * 30)
  83. # print(f'景区{place}的信息收集完毕!!!')
  84. self.save(city, place, status, score, tickets, distance, url)
  85. def save(self, city, place, status, score, tickets, distance, url):
  86. # 向表格里添加数据
  87. mylist = [city, place, status, score, tickets, distance, url]
  88. self.ws.append(mylist)
  89. # 保存到表格wb
  90. self.wb.save('携程.xlsx')
  91. if __name__ == '__main__':
  92. x = XcSpider()
  93. x.get_data()

 这个是我爬取到的深圳数据,它每一页是20个景区,100页刚好2000数据。ok,这期分享到此结束,后面还有更多好玩有趣的爬虫。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/很楠不爱3/article/detail/372203
推荐阅读
相关标签
  

闽ICP备14008679号