当前位置:   article > 正文

使用Python爬取携程旅游游记文章_python爬携程数据

python爬携程数据

简介

本文介绍了使用Python编写的爬虫代码,用于爬取携程旅游网站上的游记文章。通过该代码,可以获取游记的标题、内容、天数、时间、和谁一起以及玩法等信息

导入所需的库

首先,我们需要导入所需的库,包括re、BeautifulSoup、bag、easygui和webbrowser。其中,re库用于正则表达式匹配,BeautifulSoup库用于解析HTML页面,bag库用于创建会话,easygui库用于显示错误信息,webbrowser库用于打开验证码验证页面。

  1. import re
  2. from bs4 import BeautifulSoup
  3. import bag
  4. import easygui
  5. import webbrowser
  6. import time

创建会话并设置请求头和cookies信息

  1. session = bag.session.create_session()
  2. session.headers['Referer'] = r'https://you.ctrip.com/travels/'
  3. session.get(r'https://you.ctrip.com/TravelSite/Home/IndexTravelListHtml?p=2&Idea=0&Type=100&Plate=0')
  4. session.cookies[''] = r"你的cookies"

定义主函数main(),在其中调用catch函数并打印结果

  1. def main():
  2. resp = catch(r'https://you.ctrip.com/travels/qinhuangdao132/4011655.html')
  3. print(resp)

定义show_error_message函数,用于显示错误信息,并提供一个验证按钮以打开验证码验证页面

  1. def show_error_message(error):
  2. choices = ["验证"]
  3. button_box_choice = easygui.buttonbox(error, choices=choices)
  4. if button_box_choice == "验证":
  5. webbrowser.open("https://verify.ctrip.com/static/ctripVerify.html?returnUrl=https%3A%2F%2Fyou.ctrip.com%2Ftravelsite%2Ftravels%2Fshanhaiguan120556%2F3700092.html&bgref=l2j65wlGQDYtmBZjKEoy5w%3D%3D")

定义catch函数,用于解析游记文章页面并提取相关信息

  1. def catch(url) -> list:
  2. result_ = []
  3. try:
  4. resp = session.get(url)
  5. resp.encoding = 'utf8'
  6. resp.close()
  7. time.sleep(2)
  8. html = BeautifulSoup(resp.text, 'lxml')
  9. # 省略部分代码
  10. return result_
  11. except Exception as e:
  12. show_error_message(e)
  13. time.sleep(10)
  14. catch(url)

完整代码

  1. #!/usr/bin/env python3
  2. # coding:utf-8
  3. import re
  4. from bs4 import BeautifulSoup
  5. import bag
  6. import easygui
  7. import webbrowser
  8. import time
  9. session = bag.session.create_session()
  10. session.headers['Referer'] = r'https://you.ctrip.com/travels/'
  11. session.get(r'https://you.ctrip.com/TravelSite/Home/IndexTravelListHtml?p=2&Idea=0&Type=100&Plate=0')
  12. session.cookies[
  13. ''] = r"你的cookies"
  14. # noinspection PyBroadException
  15. def main():
  16. resp = catch(r'https://you.ctrip.com/travels/qinhuangdao132/4011655.html')
  17. print(resp)
  18. def show_error_message(error):
  19. choices = ["验证"]
  20. button_box_choice = easygui.buttonbox(error, choices=choices)
  21. if button_box_choice == "验证":
  22. webbrowser.open(
  23. "https://verify.ctrip.com/static/ctripVerify.html?returnUrl=https%3A%2F%2Fyou.ctrip.com%2Ftravelsite%2Ftravels%2Fshanhaiguan120556%2F3700092.html&bgref=l2j65wlGQDYtmBZjKEoy5w%3D%3D")
  24. def catch(url) -> list:
  25. result_ = []
  26. try:
  27. resp = session.get(url)
  28. resp.encoding = 'utf8'
  29. resp.close()
  30. time.sleep(2)
  31. # print(resp.text)
  32. html = BeautifulSoup(resp.text, 'lxml')
  33. title = re.findall(r'<h1 class="title1">(.*?)</h1>', resp.text)
  34. if len(title) == 0:
  35. title = html.find_all('div', class_="ctd_head_left")[0].h2.text
  36. soup = html.find_all('div', class_="ctd_content")
  37. days = re.compile(r'<span><i class="days"></i>天数:(.*?)</span>', re.S)
  38. times = re.compile(r'<span><i class="times"></i>时间:(.*?)</span>', re.S)
  39. whos = re.compile(r'<span><i class="whos"></i>和谁:(.*?)</span>', re.S)
  40. plays = re.compile(r'<span><i class="plays"></i>玩法:(.*?)</span>', re.S)
  41. content = re.compile(r'<h3>.*?发表于.*?</h3>(.*?)</div>]', re.S)
  42. mid = []
  43. for info in re.findall(r'<p>(.*?)</p>', str(re.findall(content, str(soup)))):
  44. if info == '':
  45. pass
  46. else:
  47. mid.append(re.sub(r'<.*?>', '', info))
  48. if len(mid) == 0:
  49. for info in re.findall(r'<p>(.*?)</p>', str(soup)):
  50. if info == '':
  51. pass
  52. else:
  53. mid.append(re.sub(r'<.*?>', '', info))
  54. result_.append([
  55. ''.join(title).strip(),
  56. '\n'.join(mid),
  57. ''.join(re.findall(days, str(soup))).replace(' ', ''),
  58. ''.join(re.findall(times, str(soup))).replace(' ', ''),
  59. ''.join(re.findall(whos, str(soup))),
  60. ''.join(re.findall(plays, str(soup))),
  61. url
  62. ])
  63. return result_
  64. except Exception as e:
  65. show_error_message(e)
  66. time.sleep(10)
  67. catch(url)
  68. if __name__ == '__main__':
  69. main()

运行结果

结语

如果你觉得本教程对你有所帮助,不妨点赞并关注我的CSDN账号。我会持续为大家带来更多有趣且实用的教程和资源。谢谢大家的支持!

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/知新_RL/article/detail/372322
推荐阅读
相关标签
  

闽ICP备14008679号