当前位置:   article > 正文

【爬虫】XPath实例

wbxapp.cc

题目要求我们用XPATH去爬某个网站并且保存为CSV文件
代码如下,仅供参考

  1. # -*- coding: UTF-8 -*-
  2. # 开发人员:萌狼蓝天
  3. # 博客:Https://mllt.cc
  4. # 笔记:Https://cnblogs.com/mllt
  5. # 哔哩哔哩/微信公众号:萌狼蓝天
  6. # 开发时间:2022/10/5
  7. import pandas as pd
  8. import requests
  9. import lxml.html
  10. csv_data = pd.DataFrame(columns=["序号", "标题", "链接", "作者", "点击", "回复", "更新时间"])
  11. # 获取页面源码
  12. headers = {
  13. "User-Agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; Tablet PC 2.0; wbx 1.0.0; wbxapp 1.0.0; Zoom 3.6.0)",
  14. "X-Amzn-Trace-Id": "Root=1-628b672d-4d6de7f34d15a77960784504"
  15. }
  16. code = requests.get("http://bbs.tianya.cn/list-no02-1.shtml", headers=headers).content.decode("utf-8")
  17. print("-------------------------------------------------获取源码-----------------------------------")
  18. # print(code)
  19. selector = lxml.html.fromstring(code)
  20. print("-------------------------------------------------获取关键部分-----------------------------------")
  21. lists = selector.xpath('//div[@class="mt5"]/table')
  22. print("-------------------------------------------------获取单独部分-----------------------------------")
  23. print(len(lists))
  24. for i in lists:
  25. x = 0
  26. for j in range(2, 9):
  27. for c in range(1, 11):
  28. x += 1
  29. title = i.xpath('//tbody[' + str(j) + ']/tr[' + str(c) + ']/td[1]/a/text()')[0].replace("\t", "").replace("\r", "").replace("\n", "")
  30. link = i.xpath('//tbody[' + str(j) + ']/tr[' + str(c) + ']/td[1]/a')[0].attrib['href'].replace("\t", "")
  31. author = i.xpath('//tbody[' + str(j) + ']/tr[' + str(c) + ']/td[2]/a/text()')[0].replace("\t", "")
  32. click = i.xpath('//tbody[' + str(j) + ']/tr[' + str(c) + ']/td[3]/text()')[0].replace("\t", "")
  33. reply = i.xpath('//tbody[' + str(j) + ']/tr[' + str(c) + ']/td[4]/text()')[0].replace("\t", "")
  34. reply_time = i.xpath('//tbody[' + str(j) + ']/tr[' + str(c) + ']/td[5]/text()')[0].replace("\t", "")
  35. csv_data=csv_data.append({"序号": x, "标题": title, "链接": 'http://bbs.tianya.cn/'+link, "作者": author, "点击": click, "回复": reply,
  36. "更新时间": reply_time}, ignore_index=True)
  37. print(title, link, author)
  38. print(csv_data)
  39. csv_data.to_csv("result.csv")

往期文章

【爬虫】爬虫简单举例(三种写法) 涉及requests、urllib、bs4,re

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小桥流水78/article/detail/838340
推荐阅读
相关标签
  

闽ICP备14008679号