当前位置:   article > 正文

python爬虫——xpath 爬取一本小说,初级爬虫入门。_etree xpath获取章节列表

etree xpath获取章节列表
  1. import requests
  2. from lxml import etree
  3. import time
  4. '''
  5. 思路:
  6. 1,确定想要爬取的小说及入口url
  7. 2,爬章节链接并通过字符串拼接得到所有章节详情页的
  8. 3,爬取书名
  9. 4,爬取每章的标题,爬取每章具体内容的文本
  10. 6,将每章小说以章节累加,并保存为一个单独的txt文件
  11. '''
  12. # 设置请求头
  13. headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
  14. url = 'http://www.biquge.info/84_84283/'
  15. def get_html(url):
  16. # 获取网页数据
  17. html = requests.get(url, headers=headers)
  18. html.encoding = 'utf-8'
  19. html_code = html.text
  20. # 解析网页
  21. soup = etree.HTML(html_code)
  22. # 返回解析后的页面内容
  23. return soup
  24. # 获取各章节目录链接
  25. def get_list(url):
  26. soup = get_html(url)
  27. # 查找所有章节的链接
  28. list_box = soup.xpath('//*[@id="list"]/dl/dd/a/@href')
  29. # 新建列表用来储存list的url
  30. book_lists = []
  31. for i in list_box:
  32. # 放进列表里
  33. book_lists.append(url + i)
  34. return book_lists
  35. # 获取书的名称
  36. def get_book_title(url):
  37. soup = get_html(url)
  38. book_title = soup.xpath('//*[@id="info"]/h1/text()')
  39. book_title = str(book_title)
  40. return book_title
  41. # 获取文章页 标题
  42. def get_title(url):
  43. soup = get_html(url)
  44. title = soup.xpath('//*[@id="wrapper"]/div[4]/div/div[2]/h1/text()')
  45. return title
  46. # 获取文章页 正文
  47. def get_novel_content(url):
  48. soup = get_html(url)
  49. # 获得需要的正文内容
  50. content = soup.xpath('//*[@id="content"]/text()')
  51. return content
  52. # 保存到本地
  53. def save_novel(url):
  54. book_lists = get_list(url)
  55. # title = get_title(url)
  56. book_title = get_book_title(url)
  57. num = 1
  58. with open(book_title+'.txt', 'a', encoding='utf-8') as f:
  59. for list_url in book_lists:
  60. chapter_title = get_title(list_url)
  61. # 这个地方写的有问题,标题的标签没有清理干净
  62. for t in chapter_title:
  63. f.write(t)
  64. chapter_content = get_novel_content(list_url)
  65. for c in chapter_content:
  66. f.write(c+"\n")
  67. # time.sleep(2)
  68. print('***第{}章下载完成***'.format(num))
  69. num += 1
  70. f.close()
  71. if __name__ == '__main__':
  72. save_novel(url)

 

参考: 一个妹子在B站的视频+微信链接https://mp.weixin.qq.com/s?__biz=MzIxOTcyMDM4OQ==&mid=2247483927&idx=1&sn=d4c9fcb6becc3e1d26a8d8385d8c2b99&chksm=97d7bdbda0a034ab3faf0f30ed50a1e35a0a9edcceb9b2ae9a0a6c7e4efd72a64cde07df439f&token=1524452913&lang=zh_CN#rd

比较优雅的代码,看的很舒服,思路很清晰:

https://blog.csdn.net/sinat_34937826/article/details/105562463?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-10.nonecase&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-10.nonecase

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/寸_铁/article/detail/792218
推荐阅读
相关标签
  

闽ICP备14008679号