赞
踩
首先,确保安装'requests'和'beautifulsoup4'库,若未安装在cmd运行以下命令。
pip install requests beautifulsoup4
具体实现代码如下。
- import os
- import requests
- from bs4 import BeautifulSoup
-
- # 小说主页URL
- base_url = "https://www.bqka.cc"
- novel_url = base_url + "/book/3315/" #可通过更改3315来爬取其他小说
-
- # 发送请求获取网页内容
- response = requests.get(novel_url)
- response.encoding = 'utf-8'
- html_content = response.text
-
- # 使用BeautifulSoup解析HTML
- soup = BeautifulSoup(html_content, 'html.parser')
-
- # 找到包含章节列表的div
- listmain_div = soup.find('div', class_='listmain')
- if listmain_div:
- # 找到所有的章节链接
- chapter_links = listmain_div.find_all('a')
-
- # 提取章节标题和链接
- chapters = []
- for link in chapter_links:
- title = link.text.strip()
- href = link['href']
-
- # 过滤无效链接
- if not href.startswith("javascript"):
- full_url = base_url + href
- chapters.append({"title": title, "url": full_url})
-
- # 创建保存小说章节的目录
- if not os.path.exists("novel"):
- os.makedirs("novel")
-
- # 爬取每个章节的内容并保存到txt文件
- for chapter in chapters:
- chapter_title = chapter['title']
- chapter_url = chapter['url']
-
- # 发送请求获取章节页面内容
- chapter_response = requests.get(chapter_url)
- chapter_response.encoding = 'utf-8'
- chapter_html_content = chapter_response.text
-
- # 使用BeautifulSoup解析章节页面HTML
- chapter_soup = BeautifulSoup(chapter_html_content, 'html.parser')
-
- # 找到包含章节正文的div
- chapter_content_div = chapter_soup.find('div', id='chaptercontent')
- if chapter_content_div:
- # 获取章节正文内容
- chapter_content = chapter_content_div.get_text(separator="\n", strip=True)
-
- # 保存章节内容到txt文件
- file_name = f"novel/{chapter_title}.txt"
- with open(file_name, 'w', encoding='utf-8') as file:
- file.write(chapter_title + "\n\n")
- file.write(chapter_content)
-
- print(f"已保存章节: {chapter_title}")
- else:
- print(f"未找到章节正文: {chapter_title}")
- else:
- print("未找到章节列表")
爬取结果如下。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。