3.2.2、BeautifulSoup + Requests
网络爬虫(英语:web crawler),也叫网络蜘蛛(spider),是一种用来自动浏览万维网的网络机器人。其目的一般为编纂网络索引。通常网络爬虫是一种自动化程序或脚本,专门用于在互联网上浏览和抓取网页信息。网络爬虫的主要目的是从网络上的不同网站、页面或资源中搜集数据。它是搜索引擎、数据挖掘、内容聚合和其他信息检索任务的关键组成部分。
爬虫框架是一种用于开发网络爬虫(Web Crawler)的工具或软件框架。网络爬虫是一类程序,用于自动地浏览互联网,并收集、提取感兴趣的信息。爬虫框架提供了一系列的工具和功能,简化了爬虫的开发过程,加速了数据采集的效率。这里汇总了一些常见的Java类爬虫框架和Python类爬虫框架。
WebMagic是一款基于Java的开源爬虫框架,支持注解和设计模式,简化了爬取任务的实现。官网地址:Introduction · WebMagic Documents
- import us.codecraft.webmagic.Page;
- import us.codecraft.webmagic.Site;
- import us.codecraft.webmagic.Spider;
- import us.codecraft.webmagic.processor.PageProcessor;
- public class MySpider implements PageProcessor {
- private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);
- @Override
- public void process(Page page) {
- // 爬虫逻辑,提取页面内容等
- }
- @Override
- public Site getSite() {
- return site;
- }
- public static void main(String[] args) {
- Spider.create(new MySpider())
- .addUrl("http://www.example.com")
- .run();
- }
- }

Jsoup是一款用于解析HTML文档的Java库,提供了类似于jQuery的API。官网地址:jsoup: Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety。
- import org.jsoup.Jsoup;
- import org.jsoup.nodes.Document;
- import org.jsoup.nodes.Element;
- import org.jsoup.select.Elements;
- import java.io.IOException;
- public class JsoupExample {
- public static void main(String[] args) {
- String url = "http://www.example.com";
- try {
- Document document = Jsoup.connect(url).get();
- // 爬虫逻辑,提取页面内容等
- } catch (IOException e) {
- e.printStackTrace();
- }
- }
- }

Apache HttpClient 是一个用于发送 HTTP 请求的 Java 库,可以用于编写简单的网络爬虫。以下是一个使用 HttpClient 实现的简单爬虫示例代码。官网地址:Overview (Apache HttpClient 5.2.3 API)
- import org.apache.http.HttpEntity;
- import org.apache.http.HttpResponse;
- import org.apache.http.client.HttpClient;
- import org.apache.http.client.methods.HttpGet;
- import org.apache.http.impl.client.HttpClients;
- import org.apache.http.util.EntityUtils;
- import java.io.IOException;
- public class SimpleHttpClientCrawler {
- public static void main(String[] args) {
- // 创建 HttpClient 实例
- HttpClient httpClient = HttpClients.createDefault();
- // 指定要爬取的 URL
- String url = "http://www.example.com";
- // 创建 HTTP GET 请求
- HttpGet httpGet = new HttpGet(url);
- try {
- // 执行请求并获取响应
- HttpResponse response = httpClient.execute(httpGet);
- // 获取响应实体
- HttpEntity entity = response.getEntity();
- if (entity != null) {
- // 将响应实体转换为字符串
- String content = EntityUtils.toString(entity);
- System.out.println(content);
- }
- } catch (IOException e) {
- e.printStackTrace();
- } finally {
- // 关闭 HttpClient 连接
- try {
- httpClient.close();
- } catch (IOException e) {
- e.printStackTrace();
- }
- }
- }
- }

Crawler4j是一个开源的Java类库提供一个用于抓取Web页面的简单接口。可以利用它来构建一个多线程的Web爬虫。官网地址:GitHub - yasserg/crawler4j: Open Source Web Crawler for Java
- public class Controller {
- public static void main(String[] args) throws Exception {
- String crawlStorageFolder = "/data/crawl/root";
- int numberOfCrawlers = 7;
- CrawlConfig config = new CrawlConfig();
- config.setCrawlStorageFolder(crawlStorageFolder);
- // Instantiate the controller for this crawl.
- PageFetcher pageFetcher = new PageFetcher(config);
- RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
- RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
- CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
- // For each crawl, you need to add some seed urls. These are the first
- // URLs that are fetched and then the crawler starts following links
- // which are found in these pages
- controller.addSeed("https://www.ics.uci.edu/~lopes/");
- controller.addSeed("https://www.ics.uci.edu/~welling/");
- controller.addSeed("https://www.ics.uci.edu/");
- // The factory which creates instances of crawlers.
- CrawlController.WebCrawlerFactory<BasicCrawler> factory = MyCrawler::new;
- // Start the crawl. This is a blocking operation, meaning that your code
- // will reach the line after this only when crawling is finished.
- controller.start(factory, numberOfCrawlers);
- }
- }

HtmlUnit 是一个用于模拟浏览器行为的 Java 库,可用于爬取动态网页。它对 HTML 文档进行建模并提供一个 API,允许您调用页面、填写表单、单击链接等......就像您在“普通”浏览器中所做的那样。它具有相当好的 JavaScript 支持(正在不断改进),甚至能够使用相当复杂的 AJAX 库,根据所使用的配置模拟 Chrome、Firefox 或 Internet Explorer。官网地址:HtmlUnit – Welcome to HtmlUnit
- import com.gargoylesoftware.htmlunit.BrowserVersion;
- import com.gargoylesoftware.htmlunit.WebClient;
- import com.gargoylesoftware.htmlunit.html.HtmlPage;
- public class HtmlUnitExample {
- public static void main(String[] args) {
- try (WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
- // 打开一个包含 JavaScript 渲染的页面
- HtmlPage page = webClient.getPage("http://www.example.com");
- // 获取页面标题
- String title = page.getTitleText();
- System.out.println("Page Title: " + title);
- } catch (Exception e) {
- e.printStackTrace();
- }
- }
- }

应用程序测试的工具。Selenium测试直接运行在浏览器中,就像真正的用户在操作一样。支持的浏览器包括IE(7, 8, 9, 10, 11),Mozilla Firefox,Safari,Google Chrome,Opera,Edge等。这个工具的主要功能包括:测试与浏览器的兼容性——测试应用程序看是否能够很好得工作在不同浏览器和操作系统之上。测试系统功能——创建回归测试检验软件功能和用户需求。支持自动录制动作和自动生成.Net、Java、Perl等不同语言的测试脚本。
- import org.openqa.selenium.WebDriver;
- import org.openqa.selenium.chrome.ChromeDriver;
- public class SeleniumExample {
- public static void main(String[] args) {
- // 设置 ChromeDriver 路径
- System.setProperty("webdriver.chrome.driver", "/path/to/chromedriver");
- // 创建 ChromeDriver 实例
- WebDriver driver = new ChromeDriver();
- try {
- // 打开一个包含 JavaScript 渲染的页面
- driver.get("http://www.example.com");
- // 获取页面标题
- String title = driver.getTitle();
- System.out.println("Page Title: " + title);
- } finally {
- // 关闭浏览器窗口
- driver.quit();
- }
- }
- }

Scrapy是一个功能强大且灵活的开源爬虫框架,用于快速开发爬虫和数据提取工具。它提供了基于规则的爬取方式,支持分布式爬取,并且有着良好的文档和活跃的社区。官网地址:GitHub - scrapy/scrapy: Scrapy, a fast high-level web crawling & scraping framework for Python.
- import scrapy
- class MySpider(scrapy.Spider):
- name = 'myspider'
- start_urls = ['http://www.example.com']
- def parse(self, response):
- # 爬虫逻辑,提取页面内容等
- pass
BeautifulSoup是一个HTML解析库,而Requests是一个用于发送HTTP请求的库。它们经常一起使用,可以轻松地进行网页解析和数据提取。官网地址:Beautiful Soup 4.12.0 文档 — Beautiful Soup 4.12.0 documentation
- import requests
- from bs4 import BeautifulSoup
- url = 'http://www.example.com'
- response = requests.get(url)
- if response.status_code == 200:
- soup = BeautifulSoup(response.text, 'html.parser')
- # 爬虫逻辑,提取页面内容等
- else:
- print(f"请求失败,状态码:{response.status_code}")
- from selenium import webdriver
- url = 'http://www.example.com'
- driver = webdriver.Chrome()
- driver.get(url)
- # 爬虫逻辑,提取页面内容等
- driver.quit()
PyQuery是一个类似于jQuery的库,用于解析HTML文档。它提供了简洁的API,使得在Python中进行HTML解析变得更加方便。官网地址:pyquery · PyPI
- from pyquery import PyQuery as pq
- url = 'http://www.example.com'
- doc = pq(url)
- # 爬虫逻辑,提取页面内容等
PySpider 是一个强大的分布式爬虫框架,使用 Python 语言开发,专注于提供简单、灵活、强大、快速的爬虫服务。PySpider 支持分布式部署,具有良好的可扩展性和高度定制化的特点。官网地址:Introduction - pyspider
- from pyspider.libs.base_handler import *
- class Handler(BaseHandler):
- crawl_config = {
- 'headers': {
- 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
- }
- }
- @every(minutes=24 * 60)
- def on_start(self):
- self.crawl('https://movie.douban.com/top250', callback=self.index_page)
- @config(age=10 * 24 * 60 * 60)
- def index_page(self, response):
- for each in response.doc('div.item'):
- self.crawl(each('div.hd a').attr.href, callback=self.detail_page)
- next = response.doc('.next a').attr.href
- self.crawl(next, callback=self.index_page)
- @config(priority=2)
- def detail_page(self, response):
- return {
- "url": response.url,
- "title": response.doc('h1 span').text(),
- "rating": response.doc('strong.ll.rating_num').text(),
- "cover": response.doc('img[rel="v:image"]').attr.src,
- }

Portia 是一个开源的可视化爬虫工具,用于从网站上提取结构化数据。它是 Scrapinghub 公司开发的一部分,旨在简化和加速网页数据抽取的过程,无需编写复杂的代码。官网地址:Getting Started — Portia 2.0.8 documentation
- pip install portia
- # 安装后直接启动
- portia
它将在本地启动一个 Web 服务,并提供一个 web 页面来进行数据抽取的可视化操作。
Newspaper 是一个用于提取文章内容的 Python 库。它旨在帮助开发者从新闻网站和其他在线文章中提取有用的信息,例如标题、作者、正文内容等。Newspaper 的设计目标是易于使用且高效,适用于各种新闻网站和文章结构。官网地址:GitHub - codelucas/newspaper: newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
pip install newspaper3k
- from newspaper import Article
- # 输入文章的 URL
- article_url = 'https://www.example.com/article'
- # 创建 Article 对象并下载文章内容
- article = Article(article_url)
- article.download()
- # 解析文章内容
- article.parse()
- # 输出文章信息
- print("Title:", article.title)
- print("Authors:", article.authors)
- print("Publish Date:", article.publish_date)
- print("\nArticle Content:\n", article.text)

Crawley可以高速爬取对应网站的内容,支持关系和非关系数据库,数据可以导出为JSON、XML等。Crawley 提供了非常强大和灵活的内容提取功能。它支持使用 CSS 选择器和 XPath 表达式从网页中提取所需的信息,使用 PyQuery 和 lxml 库进行解析。官网地址:Crawley’s Documentation — crawley v0.1.0 documentation
- from crawley.crawlers import BaseCrawler
- from crawley.scrapers import BaseScraper
- from crawley.extractors import XPathExtractor
- from models import *
- class pypiScraper(BaseScraper):
- #specify the urls that can be scraped by this class
- matching_urls = ["%"]
- def scrape(self, response):
- #getting the html table
- table = response.html.xpath("/html/body/div[5]/div/div/div[3]/table")[0]
- #for rows 1 to n-1
- for tr in table[1:-1]:
- #obtaining the searched html inside the rows
- td_updated = tr[0]
- td_package = tr[1]
- package_link = td_package[0]
- td_description = tr[2]
- #storing data in Packages table
- Package(updated=td_updated.text, package=package_link.text, description=td_description.text)
- class pypiCrawler(BaseCrawler):
- #add your starting urls here
- start_urls = ["http://pypi.python.org/pypi"]
- #add your scraper classes here
- scrapers = [pypiScraper]
- #specify you maximum crawling depth level
- max_depth = 0
- #select your favourite HTML parsing tool
- extractor = XPathExtractor

Grab 是一个用于编写网络爬虫的 Python 框架。它提供了一套强大而灵活的工具,使得爬取和处理网页变得更加容易。Grab 的设计目标是简化常见的爬虫任务,同时保持足够的灵活性来处理各种不同的网站结构。官网地址:http://docs.grablib.org/en/latest/#grab-spider-user-manual
- from grab import Grab
- # 创建 Grab 实例
- g = Grab()
- # 设置要抓取的 URL
- url = 'https://www.example.com'
- g.go(url)
- # 输出抓取的页面内容
- print("Content of", url)
- print(g.response.body)
python-goose 是一个轻量级的文章提取库,旨在从网页中提取文章内容。它使用类似于自然语言处理的技术来分析页面,提取标题、作者、正文等信息。官网地址:GitHub - grangier/python-goose: Html Content / Article Extractor, web scrapping lib in Python
- from goose3 import Goose
- # 创建 Goose 实例
- g = Goose()
- # 设置要提取的文章 URL
- url = 'https://www.example.com/article'
- article = g.extract(url)
- # 输出提取的信息
- print("Title:", article.title)
- print("Authors:", article.authors)
- print("Publish Date:", article.publish_date)
- print("\nArticle Content:\n", article.cleaned_text)
cola 是另一个用于提取文章内容的库,它使用机器学习技术,并具有可配置的规则引擎,可以适应不同的网站结构。cola 的目标是实现高准确性和高可用性。官网地址:GitHub - qinxuye/cola: A high-level distributed crawling framework.
- from cola.extractors import ArticleExtractor
- # 设置要提取的文章 URL
- url = 'https://www.example.com/article'
- # 使用 ArticleExtractor 提取文章信息
- article = ArticleExtractor().get_article(url)
- # 输出提取的信息
- print("Title:", article.title)
- print("Authors:", article.authors)
- print("Publish Date:", article.publish_date)
- print("\nArticle Content:\n", article.text)
