赞
踩
我的上一篇写的是面试技术AOP,当然,这么多天不在线,总得来点技术干货啊!公司最近需要爬虫的业务,所以翻了一些开源框架最终还是选择国人的开源,还是不错的,定制化一套,从抽取,入库,保存,一应俱全。现在展示一下我找的框架对比吧。
简单demo会如下,抽取要求,定时获取新闻列表,二级页面标题正文等信息。
关于爬虫组件的使用调研
调研简介:因使用爬虫组件抓取网页数据和分页新闻数据,故对各爬虫组件进行调研,通过分析相关组件的功能和技术门槛以及多因素,得出满足项目需求的适宜组件。
功能需求 | crawler4j | heritrix3 | nutch | spiderman2 | |
抓取指定网页数据 | √ | √ | √ | √ | √ |
抓取分页新闻数据 | √ | √ | √ | √ | √ |
自定义存储抓取的网页数据内容或文件 | 支持存储至文件和数据库中 | 支持存储至文件和数据库中 | job爬取数据默认存储为warc格式文件; 支持存储至文件和数据库中 | 1.x不支持; 2.x放到了gora中,可以使用多种数据库,例如HBase, Cassandra, MySql来存储数据 | 支持存储至文件和数据库中 |
定时抓取网页数据 | √ | √ | √ | √ | × |
是否支持分布式爬取 | √ | √ | √ | √ | √ |
性能需求 | webmagic | crawler4j | heritrix3 | nutch | spiderman2 |
可视化(1) 配置化(2) 都不可(0) | (2) 提供注解配置 | (2) 可集成spring做配置 | (1) 提供webUI配置爬取job | (2) 采用脚本配置抓取 | (0) 编辑代码配置 |
使用和查看地址 | https://github.com/code4craft/webmagic | https://github.com/apache/nutch | |||
组件热度star(s)和浏览次数(w) | s:7589 w:803 | s:3372 w:309 | s:1385 w:174 | s:1869 w:245 | s:1377 w:528 |
稳定性 | 稳定 | 稳定 | 稳定 | 稳定 | 较稳定 |
用户手册和开发文档 | 完善 | 较差 没有开放的API,只提供了几个详细的源码事例 | 完善 用户手册和开发文档介绍详细 | 完善 用户手册和开发文档皆有最新版本,且详细 | 相对缺乏 |
社区生态 | 相对较好 | 一般 | 较好 | 较好 | 相对较差 |
开发门槛和学习成本 | 较低 | 较低 | 一般 有自己的web控制台,操作者可以通过选择Crawler命令来操作控制台,需要学习相关知识,但是是java开发的开源爬虫框架 | 较高 需要编写脚本,安装和使用都需要操作服务器,熟悉相关shell命令 | 较低 |
评价 | 垂直、全栈式、模块化爬虫。更加适合抓取特定领域的信息。它包含了下载、调度、持久化、处理页面等模块。 | 多数爬虫项目基于此组件进行开发,改造,扩展性和延展性相对较高,但是较基础,生态较差 | 文档丰富,资料齐全,框架成熟,适合大型爬虫项目,学习成本相对较高 | apache下的开源爬虫程序,数据抓取解析以及存储只是其中的一个功能 | 架构简洁、易用,生态相对较差 |
综上所述:
认为选择小型框架webmagic相对适宜
选取原因:开发门槛低,简单、易用、容易上手、国内开发人员维护,文档详细,支持全栈式爬虫开发。
现在就拿springboot和webmagic做一个整合。
确定项目的技术要点,maven构建,orm为Spring Data JPA。
引入pom依赖:
- <dependencies>
- <dependency>
- <groupId>org.springframework.boot</groupId>
- <artifactId>spring-boot-starter-web</artifactId>
- </dependency>
-
- <dependency>
- <groupId>org.springframework.boot</groupId>
- <artifactId>spring-boot-starter-test</artifactId>
- <scope>test</scope>
- </dependency>
- <dependency>
- <groupId>us.codecraft</groupId>
- <artifactId>webmagic-core</artifactId>
- <version>0.7.3</version>
- </dependency>
- <dependency>
- <groupId>us.codecraft</groupId>
- <artifactId>webmagic-extension</artifactId>
- <version>0.7.3</version>
- </dependency>
- <dependency>
- <groupId>us.codecraft</groupId>
- <artifactId>webmagic-selenium</artifactId>
- <version>0.7.3</version>
- </dependency>
- <dependency>
- <groupId>org.springframework.boot</groupId>
- <artifactId>spring-boot-starter-data-jpa</artifactId>
- </dependency>
- <dependency>
- <groupId>mysql</groupId>
- <artifactId>mysql-connector-java</artifactId>
- <version>5.1.38</version>
- </dependency>
- </dependencies>
确定项目结构:
#模块介绍
processor模块负责抓取页面信息,执行抽取流程
pipeline模块负责保存抓取的信息
task模块负责设置定时任务,实现定时爬取网站信息
entity模块是实体信息模块
dao模块负责持久化数据
utils模块是工具类模块
我们这里只是做一个简单事例,代码直接贴上;
YangGuangPageContent.class
- package com.longcloud.springboot.webmagic.entity;
-
- import java.util.Date;
-
- import javax.persistence.Column;
- import javax.persistence.Entity;
- import javax.persistence.Id;
- import javax.persistence.Table;
-
- /**
- * 新闻内容
- * @author 常青
- *
- */
- @Entity
- @Table(name = "yang_guang_page_content")
- public class YangGuangPageContent {
-
- //新闻内容id
- @Id
- private String id;
-
- //新闻正文
- private String content;
-
- //新闻作者
- private String author;
-
- //列表的新闻类型
- private String type;
-
- //新闻发表地点
- private String address;
-
- //新闻标题
- private String title;
-
- //新闻的被关注状态
- private String status;
-
- //新闻发表时间
- @Column(name = "publish_time")
- private String publishTime;
-
- //新闻抓取时间
- @Column(name = "created_time")
- private Date createdTime;
-
- //新闻抓取者
- @Column(name = "created_by")
- private String createdBy;
-
- //列表的正文指向url
- @Column(name = "content_url")
- private String contentUrl;
-
- //新闻抓取时间
- @Column(name = "updated_time")
- private Date updatedTime;
-
- //新闻抓取者
- @Column(name = "updated_by")
- private String updatedBy;
-
- public String getId() {
- return id;
- }
-
- public void setId(String id) {
- this.id = id;
- }
-
- public String getContent() {
- return content;
- }
-
- public void setContent(String content) {
- this.content = content;
- }
-
- public String getAuthor() {
- return author;
- }
-
- public void setAuthor(String author) {
- this.author = author;
- }
-
- public String getPublishTime() {
- return publishTime;
- }
-
- public void setPublishTime(String publishTime) {
- this.publishTime = publishTime;
- }
-
- public Date getCreatedTime() {
- return createdTime;
- }
-
- public void setCreatedTime(Date createdTime) {
- this.createdTime = createdTime;
- }
-
- public String getCreatedBy() {
- return createdBy;
- }
-
- public void setCreatedBy(String createdBy) {
- this.createdBy = createdBy;
- }
-
- public String getType() {
- return type;
- }
-
- public void setType(String type) {
- this.type = type;
- }
-
- public String getAddress() {
- return address;
- }
-
- public void setAddress(String address) {
- this.address = address;
- }
-
- public String getTitle() {
- return title;
- }
-
- public void setTitle(String title) {
- this.title = title;
- }
-
- public String getStatus() {
- return status;
- }
-
- public void setStatus(String status) {
- this.status = status;
- }
-
- public String getContentUrl() {
- return contentUrl;
- }
-
- public void setContentUrl(String contentUrl) {
- this.contentUrl = contentUrl;
- }
-
- public Date getUpdatedTime() {
- return updatedTime;
- }
-
- public void setUpdatedTime(Date updatedTime) {
- this.updatedTime = updatedTime;
- }
-
- public String getUpdatedBy() {
- return updatedBy;
- }
-
- public void setUpdatedBy(String updatedBy) {
- this.updatedBy = updatedBy;
- }
-
-
-
- }
dao:
- package com.longcloud.springboot.webmagic.dao;
-
- import java.util.Date;
-
- import javax.transaction.Transactional;
-
- import org.springframework.data.jpa.repository.JpaRepository;
- import org.springframework.data.jpa.repository.Modifying;
- import org.springframework.data.jpa.repository.Query;
- import org.springframework.stereotype.Repository;
-
- import com.longcloud.springboot.webmagic.entity.YangGuangPageContent;
-
- @Repository
- public interface YangGuangPageContentDao extends JpaRepository<YangGuangPageContent, Long> {
-
- //根据url查询正文
- YangGuangPageContent findByContentUrl(String url);
-
- //更新部分字段
- @Transactional
- @Modifying(clearAutomatically = true)
- @Query("update YangGuangPageContent set content = ?1 , updated_time = ?2 , updated_by = ?3 where content_url = ?4")
- int updateContent(String content,Date updatedTime,
- String updatedBy,String contentUrl);
- }
抽取逻辑:
抽取新闻list ---YangGuangPageProcessor .class
- package com.longcloud.springboot.webmagic.processor;
-
- import java.util.ArrayList;
- import java.util.Date;
- import java.util.List;
-
- import org.apache.commons.lang3.StringUtils;
- import org.slf4j.Logger;
- import org.slf4j.LoggerFactory;
- import org.springframework.beans.factory.annotation.Autowired;
- import org.springframework.stereotype.Component;
-
- import com.longcloud.springboot.webmagic.entity.YangGuangPageContent;
- import com.longcloud.springboot.webmagic.pipeline.YangGuangPagePipeline;
- import com.longcloud.springboot.webmagic.utils.UUIDUtil;
- import com.longcloud.springboot.webmagic.vo.YangGuangVo;
-
- import us.codecraft.webmagic.Page;
- import us.codecraft.webmagic.Site;
- import us.codecraft.webmagic.processor.PageProcessor;
- import us.codecraft.webmagic.selector.Selectable;
-
- @Component
- public class YangGuangPageProcessor implements PageProcessor {
-
- @Autowired
- private static YangGuangPagePipeline yangGuangPagePipeline;
-
- private static Logger logger = LoggerFactory.getLogger(YangGuangPageProcessor.class);
-
- // 正则表达式\\. \\转义java中的\ \.转义正则中的.
- // 主域名
-
- public static final String URL = "http://58.210.114.86/bbs/";
-
- public static final String BASE_URL = "http://58.210.114.86/bbs/forum.php?mod=forumdisplay&fid=2&page=1";
-
- public static final String PAGE_URL = "http://58.210.114.86/bbs/forum.php?mod=forumdisplay&fid=2&page=1";
-
- //设置抓取参数。详细配置见官方文档介绍 抓取网站的相关配置,包括编码、抓取间隔、重试次数等
- private Site site = Site.me()
- .setDomain(BASE_URL)
- .setSleepTime(1000)
- .setRetryTimes(30)
- .setCharset("utf-8")
- .setTimeOut(5000);
- //.setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31");
-
-
- @Override
- public Site getSite() {
- return site;
- }
-
- @Override
- public void process(Page page) {
- String[] pages = page.getUrl().toString().split("page=");
- Long size = Long.valueOf(pages[1]);
- if(size !=null && size <=2) {
-
- YangGuangVo yangGuangVo = new YangGuangVo();
- //获取所有列表框内容
- List<Selectable> list = page.getHtml().xpath("//div[@class='bm_c']/form/table/tbody").nodes();
-
- //获取当前页面的所有列表
- if(list != null && list.size() > 0){
- List<YangGuangPageContent> yangGuangPages = new ArrayList<YangGuangPageContent>();
-
- for(int i = 0; i < list.size(); i++){
- Selectable s = list.get(i);
-
- //正文,地址等信息
- String contentUrl = s.xpath("//tr/td[@class='icn']/a/@href").toString();
- String type = s.xpath("//tr/th[@class='common']/em[1]/a/text()").toString();
- String status = s.xpath("//th[@class='common']/img[1]/@alt").toString();
- String title = s.xpath("//th[@class='common']/a[@class='s xst']/text()").toString();
- String author = s.xpath("//td[@class='by']/cite/a/text()").toString();
- String address = s.xpath("//th[@class='common']/em[2]/text()").toString();
- String publishTime = s.xpath("//td[@class='by']/em/span/span/@title").toString();
- if(StringUtils.isEmpty(type)) {
- type = s.xpath("//tr/th[@class='new']/em[1]/a/text()").toString();
- }
- if(StringUtils.isEmpty(status)) {
- status = s.xpath("//th[@class='new']/img[1]/@alt").toString();
- }
- if(StringUtils.isEmpty(title)) {
- title = s.xpath("//th[@class='new']/a[@class='s xst']/text()").toString();
- }
- if(StringUtils.isEmpty(address)) {
- address = s.xpath("//th[@class='new']/em[2]/text()").toString();
- }
- if(StringUtils.isNotEmpty(contentUrl)){
- YangGuangPageContent yangGuangPage = new YangGuangPageContent();
- yangGuangPage.setId(UUIDUtil.uuid());
- yangGuangPage.setContentUrl(URL+contentUrl);
- yangGuangPage.setCreatedBy("system");
- yangGuangPage.setCreatedTime(new Date());
- yangGuangPage.setType(type);
- yangGuangPage.setStatus(status);
- yangGuangPage.setTitle(title);
- yangGuangPage.setAuthor(author);
- yangGuangPage.setAddress(address);
- yangGuangPage.setPublishTime(publishTime);
-
- logger.info(String.format("页面的正文指向路径为:[%s]",contentUrl));
-
- yangGuangPages.add(yangGuangPage);
- }
-
- }
- yangGuangVo.setPageList(yangGuangPages);
- }
- page.putField("yangGuang", yangGuangVo);
- //page.putField("yangGuangHtml", page.getHtml());
- }
- page.addTargetRequests(doListUrl());
- }
-
- /*public static void main(String[] args) {
- Spider spider = Spider.create(new YangGuangPageProcessor());
- spider.addUrl(BASE_URL);
- spider.addPipeline();
- spider.thread(5);
- spider.setExitWhenComplete(true);
- spider.start();
- spider.stop();
- }*/
-
-
- public List<String> doListUrl(){
- List<String> list = new ArrayList<String>();
- for(int i = 2;i<3;i++) {
- list.add("http://58.210.114.86/bbs/forum.php?mod=forumdisplay&fid=2&page=" + i);
- }
- return list;
- }
-
- }
保存新闻list
YangGuangPagePipeline .class
- package com.longcloud.springboot.webmagic.pipeline;
-
- import java.util.ArrayList;
- import java.util.List;
- import java.util.Map;
-
- import org.slf4j.Logger;
- import org.slf4j.LoggerFactory;
- import org.springframework.beans.factory.annotation.Autowired;
- import org.springframework.stereotype.Component;
-
- import com.longcloud.springboot.webmagic.dao.YangGuangPageContentDao;
- import com.longcloud.springboot.webmagic.entity.YangGuangPageContent;
- import com.longcloud.springboot.webmagic.processor.YangGuangPageContentProcessor;
- import com.longcloud.springboot.webmagic.vo.YangGuangVo;
-
- import us.codecraft.webmagic.ResultItems;
- import us.codecraft.webmagic.Spider;
- import us.codecraft.webmagic.Task;
- import us.codecraft.webmagic.pipeline.Pipeline;
-
- @Component
- public class YangGuangPagePipeline implements Pipeline{
-
-
-
- @Autowired
- private YangGuangPageContentDao yangGuangContentDao;
-
- @Autowired
- private YangGuangPageContentPipeline yangGuangPageContentPipeline;
-
- private Logger logger = LoggerFactory.getLogger(YangGuangPagePipeline.class);
-
- @Override
- public void process(ResultItems resultItems, Task task) {
- YangGuangVo yangGuangVo = (YangGuangVo) resultItems.get("yangGuang");
-
- if(yangGuangVo != null){
-
- System.out.println(yangGuangVo);
- List<YangGuangPageContent> list = new ArrayList<>();
- if(yangGuangVo.getPageList()!=null && yangGuangVo.getPageList().size()>0) {
- list = yangGuangContentDao.save(yangGuangVo.getPageList());
- }
- if(list.size()>0) {
- for(YangGuangPageContent yangGuangPage : yangGuangVo.getPageList()){
- logger.info("开始正文内容的抓取");
- //这里我们对后面的页面进行了深度的抓取,获取新闻的二级页面信息
- Spider spider = Spider.create(new YangGuangPageContentProcessor());
- spider.addUrl(yangGuangPage.getContentUrl());
- logger.info("抓取正文的URL:"+yangGuangPage.getContentUrl());
- spider.addPipeline(yangGuangPageContentPipeline)
- .addPipeline(new YangGuangFilePipline());
- spider.thread(1);
- spider.setExitWhenComplete(true);
- spider.start();
- spider.stop();
- logger.info("正文内容抓取结束");
- }
- }
-
- }
- }
- }
抽取新闻每个列表的正文部分:
YangGuangPageContentProcessor .class
- package com.longcloud.springboot.webmagic.processor;
-
- import java.util.Date;
-
- import org.slf4j.Logger;
- import org.slf4j.LoggerFactory;
- import org.springframework.stereotype.Component;
-
- import com.longcloud.springboot.webmagic.entity.YangGuangPageContent;
-
- import us.codecraft.webmagic.Page;
- import us.codecraft.webmagic.Site;
- import us.codecraft.webmagic.processor.PageProcessor;
-
- @Component
- public class YangGuangPageContentProcessor implements PageProcessor {
-
- private static Logger logger = LoggerFactory.getLogger(YangGuangPageContentProcessor.class);
-
-
- public static final String URL = "http://58.210.114.86/bbs/";
- //设置抓取参数。详细配置见官方文档介绍 抓取网站的相关配置,包括编码、抓取间隔、重试次数等
- private Site site = Site.me()
- .setDomain(URL)
- .setSleepTime(1000)
- .setRetryTimes(30)
- .setCharset("utf-8")
- .setTimeOut(5000);
- //.setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31");
- @Override
- public void process(Page page) {
- //获取正文的各个参数
- YangGuangPageContent yangGuangPageContent = new YangGuangPageContent();
- String content = page.getHtml().xpath("//div[@id='postlist']/div/table/tbody/tr/td[2]").toString();
- //div[@id='JIATHIS_CODE_HTML4']/div/table/tbody/tr/td/text()正文内容
- System.out.println(content);
- yangGuangPageContent.setContentUrl(page.getUrl().toString());
- yangGuangPageContent.setContent(content);
- yangGuangPageContent.setUpdatedBy("system");
- yangGuangPageContent.setUpdatedTime(new Date());
- page.putField("yangGuangPageContent", yangGuangPageContent);
- //page.putField("yangGuangHtml", page.getHtml());
-
- }
-
- @Override
- public Site getSite() {
- return site;
- }
-
- }
保存正文部分:
YangGuangPageContentPipeline .class
- package com.longcloud.springboot.webmagic.pipeline;
-
- import org.slf4j.Logger;
- import org.slf4j.LoggerFactory;
- import org.springframework.beans.factory.annotation.Autowired;
- import org.springframework.stereotype.Component;
-
- import com.longcloud.springboot.webmagic.dao.YangGuangPageContentDao;
- import com.longcloud.springboot.webmagic.entity.YangGuangPageContent;
-
- import us.codecraft.webmagic.ResultItems;
- import us.codecraft.webmagic.Task;
- import us.codecraft.webmagic.pipeline.Pipeline;
-
- @Component
- public class YangGuangPageContentPipeline implements Pipeline{
-
- @Autowired
- private YangGuangPageContentDao yangGuangContentDao;
-
- private static Logger logger = LoggerFactory.getLogger(YangGuangPageContentPipeline.class);
- @Override
- public void process(ResultItems resultItems, Task task) {
- YangGuangPageContent yangGuangPageContent = (YangGuangPageContent) resultItems.get("yangGuangPageContent");
- if(yangGuangPageContent!=null && yangGuangPageContent.getContentUrl()!=null) {
- YangGuangPageContent dbYangGuangPageContent=yangGuangContentDao.findByContentUrl(yangGuangPageContent.getContentUrl());
- //更新列表的正文内容
- if(dbYangGuangPageContent!=null) {
- logger.info(yangGuangPageContent.getContent());
- yangGuangContentDao.updateContent(yangGuangPageContent.getContent(),
- yangGuangPageContent.getUpdatedTime(),
- yangGuangPageContent.getUpdatedBy(),
- dbYangGuangPageContent.getContentUrl());
-
- }
- }else {
- logger.info("此列表无内容");
- }
-
-
-
-
- }
-
- }
定时抓取任务
SpingBootWebmagicJob.class
- package com.longcloud.springboot.webmagic.task;
-
- import org.slf4j.Logger;
- import org.slf4j.LoggerFactory;
- import org.springframework.beans.factory.annotation.Autowired;
- import org.springframework.scheduling.annotation.EnableScheduling;
- import org.springframework.scheduling.annotation.Scheduled;
- import org.springframework.stereotype.Component;
-
- import com.longcloud.springboot.webmagic.dao.YangGuangPageContentDao;
- import com.longcloud.springboot.webmagic.pipeline.YangGuangPagePipeline;
- import com.longcloud.springboot.webmagic.processor.YangGuangPageProcessor;
-
- import us.codecraft.webmagic.Spider;
-
- @Component
- @EnableScheduling
- public class SpingBootWebmagicJob {
-
- private Logger logger = LoggerFactory.getLogger(SpingBootWebmagicJob.class);
-
- public static final String BASE_URL = "http://58.210.114.86/bbs/forum.php?mod=forumdisplay&fid=2&page=1";
-
- @Autowired
- private YangGuangPageContentDao yangGuangContentDao;
-
- @Autowired
- YangGuangPagePipeline yangGuangPagePipeline;
-
- @Scheduled(cron = "${webmagic.job.cron}")
- //@PostConstruct启动项目则开启
- public void job() {
-
- long startTime, endTime;
- System.out.println("【爬虫开始】");
- startTime = System.currentTimeMillis();
- logger.info("爬取地址:" + BASE_URL);
- try {
- yangGuangContentDao.deleteAll();
- Spider spider = Spider.create(new YangGuangPageProcessor());
- spider.addUrl(BASE_URL);
- spider.addPipeline(yangGuangPagePipeline);
- // .addPipeline(new YangGuangFilePipline());
- spider.thread(5);
- spider.setExitWhenComplete(true);
- spider.start();
- spider.stop();
- } catch (Exception e) {
- logger.error(e.getMessage(),e);
- }
- endTime = System.currentTimeMillis();
- System.out.println("【爬虫结束】");
-
- System.out.println("阳光便民任务抓取耗时约" + ((endTime - startTime) / 1000) + "秒,已保存到数据库.");
-
- }
-
-
- }
别忘了application的配置哦:
- server.port=8085
- server.context-path=/
- #database
- spring.datasource.driver-class-name=com.mysql.jdbc.Driver
- spring.datasource.url=jdbc:mysql://localhost:3306/scrapy-webmagic?useSSL=false&useUnicode=yes&characterEncoding=UTF-8&zeroDateTimeBehavior=convertToNull&allowMultiQueries=true
- spring.datasource.username=root
- spring.datasource.password=webmagic123
- #connector-pool
- spring.datasource.hikari.maximum-pool-size=20
- spring.datasource.hikari.minimum-idle=5
- #JPA
- spring.jpa.database-platform=org.hibernate.dialect.MySQL5InnoDBDialect
- spring.jpa.show-sql=true
- #cron
- #每天凌晨一点爬取一次
- webmagic.job.cron=0 0 0 1 * ? *
-
到此一个定时爬取新闻的技术就完成了。欢迎继续关注我哦!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。