当前位置:   article > 正文

Springboot整合Webmagic实现网页爬虫并实时入库_webmagic + springboot

webmagic + springboot

我的上一篇写的是面试技术AOP,当然,这么多天不在线,总得来点技术干货啊!公司最近需要爬虫的业务,所以翻了一些开源框架最终还是选择国人的开源,还是不错的,定制化一套,从抽取,入库,保存,一应俱全。现在展示一下我找的框架对比吧。

简单demo会如下,抽取要求,定时获取新闻列表,二级页面标题正文等信息。

关于爬虫组件的使用调研

调研简介:因使用爬虫组件抓取网页数据和分页新闻数据,故对各爬虫组件进行调研,通过分析相关组件的功能和技术门槛以及多因素,得出满足项目需求的适宜组件。

 

功能需求

webmagic

crawler4j

heritrix3

nutch

spiderman2

抓取指定网页数据

抓取分页新闻数据

自定义存储抓取的网页数据内容或文件

支持存储至文件和数据库中

支持存储至文件和数据库中

job爬取数据默认存储为warc格式文件;

支持存储至文件和数据库中

1.x不支持;

2.x放到了gora中,可以使用多种数据库,例如HBase, Cassandra, MySql来存储数据

支持存储至文件和数据库中

定时抓取网页数据

×

是否支持分布式爬取

性能需求

webmagic

crawler4j

heritrix3

nutch

spiderman2

可视化(1)

配置化(2)

都不可(0)

(2)

提供注解配置

(2)

可集成spring做配置

(1)

提供webUI配置爬取job

(2)

采用脚本配置抓取

(0)

编辑代码配置

使用和查看地址

https://github.com/code4craft/webmagic

https://github.com/yasserg/crawler4j

https://github.com/internetarchive/heritrix3

https://github.com/apache/nutch

https://gitee.com/l-weiwei/Spiderman2

组件热度star(s)和浏览次数(w)

s:7589

w:803

s:3372

w:309

s:1385

w:174

s:1869

w:245

s:1377

w:528

稳定性

稳定

稳定

稳定

稳定

较稳定

用户手册和开发文档

完善

较差

没有开放的API,只提供了几个详细的源码事例

完善

用户手册和开发文档介绍详细

完善

用户手册和开发文档皆有最新版本,且详细

相对缺乏

社区生态

相对较好

一般

较好

较好

相对较差

开发门槛和学习成本

较低

较低

一般

有自己的web控制台,操作者可以通过选择Crawler命令来操作控制台,需要学习相关知识,但是是java开发的开源爬虫框架

较高

需要编写脚本,安装和使用都需要操作服务器,熟悉相关shell命令

较低

评价

垂直、全栈式、模块化爬虫。更加适合抓取特定领域的信息。它包含了下载、调度、持久化、处理页面等模块。

多数爬虫项目基于此组件进行开发,改造,扩展性和延展性相对较高,但是较基础,生态较差

文档丰富,资料齐全,框架成熟,适合大型爬虫项目,学习成本相对较高

apache下的开源爬虫程序,数据抓取解析以及存储只是其中的一个功能

架构简洁、易用,生态相对较差

 

综上所述:

认为选择小型框架webmagic相对适宜

选取原因:开发门槛低,简单、易用、容易上手、国内开发人员维护,文档详细,支持全栈式爬虫开发。

现在就拿springboot和webmagic做一个整合。

确定项目的技术要点,maven构建,orm为Spring Data JPA。

引入pom依赖:

  1. <dependencies>
  2.         <dependency>
  3.             <groupId>org.springframework.boot</groupId>
  4.             <artifactId>spring-boot-starter-web</artifactId>
  5.         </dependency>
  6.         <dependency>
  7.             <groupId>org.springframework.boot</groupId>
  8.             <artifactId>spring-boot-starter-test</artifactId>
  9.             <scope>test</scope>
  10.         </dependency>
  11.         <dependency> 
  12.             <groupId>us.codecraft</groupId>
  13.             <artifactId>webmagic-core</artifactId>
  14.             <version>0.7.3</version>
  15.         </dependency>
  16.         <dependency>
  17.             <groupId>us.codecraft</groupId>
  18.             <artifactId>webmagic-extension</artifactId>
  19.             <version>0.7.3</version>
  20.         </dependency>
  21.          <dependency>
  22.             <groupId>us.codecraft</groupId>
  23.             <artifactId>webmagic-selenium</artifactId>
  24.             <version>0.7.3</version>
  25.         </dependency>
  26.         <dependency>
  27.             <groupId>org.springframework.boot</groupId>
  28.             <artifactId>spring-boot-starter-data-jpa</artifactId>
  29.         </dependency>
  30.         <dependency>
  31.             <groupId>mysql</groupId>
  32.             <artifactId>mysql-connector-java</artifactId>
  33.             <version>5.1.38</version>
  34.         </dependency>
  35.     </dependencies>

确定项目结构:

#模块介绍

processor模块负责抓取页面信息,执行抽取流程

pipeline模块负责保存抓取的信息

task模块负责设置定时任务,实现定时爬取网站信息

entity模块是实体信息模块

dao模块负责持久化数据

utils模块是工具类模块

我们这里只是做一个简单事例,代码直接贴上;

YangGuangPageContent.class

  1. package com.longcloud.springboot.webmagic.entity;
  2. import java.util.Date;
  3. import javax.persistence.Column;
  4. import javax.persistence.Entity;
  5. import javax.persistence.Id;
  6. import javax.persistence.Table;
  7. /**
  8. * 新闻内容
  9. * @author 常青
  10. *
  11. */
  12. @Entity
  13. @Table(name = "yang_guang_page_content")
  14. public class YangGuangPageContent {
  15. //新闻内容id
  16. @Id
  17. private String id;
  18. //新闻正文
  19. private String content;
  20. //新闻作者
  21. private String author;
  22. //列表的新闻类型
  23. private String type;
  24. //新闻发表地点
  25. private String address;
  26. //新闻标题
  27. private String title;
  28. //新闻的被关注状态
  29. private String status;
  30. //新闻发表时间
  31. @Column(name = "publish_time")
  32. private String publishTime;
  33. //新闻抓取时间
  34. @Column(name = "created_time")
  35. private Date createdTime;
  36. //新闻抓取者
  37. @Column(name = "created_by")
  38. private String createdBy;
  39. //列表的正文指向url
  40. @Column(name = "content_url")
  41. private String contentUrl;
  42. //新闻抓取时间
  43. @Column(name = "updated_time")
  44. private Date updatedTime;
  45. //新闻抓取者
  46. @Column(name = "updated_by")
  47. private String updatedBy;
  48. public String getId() {
  49. return id;
  50. }
  51. public void setId(String id) {
  52. this.id = id;
  53. }
  54. public String getContent() {
  55. return content;
  56. }
  57. public void setContent(String content) {
  58. this.content = content;
  59. }
  60. public String getAuthor() {
  61. return author;
  62. }
  63. public void setAuthor(String author) {
  64. this.author = author;
  65. }
  66. public String getPublishTime() {
  67. return publishTime;
  68. }
  69. public void setPublishTime(String publishTime) {
  70. this.publishTime = publishTime;
  71. }
  72. public Date getCreatedTime() {
  73. return createdTime;
  74. }
  75. public void setCreatedTime(Date createdTime) {
  76. this.createdTime = createdTime;
  77. }
  78. public String getCreatedBy() {
  79. return createdBy;
  80. }
  81. public void setCreatedBy(String createdBy) {
  82. this.createdBy = createdBy;
  83. }
  84. public String getType() {
  85. return type;
  86. }
  87. public void setType(String type) {
  88. this.type = type;
  89. }
  90. public String getAddress() {
  91. return address;
  92. }
  93. public void setAddress(String address) {
  94. this.address = address;
  95. }
  96. public String getTitle() {
  97. return title;
  98. }
  99. public void setTitle(String title) {
  100. this.title = title;
  101. }
  102. public String getStatus() {
  103. return status;
  104. }
  105. public void setStatus(String status) {
  106. this.status = status;
  107. }
  108. public String getContentUrl() {
  109. return contentUrl;
  110. }
  111. public void setContentUrl(String contentUrl) {
  112. this.contentUrl = contentUrl;
  113. }
  114. public Date getUpdatedTime() {
  115. return updatedTime;
  116. }
  117. public void setUpdatedTime(Date updatedTime) {
  118. this.updatedTime = updatedTime;
  119. }
  120. public String getUpdatedBy() {
  121. return updatedBy;
  122. }
  123. public void setUpdatedBy(String updatedBy) {
  124. this.updatedBy = updatedBy;
  125. }
  126. }

dao:

  1. package com.longcloud.springboot.webmagic.dao;
  2. import java.util.Date;
  3. import javax.transaction.Transactional;
  4. import org.springframework.data.jpa.repository.JpaRepository;
  5. import org.springframework.data.jpa.repository.Modifying;
  6. import org.springframework.data.jpa.repository.Query;
  7. import org.springframework.stereotype.Repository;
  8. import com.longcloud.springboot.webmagic.entity.YangGuangPageContent;
  9. @Repository
  10. public interface YangGuangPageContentDao extends JpaRepository<YangGuangPageContent, Long> {
  11. //根据url查询正文
  12. YangGuangPageContent findByContentUrl(String url);
  13. //更新部分字段
  14. @Transactional
  15. @Modifying(clearAutomatically = true)
  16. @Query("update YangGuangPageContent set content = ?1 , updated_time = ?2 , updated_by = ?3 where content_url = ?4")
  17. int updateContent(String content,Date updatedTime,
  18. String updatedBy,String contentUrl);
  19. }

抽取逻辑:

抽取新闻list ---YangGuangPageProcessor .class

  1. package com.longcloud.springboot.webmagic.processor;
  2. import java.util.ArrayList;
  3. import java.util.Date;
  4. import java.util.List;
  5. import org.apache.commons.lang3.StringUtils;
  6. import org.slf4j.Logger;
  7. import org.slf4j.LoggerFactory;
  8. import org.springframework.beans.factory.annotation.Autowired;
  9. import org.springframework.stereotype.Component;
  10. import com.longcloud.springboot.webmagic.entity.YangGuangPageContent;
  11. import com.longcloud.springboot.webmagic.pipeline.YangGuangPagePipeline;
  12. import com.longcloud.springboot.webmagic.utils.UUIDUtil;
  13. import com.longcloud.springboot.webmagic.vo.YangGuangVo;
  14. import us.codecraft.webmagic.Page;
  15. import us.codecraft.webmagic.Site;
  16. import us.codecraft.webmagic.processor.PageProcessor;
  17. import us.codecraft.webmagic.selector.Selectable;
  18. @Component
  19. public class YangGuangPageProcessor implements PageProcessor {
  20. @Autowired
  21. private static YangGuangPagePipeline yangGuangPagePipeline;
  22. private static Logger logger = LoggerFactory.getLogger(YangGuangPageProcessor.class);
  23. // 正则表达式\\. \\转义java中的\ \.转义正则中的.
  24. // 主域名
  25. public static final String URL = "http://58.210.114.86/bbs/";
  26. public static final String BASE_URL = "http://58.210.114.86/bbs/forum.php?mod=forumdisplay&fid=2&page=1";
  27. public static final String PAGE_URL = "http://58.210.114.86/bbs/forum.php?mod=forumdisplay&fid=2&page=1";
  28. //设置抓取参数。详细配置见官方文档介绍 抓取网站的相关配置,包括编码、抓取间隔、重试次数等
  29. private Site site = Site.me()
  30. .setDomain(BASE_URL)
  31. .setSleepTime(1000)
  32. .setRetryTimes(30)
  33. .setCharset("utf-8")
  34. .setTimeOut(5000);
  35. //.setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31");
  36. @Override
  37. public Site getSite() {
  38. return site;
  39. }
  40. @Override
  41. public void process(Page page) {
  42. String[] pages = page.getUrl().toString().split("page=");
  43. Long size = Long.valueOf(pages[1]);
  44. if(size !=null && size <=2) {
  45. YangGuangVo yangGuangVo = new YangGuangVo();
  46. //获取所有列表框内容
  47. List<Selectable> list = page.getHtml().xpath("//div[@class='bm_c']/form/table/tbody").nodes();
  48. //获取当前页面的所有列表
  49. if(list != null && list.size() > 0){
  50. List<YangGuangPageContent> yangGuangPages = new ArrayList<YangGuangPageContent>();
  51. for(int i = 0; i < list.size(); i++){
  52. Selectable s = list.get(i);
  53. //正文,地址等信息
  54. String contentUrl = s.xpath("//tr/td[@class='icn']/a/@href").toString();
  55. String type = s.xpath("//tr/th[@class='common']/em[1]/a/text()").toString();
  56. String status = s.xpath("//th[@class='common']/img[1]/@alt").toString();
  57. String title = s.xpath("//th[@class='common']/a[@class='s xst']/text()").toString();
  58. String author = s.xpath("//td[@class='by']/cite/a/text()").toString();
  59. String address = s.xpath("//th[@class='common']/em[2]/text()").toString();
  60. String publishTime = s.xpath("//td[@class='by']/em/span/span/@title").toString();
  61. if(StringUtils.isEmpty(type)) {
  62. type = s.xpath("//tr/th[@class='new']/em[1]/a/text()").toString();
  63. }
  64. if(StringUtils.isEmpty(status)) {
  65. status = s.xpath("//th[@class='new']/img[1]/@alt").toString();
  66. }
  67. if(StringUtils.isEmpty(title)) {
  68. title = s.xpath("//th[@class='new']/a[@class='s xst']/text()").toString();
  69. }
  70. if(StringUtils.isEmpty(address)) {
  71. address = s.xpath("//th[@class='new']/em[2]/text()").toString();
  72. }
  73. if(StringUtils.isNotEmpty(contentUrl)){
  74. YangGuangPageContent yangGuangPage = new YangGuangPageContent();
  75. yangGuangPage.setId(UUIDUtil.uuid());
  76. yangGuangPage.setContentUrl(URL+contentUrl);
  77. yangGuangPage.setCreatedBy("system");
  78. yangGuangPage.setCreatedTime(new Date());
  79. yangGuangPage.setType(type);
  80. yangGuangPage.setStatus(status);
  81. yangGuangPage.setTitle(title);
  82. yangGuangPage.setAuthor(author);
  83. yangGuangPage.setAddress(address);
  84. yangGuangPage.setPublishTime(publishTime);
  85. logger.info(String.format("页面的正文指向路径为:[%s]",contentUrl));
  86. yangGuangPages.add(yangGuangPage);
  87. }
  88. }
  89. yangGuangVo.setPageList(yangGuangPages);
  90. }
  91. page.putField("yangGuang", yangGuangVo);
  92. //page.putField("yangGuangHtml", page.getHtml());
  93. }
  94. page.addTargetRequests(doListUrl());
  95. }
  96. /*public static void main(String[] args) {
  97. Spider spider = Spider.create(new YangGuangPageProcessor());
  98. spider.addUrl(BASE_URL);
  99. spider.addPipeline();
  100. spider.thread(5);
  101. spider.setExitWhenComplete(true);
  102. spider.start();
  103. spider.stop();
  104. }*/
  105. public List<String> doListUrl(){
  106. List<String> list = new ArrayList<String>();
  107. for(int i = 2;i<3;i++) {
  108. list.add("http://58.210.114.86/bbs/forum.php?mod=forumdisplay&fid=2&page=" + i);
  109. }
  110. return list;
  111. }
  112. }

保存新闻list

YangGuangPagePipeline .class

  1. package com.longcloud.springboot.webmagic.pipeline;
  2. import java.util.ArrayList;
  3. import java.util.List;
  4. import java.util.Map;
  5. import org.slf4j.Logger;
  6. import org.slf4j.LoggerFactory;
  7. import org.springframework.beans.factory.annotation.Autowired;
  8. import org.springframework.stereotype.Component;
  9. import com.longcloud.springboot.webmagic.dao.YangGuangPageContentDao;
  10. import com.longcloud.springboot.webmagic.entity.YangGuangPageContent;
  11. import com.longcloud.springboot.webmagic.processor.YangGuangPageContentProcessor;
  12. import com.longcloud.springboot.webmagic.vo.YangGuangVo;
  13. import us.codecraft.webmagic.ResultItems;
  14. import us.codecraft.webmagic.Spider;
  15. import us.codecraft.webmagic.Task;
  16. import us.codecraft.webmagic.pipeline.Pipeline;
  17. @Component
  18. public class YangGuangPagePipeline implements Pipeline{
  19. @Autowired
  20. private YangGuangPageContentDao yangGuangContentDao;
  21. @Autowired
  22. private YangGuangPageContentPipeline yangGuangPageContentPipeline;
  23. private Logger logger = LoggerFactory.getLogger(YangGuangPagePipeline.class);
  24. @Override
  25. public void process(ResultItems resultItems, Task task) {
  26. YangGuangVo yangGuangVo = (YangGuangVo) resultItems.get("yangGuang");
  27. if(yangGuangVo != null){
  28. System.out.println(yangGuangVo);
  29. List<YangGuangPageContent> list = new ArrayList<>();
  30. if(yangGuangVo.getPageList()!=null && yangGuangVo.getPageList().size()>0) {
  31. list = yangGuangContentDao.save(yangGuangVo.getPageList());
  32. }
  33. if(list.size()>0) {
  34. for(YangGuangPageContent yangGuangPage : yangGuangVo.getPageList()){
  35. logger.info("开始正文内容的抓取");
  36. //这里我们对后面的页面进行了深度的抓取,获取新闻的二级页面信息
  37. Spider spider = Spider.create(new YangGuangPageContentProcessor());
  38. spider.addUrl(yangGuangPage.getContentUrl());
  39. logger.info("抓取正文的URL:"+yangGuangPage.getContentUrl());
  40. spider.addPipeline(yangGuangPageContentPipeline)
  41. .addPipeline(new YangGuangFilePipline());
  42. spider.thread(1);
  43. spider.setExitWhenComplete(true);
  44. spider.start();
  45. spider.stop();
  46. logger.info("正文内容抓取结束");
  47. }
  48. }
  49. }
  50. }
  51. }

抽取新闻每个列表的正文部分:

YangGuangPageContentProcessor .class

  1. package com.longcloud.springboot.webmagic.processor;
  2. import java.util.Date;
  3. import org.slf4j.Logger;
  4. import org.slf4j.LoggerFactory;
  5. import org.springframework.stereotype.Component;
  6. import com.longcloud.springboot.webmagic.entity.YangGuangPageContent;
  7. import us.codecraft.webmagic.Page;
  8. import us.codecraft.webmagic.Site;
  9. import us.codecraft.webmagic.processor.PageProcessor;
  10. @Component
  11. public class YangGuangPageContentProcessor implements PageProcessor {
  12. private static Logger logger = LoggerFactory.getLogger(YangGuangPageContentProcessor.class);
  13. public static final String URL = "http://58.210.114.86/bbs/";
  14. //设置抓取参数。详细配置见官方文档介绍 抓取网站的相关配置,包括编码、抓取间隔、重试次数等
  15. private Site site = Site.me()
  16. .setDomain(URL)
  17. .setSleepTime(1000)
  18. .setRetryTimes(30)
  19. .setCharset("utf-8")
  20. .setTimeOut(5000);
  21. //.setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31");
  22. @Override
  23. public void process(Page page) {
  24. //获取正文的各个参数
  25. YangGuangPageContent yangGuangPageContent = new YangGuangPageContent();
  26. String content = page.getHtml().xpath("//div[@id='postlist']/div/table/tbody/tr/td[2]").toString();
  27. //div[@id='JIATHIS_CODE_HTML4']/div/table/tbody/tr/td/text()正文内容
  28. System.out.println(content);
  29. yangGuangPageContent.setContentUrl(page.getUrl().toString());
  30. yangGuangPageContent.setContent(content);
  31. yangGuangPageContent.setUpdatedBy("system");
  32. yangGuangPageContent.setUpdatedTime(new Date());
  33. page.putField("yangGuangPageContent", yangGuangPageContent);
  34. //page.putField("yangGuangHtml", page.getHtml());
  35. }
  36. @Override
  37. public Site getSite() {
  38. return site;
  39. }
  40. }

保存正文部分:

YangGuangPageContentPipeline .class

  1. package com.longcloud.springboot.webmagic.pipeline;
  2. import org.slf4j.Logger;
  3. import org.slf4j.LoggerFactory;
  4. import org.springframework.beans.factory.annotation.Autowired;
  5. import org.springframework.stereotype.Component;
  6. import com.longcloud.springboot.webmagic.dao.YangGuangPageContentDao;
  7. import com.longcloud.springboot.webmagic.entity.YangGuangPageContent;
  8. import us.codecraft.webmagic.ResultItems;
  9. import us.codecraft.webmagic.Task;
  10. import us.codecraft.webmagic.pipeline.Pipeline;
  11. @Component
  12. public class YangGuangPageContentPipeline implements Pipeline{
  13. @Autowired
  14. private YangGuangPageContentDao yangGuangContentDao;
  15. private static Logger logger = LoggerFactory.getLogger(YangGuangPageContentPipeline.class);
  16. @Override
  17. public void process(ResultItems resultItems, Task task) {
  18. YangGuangPageContent yangGuangPageContent = (YangGuangPageContent) resultItems.get("yangGuangPageContent");
  19. if(yangGuangPageContent!=null && yangGuangPageContent.getContentUrl()!=null) {
  20. YangGuangPageContent dbYangGuangPageContent=yangGuangContentDao.findByContentUrl(yangGuangPageContent.getContentUrl());
  21. //更新列表的正文内容
  22. if(dbYangGuangPageContent!=null) {
  23. logger.info(yangGuangPageContent.getContent());
  24. yangGuangContentDao.updateContent(yangGuangPageContent.getContent(),
  25. yangGuangPageContent.getUpdatedTime(),
  26. yangGuangPageContent.getUpdatedBy(),
  27. dbYangGuangPageContent.getContentUrl());
  28. }
  29. }else {
  30. logger.info("此列表无内容");
  31. }
  32. }
  33. }

定时抓取任务

SpingBootWebmagicJob.class

  1. package com.longcloud.springboot.webmagic.task;
  2. import org.slf4j.Logger;
  3. import org.slf4j.LoggerFactory;
  4. import org.springframework.beans.factory.annotation.Autowired;
  5. import org.springframework.scheduling.annotation.EnableScheduling;
  6. import org.springframework.scheduling.annotation.Scheduled;
  7. import org.springframework.stereotype.Component;
  8. import com.longcloud.springboot.webmagic.dao.YangGuangPageContentDao;
  9. import com.longcloud.springboot.webmagic.pipeline.YangGuangPagePipeline;
  10. import com.longcloud.springboot.webmagic.processor.YangGuangPageProcessor;
  11. import us.codecraft.webmagic.Spider;
  12. @Component
  13. @EnableScheduling
  14. public class SpingBootWebmagicJob {
  15. private Logger logger = LoggerFactory.getLogger(SpingBootWebmagicJob.class);
  16. public static final String BASE_URL = "http://58.210.114.86/bbs/forum.php?mod=forumdisplay&fid=2&page=1";
  17. @Autowired
  18. private YangGuangPageContentDao yangGuangContentDao;
  19. @Autowired
  20. YangGuangPagePipeline yangGuangPagePipeline;
  21. @Scheduled(cron = "${webmagic.job.cron}")
  22. //@PostConstruct启动项目则开启
  23. public void job() {
  24. long startTime, endTime;
  25. System.out.println("【爬虫开始】");
  26. startTime = System.currentTimeMillis();
  27. logger.info("爬取地址:" + BASE_URL);
  28. try {
  29. yangGuangContentDao.deleteAll();
  30. Spider spider = Spider.create(new YangGuangPageProcessor());
  31. spider.addUrl(BASE_URL);
  32. spider.addPipeline(yangGuangPagePipeline);
  33. // .addPipeline(new YangGuangFilePipline());
  34. spider.thread(5);
  35. spider.setExitWhenComplete(true);
  36. spider.start();
  37. spider.stop();
  38. } catch (Exception e) {
  39. logger.error(e.getMessage(),e);
  40. }
  41. endTime = System.currentTimeMillis();
  42. System.out.println("【爬虫结束】");
  43. System.out.println("阳光便民任务抓取耗时约" + ((endTime - startTime) / 1000) + "秒,已保存到数据库.");
  44. }
  45. }

别忘了application的配置哦:

  1. server.port=8085
  2. server.context-path=/
  3. #database
  4. spring.datasource.driver-class-name=com.mysql.jdbc.Driver
  5. spring.datasource.url=jdbc:mysql://localhost:3306/scrapy-webmagic?useSSL=false&useUnicode=yes&characterEncoding=UTF-8&zeroDateTimeBehavior=convertToNull&allowMultiQueries=true
  6. spring.datasource.username=root
  7. spring.datasource.password=webmagic123
  8. #connector-pool
  9. spring.datasource.hikari.maximum-pool-size=20
  10. spring.datasource.hikari.minimum-idle=5
  11. #JPA
  12. spring.jpa.database-platform=org.hibernate.dialect.MySQL5InnoDBDialect
  13. spring.jpa.show-sql=true
  14. #cron
  15. #每天凌晨一点爬取一次
  16. webmagic.job.cron=0 0 0 1 * ? *

到此一个定时爬取新闻的技术就完成了。欢迎继续关注我哦!

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/一键难忘520/article/detail/903094
推荐阅读
相关标签
  

闽ICP备14008679号