当前位置:   article > 正文

webmagic一:springboot整合webmagic,实现定时任务爬取多个网站_springboot 整合webmagic

springboot 整合webmagic

最近领导让我维护以前老的爬虫,我看到代码后瞬间就不想写了,只好从头开始自己做个爬虫。发现java爬虫的框架不怎么多,当然是相对于Python来说,最后选择了webmagic作为开发框架。

写这篇文章的目的是因为在实际的开发中遇到了不少头疼的问题,特此记录。

WebMagic中文网址:Introduction · WebMagic Documents

想要从头开始学习一个东西最好的办法就是从官网看文档,或者在文章里看大佬怎么看官网的。

我这里用的用的是springboot,使用Maven的方式

  1. <dependency>
  2. <groupId>us.codecraft</groupId>
  3. <artifactId>webmagic-core</artifactId>
  4. <version>0.7.3</version>
  5. </dependency>
  6. <dependency>
  7. <groupId>us.codecraft</groupId>
  8. <artifactId>webmagic-extension</artifactId>
  9. <version>0.7.3</version>
  10. </dependency>

从官网复制的java代码如下:

  1. package com.example.testspringboot.processor;
  2. import us.codecraft.webmagic.Page;
  3. import us.codecraft.webmagic.Site;
  4. import us.codecraft.webmagic.Spider;
  5. import us.codecraft.webmagic.processor.PageProcessor;
  6. public class GithubRepoPageProcessor implements PageProcessor {
  7. private Site site = Site.me().setRetryTimes(3).setSleepTime(100);
  8. @Override
  9. public void process(Page page) {
  10. page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
  11. page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
  12. page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());
  13. if (page.getResultItems().get("name")==null){
  14. //skip this page
  15. page.setSkip(true);
  16. }
  17. page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
  18. }
  19. @Override
  20. public Site getSite() {
  21. return site;
  22. }
  23. public static void main(String[] args) {
  24. Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run();
  25. }
  26. }

点击main运行:go!!!

  1. 09:22:53.972 [pool-1-thread-1] WARN us.codecraft.webmagic.downloader.HttpClientDownloader - download page https://github.com/code4craft error
  2. javax.net.ssl.SSLHandshakeException: No appropriate protocol (protocol is disabled or cipher suites are inappropriate)
  3. at sun.security.ssl.HandshakeContext.<init>(HandshakeContext.java:171)
  4. at sun.security.ssl.ClientHandshakeContext.<init>(ClientHandshakeContext.java:101)
  5. at sun.security.ssl.TransportContext.kickstart(TransportContext.java:238)
  6. at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:394)
  7. at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:373)
  8. at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436)
  9. at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
  10. at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
  11. at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
  12. at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
  13. at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
  14. at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
  15. at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
  16. at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
  17. at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
  18. at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
  19. at us.codecraft.webmagic.downloader.HttpClientDownloader.download(HttpClientDownloader.java:85)
  20. at us.codecraft.webmagic.Spider.processRequest(Spider.java:404)
  21. at us.codecraft.webmagic.Spider.access$000(Spider.java:61)
  22. at us.codecraft.webmagic.Spider$1.run(Spider.java:320)
  23. at us.codecraft.webmagic.thread.CountableThreadPool$1.run(CountableThreadPool.java:74)
  24. at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  25. at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  26. at java.lang.Thread.run(Thread.java:748)
  27. 09:22:54.081 [main] INFO us.codecraft.webmagic.Spider - Spider github.com closed! 1 pages downloaded.

第一步开始就报错,我也是很不开心,搜索了一下,发现这个问题是因为高版本的jdk安全协议导致的。

我当前的jdk版本是1.8_291

  1. C:\Users\MI>java -version
  2. java version "1.8.0_291"
  3. Java(TM) SE Runtime Environment (build 1.8.0_291-b10)
  4. Java HotSpot(TM) 64-Bit Server VM (build 25.291-b10, mixed mode)
  5. C:\Users\MI>

找到jdk安装的目录,编辑这个java.security配置文件

在配置文件中找到如下:

jdk.tls.disabledAlgorithms=SSLv3, TLSv1, TLSv1.1, RC4, DES, MD5withRSA, \
    DH keySize < 1024, EC keySize < 224, 3DES_EDE_CBC, anon, NULL, \
    include jdk.disabled.namedCurves

将SSLv3, TLSv1, TLSv1.1都去掉;如下:

这几个是干嘛的?这偏文章说的很清楚了:一文讲清SSL协议_sslv3-CSDN博客

再次运行

先说解决办法:将webmagic版本升级到0.10.0

  1. <dependency>
  2. <groupId>us.codecraft</groupId>
  3. <artifactId>webmagic-core</artifactId>
  4. <version>0.10.0</version>
  5. </dependency>
  6. <dependency>
  7. <groupId>us.codecraft</groupId>
  8. <artifactId>webmagic-extension</artifactId>
  9. <version>0.10.0</version>
  10. </dependency>

升级的原因是因为:0.7.3版本默认的HttpClient只会用TLSv1去请求,对于某些只支持TLS1.2的站点,就会报错。Https下无法抓取只支持TLS1.2的站点 · Issue #701 · code4craft/webmagic · GitHub

webmagic版本地址:Releases · code4craft/webmagic · GitHub

再次运行ok了

下一篇开始正式爬虫的工作

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/繁依Fanyi0/article/detail/903102
推荐阅读
相关标签
  

闽ICP备14008679号