赞
踩
在Eclipse中配置WebMagic之前一篇文章有介绍,Eclipse下配置WebMagic,仅供参考,还可以通过添加依赖的方式配置WebMagic。
1.由于反爬机制的存在,如果不进行伪装,对方服务器会将爬虫屏蔽,此时要进行浏览器的伪装
private Site site = Site.me().setRetryTimes(3).setSleepTime(10000).setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36");
2.选取一个京东手机品牌界面,比如选取apple界面
public static final String starts = "https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E&cid3=655";
3.程序入口爬取设置,设置6个线程、同时以json文件格式存储到指定的位置
public static void main(String[] args) {
JDSpider jDS = new JDSpider();
Spider spider = Spider.create(jDS);
spider.addUrl(starts);
//json数据存入目录
spider.addPipeline(new JsonFilePipeline("C:\\Users\\xxx\\Desktop\\test"));
spider.thread(6);
spider.run();
}
其中addPipeline后续所跟的位置为自己创建的一个存储文件夹的位置
4.对于一些手机信息界面是动态的,比如手机价格、好评总数等等,会不断变化,于时这些数据便存在单独的json格式的网页中,而不是爬取到的静态页面中,所以需要一个能将url网址转换成json数据的函数
//url获取json数据 public String loadJson(String url) { StringBuilder json = new StringBuilder(); try { URL urlObject = new URL(url); URLConnection uc = urlObject.openConnection(); BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream(), "GBK")); String inputLine = null; while ((inputLine = in.readLine()) != null) { json.append(inputLine); } in.close(); } catch (Exception e) { e.printStackTrace(); } return json.toString(); }
5.对于apple手机(其他品牌同理),在一个界面上只能有最多个固定数量的同品牌手机个数,但有很多个界面,所以网址之间一定存在某种规律,找到规律,得到所有大类界面(一个界面有多个手机)的网址
for (int j = 0; j < 11; j++) {
String item = "https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E&pvid=17eff9be0146499b85c1e66a0f0e7ea9&page=" + (2 * j + 1) + "&s=" + (60 * j + 1) + "&click=0";
page.addTargetRequest(item);
}
6.之后得到所有手机界面的网址,通过xpath找到大类界面上所有的手机界面网址。首先判断符合条件的大类网址
if (urlone.startsWith("https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E"))
然后获取所有的手机
// 获取所有手机页面
List<String> items = page.getHtml().xpath("//div[@class='p-name p-name-type-3']/a/@href").all();
for (int j = 0; j < items.size(); j++) {
String itemsNew = "https:" + items.get(j);
page.addTargetRequest(itemsNew);
}
7接下来获取所需要的手机属性,如下获取手机名称和url
// 手机名称
String name = page.getHtml().xpath("//div[@class='product-intro clearfix']" + "//div[@class='sku-name']/text()").get();
page.putField("name", name);
// 获取手机url
String id = urltwo.replace("https://item.jd.com/", "").replace(".html", "");
page.putField("url", id);
对于价格、好评度等动态界面,需要通过解析来获取到具体的json数据
(1)价格的url举例:
https://p.3.cn/prices/mgets?skuIds=J_10026711061553
提取到的规律:
String pr = "https://p.3.cn/prices/mgets?skuIds=J_" + id;
其中id为上一步中所获得的id,根据价格界面的构造,来书写JavaBean类,对价格解析,其中JavaBean类如下
public class Price { private String op; private String m; private String cbf; private String id; private String p; public String getOp() { return op; } public void setOp(String op) { this.op = op; } public String getM() { return m; } public void setM(String m) { this.m = m; } public String getCbf() { return cbf; } public void setCbf(String cbf) { this.cbf = cbf; } public String getId() { return id; } public void setId(String id) { this.id = id; } public String getP() { return p; } public void setP(String p) { this.p = p; } }
对价格的json解析和爬取
String pr = "https://p.3.cn/prices/mgets?skuIds=J_" + id;
List<Price> prices = JSONArray.parseArray(loadJson(pr), Price.class);
page.putField("price", prices.get(0).getP());
(2)好评总数的url举例:
https://club.jd.com/comment/productCommentSummaries.action?referenceIds=10026711061553
JavaBean类
public class CommentsCount { private long SkuId; private long ProductId; private int ShowCount; private String ShowCountStr; private String CommentCountStr; private int CommentCount; private int AverageScore; private String DefaultGoodCountStr; private int DefaultGoodCount; private String GoodCountStr; private int GoodCount; private int AfterCount; private int OneYear; private String AfterCountStr; private int VideoCount; private String VideoCountStr; private double GoodRate; private int GoodRateShow; private int GoodRateStyle; private String GeneralCountStr; private int GeneralCount; private double GeneralRate; private int GeneralRateShow; private int GeneralRateStyle; private String PoorCountStr; private int PoorCount; private int SensitiveBook; private double PoorRate; private int PoorRateShow; private int PoorRateStyle; public void setSkuId(long SkuId) { this.SkuId = SkuId; } public long getSkuId() { return SkuId; } public void setProductId(long ProductId) { this.ProductId = ProductId; } public long getProductId() { return ProductId; } public void setShowCount(int ShowCount) { this.ShowCount = ShowCount; } public int getShowCount() { return ShowCount; } public void setShowCountStr(String ShowCountStr) { this.ShowCountStr = ShowCountStr; } public String getShowCountStr() { return ShowCountStr; } public void setCommentCountStr(String CommentCountStr) { this.CommentCountStr = CommentCountStr; } public String getCommentCountStr() { return CommentCountStr; } public void setCommentCount(int CommentCount) { this.CommentCount = CommentCount; } public int getCommentCount() { return CommentCount; } public void setAverageScore(int AverageScore) { this.AverageScore = AverageScore; } public int getAverageScore() { return AverageScore; } public void setDefaultGoodCountStr(String DefaultGoodCountStr) { this.DefaultGoodCountStr = DefaultGoodCountStr; } public String getDefaultGoodCountStr() { return DefaultGoodCountStr; } public void setDefaultGoodCount(int DefaultGoodCount) { this.DefaultGoodCount = DefaultGoodCount; } public int getDefaultGoodCount() { return DefaultGoodCount; } public void setGoodCountStr(String GoodCountStr) { this.GoodCountStr = GoodCountStr; } public String getGoodCountStr() { return GoodCountStr; } public void setGoodCount(int GoodCount) { this.GoodCount = GoodCount; } public int getGoodCount() { return GoodCount; } public void setAfterCount(int AfterCount) { this.AfterCount = AfterCount; } public int getAfterCount() { return AfterCount; } public void setOneYear(int OneYear) { this.OneYear = OneYear; } public int getOneYear() { return OneYear; } public void setAfterCountStr(String AfterCountStr) { this.AfterCountStr = AfterCountStr; } public String getAfterCountStr() { return AfterCountStr; } public void setVideoCount(int VideoCount) { this.VideoCount = VideoCount; } public int getVideoCount() { return VideoCount; } public void setVideoCountStr(String VideoCountStr) { this.VideoCountStr = VideoCountStr; } public String getVideoCountStr() { return VideoCountStr; } public void setGoodRate(double GoodRate) { this.GoodRate = GoodRate; } public double getGoodRate() { return GoodRate; } public void setGoodRateShow(int GoodRateShow) { this.GoodRateShow = GoodRateShow; } public int getGoodRateShow() { return GoodRateShow; } public void setGoodRateStyle(int GoodRateStyle) { this.GoodRateStyle = GoodRateStyle; } public int getGoodRateStyle() { return GoodRateStyle; } public void setGeneralCountStr(String GeneralCountStr) { this.GeneralCountStr = GeneralCountStr; } public String getGeneralCountStr() { return GeneralCountStr; } public void setGeneralCount(int GeneralCount) { this.GeneralCount = GeneralCount; } public int getGeneralCount() { return GeneralCount; } public void setGeneralRate(double GeneralRate) { this.GeneralRate = GeneralRate; } public double getGeneralRate() { return GeneralRate; } public void setGeneralRateShow(int GeneralRateShow) { this.GeneralRateShow = GeneralRateShow; } public int getGeneralRateShow() { return GeneralRateShow; } public void setGeneralRateStyle(int GeneralRateStyle) { this.GeneralRateStyle = GeneralRateStyle; } public int getGeneralRateStyle() { return GeneralRateStyle; } public void setPoorCountStr(String PoorCountStr) { this.PoorCountStr = PoorCountStr; } public String getPoorCountStr() { return PoorCountStr; } public void setPoorCount(int PoorCount) { this.PoorCount = PoorCount; } public int getPoorCount() { return PoorCount; } public void setSensitiveBook(int SensitiveBook) { this.SensitiveBook = SensitiveBook; } public int getSensitiveBook() { return SensitiveBook; } public void setPoorRate(double PoorRate) { this.PoorRate = PoorRate; } public double getPoorRate() { return PoorRate; } public void setPoorRateShow(int PoorRateShow) { this.PoorRateShow = PoorRateShow; } public int getPoorRateShow() { return PoorRateShow; } public void setPoorRateStyle(int PoorRateStyle) { this.PoorRateStyle = PoorRateStyle; } public int getPoorRateStyle() { return PoorRateStyle; } }
import java.util.List;
public class JsonRootBean {
private List<CommentsCount> CommentsCount;
public void setCommentsCount(List<CommentsCount> CommentsCount) {
this.CommentsCount = CommentsCount;
}
public List<CommentsCount> getCommentsCount() {
return CommentsCount;
}
}
对好评总数的json解析和爬取
String pJID = "https://club.jd.com/comment/productCommentSummaries.action?referenceIds=" + id;
JsonRootBean jrt = JSONObject.parseObject(loadJson(pJID), JsonRootBean.class);
List<CommentsCount> count = jrt.getCommentsCount();
page.putField("goodRateShow", count.get(0).getGoodRateShow());
page.putField("comment", count.get(0).getCommentCountStr());
到此,一个简单的爬虫程序就基本完成了,爬取到的所有数据都存储到了自己所创建的那个文件夹中。
import java.io.BufferedReader; import java.io.InputStreamReader; import java.net.URL; import java.net.URLConnection; import java.util.List; import com.alibaba.fastjson.JSONArray; import com.alibaba.fastjson.JSONObject; import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.Spider; import us.codecraft.webmagic.pipeline.JsonFilePipeline; import us.codecraft.webmagic.processor.PageProcessor; public class JDSpider implements PageProcessor { // 浏览器伪装 private Site site = Site.me().setRetryTimes(3).setSleepTime(10000).setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"); // oneplus手机开始界面 public static final String starts = "https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E&cid3=655"; public static void main(String[] args) { JDSpider jDS = new JDSpider(); Spider spider = Spider.create(jDS); spider.addUrl(starts); //json数据存入目录,一共346个数据族 spider.addPipeline(new JsonFilePipeline("C:\\Users\\zxf\\Desktop\\testnew")); spider.thread(6); spider.run(); } //url获取json数据 public String loadJson(String url) { StringBuilder json = new StringBuilder(); try { URL urlObject = new URL(url); URLConnection uc = urlObject.openConnection(); BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream(), "GBK")); String inputLine = null; while ((inputLine = in.readLine()) != null) { json.append(inputLine); } in.close(); } catch (Exception e) { e.printStackTrace(); } return json.toString(); } @Override public void process(Page page) { // List<String> items = page.getHtml().xpath("//div[@class='p-name // p-name-type-3']/a/@href").all(); for (int j = 0; j < 11; j++) { String item = "https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E&pvid=17eff9be0146499b85c1e66a0f0e7ea9&page=" + (2 * j + 1) + "&s=" + (60 * j + 1) + "&click=0"; page.addTargetRequest(item); } String urlone = page.getRequest().getUrl(); if (urlone.startsWith("https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E")) { // 获取所有手机页面 List<String> items = page.getHtml().xpath("//div[@class='p-name p-name-type-3']/a/@href").all(); for (int j = 0; j < items.size(); j++) { String itemsNew = "https:" + items.get(j); page.addTargetRequest(itemsNew); } // 跳过处理 page.setSkip(true); } String urltwo = page.getRequest().getUrl(); // 获取所有的手机id if (urltwo.startsWith("https://item.jd.com")) { // 商品名称 String name = page.getHtml().xpath("//div[@class='product-intro clearfix']" + "//div[@class='sku-name']/text()").get(); page.putField("name", name); // 获取手机url String id = urltwo.replace("https://item.jd.com/", "").replace(".html", ""); page.putField("url", id); // 价格链接:https://p.3.cn/prices/mgets?skuIds=J_10026711061553 // 评价链接:https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100014348492&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1 // https://club.jd.com/comment/productCommentSummaries.action?referenceIds=100014348492 // 商品评价链接 String pJID = "https://club.jd.com/comment/productCommentSummaries.action?referenceIds=" + id; JsonRootBean jrt = JSONObject.parseObject(loadJson(pJID), JsonRootBean.class); List<CommentsCount> count = jrt.getCommentsCount(); page.putField("goodRateShow", count.get(0).getGoodRateShow()); page.putField("comment", count.get(0).getCommentCountStr()); String pr = "https://p.3.cn/prices/mgets?skuIds=J_" + id; List<Price> prices = JSONArray.parseArray(loadJson(pr), Price.class); page.putField("price", prices.get(0).getP()); } } @Override public Site getSite() { // TODO Auto-generated method stub return site; } }
同时还需要上边的3个JavaBean类,程序具体代码已在上边给出啦!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。