当前位置:   article > 正文

爬虫第一弹——爬取京东手机信息_webmagic 京东反爬虫

webmagic 京东反爬虫

一、Eclipse+WebMagic配置

在Eclipse中配置WebMagic之前一篇文章有介绍,Eclipse下配置WebMagic,仅供参考,还可以通过添加依赖的方式配置WebMagic。

二、爬虫步骤

1.由于反爬机制的存在,如果不进行伪装,对方服务器会将爬虫屏蔽,此时要进行浏览器的伪装

private Site site = Site.me().setRetryTimes(3).setSleepTime(10000).setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36");
  • 1

2.选取一个京东手机品牌界面,比如选取apple界面

public static final String starts = "https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E&cid3=655";
  • 1

3.程序入口爬取设置,设置6个线程、同时以json文件格式存储到指定的位置

public static void main(String[] args) {
	JDSpider jDS = new JDSpider();
	Spider spider = Spider.create(jDS);
	spider.addUrl(starts);
	//json数据存入目录
	spider.addPipeline(new JsonFilePipeline("C:\\Users\\xxx\\Desktop\\test"));
	spider.thread(6);
	spider.run();
	}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

其中addPipeline后续所跟的位置为自己创建的一个存储文件夹的位置
4.对于一些手机信息界面是动态的,比如手机价格、好评总数等等,会不断变化,于时这些数据便存在单独的json格式的网页中,而不是爬取到的静态页面中,所以需要一个能将url网址转换成json数据的函数

//url获取json数据
public String loadJson(String url) {
	StringBuilder json = new StringBuilder();
	try {
		URL urlObject = new URL(url);
		URLConnection uc = urlObject.openConnection();
		BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream(), "GBK"));
		String inputLine = null;
		while ((inputLine = in.readLine()) != null) {
			json.append(inputLine);
		}
		in.close();
	} catch (Exception e) {
		e.printStackTrace();
	}
	return json.toString();
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17

5.对于apple手机(其他品牌同理),在一个界面上只能有最多个固定数量的同品牌手机个数,但有很多个界面,所以网址之间一定存在某种规律,找到规律,得到所有大类界面(一个界面有多个手机)的网址

for (int j = 0; j < 11; j++) {
	String item = "https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E&pvid=17eff9be0146499b85c1e66a0f0e7ea9&page=" + (2 * j + 1) + "&s=" + (60 * j + 1) + "&click=0";
	page.addTargetRequest(item);
}
  • 1
  • 2
  • 3
  • 4

6.之后得到所有手机界面的网址,通过xpath找到大类界面上所有的手机界面网址。首先判断符合条件的大类网址

if (urlone.startsWith("https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E"))
  • 1

然后获取所有的手机

// 获取所有手机页面
List<String> items = page.getHtml().xpath("//div[@class='p-name p-name-type-3']/a/@href").all();
for (int j = 0; j < items.size(); j++) {
	String itemsNew = "https:" + items.get(j);
	page.addTargetRequest(itemsNew);
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

7接下来获取所需要的手机属性,如下获取手机名称和url

// 手机名称
String name = page.getHtml().xpath("//div[@class='product-intro clearfix']" + "//div[@class='sku-name']/text()").get();
page.putField("name", name);
// 获取手机url
String id = urltwo.replace("https://item.jd.com/", "").replace(".html", "");
page.putField("url", id);
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

对于价格、好评度等动态界面,需要通过解析来获取到具体的json数据
(1)价格的url举例:

https://p.3.cn/prices/mgets?skuIds=J_10026711061553
  • 1

提取到的规律:

String pr = "https://p.3.cn/prices/mgets?skuIds=J_" + id;
  • 1

其中id为上一步中所获得的id,根据价格界面的构造,来书写JavaBean类,对价格解析,其中JavaBean类如下

public class Price {
	private String op;
	private String m;
	private String cbf;
	private String id;
	private String p;
	public String getOp() {
		return op;
	}
	public void setOp(String op) {
		this.op = op;
	}
	public String getM() {
		return m;
	}
	public void setM(String m) {
		this.m = m;
	}
	public String getCbf() {
		return cbf;
	}
	public void setCbf(String cbf) {
		this.cbf = cbf;
	}
	public String getId() {
		return id;
	}
	public void setId(String id) {
		this.id = id;
	}
	public String getP() {
		return p;
	}
	public void setP(String p) {
		this.p = p;
	}
}

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38

对价格的json解析和爬取

String pr = "https://p.3.cn/prices/mgets?skuIds=J_" + id;
List<Price> prices = JSONArray.parseArray(loadJson(pr), Price.class);
page.putField("price", prices.get(0).getP());
  • 1
  • 2
  • 3

(2)好评总数的url举例:

https://club.jd.com/comment/productCommentSummaries.action?referenceIds=10026711061553
  • 1

JavaBean类

public class CommentsCount {
	private long SkuId;
    private long ProductId;
    private int ShowCount;
    private String ShowCountStr;
    private String CommentCountStr;
    private int CommentCount;
    private int AverageScore;
    private String DefaultGoodCountStr;
    private int DefaultGoodCount;
    private String GoodCountStr;
    private int GoodCount;
    private int AfterCount;
    private int OneYear;
    private String AfterCountStr;
    private int VideoCount;
    private String VideoCountStr;
    private double GoodRate;
    private int GoodRateShow;
    private int GoodRateStyle;
    private String GeneralCountStr;
    private int GeneralCount;
    private double GeneralRate;
    private int GeneralRateShow;
    private int GeneralRateStyle;
    private String PoorCountStr;
    private int PoorCount;
    private int SensitiveBook;
    private double PoorRate;
    private int PoorRateShow;
    private int PoorRateStyle;
    public void setSkuId(long SkuId) {
         this.SkuId = SkuId;
     }
     public long getSkuId() {
         return SkuId;
     }

    public void setProductId(long ProductId) {
         this.ProductId = ProductId;
     }
     public long getProductId() {
         return ProductId;
     }

    public void setShowCount(int ShowCount) {
         this.ShowCount = ShowCount;
     }
     public int getShowCount() {
         return ShowCount;
     }

    public void setShowCountStr(String ShowCountStr) {
         this.ShowCountStr = ShowCountStr;
     }
     public String getShowCountStr() {
         return ShowCountStr;
     }

    public void setCommentCountStr(String CommentCountStr) {
         this.CommentCountStr = CommentCountStr;
     }
     public String getCommentCountStr() {
         return CommentCountStr;
     }

    public void setCommentCount(int CommentCount) {
         this.CommentCount = CommentCount;
     }
     public int getCommentCount() {
         return CommentCount;
     }

    public void setAverageScore(int AverageScore) {
         this.AverageScore = AverageScore;
     }
     public int getAverageScore() {
         return AverageScore;
     }

    public void setDefaultGoodCountStr(String DefaultGoodCountStr) {
         this.DefaultGoodCountStr = DefaultGoodCountStr;
     }
     public String getDefaultGoodCountStr() {
         return DefaultGoodCountStr;
     }

    public void setDefaultGoodCount(int DefaultGoodCount) {
         this.DefaultGoodCount = DefaultGoodCount;
     }
     public int getDefaultGoodCount() {
         return DefaultGoodCount;
     }

    public void setGoodCountStr(String GoodCountStr) {
         this.GoodCountStr = GoodCountStr;
     }
     public String getGoodCountStr() {
         return GoodCountStr;
     }

    public void setGoodCount(int GoodCount) {
         this.GoodCount = GoodCount;
     }
     public int getGoodCount() {
         return GoodCount;
     }

    public void setAfterCount(int AfterCount) {
         this.AfterCount = AfterCount;
     }
     public int getAfterCount() {
         return AfterCount;
     }

    public void setOneYear(int OneYear) {
         this.OneYear = OneYear;
     }
     public int getOneYear() {
         return OneYear;
     }

    public void setAfterCountStr(String AfterCountStr) {
         this.AfterCountStr = AfterCountStr;
     }
     public String getAfterCountStr() {
         return AfterCountStr;
     }

    public void setVideoCount(int VideoCount) {
         this.VideoCount = VideoCount;
     }
     public int getVideoCount() {
         return VideoCount;
     }

    public void setVideoCountStr(String VideoCountStr) {
         this.VideoCountStr = VideoCountStr;
     }
     public String getVideoCountStr() {
         return VideoCountStr;
     }

    public void setGoodRate(double GoodRate) {
         this.GoodRate = GoodRate;
     }
     public double getGoodRate() {
         return GoodRate;
     }

    public void setGoodRateShow(int GoodRateShow) {
         this.GoodRateShow = GoodRateShow;
     }
     public int getGoodRateShow() {
         return GoodRateShow;
     }

    public void setGoodRateStyle(int GoodRateStyle) {
         this.GoodRateStyle = GoodRateStyle;
     }
     public int getGoodRateStyle() {
         return GoodRateStyle;
     }

    public void setGeneralCountStr(String GeneralCountStr) {
         this.GeneralCountStr = GeneralCountStr;
     }
     public String getGeneralCountStr() {
         return GeneralCountStr;
     }

    public void setGeneralCount(int GeneralCount) {
         this.GeneralCount = GeneralCount;
     }
     public int getGeneralCount() {
         return GeneralCount;
     }

    public void setGeneralRate(double GeneralRate) {
         this.GeneralRate = GeneralRate;
     }
     public double getGeneralRate() {
         return GeneralRate;
     }

    public void setGeneralRateShow(int GeneralRateShow) {
         this.GeneralRateShow = GeneralRateShow;
     }
     public int getGeneralRateShow() {
         return GeneralRateShow;
     }

    public void setGeneralRateStyle(int GeneralRateStyle) {
         this.GeneralRateStyle = GeneralRateStyle;
     }
     public int getGeneralRateStyle() {
         return GeneralRateStyle;
     }

    public void setPoorCountStr(String PoorCountStr) {
         this.PoorCountStr = PoorCountStr;
     }
     public String getPoorCountStr() {
         return PoorCountStr;
     }

    public void setPoorCount(int PoorCount) {
         this.PoorCount = PoorCount;
     }
     public int getPoorCount() {
         return PoorCount;
     }

    public void setSensitiveBook(int SensitiveBook) {
         this.SensitiveBook = SensitiveBook;
     }
     public int getSensitiveBook() {
         return SensitiveBook;
     }

    public void setPoorRate(double PoorRate) {
         this.PoorRate = PoorRate;
     }
     public double getPoorRate() {
         return PoorRate;
     }

    public void setPoorRateShow(int PoorRateShow) {
         this.PoorRateShow = PoorRateShow;
     }
     public int getPoorRateShow() {
         return PoorRateShow;
     }

    public void setPoorRateStyle(int PoorRateStyle) {
         this.PoorRateStyle = PoorRateStyle;
     }
     public int getPoorRateStyle() {
         return PoorRateStyle;
     }
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171
  • 172
  • 173
  • 174
  • 175
  • 176
  • 177
  • 178
  • 179
  • 180
  • 181
  • 182
  • 183
  • 184
  • 185
  • 186
  • 187
  • 188
  • 189
  • 190
  • 191
  • 192
  • 193
  • 194
  • 195
  • 196
  • 197
  • 198
  • 199
  • 200
  • 201
  • 202
  • 203
  • 204
  • 205
  • 206
  • 207
  • 208
  • 209
  • 210
  • 211
  • 212
  • 213
  • 214
  • 215
  • 216
  • 217
  • 218
  • 219
  • 220
  • 221
  • 222
  • 223
  • 224
  • 225
  • 226
  • 227
  • 228
  • 229
  • 230
  • 231
  • 232
  • 233
  • 234
  • 235
  • 236
  • 237
  • 238
  • 239
  • 240
  • 241
import java.util.List;

public class JsonRootBean {	
    private List<CommentsCount> CommentsCount;
    public void setCommentsCount(List<CommentsCount> CommentsCount) {
         this.CommentsCount = CommentsCount;
     }
     public List<CommentsCount> getCommentsCount() {
         return CommentsCount;
     }
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

对好评总数的json解析和爬取

String pJID = "https://club.jd.com/comment/productCommentSummaries.action?referenceIds=" + id;
JsonRootBean jrt = JSONObject.parseObject(loadJson(pJID), JsonRootBean.class);
List<CommentsCount> count = jrt.getCommentsCount();
page.putField("goodRateShow", count.get(0).getGoodRateShow());
page.putField("comment", count.get(0).getCommentCountStr());
  • 1
  • 2
  • 3
  • 4
  • 5

到此,一个简单的爬虫程序就基本完成了,爬取到的所有数据都存储到了自己所创建的那个文件夹中。

三、程序源代码

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.List;

import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.JsonFilePipeline;
import us.codecraft.webmagic.processor.PageProcessor;
public class JDSpider implements PageProcessor {
	// 浏览器伪装
	private Site site = Site.me().setRetryTimes(3).setSleepTime(10000).setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36");
	// oneplus手机开始界面
	public static final String starts = "https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E&cid3=655";

	public static void main(String[] args) {
		JDSpider jDS = new JDSpider();
		Spider spider = Spider.create(jDS);
		spider.addUrl(starts);
		//json数据存入目录,一共346个数据族
		spider.addPipeline(new JsonFilePipeline("C:\\Users\\zxf\\Desktop\\testnew"));
		spider.thread(6);
		spider.run();
	}
	
	//url获取json数据
	public String loadJson(String url) {
		StringBuilder json = new StringBuilder();
		try {
			URL urlObject = new URL(url);
			URLConnection uc = urlObject.openConnection();
			BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream(), "GBK"));
			String inputLine = null;
			while ((inputLine = in.readLine()) != null) {
				json.append(inputLine);
			}
			in.close();
		} catch (Exception e) {
			e.printStackTrace();
		}
		return json.toString();
	}

	@Override
	public void process(Page page) {
		// List<String> items = page.getHtml().xpath("//div[@class='p-name
		// p-name-type-3']/a/@href").all();
		for (int j = 0; j < 11; j++) {
			String item = "https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E&pvid=17eff9be0146499b85c1e66a0f0e7ea9&page=" + (2 * j + 1) + "&s=" + (60 * j + 1) + "&click=0";
			page.addTargetRequest(item);
		}
		String urlone = page.getRequest().getUrl();
		if (urlone.startsWith("https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E")) {
			// 获取所有手机页面
			List<String> items = page.getHtml().xpath("//div[@class='p-name p-name-type-3']/a/@href").all();
			for (int j = 0; j < items.size(); j++) {
				String itemsNew = "https:" + items.get(j);
				page.addTargetRequest(itemsNew);
			}
			// 跳过处理
			page.setSkip(true);
		}
		String urltwo = page.getRequest().getUrl();
		// 获取所有的手机id
		if (urltwo.startsWith("https://item.jd.com")) {
			// 商品名称
			String name = page.getHtml().xpath("//div[@class='product-intro clearfix']" + "//div[@class='sku-name']/text()").get();
			page.putField("name", name);
			// 获取手机url
			String id = urltwo.replace("https://item.jd.com/", "").replace(".html", "");
			page.putField("url", id);
			// 价格链接:https://p.3.cn/prices/mgets?skuIds=J_10026711061553
			// 评价链接:https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100014348492&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1
			// https://club.jd.com/comment/productCommentSummaries.action?referenceIds=100014348492
			// 商品评价链接
			String pJID = "https://club.jd.com/comment/productCommentSummaries.action?referenceIds=" + id;
			JsonRootBean jrt = JSONObject.parseObject(loadJson(pJID), JsonRootBean.class);
			List<CommentsCount> count = jrt.getCommentsCount();
			page.putField("goodRateShow", count.get(0).getGoodRateShow());
			page.putField("comment", count.get(0).getCommentCountStr());

			String pr = "https://p.3.cn/prices/mgets?skuIds=J_" + id;
			List<Price> prices = JSONArray.parseArray(loadJson(pr), Price.class);
			page.putField("price", prices.get(0).getP());
		}
	}

	@Override
	public Site getSite() {
		// TODO Auto-generated method stub
		return site;
	}
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98

同时还需要上边的3个JavaBean类,程序具体代码已在上边给出啦!

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/花生_TL007/article/detail/384542
推荐阅读
相关标签
  

闽ICP备14008679号