当我们配置Nutch抓取 http://yangshangchuan.iteye.com 的时候,抓取的所有页面内容均为:您的访问请求被拒绝 ...... 这是最简单的反爬虫策略(该策略简单地读取HTTP请求头User-Agent的值来判断是人(浏览器)还是机器爬虫),我们只需要简单地配置Nutch来模拟浏览器(simulate web browser)就可以绕过这种限制。
在nutch-default.xml中有5项配置是和User-Agent相关的:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
|
<
property
>
<
name
>http.agent.description</
name
>
<
value
></
value
>
<
description
>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</
description
>
</
property
>
<
property
>
<
name
>http.agent.url</
name
>
<
value
></
value
>
<
description
>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</
description
>
</
property
>
<
property
>
<
name
>http.agent.email</
name
>
<
value
></
value
>
<
description
>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</
description
>
</
property
>
<
property
>
<
name
>http.agent.name</
name
>
<
value
></
value
>
<
description
>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</
description
>
</
property
>
<
property
>
<
name
>http.agent.version</
name
>
<
value
>Nutch-1.7</
value
>
<
description
>A version string to advertise in the User-Agent
header.</
description
>
</
property
>
|
在类nutch1.7/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java中可以看到这5项配置是如何构成User-Agent的:
1
2
3
4
5
|
this
.userAgent = getAgentString( conf.get(
"http.agent.name"
),
conf.get(
"http.agent.version"
),
conf.get(
"http.agent.description"
),
conf.get(
"http.agent.url"
),
conf.get(
"http.agent.email"
) );
|
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
|
private
static
String getAgentString(String agentName,
String agentVersion,
String agentDesc,
String agentURL,
String agentEmail) {
if
( (agentName ==
null
) || (agentName.trim().length() ==
0
) ) {
// TODO : NUTCH-258
if
(LOGGER.isErrorEnabled()) {
LOGGER.error(
"No User-Agent string set (http.agent.name)!"
);
}
}
StringBuffer buf=
new
StringBuffer();
buf.append(agentName);
if
(agentVersion !=
null
) {
buf.append(
"/"
);
buf.append(agentVersion);
}
if
( ((agentDesc !=
null
) && (agentDesc.length() !=
0
))
|| ((agentEmail !=
null
) && (agentEmail.length() !=
0
))
|| ((agentURL !=
null
) && (agentURL.length() !=
0
)) ) {
buf.append(
" ("
);
if
((agentDesc !=
null
) && (agentDesc.length() !=
0
)) {
buf.append(agentDesc);
if
( (agentURL !=
null
) || (agentEmail !=
null
) )
buf.append(
"; "
);
}
if
((agentURL !=
null
) && (agentURL.length() !=
0
)) {
buf.append(agentURL);
if
(agentEmail !=
null
)
buf.append(
"; "
);
}
if
((agentEmail !=
null
) && (agentEmail.length() !=
0
))
buf.append(agentEmail);
buf.append(
")"
);
}
return
buf.toString();
}
|
在类nutch1.7/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java中使用User-Agent请求头,这里的http.getUserAgent()返回的userAgent就是HttpBase.java中的userAgent:
1
2
3
4
5
6
7
8
|
String userAgent = http.getUserAgent();
if
((userAgent ==
null
) || (userAgent.length() ==
0
)) {
if
(Http.LOG.isErrorEnabled()) { Http.LOG.error(
"User-agent is not set!"
); }
}
else
{
reqStr.append(
"User-Agent: "
);
reqStr.append(userAgent);
reqStr.append(
"\r\n"
);
}
|
通过上面的分析可知:在nutch-site.xml中只需要增加如下几种配置之一便可以模拟一个特定的浏览器(Imitating a specific browser):
1、模拟Firefox浏览器:
1
2
3
4
5
6
7
8
|
<
property
>
<
name
>http.agent.name</
name
>
<
value
>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko</
value
>
</
property
>
<
property
>
<
name
>http.agent.version</
name
>
<
value
>20100101 Firefox/27.0</
value
>
</
property
>
|
2、模拟IE浏览器:
1
2
3
4
5
6
7
8
|
<
property
>
<
name
>http.agent.name</
name
>
<
value
>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident</
value
>
</
property
>
<
property
>
<
name
>http.agent.version</
name
>
<
value
>6.0)</
value
>
</
property
>
|
3、模拟Chrome浏览器:
1
2
3
4
5
6
7
8
|
<
property
>
<
name
>http.agent.name</
name
>
<
value
>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari</
value
>
</
property
>
<
property
>
<
name
>http.agent.version</
name
>
<
value
>537.36</
value
>
</
property
>
|
4、模拟Safari浏览器:
1
2
3
4
5
6
7
8
|
<
property
>
<
name
>http.agent.name</
name
>
<
value
>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari</
value
>
</
property
>
<
property
>
<
name
>http.agent.version</
name
>
<
value
>534.57.2</
value
>
</
property
>
|
5、模拟Opera浏览器:
1
2
3
4
5
6
7
8
|
<
property
>
<
name
>http.agent.name</
name
>
<
value
>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 OPR</
value
>
</
property
>
<
property
>
<
name
>http.agent.version</
name
>
<
value
>19.0.1326.59</
value
>
</
property
>
|
后记:查看User-Agent的方法:
1、http://www.useragentstring.com
3、http://www.enhanceie.com/ua.aspx