赞
踩
周末大清早收到封警报邮件,估计网站被攻击了,要么就是缓存日志memory的问题。打开access.log 看了一眼,原来该时间段内大波的bot(bot: 网上机器人;自动程序 a computer programthat performs a particular task again and again many times)访问了我的网站。
http://ltx71.com
http://mj12bot.com
http://www.bing.com/bingbot.htm
http://ahrefs.com/robot/
http://yandex.com/bots
上网搜了一下,发现许多webmaster都遇到了由于bot短期密集访问形成的流量高峰而无法其它终端提供服务的问题。从这篇文章的分析中,我们看到有这样几种方法来block这些web bot。
1. robots.txt
许多网络爬虫都是先去搜索robots.txt,如下所示:
"199.58.86.206" - - [25/Mar/2017:01:26:50 +0000] "GET /robots.txt HTTP/1.1" 404 341 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)" |
许多bot的发布者也谈到了如果不希望被爬取,应该如何来操作,就以MJ12bot为例:
How can I block MJ12bot?MJ12bot adheres to the robots.txt standard. If you want the bot to prevent website from being crawled then add the following text to your robots.txt: User-agent: MJ12bot Disallow: / Please do not waste your time trying to block bot via IP in htaccess - we do not use any consecutive IP blocks so your efforts will be in vain. Also please make sure the bot can actually retrieve robots.txt itself - if it can't then it will assume (this is the industry practice) that its okay to crawl your site. If you have reason to believe that MJ12bot did NOT obey your robots.txt commands, then please let us know via email: bot@majestic12.co.uk. Please provide URL to your website and log entries showing bot trying to retrieve pages that it was not supposed to. How can I slow down MJ12bot?You can easily slow down bot by adding the following to your robots.txt file: User-Agent: MJ12bot Crawl-Delay: 5 Crawl-Delay should be an integer number and it signifies number of seconds of wait between requests. MJ12bot will make an up to 20 seconds delay between requests to your site - note however that while it is unlikely, it is still possible your site may have been crawled from multiple MJ12bots at the same time. Making high Crawl-Delay should minimise impact on your site. This Crawl-Delay parameter will also be active if it was used for * wildcard. If our bot detects that you used Crawl-Delay for any other bot then it will automatically crawl slower even though MJ12bot specifically was not asked to do so. |
那么我们可以写如下的
User-agent: YisouSpider Disallow: / User-agent: EasouSpider Disallow: / User-agent: EtaoSpider Disallow: / User-agent: MJ12bot Disallow: / |
另外,鉴于很多bot都会去访问这些目录:
/wp-login.php /trackback/ /?replytocom= … |
许多WordPress网站也确实用到了这些文件夹,那么如何在不影响功能的情况下做一些调整呢?
robots.txt修改之前 | robots.txt修改之后 |
User-agent: * Disallow: /wp-admin Disallow: /wp-content/plugins Disallow: /wp-content/themes Disallow: /wp-includes Disallow: /?s=
| User-agent: * Disallow: /wp-admin Disallow: /wp-* Allow: /wp-content/uploads/ Disallow: /wp-content Disallow: /wp-login.php Disallow: /comments Disallow: /wp-includes Disallow: /*/trackback Disallow: /*?replytocom* Disallow: /?p=*&preview=true Disallow: /?s= |
不过,也可以看到许多爬虫并不理会robots.txt,以这个为例,就没有先去访问robots.txt
"10.70.8.30, 163.172.65.40" - - [25/Mar/2017:02:13:36 +0000] "GET / HTTP/1.1" 200 129989 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" |
这个时候就要试一下其他几种方法。
2. .htaccess
原理就是利用URL rewrite,只要发现访问来自于这些agent,就禁止其访问。作者“~吉尔伽美什”的这篇文章介绍了关于.htaccess的很多用法。
5. Blocking users by IP 根据IP阻止用户访问 |
3. 拒绝IP的访问
可以在Apache配置文件httpd.conf指明拒绝来自某些IP的访问。
<Directory "/var/www/html"> Order allow,deny Allow from all Deny from 5.9.26.210 Deny from 162.243.213.131 </Directory> |
但是由于很多时候,这些访问的IP并不固定,所以这种方法不太方便,而且修改了httpd.conf还要重启apache才能生效,所以建议采用修改.htaccess。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。