So many bots, so little time

The number of bots that are crawling my server is getting out of hand. A quick survey of the log files showed that about two thirds of all requests are coming from bots. Many are genuine (the nice folks at Bloglines or the billionairs at Google). But a lot are at least suspicious if not known to be evil.

Googling for the bot name (if given in the HTTP_USER_AGENT part of the request) gets you to many discussion threads listing many of the crawlers that you don’t want to visit your site (email harvesters, image harvesters, spam bots, etc) and many who are of unknown purpose (which in this day and age means that most likely you want to block them). Very interesting is this three part thread over at WebmasterWorld which discusses a few of the bots and more importantly good ways to get rid of them, especially those that ignore your robots.txt (and there are many other similar threads elsewhere).

I followed the consensus and decided to be a little more aggressive - a lengthy list of bots simply gets a Forbidden response from the Apache server. mod_rewrite is your friend.

Since I am blocking a most of the bots I notice two good side effects: on the one hand less clutter in the log files, on the other hand less traffic which means better response times for the people actually looking at my blogs (I had one bot pulling about 50MB worth of images over and over again from the site).

Thanks for visiting!
I hope this was helpful - if not, please leave a comment and let me know why! Were you searching for something else? Did I miss an important aspect?

1 Comment so far

  1. [...] was just a couple of weeks ago that I posted about the number of undesirable bots crawling my site. Over the last few days suddenly the number of genuine bots has just exploded. [...]

Leave a reply

FireStats icon Powered by FireStats