logo


Google and friends crawling like hell

The linux portal I manage receives 3000/4000 page views each day, revealed from different Analytics tools which works with Javascript, without any noscript.  This means that from those stats are completely excluded crawling bots, as they usually use perl scripts (or similar)  to fetch pages.

Yesterday I installed on my private server mod_defensible for apache2, a simple mod able to retrieve a blacklist of spammers, malevolous bots & company and to automatically give them a 403 error (access denied), so I decided to create a little report about the acceses on my server.

So using apache2 I created a new log, writing just some information I needed ( ip and unix timestamp ) and refining them (no css/images/js accesses, only .html and .php pages), and then a quick bash daemon that parses those logs and puts datas on my postgres DB every hour.

Results are amazing.

In 24 hours I got a total of 20500 accesses to linuxfeed and 9847 of those were made by 66.249.70.132. Know this guy? Yes it’s google… Ok I repeat in case you didn’t get it. 20500 access, 9847 by google.

That’s incredible but not enough. My ladder continues with 969 accesses from 74.6.17.151 (yahoo) , 903 from 66.249.71.237 (google again) and finally 340 from 67.195.37.89 (yahoo again). The total traffic I received from ‘good’ spiders is 12059 page views. This is more then an half of my whole site traffic. This is more then an half of my server load.

Then I focused on blocked bad guys stats. Mod_defensible did a great job stopping 6000+ accesses from blacklisted IPs. Particulary I noticed this guy ( 200.35.148.96 ) that made 5024 requests to my webserver. He received 5024 403 errors, perhaps this bot should have been coded in better way…

I just want to point out that I didn’t write those stats to criticize google or others search engines. If I want to be found on the net I’ll accept them to crawl me a lot. Google let’s you decrease crawling rate from webmaster tools (and I did id as soon as I saw these stats!) and the robots.txt lets you keep crawlers away. So you can decide in every moment to not be reached from everyone in the next months.

It’s just that I could not even imagine that their traffic might be even more (a lot more) then ‘real visitors’ and wanted to share my surprise… this web is perl-ed a lot!