Forum Discussion

Mike_62629's avatar
Mike_62629
Icon for Nimbostratus rankNimbostratus
Jul 16, 2008

Rate limiting Search Spiders

We're currently having some problems with some web spiders beating up our webservers sucking up available sessions in our application and slurping up a whole bunch of our bandwidth. We're interested in rate-limiting them.

 

 

I found what appeared to be a very relevant iRule at http://devcentral.f5.com/Default.aspx?tabid=109 (third place winner), but when I try to load it up in the iRule editor it complains. It complains, I believe, because HTTP Headers are not available from within CLIENT_ACCEPTED and CLIENT_CLOSED logic. That makes sense because CLIENT_ACCEPTED and CLIENT_CLOSED are associated with building and destroying tcp connections (i believe), so it wouldn't make sense for data (headers/req-uri's) to be transferred at that time.

 

 

Does anyone have any suggestions on how to accomplish this or something similar?

 

 

  • @Colin, I am on version 10.2.1. I would be very interested in some working examples.

     

     

    Personally I think I would rather just have the actual irule to contain the list.

     

    And of course performance is important.

     

     

     

    Thanks

     

     

  • I think you can tell Google and Bing to crawl your sites at a slower rate:

     

     

     

    http://www.bing.com/community/site_blogs/b/webmaster/archive/2009/08/10/crawl-delay-and-the-bing-crawler-msnbot.aspx

     

     

    In the robots.txt file, within the generic user agent section, add the crawl-delay directive as shown in the example below:

     

     

    User-agent: *

     

    Crawl-delay: 1

     

     

     

     

    http://googlewebmastercentral.blogspot.com/2008/12/more-control-of-googlebots-crawl-rate.html

     

     

    We've upgraded the crawl rate setting in Webmaster Tools so that webmasters experiencing problems with Googlebot can now provide us more specific information. Crawl rate for your site determines the time used by Googlebot to crawl your site on each visit.

     

     

     

    If those options don't work for you it might be better to assign a rateclass for search engine spiders rather than sending back a 503. It should add less overhead on LTM and provide faster overall crawl times. Of course, I'm not an SEO expert so this is something you might want to research before using.

     

     

    You could use a list of spider user-agents like this to identify spiders:

     

     

    http://www.useragentstring.com/pages/Crawlerlist/

     

     

    You could either check the User-Agent header using a switch statement or putting the header tokens in a data group and using the class command to do the lookup. Once you identify a spider you could assign a rate class:

     

     

    http://devcentral.f5.com/wiki/iRules.switch.ashx

     

    http://devcentral.f5.com/wiki/iRules.class.ashx

     

    http://devcentral.f5.com/wiki/iRules.rateclass.ashx

     

     

    Aaron
  • Hi there, we are a SaaS company and therefore don't have control of all our communities submissions to Search engines. As well each of our Communities we allow the ability for them to have a robots.txt file as well. These robots.txt files are handled and server up with Code and not sitting in the root of the typical web server.

     

     

    Here you can see why I need to control speed from the F5 for robots.

     

     

    When searching on this site it seems this thread beginning is what I am after. Now I am just hoping someone has a working examples and can help me.

     

     

     

    thanks for your input.