We're currently having some problems with some web spiders beating up our webservers sucking up available sessions in our application and slurping up a whole bunch of our bandwidth. We're interested in rate-limiting them. I found what appeared to be a very relevant iRule at http://devcentral.f5.com/Default.aspx?tabid=109 (third place winner), but when I try to load it up in the iRule editor it complains. It complains, I believe, because HTTP Headers are not available from within CLIENT_ACCEPTED and CLIENT_CLOSED logic. That makes sense because CLIENT_ACCEPTED and CLIENT_CLOSED are associated with building and destroying tcp connections (i believe), so it wouldn't make sense for data (headers/req-uri's) to be transferred at that time. Does anyone have any suggestions on how to accomplish this or something similar?

If you changed the CLIENT_ACCEPTED event to HTTP_REQUEST and CLIENT_CLOSED to HTTP_RESPONSE, you'd get per HTTP request throttling, instead of per-TCP connection throttling (Click here). Aaron

I was thinking about that, it would limit to N pending requests. I was worried about how to respond when I wanted to reject the request. It's easy at the tcp level, just reject the connection. At the request-level I have to think about how to respond without negatively impacting our page -rank. I'm currently working on a rule which would rate limit to N requests per M seconds and have the same problem. How do I tell msn-search to buzz off without pissing off them off too much? -- though this won't be a big problem in the future as we've added crawl-delay to our robots.txt for future crawling.

You can still use reject to send a TCP reset (even from an HTTP_ event). I haven't delved into search engine optimization, but technically it would be appropriate to send a 503 response back. Google seems to handle the 503 as you'd hope: http://googlewebmastercentral.blogspot.com/2006/08/all-about-googlebot.html If my site is down for maintenance, how can I tell Googlebot to come back later rather than to index the "down for maintenance" page? You should configure your server to return a status of 503 (network unavailable) rather than 200 (successful). That lets Googlebot know to try the pages again later. What should I do if Googlebot is crawling my site too much? You can contact us -- we'll work with you to make sure we don't overwhelm your server's bandwidth. We're experimenting with a feature in our webmaster tools for you to provide input on your crawl rate, and have gotten great feedback so far, so we hope to offer it to everyone soon. Aaron

Rate limiting Search Spiders

hooleylist
Cirrostratus
Jul 16, 2008
If you changed the CLIENT_ACCEPTED event to HTTP_REQUEST and CLIENT_CLOSED to HTTP_RESPONSE, you'd get per HTTP request throttling, instead of per-TCP connection throttling (Click here).

Aaron
Mike_62629
Nimbostratus
Jul 16, 2008
I was thinking about that, it would limit to N pending requests.

I was worried about how to respond when I wanted to reject the request. It's easy at the tcp level, just reject the connection. At the request-level I have to think about how to respond without negatively impacting our page -rank.

I'm currently working on a rule which would rate limit to N requests per M seconds and have the same problem. How do I tell msn-search to buzz off without pissing off them off too much? -- though this won't be a big problem in the future as we've added crawl-delay to our robots.txt for future crawling.

Mike_62629

Nimbostratus

Jul 16, 2008

Here's where I am with my rate limiting iRule, though I havnt even checked to see if it'll parse yet:

when RULE_INIT { 
 array set ::active_crawlers { } 
 set ::min_interval 1 
 set ::rate_limit_message "You've been rate limited for sending more than 1 request every $::min_interval seconds." 
 } 
  
 when HTTP_REQUEST { 
 set user_agent [string tolower [HTTP::header "User-Agent"]] 
 if { [matchclass [$user_agent contains $::Crawlers] } 
  Throttle crawlers. 
 set curr_time [clock seconds] 
 if { [info exists ::active_crawlers($user_agent)] } { 
 if { [ $::active_crawlers($user_agent) < $curr_time ] } { 
 set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}] 
 } else { 
  block it somehow 
 HTTP::respond 500 content $::rate_limit_message 
 } 
 } else { 
 set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}] 
 } 
 } 
 }

hooleylist
Cirrostratus
Jul 16, 2008
You can still use reject to send a TCP reset (even from an HTTP_ event). I haven't delved into search engine optimization, but technically it would be appropriate to send a 503 response back.

Google seems to handle the 503 as you'd hope:

http://googlewebmastercentral.blogspot.com/2006/08/all-about-googlebot.html

If my site is down for maintenance, how can I tell Googlebot to come back later rather than to index the "down for maintenance" page?

You should configure your server to return a status of 503 (network unavailable) rather than 200 (successful). That lets Googlebot know to try the pages again later.

What should I do if Googlebot is crawling my site too much?

You can contact us -- we'll work with you to make sure we don't overwhelm your server's bandwidth. We're experimenting with a feature in our webmaster tools for you to provide input on your crawl rate, and have gotten great feedback so far, so we hope to offer it to everyone soon.

Aaron

Mike_62629

Nimbostratus

Jul 16, 2008

Here's the rate limiting approach:

when RULE_INIT { 
 array set ::active_crawlers { } 
 set ::min_interval 1 
 } 
  
 when HTTP_REQUEST { 
 set user_agent [string tolower [HTTP::header "User-Agent"]] 
  Logic only relevant for crawler user agents 
 if { [matchclass $user_agent contains $::Crawlers] } { 
  Throttle crawlers. 
 set curr_time [clock seconds] 
 if { [info exists ::active_crawlers($user_agent)] } { 
 if { [ $::active_crawlers($user_agent) < $curr_time ] } { 
 set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}] 
 } else { 
 reject 
 } 
 } else { 
 set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}] 
 } 
 } 
 }

hooleylist
Cirrostratus
Jul 16, 2008
That looks like a reasonable solution. I did some searching on MSN's search site (http://blogs.msdn.com/livesearch/), but couldn't find any indication of how the MSN bot handles 503 responses.

Aaron
Mike_62629
Nimbostratus
Jul 16, 2008
Thanks for the help.
aneilsingh_5064
Nimbostratus
Jan 23, 2012
Hi,

I am not a developer so bare with me and my understanding..

Is there a final confirmed solution for Microsoft bots? I am having a problem with Bingbot and would love to limit the rate.

I do not see in the example Mike put up and specifying of MSN.

Thanks
Colin_Walker_12
Historic F5 Account
Jan 23, 2012
The reason you don't see his code explicitly stating anything about MSN is because he is using a class to store the bots that he wants to rate limit. The line that says:

if { [matchclass $user_agent contains $::Crawlers] } {

is doing a lookup in a data group (class) named Crawlers and is limiting any request with a User-Agent matching those strings. That being said, this is a very old approach to this solution. If you are in need of a bot rate-limiting rule and are on a version newer than 10.0, there are more efficient ways to go about this.

Colin
aneilsingh_5064
Nimbostratus
Jan 23, 2012
@Colin, I am on version 10.2.1. I would be very interested in some working examples.

Personally I think I would rather just have the actual irule to contain the list. Not sure why you use the class.

And of course performance is important.

Thanks

Forum Discussion

Rate limiting Search Spiders

Recent Discussions

Reporting Help Needed

Switch ssl profile based on weak cipher detection via IRULE

F5 looses the token for the first call

iRule - Url rewrite and header replace and pool selection not working

full-proxy HTTP2

Related Content

How to search the latest security academic papers

In search of a security incident response system for the masses

F5 logs - search query

L3/4/DNS DDoS Reporting with Elastic Search and Kibana

Config Search

ABOUT DEVCENTRAL

RESOURCES

SUPPORT

PARTNERS