Forum Discussion
Mike_62629
Nimbostratus
Jul 16, 2008Rate limiting Search Spiders
We're currently having some problems with some web spiders beating up our webservers sucking up available sessions in our application and slurping up a whole bunch of our bandwidth. We're interested in rate-limiting them.
I found what appeared to be a very relevant iRule at http://devcentral.f5.com/Default.aspx?tabid=109 (third place winner), but when I try to load it up in the iRule editor it complains. It complains, I believe, because HTTP Headers are not available from within CLIENT_ACCEPTED and CLIENT_CLOSED logic. That makes sense because CLIENT_ACCEPTED and CLIENT_CLOSED are associated with building and destroying tcp connections (i believe), so it wouldn't make sense for data (headers/req-uri's) to be transferred at that time.
Does anyone have any suggestions on how to accomplish this or something similar?
13 Replies
- hoolio
Cirrostratus
If you changed the CLIENT_ACCEPTED event to HTTP_REQUEST and CLIENT_CLOSED to HTTP_RESPONSE, you'd get per HTTP request throttling, instead of per-TCP connection throttling (Click here).
Aaron - Mike_62629
Nimbostratus
I was thinking about that, it would limit to N pending requests.
I was worried about how to respond when I wanted to reject the request. It's easy at the tcp level, just reject the connection. At the request-level I have to think about how to respond without negatively impacting our page -rank.
I'm currently working on a rule which would rate limit to N requests per M seconds and have the same problem. How do I tell msn-search to buzz off without pissing off them off too much? -- though this won't be a big problem in the future as we've added crawl-delay to our robots.txt for future crawling. - Mike_62629
Nimbostratus
Here's where I am with my rate limiting iRule, though I havnt even checked to see if it'll parse yet:when RULE_INIT { array set ::active_crawlers { } set ::min_interval 1 set ::rate_limit_message "You've been rate limited for sending more than 1 request every $::min_interval seconds." } when HTTP_REQUEST { set user_agent [string tolower [HTTP::header "User-Agent"]] if { [matchclass [$user_agent contains $::Crawlers] } Throttle crawlers. set curr_time [clock seconds] if { [info exists ::active_crawlers($user_agent)] } { if { [ $::active_crawlers($user_agent) < $curr_time ] } { set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}] } else { block it somehow HTTP::respond 500 content $::rate_limit_message } } else { set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}] } } } - hoolio
Cirrostratus
You can still use reject to send a TCP reset (even from an HTTP_ event). I haven't delved into search engine optimization, but technically it would be appropriate to send a 503 response back.
Google seems to handle the 503 as you'd hope:
http://googlewebmastercentral.blogspot.com/2006/08/all-about-googlebot.html
If my site is down for maintenance, how can I tell Googlebot to come back later rather than to index the "down for maintenance" page?
You should configure your server to return a status of 503 (network unavailable) rather than 200 (successful). That lets Googlebot know to try the pages again later.
What should I do if Googlebot is crawling my site too much?
You can contact us -- we'll work with you to make sure we don't overwhelm your server's bandwidth. We're experimenting with a feature in our webmaster tools for you to provide input on your crawl rate, and have gotten great feedback so far, so we hope to offer it to everyone soon.
Aaron - Mike_62629
Nimbostratus
Here's the rate limiting approach:when RULE_INIT { array set ::active_crawlers { } set ::min_interval 1 } when HTTP_REQUEST { set user_agent [string tolower [HTTP::header "User-Agent"]] Logic only relevant for crawler user agents if { [matchclass $user_agent contains $::Crawlers] } { Throttle crawlers. set curr_time [clock seconds] if { [info exists ::active_crawlers($user_agent)] } { if { [ $::active_crawlers($user_agent) < $curr_time ] } { set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}] } else { reject } } else { set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}] } } } - hoolio
Cirrostratus
That looks like a reasonable solution. I did some searching on MSN's search site (http://blogs.msdn.com/livesearch/), but couldn't find any indication of how the MSN bot handles 503 responses.
Aaron - Mike_62629
Nimbostratus
Thanks for the help. - aneilsingh_5064
Nimbostratus
Hi,
I am not a developer so bare with me and my understanding..
Is there a final confirmed solution for Microsoft bots? I am having a problem with Bingbot and would love to limit the rate.
I do not see in the example Mike put up and specifying of MSN.
Thanks - Colin_Walker_12Historic F5 AccountThe reason you don't see his code explicitly stating anything about MSN is because he is using a class to store the bots that he wants to rate limit. The line that says:
if { [matchclass $user_agent contains $::Crawlers] } {
is doing a lookup in a data group (class) named Crawlers and is limiting any request with a User-Agent matching those strings. That being said, this is a very old approach to this solution. If you are in need of a bot rate-limiting rule and are on a version newer than 10.0, there are more efficient ways to go about this.
Colin - aneilsingh_5064
Nimbostratus
@Colin, I am on version 10.2.1. I would be very interested in some working examples.
Personally I think I would rather just have the actual irule to contain the list. Not sure why you use the class.
And of course performance is important.
Thanks
Help guide the future of your DevCentral Community!
What tools do you use to collaborate? (1min - anonymous)Recent Discussions
Related Content
DevCentral Quicklinks
* Getting Started on DevCentral
* Community Guidelines
* Community Terms of Use / EULA
* Community Ranking Explained
* Community Resources
* Contact the DevCentral Team
* Update MFA on account.f5.com
Discover DevCentral Connects
