Forum Discussion
Mike_62629
Jul 16, 2008Nimbostratus
Rate limiting Search Spiders
We're currently having some problems with some web spiders beating up our webservers sucking up available sessions in our application and slurping up a whole bunch of our bandwidth. We're interested in rate-limiting them.
I found what appeared to be a very relevant iRule at http://devcentral.f5.com/Default.aspx?tabid=109 (third place winner), but when I try to load it up in the iRule editor it complains. It complains, I believe, because HTTP Headers are not available from within CLIENT_ACCEPTED and CLIENT_CLOSED logic. That makes sense because CLIENT_ACCEPTED and CLIENT_CLOSED are associated with building and destroying tcp connections (i believe), so it wouldn't make sense for data (headers/req-uri's) to be transferred at that time.
Does anyone have any suggestions on how to accomplish this or something similar?
- hooleylistCirrostratusIf you changed the CLIENT_ACCEPTED event to HTTP_REQUEST and CLIENT_CLOSED to HTTP_RESPONSE, you'd get per HTTP request throttling, instead of per-TCP connection throttling (Click here).
- Mike_62629NimbostratusI was thinking about that, it would limit to N pending requests.
- Mike_62629NimbostratusHere's where I am with my rate limiting iRule, though I havnt even checked to see if it'll parse yet:
when RULE_INIT { array set ::active_crawlers { } set ::min_interval 1 set ::rate_limit_message "You've been rate limited for sending more than 1 request every $::min_interval seconds." } when HTTP_REQUEST { set user_agent [string tolower [HTTP::header "User-Agent"]] if { [matchclass [$user_agent contains $::Crawlers] } Throttle crawlers. set curr_time [clock seconds] if { [info exists ::active_crawlers($user_agent)] } { if { [ $::active_crawlers($user_agent) < $curr_time ] } { set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}] } else { block it somehow HTTP::respond 500 content $::rate_limit_message } } else { set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}] } } }
- hooleylistCirrostratusYou can still use reject to send a TCP reset (even from an HTTP_ event). I haven't delved into search engine optimization, but technically it would be appropriate to send a 503 response back.
- Mike_62629NimbostratusHere's the rate limiting approach:
when RULE_INIT { array set ::active_crawlers { } set ::min_interval 1 } when HTTP_REQUEST { set user_agent [string tolower [HTTP::header "User-Agent"]] Logic only relevant for crawler user agents if { [matchclass $user_agent contains $::Crawlers] } { Throttle crawlers. set curr_time [clock seconds] if { [info exists ::active_crawlers($user_agent)] } { if { [ $::active_crawlers($user_agent) < $curr_time ] } { set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}] } else { reject } } else { set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}] } } }
- hooleylistCirrostratusThat looks like a reasonable solution. I did some searching on MSN's search site (http://blogs.msdn.com/livesearch/), but couldn't find any indication of how the MSN bot handles 503 responses.
- Mike_62629NimbostratusThanks for the help.
- aneilsingh_5064NimbostratusHi,
- Colin_Walker_12Historic F5 AccountThe reason you don't see his code explicitly stating anything about MSN is because he is using a class to store the bots that he wants to rate limit. The line that says:
- aneilsingh_5064Nimbostratus@Colin, I am on version 10.2.1. I would be very interested in some working examples.
Recent Discussions
Related Content
Â
DevCentral Quicklinks
* Getting Started on DevCentral
* Community Guidelines
* Community Terms of Use / EULA
* Community Ranking Explained
* Community Resources
* Contact the DevCentral Team
* Update MFA on account.f5.com
Discover DevCentral Connects