Forum Discussion

mtobkes_64700's avatar
mtobkes_64700
Icon for Nimbostratus rankNimbostratus
May 12, 2010

Rate-Limiting Crawlers

Hi I found this iRule here that will limit requests to 1 request per n seconds. I would like to know how I'd be able to allow n requests per 1 second, e.g. allow 5 requests per 1 second.

 

 

when RULE_INIT {

 

array set ::active_crawlers { }

 

set ::min_interval 1

 

set ::rate_limit_message "You've been rate limited for sending more than 1 request every $::min_interval seconds."

 

}

 

when HTTP_REQUEST {

 

set user_agent [string tolower [HTTP::header "User-Agent"]]

 

if { [matchclass $user_agent contains $::Crawlers] } {

 

Throttle crawlers.

 

set curr_time [clock seconds]

 

if { [info exists ::active_crawlers($user_agent)] } {

 

if { [ $::active_crawlers($user_agent) < $curr_time ] } {

 

set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}]

 

} else {

 

block it somehow

 

HTTP::respond 503 content $::rate_limit_message }

 

} else {

 

set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}]

 

}

 

}

 

}

 

 

 

Thanks,

 

myles
  • For cleaner, more accurate rate-limiting, check out the table command article series that covers this in depth:

     

     

    http://devcentral.f5.com/Tutorials/TechTips/tabid/63/articleType/ArticleView/articleId/2391/categoryId/96/v101--iRules-rate-limiting-with-the-table-command.aspx Click Here
  • Thanks for the link. However I'm only running v9.4.7. Can you tell me what options that leaves me?

     

     

    Thanks again,

     

    myles
  • Check this version of the dns flood protection rule, the bones of the rate limiting are there:

     

     

    http://devcentral.f5.com/wiki/default.aspx/iRules/DNS_Flood_Protection_v2.html Click Here

     

     

  • I've modified the iRule I found to limit crawlers. I want to allow ::max_req_count for every ::min_interval, but I am getting a TCL error in my logs. Was wondering if someone can help me figure out what the problem is. The error I'm getting is:

     

     

    TCL error: googlebot_rate-limit_vb5 HTTP_REQUEST - invalid command name ::active_crawlersmozilla/4.0 compatible msie 7.0 windows nt 5.1 gtb6.4 .net clr 1.1.4322 .net clr 2.0.50727 .net clr 3.0.4506.2152 .net clr 3.5.30729 while executing ::active_crawlers$user_agent $curr_time

     

     

     

    when RULE_INIT {

     

    array set ::active_crawlers { }

     

    min_interval is the minimum amount of seconds

     

    set ::min_interval 10

     

    max_req_count variable is the maximum amount of request per min_interval

     

    set ::max_req_count 3

     

    set ::rate_limit_message "You've been rate limited for sending more than $::max_req_count request every $::min_interval seconds."

     

     

    }

     

    when HTTP_REQUEST {

     

    set user_agent [string tolower [HTTP::header "User-Agent"]]

     

    remove below log when we go to production

     

    log local0. "user agent is $user_agent"

     

    if { [matchclass $user_agent contains $::Crawlers] } {

     

    Throttle crawlers.

     

    remove below log when we go to production

     

    log local0. "user agent matches $user_agent"

     

    set curr_time [clock seconds]

     

    if { [info exists ::active_crawlers($user_agent)] } {

     

    remove below log when we go to production

     

    log local0. "passed active Crawlers"

     

    if { [ ::active_crawlers($user_agent) < $curr_time ] } {

     

    set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}]

     

    set reqcount 1

     

    remove below log when we go to production

     

    log local0. "passed set active crawlers"

     

    } else {

     

    if { [$reqcount > $::max_req_count] } {

     

    allow 10 request then block

     

    HTTP::respond 503 content $::rate_limit_message

     

    log when crawler hits more than 10 requests and block it

     

    log local0. "Rate Limit Has Reached $::max_req_count Requests Per $min_interval for $user_agent"

     

    } else {

     

    reqcount keeps track of request

     

    set reqcount [expr {$reqcount + 1}]

     

    }

     

    }

     

    } else {

     

    set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}]

     

    set reqcount 1

     

    }

     

     

    }

     

    }

     

     

    Thanks

     

    m
  • Hi myles,

     

     

    Try changing this line:

     

     

    if { [ ::active_crawlers($user_agent) < $curr_time ] } {

     

     

    to

     

     

    if { $::active_crawlers($user_agent) < $curr_time } {

     

     

    Aaron
  • Thanks Aaron. I changed the line however I now get this TCL error in my logs:

     

     

    TCL error: googlebot_rate-limit_vb5 HTTP_REQUEST - invalid command name 1273758282 while executing $::active_crawlers$user_agent $curr_time

     

     

     

     

     

  • Do you still have the parentheses around $user_agent and the less than sign in this line?

     

     

    if { $::active_crawlers($user_agent) < $curr_time } {

     

     

    Can you post a current copy of the iRule and the exact error message from /var/log/ltm?

     

     

    Thanks, Aaron