Forum Discussion

Kenny_Lussier_5's avatar
Kenny_Lussier_5
Icon for Nimbostratus rankNimbostratus
Jun 13, 2011

Tracking triggers in an iRule

Hi All,

 

 

I have the following iRule which checks a data group to see if the server is marked as online or off-line. If it is marked as online, traffic passes as normal, if it is off-line, it sends back a 503. If the proxy is online, but the back end pool is unavailable (using LB_FAILED), it sends back a 502. The pool is actually a single node, so there is no need for LB::reselect (which I don't think would work anyway). I have a tcp profile assigned to the virtual server that sets max syn retry to 1 so that LB_FAILED is immediate.

 

 

This has worked fairly well so far, except that LB_FAILED is being triggered intermittently, and I don't know why. One request will get a 502, while another request, received within milliseconds, goes through. If I were using a built-in health check, there would be logging on member up/down, and failures to select pools. But since I am doing passive checking, there isn't much info that I can find. Is there a way to see what is causing the failures from the LTMs point of view?

 

 

Thanks,

 

Kenny

 

 

when RULE_INIT {

 

 

log local0.info "proxystatushttp v1.0 $static::tcl_platform(os) $static::tcl_platform(osVersion)"

 

 

set static::DEBUG 0

 

 

set static::offlineFlag "offline"

 

set static::proxyStatus proxystatus

 

 

if { $static::DEBUG } { log local0.debug "$static::proxyStatus:\n[class get $static::proxyStatus]" }

 

 

set static::privateNetworkAddresses private_net

 

set static::externalMonitoringAddresses external_monitoring_addresses

 

 

if { $static::DEBUG } { log local0.debug "$static::privateNetworkAddresses:\n[class get $static::privateNetworkAddresses]" }

 

if { $static::DEBUG } { log local0.debug "$static::externalMonitoringAddresses:\n[class get $static::externalMonitoringAddresses]" }

 

 

}

 

 

when HTTP_REQUEST {

 

 

if { [class lookup $static::offlineFlag $static::proxyStatus] } {

 

 

if { (not [class match [IP::client_addr] equals $static::externalMonitoringAddresses]) &&

 

(not [class match [IP::client_addr] equals $static::privateNetworkAddresses]) } {

 

 

set response "ForbiddenNOTICE: Service unavailable at this time."

 

 

HTTP::respond 503 content $response noserver "Connection" "close" "Content-Length" [string length $response]

 

if { $static::DEBUG } { log local0.debug "Sent HTTP Status Code 503 due to proxy status offline to [IP::client_addr]" }

 

log -noname local0. "[virtual name] MyIP=[IP::local_addr] SrcIP=[IP::client_addr] - - \[[clock format [clock seconds] -format "%d/%b/%Y:%H:%M:%S %z"]\] - \"[HTTP::method] [HTTP::uri] HTTP/[HTTP::version]\" 503 [HTTP::payload length]"

 

return

 

}

 

else {

 

if { $static::DEBUG } { log local0.debug "Processing HTTP request with proxy status offline from [IP::client_addr]" }

 

}

 

}

 

}

 

 

when LB_FAILED {

 

set response "Server ErrorNOTICE: Site has experienced an error."

 

HTTP::respond 502 content $response noserver "Connection" "close"

 

log -noname local0. "[virtual name] MyIP=[IP::local_addr] SrcIP=[IP::client_addr] - - \[[clock format [clock seconds] -format "%d/%b/%Y:%H:%M:%S %z"]\] - \"[HTTP::method] [HTTP::uri] HTTP/[HTTP::version]\" 502 [HTTP::payload length]"

 

}
  • HI Kenny,

     

    You might need to apply a TCP profile where you can fine tune the "Maximum Syn Retransmissions" . Here is a link to describe the various settings behind it.

     

     

    http://devcentral.f5.com/wiki/default.aspx/iRules/LB_FAILED.html

     

     

    I hope this helps

     

    Bhattman

     

  • Hi Kenny,

     

     

    I don't think there is anything special you're doing in the iRule which would trigger a load balancing failure and the LB_FAILED event to run. You can check the LB_FAILED wiki page for details on when this event is triggered:

     

     

    http://devcentral.f5.com/wiki/default.aspx/iRules/lb_failed

     

     

    If this happens frequently, you could try capturing a tcpdump of the client and serverside traffic to see if the pool member is in fact not responding to LTM SYNs. For details on using tcpdump, check SOL411:

     

     

    sol411: Overview of packet tracing with the tcpdump utility

     

    http://support.f5.com/content/kb/en-us/solutions/public/0000/400/sol411.html

     

     

    Aaron
  • Thanks for the pointers. Tracking connections is a little tough, since there are thousands of connections to the front end, and the pool/node is a load balancer. finding the one SYN that isn't ACKd is like finding a needle in a needle stack :-)

     

     

    One thing to note is that I replaced some old Linux servers running Apache using mod_proxy with the LTM. We never had this issue until we went to the LTM. I am trying to figure out which of the hundreds of differences is causing the problem, and if there is a way to adjust the LTM so that it doesn't behave this way. I suppose increasing the SYN retry is an option. Would using an LB::reselect work if there is only one node in a pool?

     

     

    Thanks,

     

    Kenny

     

  • Colin_Walker_12's avatar
    Colin_Walker_12
    Historic F5 Account
    I don't think you'd need a LB::reselect, if you just want to try again to the same server, you could use HTTP::retry.

     

     

    Colin
  • With a default TCP profile, TMM tries 5 times over 45 seconds to establish a TCP connection. If that's not enough attempts you could increase the Maximum Syn Retransmissions" option in the TCP profile.

     

     

    However, it would probably be more effective to try to capture the failure happening in a tcpdump so you can see exactly what's failing. I realize that's not easy when the virtual server is in production. However, you might be able to create a test VS with a custom SNAT pool and point a test client (or front end server) at it to isolate the traffic.

     

     

    Aaron
  • Problem Solved!! The problem isn't connections being refused, it's connections being closed on the back ent. Our tomcat servers have a timeout of 60 seconds. The LTM has a 300 second TCP timeout. So, if the client connects, sends a request, gets a response, and does not properly close a connection, it stays open for 300 seconds as far as the LTM is concerned. However, the tomcat server kills the thread servicing that connection after 60 idle seconds. That makes the LTM think that the pool member has failed, triggering LB_FAILED the next time it tries to use that connection. I solved it with this:

     
    when RULE_INIT {
    
       log local0.info "keepalivetimeout v0.1  $static::tcl_platform(os) $static::tcl_platform(osVersion)"
    
       set static::keepalivetimeoutDEBUG 0
    
       set static::keepAliveTimeout [class lookup "keepAliveTimeout" httpdefaults]
    
    }
    
    when HTTP_REQUEST {
    
        (re)set the TCP idle timeout for the current connection to the profile default
       IP::idle_timeout [PROFILE::tcp idle_timeout]
    
       if { $static::keepalivetimeoutDEBUG } { log local0.debug "[IP::client_addr]:[TCP::client_port] TCP idle_timeout set to [IP::idle_timeout]" }
    
    
    }
    
    when HTTP_RESPONSE {
    
        (re)set the TCP idle timeout for the current connection until 
       IP::idle_timeout $static::keepAliveTimeout
    
       if { $static::keepalivetimeoutDEBUG } { log local0.debug "[IP::client_addr]:[TCP::client_port] TCP idle_timeout set to [IP::idle_timeout]" }
    
    }
    
     :keepalivetimeout 

    httpdefaults is a data group with several variables, now with one called keepAliveTimeout, which I have set to 15 seconds. When HTTP_REQUEST is triggered, the timeout is set to 300 (the profile default). When HTTP_RESPONSE is triggered, the timeout is set to 15. If another request comes in on the same connection, then the timeout is reset to 300, and the socket is re-used. If not, it is torn down.

    Thanks,

    Kenny
  • Nice work in figuring out what was happening.

     

     

    However, I'm not sure what you're trying to do with the iRule. TMM should automatically reset the TCP idle timeout anytime a packet is received. Trying to do this manually for each HTTP request or response seems redundant.

     

     

    Also, from your description of the issue, the problem isn't that LTM is not resetting its timeout--it's that the TMM and server timeouts are mismatched. Couldn't you just update the idle timeout on the clientside (and serverside) TCP profile(s) to be slightly lower than the servers to force TMM to close the connections before the servers?

     

     

    Aaron
  • Hoolio,

     

     

    The problem is that when a request comes comes in, it can take a minute, maybe two, for the backend tomcat to process the request, do what it needs to (I'm being intentionally vague about what our application does), and send a response. If I set the tcp idle timeout to 15 seconds in the TCP profile, then the connection can get closed before the response is sent (TCP doesn't know what state HTTP is in). If I set the timeout to 300 seconds on the request, then longer processing times are covered. However, if the client doesn't close the connection because they use something like releaseConnection() instead of closeConnection() in their client code, then the connection stays open but idle. Tomcat cleans up idle threads by killing them off. This is configurable, and our application closes after 60 seconds of HTTP idle time. Tomcat, unlike the LTM, is aware that an http response has been sent, and the transaction is complete, and it starts the clock when the response is sent. By setting the timeout to 15 seconds on every time the HTTP_RESPONSE event is triggered, the LTM becomes aware of the HTTP state, not just the TCP state.

     

     

    Thanks,

     

    Kenny