Forum Discussion

Nimbostratus

Aug 01, 2008

LB::reselect problem

Let me give you some background before I get in to the problem.

We have a web application that uses session replication. A node can fail in such a way that it always return 5xx error pages. We would like BigIP to route people to the same node as long as it is up, but if it is down (returning 5xx) it should retry their request on the other node.

To accomplish this I worked from the example in the documentation for HTTP::retry. My irule is at the bottom of the post. The problem is that the logs show it selecting the other node for a retry on a 5xx error, but in reality it tries to connect to the same node twice. If anyone has an idea about how I can make this work I would appreciate it very much.

BigIP log message:

Jul 31 15:26:10 tmm tmm[934]: 01220002:6: Rule methcheck_retry_on_error : Sticky to 1573653002.25635.0000

Jul 31 15:26:10 tmm tmm[934]: 01220002:6: Rule methcheck_retry_on_error : Trying node 10.10.204.93:9060

Jul 31 15:26:10 tmm tmm[934]: 01220002:6: Rule methcheck_retry_on_error : 5xx error caught: retry 1 out of 2

Jul 31 15:26:10 tmm tmm[934]: 01220002:6: Rule methcheck_retry_on_error : Removing cookie

Jul 31 15:26:10 tmm tmm[934]: 01220002:6: Rule methcheck_retry_on_error : Trying node 10.10.204.93:9062

It looks like it is trying port 9060 once and port 9062 once, but in reality two requests come in on port 9060 and none come in on 9062. The second request does not have the BIGipServer cookie and has two x-forwarded-for headers as expected.

irule:

when CLIENT_ACCEPTED {

set retries 0

}

when HTTP_REQUEST {

if { $retries >= 1 } {

HTTP::cookie remove "BIGipServermethcheck"

log "Removing cookie"

}

if { [HTTP::cookie exists "BIGipServermethcheck"] } {

log "Sticky to [HTTP::cookie value "BIGipServermethcheck"]"

}

set request [HTTP::request]

}

when LB_SELECTED {

if { $retries >= 1 } {

LB::mode rr

LB::reselect

}

log "Trying node [LB::server addr]:[LB::server port]"

}

when HTTP_RESPONSE {

if { [HTTP::status] starts_with "5" } {

incr retries

log "5xx error caught: retry $retries out of [active_members [LB::server pool]]"

if { $retries < [active_members [LB::server pool]] } {

HTTP::retry $request

}

11 Replies

Nicolas_Menant
Employee
Aug 03, 2008
Hi,

Check the notes for LB::reselect there is some problem: Click here

Version Specific Notes

LTM 9.2 When load balancing fails to a selected member, the iRule LB::reselect command repeatedly attempts to connect to the unavailable member, rather than selecting the next available pool member.

See AskF5 SOL8188 (CR84102) (deb)

LTM 9.3, 9.4, & 9.3.1 When load balancing fails to a selected member, the iRule LB::reselect command repeatedly attempts to connect to the unavailable member, rather than selecting the next available pool member.

See AskF5 SOL8188Click here (CR85186) (deb)

Maybe you can try to do a LB::detach to close the connection to the server and do the HTTP::retry
pjcampbell_7243
Cirrus
Apr 21, 2009
Can anyone shine any light for me as to why this may not be working:

when HTTP_RESPONSE {

if { [HTTP::status] > 500 }{

set failure 1

}

}

when LB_SELECTED {

if { $failure > 0 } {

LB::detach

LB::mode rr

LB::reselect pool [LB::server pool]

}

}

When I do this I am getting a The connection to the server was reset while the page was loading, as if it doesn't like either server.

One thing I am noticing, in my logs, our basic health check is getting a 500, every other check:

- - - [21/Apr/2009:14:29:49 -0700] "GET /" 200 108 "-" "-"

- - - [21/Apr/2009:14:29:50 -0700] "GET /" 500 3328 "-" "-"

However, the page I am going to is not yielding a 500.
dennypayne
Employee
Apr 21, 2009
Have you verified with HTTPWatch or LiveHTTPHeaders that no 500's are returning? Maybe just some element of the the page is returning a 500?

Plus with that rule, a 500 won't trigger the failure variable to be set, only a 501 or higher. And once $failure is set to 1, it never gets set back, so every subsequent LB_SELECTED event will trigger the detach and reselect (so it will loop). You probably need to set $failure back to 0 once you're in the detach/reselect logic.

Denny
pjcampbell_7243
Cirrus
Apr 21, 2009
Thanks for the reply, and for those problems. I verified that no 500/503 errors are being put up with livehttpdheaders (I was also checking the apache log).

I changed it a bit, and now am checking LTM logs:

Apr 21 16:16:30 tmm tmm[1564]: 01220001:3: TCL error: redirect_on_bad_status - can't read "failure": no such variable while executing "if { $failure == 1 } { LB::detach LB::mode rr LB::reselect pool [LB::server pool] set failure 0 }"

Is this not a "global" variable?

I thought it would make sense to just nest the when LB_SELECTED { inside of the when HTTP_RESPONSE { , but that did not seem to work - it wouldn't let me save the iRule that way - command is not valid in the current scope.

Here is what I changed it to:

when HTTP_RESPONSE {

if { [HTTP::status] == 503 }{

set ::failure 1

}

}

when LB_SELECTED {

if { $::failure == 1 } {

LB::detach

LB::mode rr

LB::reselect pool [LB::server pool]

set ::failure 0

}

}

I am testing out Tomcat session replication, and I want to make it so that when we have a Tomcat server that is restarting (503 error) the customer doesn't see any errors and just gets tossed onto another server with no loss of data.

BTW I have seen variables set as ::variable (assuming that is global), tried that as well and it didn't work...

Apr 21 16:23:47 tmm tmm[1564]: 01220001:3: TCL error: redirect_on_bad_status - can't read "::failure": no such variable while executing "if { $::failure == 1 } { LB::detach LB::mode rr LB::reselect pool [LB::server pool] set ::failure 0 }"
dennypayne
Employee
Apr 22, 2009
Thinking about this again, I think the problem is that the LB_SELECTED event happens before HTTP_RESPONSE, so this approach won't work. I think you need something more along the lines of the passive health monitoring as described here ( Click here ). You probably don't need all the timing logic in that post though. So something like this:

when HTTP_RESPONSE { if { [HTTP::status] eq "503" } { LB::down LB::reselect pool [LB::server pool] } }

That way everything gets done in the same event.

Denny
pjcampbell_7243
Cirrus
Apr 22, 2009
Thanks. That is sort of what I was thinking but when I tried it last time, it didn't let me:

error: line 4: [command is not valid in current event context HTTP_RESPONSE] [LB::reselect pool [LB::server pool]]
hooleylist
Cirrostratus
Apr 22, 2009
You would want to use a local variable ($failure) not a global variable ($::failure) as a global variable would get updated across all TCP connections. Also, you would need to initialise the variable in all paths through the iRule or check to see if it's set before trying to use it. You can check to see if a variable is defined using [info exists var_name]. Keep in mind that LB_SELECTED will occur before HTTP_RESPONSE on the first request on a new TCP connection.

Even after fixing those issues, your current iRule wouldn't do much to fix the current request that received a 503 as you're not forcing a retry of the request from HTTP_RESPONSE.

With that said, what are you actually trying to accomplish? Are you concerned that the application will return a 503 and more requests will be sent to it before the monitor marks it down? Do you want to mark the pool member down if a 503 is seen in the response? Do you want to have the current request reload balanced to a new pool member if a 503 is seen in the response? Is the application stateful with the sessions only existing on one server?

If you mark the server down after a single 503, and your application is stateful with the sessions existing only on that server, the client could be reload balanced, but they'd have to re-establish their application session on the new server.

If you do want to retry the request after a 503 response, you could use HTTP::retry (Click here) in HTTP_RESPONSE. LB::reselect isn't available in HTTP_RESPONSE--just LB_FAILED and LB_SELECTED. And an HTTP 50x response (or any acceptance of the TCP connection) will not trigger LB_FAILED. The downside to HTTP::retry is that you need to save the request headers (and data) for every request in order to retry the request.

Aaron
pjcampbell_7243
Cirrus
Apr 22, 2009
Switching servers is OK. The session info is replicated between cluster members. So no re-try is necessary.

I am not sure, if we mark the pool member to down, how will it get turned back on?

I am sort of inclined to just leave the pool member up, and let this check pass the user onto another server and have the Nagios content checks will handle whatever needs to get done on the server side (restart Tomcat or whatever).
hooleylist
Cirrostratus
Apr 23, 2009
A successful response to a monitors would mark the pool member back up. If the application isn't stateful and you're very concerned about failed responses, you could configure a monitor with an interval of 5 seconds and a timeout of 6 seconds. Then set the pool's action on service down to reselect. Anyone have concerns/suggestions on this idea?

Else, if you want to retry the request to a new server it looks like the alternative is to manually save every request and use HTTP::retry and LB::down in HTTP_RESPONSE when a response code you don't like is seen. To handle POST requests, you would need to collect every request using HTTP::collect. I could see it getting very resource intensive to use this approach.

Aaron
dennypayne
Employee
Apr 23, 2009
If session state is replicated and no re-try is necessary, what about just this:

when HTTP_RESPONSE { if { [HTTP::status] eq "503" } { LB::down } }

in combination with a reverse health monitor that watches for 503's as well? Basically the same thing as before minus the reselect, which will effectively get done anyway.

Denny