Here's my story. I have several hundred long-lived client TCP connections to two Tibco RVD servers. (don't ask why we're not just using RVD in a multicast arrangement. Long story.) Clients can connect to either RVD server. It doesn't matter. The connections are made at the beginning of the day and not terminated until the evening. If one of the servers should go down, i'd like to reconnect the client connections to the other server, preferably without the client ever knowing it was disconnected. So, the thought was to insert an F5 between the client and servers. The F5 would act as the "server", so a TCP connection would exist between the client and F5, then the F5 would maintain a TCP connection between itself and the "real" RVD servers. OneConnect sounds similar (i.e. F5 acts as a TCP proxy) but seems to be used for pairing new client connections to existing backend connections. That part is cool. What I really need to do is detect backend server failure, through a healthcheck, irule, or whatever, and force the clients connected to that failed backend server to reconnect to the other server, seamlessly if possible. I could use some help figuring this out. Thanks. Dave

Hi David, Sounds like you need to setup a typical VIP, Pool scenario. Have you looked at the configuration guides on ask.f5.com? Bhattman

Thanks. I did setup a typical VIP, with a typical pool, but this isn't a typical scenario, asfaik. I need to make the whole thing fault-tolerant. I've confirmed that client connections round-robin between the two rvd servers, as I knew they would. However, if I kill one of the rvd servers, the clients attached to that rvd immediately terminate as well. What I want to accomplish is to detect, or intercept, the SERVER-side failure before the CLIENT-side TCP connection is closed and somehow get the client connected on another SERVER-side connection. I'm not sure if this is possible. If not, i'd like to know what, if any, of my options are for high availability of long standing TCP sessions. I'm hoping there are options. The VIP is setup with SNAT and OneConnect is set to use the standard oneconnect profile. I'm using the round-robin algo. I implemented an iRule with stubs and log0. entries in each stub just to see what gets called when. When I kill the server, I see the client terminate before I see any LTM messages. And the first message I see is the health check message telling me my server died, but by then it's too late. The client has already terminated.

Hi David, It sounds like you want to reselect a new pool member if the serverside connection is closed before the clientside connection is. As you found, a health monitor and the 'action on service down' pool setting would happen too late to have any reliable effect on a single TCP connection. Ideally, you'd be able to use an iRule which calls LB::reselect from the SERVER_CLOSED event. However, I'm not sure this is supported or would work. The wiki page for the SERVER_CLOSED event (Click here) doesn't show much hope as the LB::reselect command isn't listed there. A quick test shows the syntax parser doesn't allow it. And when you bypass the parser, it doesn't seem to work. Here is an apparently non-working example: when CLIENT_ACCEPTED { Try to reselect a server set reselect 1 log local0. "[IP::client_addr]:[TCP::client_port]: New connection to [IP::local_addr]:[TCP::local_port]" } when LB_SELECTED { log local0. "[IP::client_addr]:[TCP::client_port]: Selected server info: [LB::server]" } when SERVER_CONNECTED { log local0. "[IP::client_addr]:[TCP::client_port]: Connected server info: [IP::server_addr]:[TCP::server_port]" } when SERVER_CLOSED { log local0. "[IP::client_addr]:[TCP::client_port]: Server connection closed" if {$reselect}{ log local0. "[IP::client_addr]:[TCP::client_port]: Trying reselect" set lb_cmd "LB::reselect" eval $lb_cmd } } when CLIENT_CLOSED { Do not try to reselect a server if the client closed the connection first set reselect 0 log local0. "[IP::client_addr]:[TCP::client_port]: Client connection closed" } Assuming this won't work, the best I can think of are ways to reduce the chance that LTM will close an idle connection. These configuration options are related to idle timeouts on the TCP and SNAT profiles: SOL7606: Overview of BIG-IP LTM idle session timeouts https://support.f5.com/kb/en-us/solutions/public/7000/600/sol7606.html Aaron

Thanks Aaron, I'm not feeling warm and fuzzy. Can we load balance actual TCP traffic? Or just new connections? If the answer is "just new connections", then we're going to have to return our two brand-new 3900 series load balancers because they're not going to do anything for us. The sales rep. claimed this was easy stuff. I can write a C/C++ program to do what we want, which is to open a TCP listener on one side and sockets to two RVD servers on the other side, and round-robin data between the two servers. We chose F5, instead of doing that, because we've got experience with them (i.e. we have a trust relationship from other projects we've used F5 on), and because of F5's HA model (i.e. the F5 isn't a single point of failure). What can I do? I'd like to review all possible solutions before giving up on F5. Thanks. Dave

You might get warm and fuzzy from an F5 salesperson. I'm just providing the best technical suggestions I can think of based on the requirements you've described. If you're considering returning hardware because a salesperson said the F5 kit could do something and it doesn't seem to be working, I'd suggest you get in touch with someone at F5 to get an official F5 response. I'm not an F5 employee and Devcentral isn't a place to get official F5 responses. I don't have any vested interest in you keeping the gear and won't come up with a better solution than what I've already suggested based on you possibly returning the hardware. If you do figure out a better solution, could you reply back with what you figure out so we can reference it in the future? Thanks, Aaron

Fault Tolerant long-lived TCP connections

13 Replies

The_Bhattman
Nimbostratus
Mar 07, 2010
Hi David,

Sounds like you need to setup a typical VIP, Pool scenario. Have you looked at the configuration guides on ask.f5.com?

Bhattman
David_Bradley_2
Nimbostratus
Mar 07, 2010
Thanks. I did setup a typical VIP, with a typical pool, but this isn't a typical scenario, asfaik. I need to make the whole thing fault-tolerant. I've confirmed that client connections round-robin between the two rvd servers, as I knew they would. However, if I kill one of the rvd servers, the clients attached to that rvd immediately terminate as well. What I want to accomplish is to detect, or intercept, the SERVER-side failure before the CLIENT-side TCP connection is closed and somehow get the client connected on another SERVER-side connection. I'm not sure if this is possible. If not, i'd like to know what, if any, of my options are for high availability of long standing TCP sessions. I'm hoping there are options.

The VIP is setup with SNAT and OneConnect is set to use the standard oneconnect profile. I'm using the round-robin algo. I implemented an iRule with stubs and log0. entries in each stub just to see what gets called when. When I kill the server, I see the client terminate before I see any LTM messages. And the first message I see is the health check message telling me my server died, but by then it's too late. The client has already terminated.
hooleylist
Cirrostratus
Mar 07, 2010
Hi David,

It sounds like you want to reselect a new pool member if the serverside connection is closed before the clientside connection is.

As you found, a health monitor and the 'action on service down' pool setting would happen too late to have any reliable effect on a single TCP connection.

Ideally, you'd be able to use an iRule which calls LB::reselect from the SERVER_CLOSED event. However, I'm not sure this is supported or would work. The wiki page for the SERVER_CLOSED event (Click here) doesn't show much hope as the LB::reselect command isn't listed there. A quick test shows the syntax parser doesn't allow it. And when you bypass the parser, it doesn't seem to work. Here is an apparently non-working example:

when CLIENT_ACCEPTED { Try to reselect a server set reselect 1 log local0. "[IP::client_addr]:[TCP::client_port]: New connection to [IP::local_addr]:[TCP::local_port]" } when LB_SELECTED { log local0. "[IP::client_addr]:[TCP::client_port]: Selected server info: [LB::server]" } when SERVER_CONNECTED { log local0. "[IP::client_addr]:[TCP::client_port]: Connected server info: [IP::server_addr]:[TCP::server_port]" } when SERVER_CLOSED { log local0. "[IP::client_addr]:[TCP::client_port]: Server connection closed" if {$reselect}{ log local0. "[IP::client_addr]:[TCP::client_port]: Trying reselect" set lb_cmd "LB::reselect" eval $lb_cmd } } when CLIENT_CLOSED { Do not try to reselect a server if the client closed the connection first set reselect 0 log local0. "[IP::client_addr]:[TCP::client_port]: Client connection closed" }

Assuming this won't work, the best I can think of are ways to reduce the chance that LTM will close an idle connection. These configuration options are related to idle timeouts on the TCP and SNAT profiles:

SOL7606: Overview of BIG-IP LTM idle session timeouts

https://support.f5.com/kb/en-us/solutions/public/7000/600/sol7606.html

Aaron
David_Bradley_2
Nimbostratus
Mar 08, 2010
Thanks Aaron,

I'm not feeling warm and fuzzy. Can we load balance actual TCP traffic? Or just new connections? If the answer is "just new connections", then we're going to have to return our two brand-new 3900 series load balancers because they're not going to do anything for us. The sales rep. claimed this was easy stuff. I can write a C/C++ program to do what we want, which is to open a TCP listener on one side and sockets to two RVD servers on the other side, and round-robin data between the two servers. We chose F5, instead of doing that, because we've got experience with them (i.e. we have a trust relationship from other projects we've used F5 on), and because of F5's HA model (i.e. the F5 isn't a single point of failure). What can I do? I'd like to review all possible solutions before giving up on F5. Thanks.

Dave
hooleylist
Cirrostratus
Mar 08, 2010
You might get warm and fuzzy from an F5 salesperson. I'm just providing the best technical suggestions I can think of based on the requirements you've described. If you're considering returning hardware because a salesperson said the F5 kit could do something and it doesn't seem to be working, I'd suggest you get in touch with someone at F5 to get an official F5 response. I'm not an F5 employee and Devcentral isn't a place to get official F5 responses. I don't have any vested interest in you keeping the gear and won't come up with a better solution than what I've already suggested based on you possibly returning the hardware.

If you do figure out a better solution, could you reply back with what you figure out so we can reference it in the future?

Thanks, Aaron
David_Bradley_2
Nimbostratus
Mar 09, 2010
Sure. I'm not coming down on you guys. You do great work and i've learned a ton from you in the past about irules and icontrol. I'm frustrated at the sales rep for saying it could be done and for myself for not being involved in the conversation in the first place. No, actually i'm mildly furious. I'll hack away at it. Maybe I can figure something out. If I do, i'll post here. Thanks for your help.

Dave
spark_86682
Historic F5 Account
Mar 09, 2010
I definitely echo what hoolio said.

That done, it sounds like you're trying to do some variation of "message-based load balancing", which might be a good phrase to look for in the F5 documentation. I don't have a good grip on how to configure it, or the level of support for it in released versions, but that might give you a good starting point.

Now, to expound a bit on the problem, is "Tibco RVD" a request/response protocol, or is it something else? If it is request/response, then suppose that a client sends a request, the server goes down, the BIG-IP acts as you intend and the client is transparently connected to a different server. Now the new server is waiting for a request that the client has already sent. Will the protocol handle this? If not, it may still be possible for the BIG-IP to solve this problem, but it will likely be a very intensive iRules-based solution if any can be found. I'm just trying to consider all the aspects of the problem, so we can try to come to a solution.
hooleylist
Cirrostratus
Mar 09, 2010
Thanks for the suggestions, Spark. I did a quick search on AskF5 for "message-based" and didn't see anything related. Do you have any more info on this?

David, I wonder if you could use a new feature Citizen_elah mentioned, where you can get access to the TCP headers from an iRule:

http://devcentral.f5.com/Default.aspx?tabid=53&forumid=5&postid=1168441&view=topic

If this is necessary for your applications, please open a case and request iRules access to the TCP header fields.

If you could look for server responses with a FIN or RST, then you might be able to intercept the server response and reselect a new pool member. If you do contact F5 Support asking about this feature, I wouldn't try to ask them to provide an iRule which will do this. You might find someone willing to help, but you're more likely to start a debate on what F5 Support supports for iRules. This might be a red herring, but could be worthwhile looking into.

Regardless, I think your best bet would be to contact your F5 sales rep and see if they can put you in touch with an F5 engineer or consultant who could provide you with official feedback on your requirements.

Aaron
David_Bradley_2
Nimbostratus
Mar 09, 2010
Thanks for the suggestions. I *think* (translate hope) that the RVD client isn't expecting anything other than a simple TCP ACK back from its request to the server. I need to look at tcpdump and figure this out today. Also, I found out where the packet length field in the RVD header is, so I should be able to TCP::collect until I've gotten a full packet. From there i think i'd need to select a server, open a connection to it (or let the LB pick one) and send the data there. I'm not sure how to do that yet, but i'm wondering if I can get clues from this:

http://devcentral.f5.com/wiki/default.aspx/iRules/LDAPProxy.html

What do you think?

Dave
David_Bradley_2
Nimbostratus
Mar 10, 2010
Actually I think I see what you're saying now. Since these are long-lived TCP connections, I can just simply let the load balancer pick a new target, since the load balancer is out of the loop at that point. So the RVD payload doesn't really matter. The LDAP example I was looking at doesn't apply because it's doing a new TCP connection, so it's able to change pools, etc. I'll work with the F5 guys and see what they say about possible TCP header logic being added to irules. Thanks very much.