BIG-IP connection mirroring in public cloud doesn't work, but why?
Summary
BIG-IP connection mirroring is not supported in public cloud environments. Cloud Failover Extension (CFE) supports failover between BIG-IP devices, and persistence mirroring will work, but connection mirroring will not. This article discusses a customer case where we explored what happens if we try.
Background
I don't see it often anymore, but some F5 customers use connection mirroring as a way to provide High Availability (HA) to applications where network connections are long-lived, such as telnet or FTP. Typically, connection mirroring is not required for short-lived connections like HTTP and UDP. Because modern applications tend to be stateless in nature, and their network connections are often resilient to network-layer failures (eg HTTP), it's rare that I get asked about connection mirroring in public cloud.
Note: persistence mirroring is different than connection mirroring. Persistence mirroring will work in public cloud. If you're unsure of the difference, I found this explanation fairly helpful.
Despite connection mirroring being unsupported in public cloud, I had a customer ask me if they could test connection mirroring in AWS so they could explain to their management the reasons behind things like cloud architectures, failover planning, and expectations of support.
Connection Mirroring not supported in public cloud
Let me be clear, connection mirroring is not supported in public cloud. Do not plan to use this in public cloud. This article is intended to satisfy your curiosity and answer "why not", rather than "how to".
If you deploy F5 BIG-IP VE in public cloud and configure connection mirroring following the typical setup guidelines, you might still see mirrored connections on your standby device with the command show sys connection type mirror
. You may wonder to yourself, "what would happen if I enable connection mirroring on a Virtual Server, and attempt a failover?"
Why is connection mirroring not supported in public cloud?
Naturally, my customer and I were curious to see if we could even test connection mirroring to watch it fail in public cloud.
When we read this FAQ from the Cloud Failover Extension (CFE) documentation, the support stance made sense. There are many reasons that connection mirroring will not work in a public cloud. In a rough order of my opinion of biggest to smallest show-stoppers, here are some (but not all) reasons:
- Failover time
- When we rely on API calls for HA in cloud, failover will typically take longer than a network connection timeout will allow.
- As an alternative to API calls, we can also use cloud-based load balancers for HA in cloud, but again, the failover time is too long. So, even if session mirroring worked fine, your actual outcome (eg. your application maintaining state for users) will likely be failure, since the API response and associated cloud networking changes takes too long.
- NAT and SNAT
- Often when failing over in public cloud, you often edit Destination NAT's. For example, re-associating an AWS EIP with an internal IP address. If this is the case, connection mirroring will not work.
- Often we source NAT (SNAT) in cloud but do not have "floating" IP addresses for the server-side self-IP. If the SNAT address changes at failover, I expect connection mirroring to fail.
- Cloud provider level of control
- API calls to the cloud provider have no SLA. They typically get authenticated with a ticket and are queued, so it would be almost impossible to test (and guarantee) performance of API calls to AWS, Azure, or GCP.
- Other components may track state of connections. The security groups you've set up in cloud, any DDoS or security/network policies you have from your cloud provider, any other firewalls that you traverse - all of these may not handle your connection state as you expect.
To summarize, there's much to cloud networking that we as users cannot see or control.
My customer's test failures, and what we learned
As egineers, sometimes our instinct is to push back: I know it's unsupported and I understand why, but can I test myself? Naturally, my customer and I wanted to push further, and we had read Jeff Giroux's article about HA in cloud which made us curious. My customer was using AWS route-based failover with CFE, which is typically performed in only a few seconds.
First, we had to enable connection mirroring on the VE image they had used, which came straight from our Cloud Formation Templates (CFT), specifically this one. To do this, we had to:
- Enable connection mirroring over a VLAN (Devices > Device (Self) > Mirroring tab).
- Allow TCP/1029-1055 between the appropriate ENI's in the AWS security group.
- Allow TCP/1029-1055 in the Self IP port lockdown (Network > Self IPs > Port Lockdown)
- Perform a failover in AWS with a connection that was mirrored.
After setting this up and creating a Virtual Server where the "Advanced" configurations had "Connection Mirroring" checkbox checked (default is unchecked), our test was not successful. I tested with a SSH session using Putty, but my SSH session was dropped at the time of device failover.
My client was an Ubuntu VM in the external VLAN of the BIG-IP, the SSH session was established to a VIP where the IP was from an alien range, and I used AWS route-based failover to move this IP address between Active and Standby BIG-IP's, which were in different AZ's. The backend pool member was another Ubuntu VM to which my SSH connection was proxied.
So, even without destination NAT'ing, without any 3rd party security devices, and using the bare minimum of failover time, we've shown that connection mirroring doesn't work in the public cloud.
I have heard anecdotal stories of connection mirroring half-working sometimes in public cloud. It has never worked for me, but I did hear from a customer that said only some of their EPIC EHR connections dropped when they did similar testing. However, they confirmed, that other connections did drop, so they did not successfully test connection mirroring in any kind of reliable way.
I'll leave it there for now, but if you have other test scenarios you can think of, let me know in the comments!
Conclusion
Do not plan to use connection mirroring in public cloud. It is unsupported. But I hope this article has helped you think through some of the architectures and implications of planning for HA in public cloud. Thanks for reading!
Related articles
K84303332: Overview of connection and persistence mirroring (13.x - 16.x)
- shsinghEmployee
Thanks for the writeup MichaelOLeary.
I will add that there are some other things to think about when looking to use connection-mirroring (TL;DR most times it's unnecessary):
- SSH and RDP type services do have the ability to set keep-alives
- in SSH for example you can set the following in your ~/.ssh/config
# Settings 10 retry messages at one every 60sec ServerAliveInterval = 60 ServerAliveCountMax = 10
- in SSH for example you can set the following in your ~/.ssh/config
- Ensuring that your fastL4 servers have the ability to pass flows in flight (e.g. the loose-initiation and loose-strict values)
- non-HTTP and Standard Virtual Servers that need connection mirroring definitely come with a caveat for the type of app and protocol (databases for example)
Having said all that, if you do have a fastL4 wildcard routing-type Virtual Server *most* protocols tend to be fine unless its something in the middle of its transaction (e.g. database write, etc.)
I've helped customers deploy BIG-IP in Carrier Grade NAT scenarios (which is similar in some respects to a cloud-based environments) to be able to "seamlessly" fail devices or reboot so that subscribers are generally unaware their Internet is down: https://www.youtube.com/watch?v=hsb0OtqO_AM&list=PL5jC9WagzrjExq85JuWQHSUm9PegO3JmR&index=16
- SSH and RDP type services do have the ability to set keep-alives
- MichaelOLearyEmployee
Thanks a lot for the comments and link to the video shsingh . Like you said, it's nice to remember that ensuring your fastL4 servers have the ability to pass flows in flight with loose-initiation and loose-strict values is a good practice. I should have written that in the article. Also, like you said, SSH, RDP, and other protocols can have configurable keep-alives.
So if you're reading these comments and want to dive in further on your particular issue, leave a comment and/or reach out to us. Thanks for reading!