Forum Discussion
HA Cluster behavior on AWS
- Sep 20, 2022
Let's break the problem down into two items. The upgrade and CFE.
Upgrading
When you upgrade an existing deployment software is installed into a new slot on the HDD and the configuration is imported. After the system(s) are rebooted the HA iAPP may need to reinstalled and apply the iAPP configuration applied to BOTH of the systems. A tell tale sign of the HA iAPP configuration not being applied to both of the systems is when the secondary device attempts to go active you do not see anything in the /var/log/ltm listing tg_active. If you do see the tg_active scripts firing on the system attempting to move from standby to active but the mapped configuration objects do not move then either the instance does not have the IAM permissions, does not have access to the EC2 API (normally via eth 0 - the exact interface will be exposed via the route command and looking at the route metrics), or the secondary elastic IP (public IP) were not allowed to be remapped with the system was deployed.
Migrating to CFEMigrating from the HA iAPP to CFE requires you to remove the HA iAPP and then install CFE. The migration to CFE is depending on proper IAM roles, access to the EC2 API, and the S3 API endpoint (you can use VPC endpoints if necessary for these). A peer of mine who works with the integration lays out the steps as follows
- Gather EIPs as defined in HA-iApp to the Static elastic IP definitions https://clouddocs.f5.com/products/extensions/f5-cloud-failover/latest/userguide/aws.html#define-the-failover-addresses-in-aws
- Gather Routes as defined in HA-iApp to Static Route Definitions https://clouddocs.f5.com/products/extensions/f5-cloud-failover/latest/userguide/aws.html#define-the-routes-in-aws
- Uninstall HA-Iapp.
- Start fresh installation of CFE: https://clouddocs.f5.com/products/extensions/f5-cloud-failover/latest/userguide/installation.html
- Configure by running through https://clouddocs.f5.com/products/extensions/f5-cloud-failover/latest/userguide/aws.html.
- i.e. the example is even the static config (with EIP mappings) so should be really close to the old iApp.
- Just remember to tag the Network Interfaces though, that’s the only thing that needs to be tagged with the static config.
- The hardest part might be migrating/getting the IAM role right as we have a much more granular role example now, we have a S3 bucket added, etc.
Upgrade or Deploy New
I cannot answer that for you as there are many nuances to the architecture and each carries with it some level of work..
A parallel deployment allows you to build out the new stack, operational aspects and then cut over the DNS records of the virtual IP addresses. It carries with it the work of having to migrate the virtual server configurations.
An upgrade without the installation of CFE us just a standard upgrade and then reapplying the HA iAPP config.
An upgrade after the installation of CFE will be similar to the HA iAPP (install the package post upgrade apply config)
I would separate the migration from the HA iAPP to CFE from the OS upgrade if you are not performing a parallel deployment. Why? Failing over in cloud requires access to proper roles and API endpoints. Tyring to upgrade the OS and the failover tooling at the same time can lead to a large amount of work. With the cloud failover tooling there are aspects that are more dependent depending on the cloud provider so if one has to troubleshoot both an upgrade AND the iApp migration a change window can become small.
In any upgrade scenario you should always take a backup of each BIG-IP. Additionally you have the option to take a snapshot of the VM disk when it is powered off.
Hi,
You mention that you configured the devices from scratch. I am inferring this means that you did not use the our v1 CFTs or v2 CFTs.
My recommendation is that you go back and use the CFTs or other automation such as the terraform module. Manual setups are error prone as you need to create the security groups, rules, routes, IAM roles on your own. The automation tools address these for you and can save you alot of time figuring out is wrong.
Assuming that your security groups are correct, the route tables are correct, network acls are correct, you have S3 API and EC2 API access and DNS enabled in the VP and you have the proper routing set on the BIG-IPS it will come down to IAM. Based on some of the other problems you mention these items will also need your attention. For more information on the IAM role you can find it documented- Amazon Web Services: High Availability F5 BIG-IP Virtual Edition.
Symptom | Action |
Both Devices Active | Improper security group rules (refer to documentation), routing issue |
Device cannot failover | IAM issue, EC2 Access issue, S3 access issue, DNS issue |
Peer Device offline (when it is actually not) | Improper security group rules (refer to documentation), routing issue |
While testing this setup i saw that if active device goes offline for any reason, the peer device does nothing. Even, the CMI logs on stand-by unit says that the peer device unreachable. Is that normal? Should stand-by device take action to go active, after when it realized the peer is unreachable. - No this is not normal. There is something very wrong with your setup. The devices should take over "as normal" minus the fact that it is API driven event so it is much slower than one expects in a data center (AWS does not support GARP so remapping must be done at the API layer)
Despite there is a Sync-Failover device group configuration on both devices, each device says that they are "Active" itself. In setups with help of AWS Cloud Formation or Terraform guided, does this happen? No. The systems will build and cluster correctly. I suspect that the systems can support communication of port 443 but not UDP 1026.
Devices can sync objects i added including pools, iRules, nodes, virtual servers and etc. Also, the status indicator on top left corner says devices are "In Sync". However, when i looked at "Device Manager > Devices", each device sees other device is offline. Why? Your security group either does not have the correct ports open or you have not setup the correct routes for the peers to communicate on the systems. The cause can very based if this is single NIC, two NIC, three NIC deployment and which peer addresses you used to peer and the routes configured. See comment above.
When i use GUI for failover, it takes around 1 minute and 20 seconds. But if i trigger failover with "curl" command it only takes 5 seconds. Is that normal? Above you say that failover does not happen here you say it does. When you manually failover does the failover persist or does it revert to both systems thinking they are active? See comments above about SG, Routing. I suspect you have a scenario where you attempt to failover and the device bounces back to active.
I have successfully deployed a fully-functional HA pair devices on AWS and very happy with that. Thank you for pointing me to right direction.
One question still remains. I started this journey with a problem in the beginning. A customer already have a couple of pair devices working on AWS and they built these clusters with help of old iApp (f5.aws_advanced_ha.v1.3.0rc1 and f5.aws_advanced_ha.v1.4.0rc5) that was provided for earlier. However, as you probably know, the iApp is deprecated and now there is CFE instead of iApp. A month ago, when we tried to upgrade these devices to the recent BIGIP versions, we had a couple of problems about "Fail-Over" function. I'm not a cloud expert which can able to figure out those problems with help of my knowledge, so we took upgrade back.
What is the best way to upgrade an existing ha pair which was built with iApp to CFE supported Ha Cluster?
- Tear down old ha pair and built a new pair on top of same vpc?
- Build another ha pair (with CFE support) and move/adapt configuration to new infrastructure?
- or, perhaps changing some parts of AWS and F5 set up might be help us to adjust old to new?
Recent Discussions
Related Content
* Getting Started on DevCentral
* Community Guidelines
* Community Terms of Use / EULA
* Community Ranking Explained
* Community Resources
* Contact the DevCentral Team
* Update MFA on account.f5.com