Forum Discussion

Torijori_Yamamada's avatar
Sep 14, 2022
Solved

HA Cluster behavior on AWS

Hello,

I've managed to run two F5 in different "Availability Zone" (in same VPC) from scratch and they can sync configuration objects that i created. CFE configuration took some time to figured out but eventually its is working right now. I have a couple of questions about HA cluster on Aws. Can you help me to understand?

- While testing this setup i saw that if active device goes offline for any reason, the peer device does nothing. Even, the CMI logs on stand-by unit says that the peer device unreachable. Is that normal? Should stand-by device take action to go active, after when it realized the peer is unreachable.

- Devices can sync objects i added including pools, iRules, nodes, virtual servers and etc. Also, the status indicator on top left corner says devices are "In Sync". However, when i looked at "Device Manager > Devices", each device sees other device is offline. Why?

- Despite there is a Sync-Failover device group configuration on both devices, each device says that they are "Active" itself. In setups with help of AWS Cloud Formation or Terraform guided, does this happen?

- When i use GUI for failover, it takes around 1 minute and 20 seconds. But if i trigger failover with "curl" command it only takes 5 seconds. Is that normal?

 

 

 

  • Let's break the problem down into two items.  The upgrade and CFE.  

    Upgrading

    When you upgrade an existing deployment software is installed into a new slot on the HDD and the configuration is imported.  After the system(s) are rebooted the HA iAPP may need to reinstalled and apply the iAPP configuration applied to BOTH of the systems.   A tell tale sign of the HA iAPP configuration not being applied to both of the systems is when the secondary device attempts to go active you do not see anything in the /var/log/ltm listing tg_active.   If you do see the tg_active scripts firing on the system attempting to move from standby to active but the mapped configuration objects do not move then either the instance does not have the IAM permissions, does not have access to the EC2 API (normally via eth 0 - the exact interface will be exposed via the route command and looking at the route metrics), or the secondary elastic IP (public IP) were not allowed to be remapped with the system was deployed.

    Migrating to CFE

     Migrating from the HA iAPP to CFE requires you to remove the HA iAPP and then install CFE.  The migration to CFE is depending on proper IAM roles, access to the EC2 API, and the S3 API endpoint (you can use VPC endpoints if necessary for these).   A peer of mine who works with the integration lays out the steps as follows

    1. Gather EIPs as defined in HA-iApp to the Static elastic IP definitions https://clouddocs.f5.com/products/extensions/f5-cloud-failover/latest/userguide/aws.html#define-the-failover-addresses-in-aws
    2. Gather Routes as defined in HA-iApp to Static Route Definitions  https://clouddocs.f5.com/products/extensions/f5-cloud-failover/latest/userguide/aws.html#define-the-routes-in-aws
    3. Uninstall HA-Iapp.
    4. Start fresh installation of CFE: https://clouddocs.f5.com/products/extensions/f5-cloud-failover/latest/userguide/installation.html
    5. Configure by running through https://clouddocs.f5.com/products/extensions/f5-cloud-failover/latest/userguide/aws.html.
      1. i.e. the example is even the static config (with EIP mappings) so should be really close to the old iApp.
      2. Just remember to tag the Network Interfaces though, that’s the only thing that needs to be tagged with the static config.  
      3. The hardest part might be migrating/getting the IAM role right as we have a much more granular role example now, we have a S3 bucket added, etc.


    Upgrade or Deploy New

    I cannot answer that for you as there are many nuances to the architecture and each carries with it some level of work..  

    A parallel deployment allows you to build out the new stack, operational aspects and then cut over the DNS records of the virtual IP addresses.   It carries with it the work of having to migrate the virtual server configurations. 

    An upgrade without the installation of CFE us just a standard upgrade and then reapplying the HA iAPP config.

    An upgrade after the installation of CFE will be similar to the HA iAPP (install the package post upgrade apply config)

    I would separate the migration from the HA iAPP to CFE from the OS upgrade if you are not performing a parallel deployment.  Why? Failing over in cloud requires access to proper roles and API endpoints.  Tyring to upgrade the OS and the failover tooling at the same time can lead to a large amount of work.  With the cloud failover tooling there are aspects that are more dependent depending on the cloud provider so if one has to troubleshoot both an upgrade AND the iApp migration a change window can become small. 

    In any upgrade scenario you should always take a backup of each BIG-IP.  Additionally you have the option to take a snapshot of the VM disk when it is powered off. 


6 Replies

  • Hi,

    You mention that you configured the devices from scratch.  I am inferring this means that you did not use the our v1 CFTs  or v2 CFTs.     

    My recommendation is that you go back and use the CFTs or other automation such as the terraform module.  Manual setups are error prone as you need to create the security groups, rules, routes, IAM roles on your own.  The automation tools address these for you and can save you alot of time figuring out is wrong.   

    Assuming that your security groups are correct, the route tables are correct, network acls are correct, you have S3 API and EC2 API access and DNS enabled in the VP  and you have the proper routing set on the BIG-IPS it will come down to IAM.  Based on some of the other problems you mention these items will also need your attention.  For more information on the IAM role you can find it documented- Amazon Web Services: High Availability F5 BIG-IP Virtual Edition.

     

    Symptom Action
    Both Devices Active Improper security group rules (refer to documentation), routing issue
    Device cannot failover IAM issue, EC2 Access issue, S3 access issue, DNS issue
    Peer Device offline (when it is actually not) Improper security group rules (refer to documentation), routing issue
       



    While testing this setup i saw that if active device goes offline for any reason, the peer device does nothing. Even, the CMI logs on stand-by unit says that the peer device unreachable. Is that normal? Should stand-by device take action to go active, after when it realized the peer is unreachable.  - No this is not normal.  There is something very wrong with your setup.  The devices should take over "as normal" minus the fact that it is API driven event so it is much slower than one expects in a data center (AWS does not support GARP so remapping must be done at the API layer) 

    Despite there is a Sync-Failover device group configuration on both devices, each device says that they are "Active" itself. In setups with help of AWS Cloud Formation or Terraform guided, does this happen?  No. The systems will build and cluster correctly.  I suspect that the systems can support communication of port 443 but not UDP 1026.  

    Devices can sync objects i added including pools, iRules, nodes, virtual servers and etc. Also, the status indicator on top left corner says devices are "In Sync". However, when i looked at "Device Manager > Devices", each device sees other device is offline. Why?  Your security group either does not have the correct ports open or you have not setup the correct routes for the peers to communicate on the systems.   The cause can very based if this is single NIC, two NIC, three NIC deployment and which peer addresses you used to peer and the routes configured.  See comment above.

    When i use GUI for failover, it takes around 1 minute and 20 seconds. But if i trigger failover with "curl" command it only takes 5 seconds. Is that normal?  Above you say that failover does not happen here you say it does.  When you manually failover does the failover persist or does it revert to both systems thinking they are active?  See comments above about SG, Routing.  I suspect you have a scenario where you attempt to failover and the device bounces back to active. 


     

    • Torijori_Yamamada's avatar
      Torijori_Yamamada
      Icon for Cirrus rankCirrus

      Hello again,

      Thank you for your answer. I'll look into the guided installation to eliminate possible mistakes. Perhaps, i could do it right way this time.

       

       

    • Torijori_Yamamada's avatar
      Torijori_Yamamada
      Icon for Cirrus rankCirrus

      I have successfully deployed a fully-functional HA pair devices on AWS and very happy with that. Thank you for pointing me to right direction.

      One question still remains. I started this journey with a problem in the beginning. A customer already have a couple of pair devices working on AWS and they built these clusters with help of old iApp (f5.aws_advanced_ha.v1.3.0rc1 and f5.aws_advanced_ha.v1.4.0rc5) that was provided for earlier. However, as you probably know, the iApp is deprecated and now there is CFE instead of iApp. A month ago, when we tried to upgrade these devices to the recent BIGIP versions, we had a couple of problems about "Fail-Over" function. I'm not a cloud expert which can able to figure out those problems with help of my knowledge, so we took upgrade back.

      What is the best way to upgrade an existing ha pair which was built with iApp to CFE supported Ha Cluster?
      - Tear down old ha pair and built a new pair on top of same vpc?
      - Build another ha pair (with CFE support) and move/adapt configuration to new infrastructure?
      - or, perhaps changing some parts of AWS and F5 set up might be help us to adjust old to new?

       

       

       

       

       

  • Let's break the problem down into two items.  The upgrade and CFE.  

    Upgrading

    When you upgrade an existing deployment software is installed into a new slot on the HDD and the configuration is imported.  After the system(s) are rebooted the HA iAPP may need to reinstalled and apply the iAPP configuration applied to BOTH of the systems.   A tell tale sign of the HA iAPP configuration not being applied to both of the systems is when the secondary device attempts to go active you do not see anything in the /var/log/ltm listing tg_active.   If you do see the tg_active scripts firing on the system attempting to move from standby to active but the mapped configuration objects do not move then either the instance does not have the IAM permissions, does not have access to the EC2 API (normally via eth 0 - the exact interface will be exposed via the route command and looking at the route metrics), or the secondary elastic IP (public IP) were not allowed to be remapped with the system was deployed.

    Migrating to CFE

     Migrating from the HA iAPP to CFE requires you to remove the HA iAPP and then install CFE.  The migration to CFE is depending on proper IAM roles, access to the EC2 API, and the S3 API endpoint (you can use VPC endpoints if necessary for these).   A peer of mine who works with the integration lays out the steps as follows

    1. Gather EIPs as defined in HA-iApp to the Static elastic IP definitions https://clouddocs.f5.com/products/extensions/f5-cloud-failover/latest/userguide/aws.html#define-the-failover-addresses-in-aws
    2. Gather Routes as defined in HA-iApp to Static Route Definitions  https://clouddocs.f5.com/products/extensions/f5-cloud-failover/latest/userguide/aws.html#define-the-routes-in-aws
    3. Uninstall HA-Iapp.
    4. Start fresh installation of CFE: https://clouddocs.f5.com/products/extensions/f5-cloud-failover/latest/userguide/installation.html
    5. Configure by running through https://clouddocs.f5.com/products/extensions/f5-cloud-failover/latest/userguide/aws.html.
      1. i.e. the example is even the static config (with EIP mappings) so should be really close to the old iApp.
      2. Just remember to tag the Network Interfaces though, that’s the only thing that needs to be tagged with the static config.  
      3. The hardest part might be migrating/getting the IAM role right as we have a much more granular role example now, we have a S3 bucket added, etc.


    Upgrade or Deploy New

    I cannot answer that for you as there are many nuances to the architecture and each carries with it some level of work..  

    A parallel deployment allows you to build out the new stack, operational aspects and then cut over the DNS records of the virtual IP addresses.   It carries with it the work of having to migrate the virtual server configurations. 

    An upgrade without the installation of CFE us just a standard upgrade and then reapplying the HA iAPP config.

    An upgrade after the installation of CFE will be similar to the HA iAPP (install the package post upgrade apply config)

    I would separate the migration from the HA iAPP to CFE from the OS upgrade if you are not performing a parallel deployment.  Why? Failing over in cloud requires access to proper roles and API endpoints.  Tyring to upgrade the OS and the failover tooling at the same time can lead to a large amount of work.  With the cloud failover tooling there are aspects that are more dependent depending on the cloud provider so if one has to troubleshoot both an upgrade AND the iApp migration a change window can become small. 

    In any upgrade scenario you should always take a backup of each BIG-IP.  Additionally you have the option to take a snapshot of the VM disk when it is powered off. 


    • Torijori_Yamamada's avatar
      Torijori_Yamamada
      Icon for Cirrus rankCirrus

      Thanks for your time and answer. The answer is very elaborate. In order to be familirize upon issues that could pop out in upgrade, i'm going to build an iApp deployed cluster and try to upgrade to CFE.

    • Torijori_Yamamada's avatar
      Torijori_Yamamada
      Icon for Cirrus rankCirrus

      Hello again

      At last, i've successfully deployed a pair of F5 on AWS and gather them in a HA group with help of the iApp. I made dozens of tests on them.

      After than, i updated HA configuration so they both started to use CFE. They worked well with CFE. At the end of my tests, i tried reverting CFE to iApp back and got success. While testing this environment, i saw that both device can able do failover perfectly when one of them configured with iApp and other one is CFE.

      Now, i believe that i'm ready to manage whole operation to upgrading an F5 cluster currently working with iApp. Even, i have prepared a document to share with other parties which will participate to upgrade work. The document contains the steps required to upgrade a HA cluster from iApp to CFE and steps which needs to revert back..

      Thank you for your answer, it really helped me.