fail over
50 TopicsConfigure HA Groups on BIG-IP
Last week we talked about how HA Groups work on BIG-IP and this week we’ll look at how to configure HA Groupson BIG-IP. To recap, an HA group is a configuration object you create and assign to a traffic group for devices in a device group. An HA group defines health criteria for a resource (such as an application server pool) that the traffic group uses. With an HA group, the BIG-IP system can decide whether to keep a traffic group active on its current device or fail over the traffic group to another device when resources such as pool members fall below a certain level. First, some prerequisites: Basic Setup: Each BIG-IP (v13) is licensed, provisioned and configured to run BIG-IP LTM HA Configuration: All BIG-IP devices are members of a sync-failover device group and synced Each BIG-IP has a unique virtual server with a unique server pool assigned to it All virtual addresses are associated with traffic-group-1 To the BIG-IP GUI! First you go to System>High Availability>HA Group List>and then click the Create button. The first thing is to name the group. Give it a detailed name to indicate the object group type, the device it pertains to and the traffic group it pertains to. In this case, we’ll call it ‘ha_group_deviceA_tg1.’ Next, we’ll click Add in the Pools area under Health Conditions and add the pool for BIG-IP A to the HA Group which we’ve already created. We then move on to the minimum member count. The minimum member count is members that need to be up for traffic-group-1 to remain active on BIG-IP A. In this case, we want 3 out of 4 members to be up. If that number falls below 3, the BIG-IP will automatically fail the traffic group over to another device in the device group. Next is HA Score and this is the sufficient threshold which is the number of up pool members you want to represent a full health score. In this case, we’ll choose 4. So if 4 pool members are up then it is considered to have a full health score. If fewer than 4 members are up, then this health score would be lower. We’ll give it a default weight of 10 since 10 represents the full HA score for BIG-IP A. We’re going to say that all 4 members need to be active in the group in order for BIG-IP to give BIG-IP A an HA score of 10. And we click Add. We’ll then see a summary of the health conditions we just specified including the minimum member count and sufficient member count. Then click Create HA Group. Next, we go to Device Management>Traffic Groups>and click on traffic-group-1. Now, we’ll associate this new HA Group with traffic-group-1. Go to the HA Group setting and select the new HA Group from the drop-down list. And then select the Failover Method to Device with the Best HA Score. Click Save. Now we do the same thing for BIG-IP B. So again, go to System>High Availability>HA Group List>and then click the Create button. Give it a special name, click Add in the Pools area and select the pool you’ve already created for BIG-IP B. Again, for our situation, we’ll specify a minimum of 3 members to be up if traffic-group-1 is active on BIG-IP B. This minimum number does not have to be the same as the other HA Group, but it is for this example. Again, a default weight of 10 in the HA Score for all pool members. Click Add and then Create HA Group for BIG-IP B. And then, Device Management>Traffic Groups> and click traffic-group-1. Choose BIG-IP B’s HA Group and select the same Failover method as BIG-IP A – Based on HA Score. Click Save. Lastly, you would create another HA Group on BIG-IP C as we’ve done on BIG-IP A and BIG-IP B. Once that happens, you’ll have the same set up as this: As you can see, BIG-IP A has lost another pool member causing traffic-group-1 to failover and the BIG-IP software has chosen BIG-IP C as the next active device to host the traffic group because BIG-IP C has the highest HA Score based on the health of its pool. Thanks to our TechPubs group for the basis of this article and check out a video demo here. ps9.2KViews1like0CommentsARP/MAC Tables Not Updating on Core Switches After F5 LTM Failover (GARP Issue?)
We have two F5 LTM 5250v appliances configured with 2 vCMP instances each in an HA pair (Active/Standby). Each F5 5250v has a 10G uplink to two core switches (Cisco Nexus 7010) configured as an LACP port-channel on the F5 side and a Port-Channel/vPC on the Nexus side. Port-Channel127/vPC127 = F5ADC01 Port-Channel128/vPC128 = F5ADC01 When I look at the MAC address tables on both 7K1 and 7K2, I can see all the individual F5 MACs for each VLAN we have configured on the F5 vCMP instances. We are having an issue during automatic or manual failover where the MAC addresses for the virtual-servers are not being updated. If F5ADC01 is Active and we force it Standby, it immediately changes to Standby and F5ADC02 immediately takes over the Active role. However, the ARP tables on the Nexus 7K Core switches do not get updated so all the virtual-servers continue to have the MAC address associated with F5ADC01. We have multiple partitions on each vCMP instance with several VLANs associated with each partition. Each partition only has a single route-domain the VLANs are allocated to. For traffic to virtual-servers, we are using Auto-MAP to SNAT to the floating Self-IP and using Auto-Last Hop so return traffic passes through the correct source VLAN. We are not using MAC masquerading. The ARP time out on the Nexus 7Ks is 1500 seconds (default) so it takes 25min after a failover for a full network recovery. Eventually the ARP entries age out for all virtual servers and get refreshed with the correct MAC address. Obviously this is not acceptable. I found an SOL article that talks about when GARPs can be missed after failover: SOL7332: Gratuitous ARPs may be lost after a BIG-IP failover event. We have confirmed the upstream core switches are not dropping any GARPs. As a test I went in and manually disabled all virtual-servers and then enabled them and all MACs updated immediately. I have opened a support case with F5 and we have yet to determine where the issue lies. Does anybody have any ideas what the issue might be? If I need to provide more information about our configuration, let me know. We are pretty new to the F5 platform. We recently migrated from the Cisco ACE30 platform. Failover on the ACE platform worked perfectly. Similar cabling setup (two port-channels to two separate Catalyst 6509 switches with an ACE30 module in each switch). After ACE failover, the MAC tables/ARP caches immediately updated. Thank You!6.4KViews0likes22CommentsHigh Availability Groups on BIG-IP
High Availability of applications is critical to an organization’s survival. On BIG-IP, HA Groups is a feature that allows BIG-IP to fail over automatically based not on the health of the BIG-IP system itself but rather on the health of external resources within a traffic group. These external resources include the health and availability of pool members, trunk links, VIPRION cluster members or a combination of all three. This is the only cause of failover that is triggered based on resources outside of the BIG-IP. An HA group is a configuration object you create and assign to a traffic group for devices in a device group. An HA group defines health criteria for a resource (such as an application server pool) that the traffic group uses. With an HA group, the BIG-IP system can decide whether to keep a traffic group active on its current device or fail over the traffic group to another device when resources such as pool members fall below a certain level. In this scenario, there are three BIG-IP Devices – A, B, C and each device has two traffic groups on it. As you can see, for BIG-IP A, traffic-group 1 is active. For BIG-IP B, traffic-group 2 is active and for BIG-IP C, both traffic groups are in a standby state. Attached to traffic-group 1 on BIG-IP A is an HA group which specifies that there needs to be a minimum of 3 pool members out of 4 to be up for traffic-group-1 to remain active on BIG-IP A. Similarly, on BIG-IP B the traffic-group needs a minimum of 3 pool members up out of 4 for this traffic group to stay active on BIG-IP B. On BIG-IP A, if fewer than 3 members of traffic-group-1 are up, this traffic-group will fail-over. So let’s say that 2 pool members go down on BIG-IP A. Traffic-group-1 responds by failing-over to the device (BIG-IP) that has the healthiest pool…which in this case is BIG-IP C. Now we see that traffic-group-1 is active on BIG-IP C. Achieving the ultimate ‘Five Nines’ of web site availability (around 5 minutes of downtime a year) has been a goal of many organizations since the beginning of the internet era. There are several ways to accomplish this but essentially a few principles apply. Eliminate single points of failure by adding redundancy so if one component fails, the entire system still works. Have reliable crossover to the duplicate systems so they are ready when needed. And have the ability to detect failures as they occur so proper action can be taken. If the first two are in place, hopefully you never see a failure. But if you do, HA Groups can help. ps Related: Lightboard Lessons: BIG-IP Basic Nomenclature Lightboard Lessons: Device Services Clustering HA Groups Overview2.2KViews0likes2CommentsF5 Gratuitous-ARP issue when failover
Hi Last night we upgrade F5 v. 11.5.4 to v.12.1.3 when we failover from old unit v.11.5.4 to newly unit v.12.1.3, We experience some IP has more request timeout than the rest (we ping ip of each vs (~20 ip) when failover) From my understanding, F5 will send G-ARP to neighbour unit when it's active. Is it possible that G-ARP that sent is drop so those IP experience longer downtime due to still using old ARP? or Is it because neighbour unit not use some G-ARP from F5? or Is there any possibilities that make neighbour unit not learn new ARP as expect? Thank you1.8KViews0likes5CommentsActive was down, Standby took over, then Active went up, conflict happened.
Hello, I have an issue with my Active/Standby F5 devices. Active node (F5_A) lost its network connection. Standby node (F5_B) took over as Active. After 10 minutes, F5_A went back online. So, I have Active/Active devices. Everything failed because of this case. I had to force to Standby F5_B to be able to be online again. Why does this conflict happened? This article is what we have setup right now, except, we use Network Failover because they are located in different location. http://itadminguide.com/configure-high-availability-activestandby-of-big-ip-f5-ltms/ Auto failback is disabled on both devices. I saw this logs when F5_A went back online. I am not sure about the behavior of it once it went back. Sep 7 22:39:12 f5_B notice sod[7345]: 010c007e:5: Not receiving status updates from peer device /Common/f5_A (10.41.253.44) (Disconnected). Sep 7 22:39:12 f5_B notice sod[7345]: 010c006d:5: Leaving Standby for Active (best load): NextActive:. Sep 7 22:39:12 f5_B notice sod[7345]: 010c0053:5: Active for traffic group /Common/only_4751. Sep 7 22:39:12 f5_B notice sod[7345]: 010c006d:5: Leaving Standby for Active (best load): NextActive:. Sep 7 22:39:12 f5_B notice sod[7345]: 010c0053:5: Active for traffic group /Common/prefer_4751. Sep 7 22:39:12 f5_B notice sod[7345]: 010c006d:5: Leaving Standby for Active (best load): NextActive:. Sep 7 22:39:12 f5_B notice sod[7345]: 010c0053:5: Active for traffic group /Common/prefer_MDR. Sep 7 22:39:12 f5_B notice sod[7345]: 010c006d:5: Leaving Standby for Active (best load): NextActive:. Sep 7 22:39:12 f5_B notice sod[7345]: 010c0053:5: Active for traffic group /Common/traffic-group-1. Sep 7 22:39:12 f5_B notice sod[7345]: 010c0019:5: Active Sep 7 22:49:10 f5_B notice sod[7345]: 010c007f:5: Receiving status updates from peer device /Common/f5_A (10.41.253.44) (Online). Sep 7 22:49:10 f5_B notice tmm1[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32771 for traffic-group /Common/only_4751 established. Sep 7 22:49:10 f5_B notice tmm3[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32769 for traffic-group /Common/only_4751 established. Sep 7 22:49:10 f5_B notice tmm2[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32768 for traffic-group /Common/only_4751 established. Sep 7 22:49:10 f5_B notice tmm[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32770 for traffic-group /Common/only_4751 established. Sep 7 22:49:10 f5_B notice tmm[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32771 for traffic-group /Common/prefer_4751 established. Sep 7 22:49:10 f5_B notice tmm2[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32769 for traffic-group /Common/prefer_4751 established. Sep 7 22:49:10 f5_B notice tmm1[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32770 for traffic-group /Common/prefer_4751 established. Sep 7 22:49:10 f5_B notice tmm3[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32768 for traffic-group /Common/prefer_4751 established. Sep 7 22:49:10 f5_B notice tmm3[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32769 for traffic-group /Common/prefer_MDR established. Sep 7 22:49:10 f5_B notice tmm1[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32771 for traffic-group /Common/prefer_MDR established. Sep 7 22:49:10 f5_B notice tmm2[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32768 for traffic-group /Common/prefer_MDR established. Sep 7 22:49:10 f5_B notice tmm[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32770 for traffic-group /Common/prefer_MDR established. Sep 7 22:49:10 f5_B notice tmm3[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32770 for traffic-group /Common/traffic-group-1 established. Sep 7 22:49:10 f5_B notice tmm1[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32768 for traffic-group /Common/traffic-group-1 established. Sep 7 22:49:10 f5_B notice tmm2[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32771 for traffic-group /Common/traffic-group-1 established. Sep 7 22:49:10 f5_B notice tmm[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32769 for traffic-group /Common/traffic-group-1 established.1.1KViews0likes1CommentHigh-availability configuration produces a status of "ONLINE (STANDBY), In Sync"
Problem: High-availability configuration produces a status of "ONLINE (STANDBY), In Sync" on the F5 primary and standby units. Models: F5 1600 Big-IP Version: BIG-IP 11.5.0 Build 7.0.265 Hotfix HF7 Steps used to configure high-availability: Connect a network cable on port 1.3 of each F5 1600 Create a dedicated VLAN for high-availability on each F5 1600 Configure an IP address for the high-availability VLAN on each F5 1600 Ensure that both F5 1600 units can ping each other from the high-availability VLAN On each F5 1600, navigate to "Device Management" -> "Devices" -> "Device List". Select the F5 1600 system labelled as "self" On each F5 1600, navigate to "Device Connectivity" -> "ConfigSync". Select the IP address assigned to the high-availability VLAN On each F5 1600, navigate to "Device Connectivity" -> "Network Failover". Add the IP address assigned to the high-availability VLAN to the failover unicast configuration Force the standby unit offline On the active unit, navigate to "Device" -> "Peer List". Click "Add", and add standby unit to the high-availability configuration At this point, the primary F5 unit has a status of "ONLINE (ACTIVE), In Sync", and the standby unit has a status of "FORCED (OFFLINE), In Sync" On the primary unit, navigate to "Device Management" -> "Device Groups" to create a device group At this point, both units have a status of "ONLINE (STANDBY), In Sync". Any ideas as to why this happening? My goal is to have high-availability configured in an ACTIVE/STANDBY pair.1.1KViews0likes15CommentsFAILED failover: Virtual servers were disabled on standby unit.
We experienced a failed failover the other day where the standby F5 was unable to take over. I would like to ask for any help preventing this in the future and if possible further investigating what may have gone wrong. Setup: F5_1 and F5_2 are virtual editions running on Xen cluster along with about 40 virtual servers providing many services. The main services that were affected: * Server farm 14 web servers, dev, production, mail, cdn load balanced by F5_1 & F5_2. AND a small cluster of three user authentication servers load balanced by F5_1 & F5_2 LB_2 was acting as the active unit. LB_1 was in standby mode. Both units were “In Sync”. (Some configurations were changed and synced from LB_2 to LB_1 about a week prior to this failure. We did not actually log in to LB_1 and check any status at that time.) The web service monitors hit the web server’s status pages every 10 seconds to test online conditions. The login service monitor uses default tcp check to test online conditions. Timeline of the failed failover. LB_1 in standby mode recorded a monitor failure when checking web servers May 15, 2017 @ 3:13 am. This message was unnoticed until the day of the failover event. From LB_1, web server nodes were apparently considered online (blue square) but web virtual servers (port 25,80,443 etc) were considered unreachable until the failover June 23. Logs from the servers do not show attempts from LB_1 to access the web servers’ online status report page. We assume LB_1 did not try or failed to try hitting the servers’ status pages. LB_1 in standby mode recorded a monitor failure when checking user authentication servers May 21, 2017 @ 5:56 am user authentication nodes were apparently considered online but user authentication services were considered unreachable until the failover June 23. We could not confirm if LB_1 was attempting to test online conditions from the server side. LB_2 experienced a network heartbeat failure daemon failure June 23, 2017 @ 5:16 am LB_2 failed over responsibilities to LB_1 5:16:24.880 am LB_1 shows updating a few ASM configurations, but no warnings or updated offline events reported for the web and user authentication services. LB_2 and LB_1 both reported configurations “In Sync” at all times during this event. Communication to web and user authentication services was effectively “blackholed” and virtual servers providing those services were listed as disabled with red diamond indicators in now active LB_1. Service was manually failed back to LB_2 9:40 am and all services were available. LB_1 remained in the same state for several hours while some investigation took place. We tried disable and re-enabling the virtual nodes that were marked offline with red-diamonds. The services did not return and LB_1 did not log a new offline message for the nodes using the monitor. When LB_1 was rebooted at 21:40 pm all services were discovered and normal offline checks were taking place. Note: This post appears similar to our situation - Node marked offline in standby It was nuclear FAIL. The units reported "In sync" but they were not in sync and the standby unit was unprepared to take up the active unit responsibilities. We cannot find any log explaining why LB_1 failed to conduct subsequent health checks on various servers. We cannot determine from the logs why new health checks did not occur when the LB_2 failed over to LB_1. We want to prevent something like this from happening again. Here are 2 questions for the board: Can you share advice to help prevent a recurrence? Can you share any advice to further investigate?943Views0likes6Comments11.4.0 Peer device "disconnected" but syncing
Hello all Last week I upgraded a pair of BIG-IP 6900 from 10.2.2 to 11.4.0, following the recommended procedure for active-standby configurations. Once both nodes were upgraded, I noticed that in the screen "Device Group" each node saw the peer as "Disconnected" (red dot). However, the sync was working, and also did the failover, as I tried forcing the active node to standby and the other one became active immediately. I tried resetting the trust but the situation was the same. I attach a screenshot of Device Group (I have shadowed the hostnames for privacy). This is from one of the nodes, the other one shows the equivalent. I wonder if any of you have encountered a similar issue, and whether you know how I could solve it. I guess maybe it is a silly parameter I forgot to configure, but I can't manage to figure out which one, so any help would be appreciated. If you need further information, please let me know. Thanks in advance.917Views0likes14CommentsBIG-IP Sync-Failover - Sync Failed
Hi, In a project we're running a device-group in Sync-Failover* mode with Manual Sync type. After a change on the Active unit trying to sync from the Active unit to the device-group, Sync Failed with the information below: Sync Summary Status Sync Failed Summary A validation error occurred while syncing to a remote device Details Sync error on 2nd-unit: Load failed from 1st-unit 01070110:3: Node address 'node' is referenced by a member of pool 'pool'. Recommended action: Review the error message and determine corrective action on the device We're totally sure that nothing had been changed manually on the 2nd node, and both nodes were in sync before the change on 1st node. The Last Sync Type field for both nodes shows Manual Full Node. I couldn't find anything on this case; is it safe to just manipulate the configuration on the 2nd node and then sync from 2nd node to the device-group? Many thanks in advance!899Views0likes5CommentsScore bonus & failover condition
Hello guys, A question regarding the behavior of the failover in the 11.4.1 Two devices, in active/standby, with a trunk Active unit bonus score : +10 Bonus per available trunk interface : +10 Initial setup (F5A active): F5A : 20 (2 int) + 10 (active) = 30 F5B : 20 (2 int) + 0 = 20 With 1 interface shut on F5A : F5A : 10 (1 int) + 10 (active) = 20 F5B : 20 (2 int) + 0 = 20 => F5A still active, ok. With 2 interfaces shut on F5A : F5A : 0 (0 int) + 10 (active) = 10 F5B : 20 (2 int) + 0 = 20 => failover on F5B. With 2 interfaces OK on F5A : F5A : 20 (2 int) + 0 = 20 F5B : 20 (2 int) + 10 (active) = 30 => F5B still active, ok. Until now, everything is logic. With 1 interface shut on F5B F5A : 20 (2 int) + 0 = 20 F5B : 10 (1 int) + 10 (active) = 20 => Failover on F5A ! I don't understand why we have a failover event in this last case. Any explanation ?726Views0likes9Comments