fail over
42 TopicsARP/MAC Tables Not Updating on Core Switches After F5 LTM Failover (GARP Issue?)
We have two F5 LTM 5250v appliances configured with 2 vCMP instances each in an HA pair (Active/Standby). Each F5 5250v has a 10G uplink to two core switches (Cisco Nexus 7010) configured as an LACP port-channel on the F5 side and a Port-Channel/vPC on the Nexus side. Port-Channel127/vPC127 = F5ADC01 Port-Channel128/vPC128 = F5ADC01 When I look at the MAC address tables on both 7K1 and 7K2, I can see all the individual F5 MACs for each VLAN we have configured on the F5 vCMP instances. We are having an issue during automatic or manual failover where the MAC addresses for the virtual-servers are not being updated. If F5ADC01 is Active and we force it Standby, it immediately changes to Standby and F5ADC02 immediately takes over the Active role. However, the ARP tables on the Nexus 7K Core switches do not get updated so all the virtual-servers continue to have the MAC address associated with F5ADC01. We have multiple partitions on each vCMP instance with several VLANs associated with each partition. Each partition only has a single route-domain the VLANs are allocated to. For traffic to virtual-servers, we are using Auto-MAP to SNAT to the floating Self-IP and using Auto-Last Hop so return traffic passes through the correct source VLAN. We are not using MAC masquerading. The ARP time out on the Nexus 7Ks is 1500 seconds (default) so it takes 25min after a failover for a full network recovery. Eventually the ARP entries age out for all virtual servers and get refreshed with the correct MAC address. Obviously this is not acceptable. I found an SOL article that talks about when GARPs can be missed after failover: SOL7332: Gratuitous ARPs may be lost after a BIG-IP failover event. We have confirmed the upstream core switches are not dropping any GARPs. As a test I went in and manually disabled all virtual-servers and then enabled them and all MACs updated immediately. I have opened a support case with F5 and we have yet to determine where the issue lies. Does anybody have any ideas what the issue might be? If I need to provide more information about our configuration, let me know. We are pretty new to the F5 platform. We recently migrated from the Cisco ACE30 platform. Failover on the ACE platform worked perfectly. Similar cabling setup (two port-channels to two separate Catalyst 6509 switches with an ACE30 module in each switch). After ACE failover, the MAC tables/ARP caches immediately updated. Thank You!6.4KViews0likes22CommentsF5 Gratuitous-ARP issue when failover
Hi Last night we upgrade F5 v. 11.5.4 to v.12.1.3 when we failover from old unit v.11.5.4 to newly unit v.12.1.3, We experience some IP has more request timeout than the rest (we ping ip of each vs (~20 ip) when failover) From my understanding, F5 will send G-ARP to neighbour unit when it's active. Is it possible that G-ARP that sent is drop so those IP experience longer downtime due to still using old ARP? or Is it because neighbour unit not use some G-ARP from F5? or Is there any possibilities that make neighbour unit not learn new ARP as expect? Thank you1.8KViews0likes5CommentsActive was down, Standby took over, then Active went up, conflict happened.
Hello, I have an issue with my Active/Standby F5 devices. Active node (F5_A) lost its network connection. Standby node (F5_B) took over as Active. After 10 minutes, F5_A went back online. So, I have Active/Active devices. Everything failed because of this case. I had to force to Standby F5_B to be able to be online again. Why does this conflict happened? This article is what we have setup right now, except, we use Network Failover because they are located in different location. http://itadminguide.com/configure-high-availability-activestandby-of-big-ip-f5-ltms/ Auto failback is disabled on both devices. I saw this logs when F5_A went back online. I am not sure about the behavior of it once it went back. Sep 7 22:39:12 f5_B notice sod[7345]: 010c007e:5: Not receiving status updates from peer device /Common/f5_A (10.41.253.44) (Disconnected). Sep 7 22:39:12 f5_B notice sod[7345]: 010c006d:5: Leaving Standby for Active (best load): NextActive:. Sep 7 22:39:12 f5_B notice sod[7345]: 010c0053:5: Active for traffic group /Common/only_4751. Sep 7 22:39:12 f5_B notice sod[7345]: 010c006d:5: Leaving Standby for Active (best load): NextActive:. Sep 7 22:39:12 f5_B notice sod[7345]: 010c0053:5: Active for traffic group /Common/prefer_4751. Sep 7 22:39:12 f5_B notice sod[7345]: 010c006d:5: Leaving Standby for Active (best load): NextActive:. Sep 7 22:39:12 f5_B notice sod[7345]: 010c0053:5: Active for traffic group /Common/prefer_MDR. Sep 7 22:39:12 f5_B notice sod[7345]: 010c006d:5: Leaving Standby for Active (best load): NextActive:. Sep 7 22:39:12 f5_B notice sod[7345]: 010c0053:5: Active for traffic group /Common/traffic-group-1. Sep 7 22:39:12 f5_B notice sod[7345]: 010c0019:5: Active Sep 7 22:49:10 f5_B notice sod[7345]: 010c007f:5: Receiving status updates from peer device /Common/f5_A (10.41.253.44) (Online). Sep 7 22:49:10 f5_B notice tmm1[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32771 for traffic-group /Common/only_4751 established. Sep 7 22:49:10 f5_B notice tmm3[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32769 for traffic-group /Common/only_4751 established. Sep 7 22:49:10 f5_B notice tmm2[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32768 for traffic-group /Common/only_4751 established. Sep 7 22:49:10 f5_B notice tmm[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32770 for traffic-group /Common/only_4751 established. Sep 7 22:49:10 f5_B notice tmm[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32771 for traffic-group /Common/prefer_4751 established. Sep 7 22:49:10 f5_B notice tmm2[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32769 for traffic-group /Common/prefer_4751 established. Sep 7 22:49:10 f5_B notice tmm1[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32770 for traffic-group /Common/prefer_4751 established. Sep 7 22:49:10 f5_B notice tmm3[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32768 for traffic-group /Common/prefer_4751 established. Sep 7 22:49:10 f5_B notice tmm3[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32769 for traffic-group /Common/prefer_MDR established. Sep 7 22:49:10 f5_B notice tmm1[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32771 for traffic-group /Common/prefer_MDR established. Sep 7 22:49:10 f5_B notice tmm2[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32768 for traffic-group /Common/prefer_MDR established. Sep 7 22:49:10 f5_B notice tmm[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32770 for traffic-group /Common/prefer_MDR established. Sep 7 22:49:10 f5_B notice tmm3[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32770 for traffic-group /Common/traffic-group-1 established. Sep 7 22:49:10 f5_B notice tmm1[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32768 for traffic-group /Common/traffic-group-1 established. Sep 7 22:49:10 f5_B notice tmm2[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32771 for traffic-group /Common/traffic-group-1 established. Sep 7 22:49:10 f5_B notice tmm[21172]: 01340001:5: HA Connection with peer 10.70.1.236:32769 for traffic-group /Common/traffic-group-1 established.1.1KViews0likes1CommentHigh-availability configuration produces a status of "ONLINE (STANDBY), In Sync"
Problem: High-availability configuration produces a status of "ONLINE (STANDBY), In Sync" on the F5 primary and standby units. Models: F5 1600 Big-IP Version: BIG-IP 11.5.0 Build 7.0.265 Hotfix HF7 Steps used to configure high-availability: Connect a network cable on port 1.3 of each F5 1600 Create a dedicated VLAN for high-availability on each F5 1600 Configure an IP address for the high-availability VLAN on each F5 1600 Ensure that both F5 1600 units can ping each other from the high-availability VLAN On each F5 1600, navigate to "Device Management" -> "Devices" -> "Device List". Select the F5 1600 system labelled as "self" On each F5 1600, navigate to "Device Connectivity" -> "ConfigSync". Select the IP address assigned to the high-availability VLAN On each F5 1600, navigate to "Device Connectivity" -> "Network Failover". Add the IP address assigned to the high-availability VLAN to the failover unicast configuration Force the standby unit offline On the active unit, navigate to "Device" -> "Peer List". Click "Add", and add standby unit to the high-availability configuration At this point, the primary F5 unit has a status of "ONLINE (ACTIVE), In Sync", and the standby unit has a status of "FORCED (OFFLINE), In Sync" On the primary unit, navigate to "Device Management" -> "Device Groups" to create a device group At this point, both units have a status of "ONLINE (STANDBY), In Sync". Any ideas as to why this happening? My goal is to have high-availability configured in an ACTIVE/STANDBY pair.1.1KViews0likes15CommentsFAILED failover: Virtual servers were disabled on standby unit.
We experienced a failed failover the other day where the standby F5 was unable to take over. I would like to ask for any help preventing this in the future and if possible further investigating what may have gone wrong. Setup: F5_1 and F5_2 are virtual editions running on Xen cluster along with about 40 virtual servers providing many services. The main services that were affected: * Server farm 14 web servers, dev, production, mail, cdn load balanced by F5_1 & F5_2. AND a small cluster of three user authentication servers load balanced by F5_1 & F5_2 LB_2 was acting as the active unit. LB_1 was in standby mode. Both units were “In Sync”. (Some configurations were changed and synced from LB_2 to LB_1 about a week prior to this failure. We did not actually log in to LB_1 and check any status at that time.) The web service monitors hit the web server’s status pages every 10 seconds to test online conditions. The login service monitor uses default tcp check to test online conditions. Timeline of the failed failover. LB_1 in standby mode recorded a monitor failure when checking web servers May 15, 2017 @ 3:13 am. This message was unnoticed until the day of the failover event. From LB_1, web server nodes were apparently considered online (blue square) but web virtual servers (port 25,80,443 etc) were considered unreachable until the failover June 23. Logs from the servers do not show attempts from LB_1 to access the web servers’ online status report page. We assume LB_1 did not try or failed to try hitting the servers’ status pages. LB_1 in standby mode recorded a monitor failure when checking user authentication servers May 21, 2017 @ 5:56 am user authentication nodes were apparently considered online but user authentication services were considered unreachable until the failover June 23. We could not confirm if LB_1 was attempting to test online conditions from the server side. LB_2 experienced a network heartbeat failure daemon failure June 23, 2017 @ 5:16 am LB_2 failed over responsibilities to LB_1 5:16:24.880 am LB_1 shows updating a few ASM configurations, but no warnings or updated offline events reported for the web and user authentication services. LB_2 and LB_1 both reported configurations “In Sync” at all times during this event. Communication to web and user authentication services was effectively “blackholed” and virtual servers providing those services were listed as disabled with red diamond indicators in now active LB_1. Service was manually failed back to LB_2 9:40 am and all services were available. LB_1 remained in the same state for several hours while some investigation took place. We tried disable and re-enabling the virtual nodes that were marked offline with red-diamonds. The services did not return and LB_1 did not log a new offline message for the nodes using the monitor. When LB_1 was rebooted at 21:40 pm all services were discovered and normal offline checks were taking place. Note: This post appears similar to our situation - Node marked offline in standby It was nuclear FAIL. The units reported "In sync" but they were not in sync and the standby unit was unprepared to take up the active unit responsibilities. We cannot find any log explaining why LB_1 failed to conduct subsequent health checks on various servers. We cannot determine from the logs why new health checks did not occur when the LB_2 failed over to LB_1. We want to prevent something like this from happening again. Here are 2 questions for the board: Can you share advice to help prevent a recurrence? Can you share any advice to further investigate?943Views0likes6Comments11.4.0 Peer device "disconnected" but syncing
Hello all Last week I upgraded a pair of BIG-IP 6900 from 10.2.2 to 11.4.0, following the recommended procedure for active-standby configurations. Once both nodes were upgraded, I noticed that in the screen "Device Group" each node saw the peer as "Disconnected" (red dot). However, the sync was working, and also did the failover, as I tried forcing the active node to standby and the other one became active immediately. I tried resetting the trust but the situation was the same. I attach a screenshot of Device Group (I have shadowed the hostnames for privacy). This is from one of the nodes, the other one shows the equivalent. I wonder if any of you have encountered a similar issue, and whether you know how I could solve it. I guess maybe it is a silly parameter I forgot to configure, but I can't manage to figure out which one, so any help would be appreciated. If you need further information, please let me know. Thanks in advance.917Views0likes14CommentsBIG-IP Sync-Failover - Sync Failed
Hi, In a project we're running a device-group in Sync-Failover* mode with Manual Sync type. After a change on the Active unit trying to sync from the Active unit to the device-group, Sync Failed with the information below: Sync Summary Status Sync Failed Summary A validation error occurred while syncing to a remote device Details Sync error on 2nd-unit: Load failed from 1st-unit 01070110:3: Node address 'node' is referenced by a member of pool 'pool'. Recommended action: Review the error message and determine corrective action on the device We're totally sure that nothing had been changed manually on the 2nd node, and both nodes were in sync before the change on 1st node. The Last Sync Type field for both nodes shows Manual Full Node. I couldn't find anything on this case; is it safe to just manipulate the configuration on the 2nd node and then sync from 2nd node to the device-group? Many thanks in advance!899Views0likes5CommentsScore bonus & failover condition
Hello guys, A question regarding the behavior of the failover in the 11.4.1 Two devices, in active/standby, with a trunk Active unit bonus score : +10 Bonus per available trunk interface : +10 Initial setup (F5A active): F5A : 20 (2 int) + 10 (active) = 30 F5B : 20 (2 int) + 0 = 20 With 1 interface shut on F5A : F5A : 10 (1 int) + 10 (active) = 20 F5B : 20 (2 int) + 0 = 20 => F5A still active, ok. With 2 interfaces shut on F5A : F5A : 0 (0 int) + 10 (active) = 10 F5B : 20 (2 int) + 0 = 20 => failover on F5B. With 2 interfaces OK on F5A : F5A : 20 (2 int) + 0 = 20 F5B : 20 (2 int) + 10 (active) = 30 => F5B still active, ok. Until now, everything is logic. With 1 interface shut on F5B F5A : 20 (2 int) + 0 = 20 F5B : 10 (1 int) + 10 (active) = 20 => Failover on F5A ! I don't understand why we have a failover event in this last case. Any explanation ?726Views0likes9Commentsusing tmsh commands in tcl script
Hey all, I am new to the community as well as f5 technologies in general. Ill give a synopsis as to I am looking to accomplish by using a tmsh script. At my company we do site switches where we fail over applications depending on work thats be done at DC or another. And at times we have failover close to 70 applications. My question is if I was going to use for instance the tmsh cmd: modify gtm pool members modify { :https { disabled } } modify gtm pool members modify { :https { enabled } } In a script how would I got about adding it to the script? Also, I created a test.tcl file in my home directory and executed (Active)(tmos) run cli script but couldnt find the file. How would I execute it? Thanks617Views0likes4CommentsBIG-IP : configure 2-node pool as simple failover
F5 BIG-IP Virtual Edition v11.4.1 (Build 635.0) LTM on ESXi I have a RESTful service deployed on two servers (with no other sites/services). I've configured BIG-IP as follows : single vip dedicated to service single pool dedicated to service two nodes , one for each server one health monitor which determines health of each node I need to configure this cluster as simple failover where traffic is sent only to primary. So, if node 1 is primary and it fails health monitor, node 2 is promoted to primary and handles all traffic. How to configure BIG-IP ?610Views0likes3Comments