Forum Discussion
ARP/MAC Tables Not Updating on Core Switches After F5 LTM Failover (GARP Issue?)
We have two F5 LTM 5250v appliances configured with 2 vCMP instances each in an HA pair (Active/Standby). Each F5 5250v has a 10G uplink to two core switches (Cisco Nexus 7010) configured as an LACP port-channel on the F5 side and a Port-Channel/vPC on the Nexus side.
Port-Channel127/vPC127 = F5ADC01 Port-Channel128/vPC128 = F5ADC01
When I look at the MAC address tables on both 7K1 and 7K2, I can see all the individual F5 MACs for each VLAN we have configured on the F5 vCMP instances.
We are having an issue during automatic or manual failover where the MAC addresses for the virtual-servers are not being updated. If F5ADC01 is Active and we force it Standby, it immediately changes to Standby and F5ADC02 immediately takes over the Active role. However, the ARP tables on the Nexus 7K Core switches do not get updated so all the virtual-servers continue to have the MAC address associated with F5ADC01.
We have multiple partitions on each vCMP instance with several VLANs associated with each partition. Each partition only has a single route-domain the VLANs are allocated to. For traffic to virtual-servers, we are using Auto-MAP to SNAT to the floating Self-IP and using Auto-Last Hop so return traffic passes through the correct source VLAN. We are not using MAC masquerading.
The ARP time out on the Nexus 7Ks is 1500 seconds (default) so it takes 25min after a failover for a full network recovery. Eventually the ARP entries age out for all virtual servers and get refreshed with the correct MAC address. Obviously this is not acceptable.
I found an SOL article that talks about when GARPs can be missed after failover: SOL7332: Gratuitous ARPs may be lost after a BIG-IP failover event. We have confirmed the upstream core switches are not dropping any GARPs. As a test I went in and manually disabled all virtual-servers and then enabled them and all MACs updated immediately.
I have opened a support case with F5 and we have yet to determine where the issue lies. Does anybody have any ideas what the issue might be? If I need to provide more information about our configuration, let me know.
We are pretty new to the F5 platform. We recently migrated from the Cisco ACE30 platform. Failover on the ACE platform worked perfectly. Similar cabling setup (two port-channels to two separate Catalyst 6509 switches with an ACE30 module in each switch). After ACE failover, the MAC tables/ARP caches immediately updated.
Thank You!
- BRUCE_A_NOLAN_1Nimbostratus
We ran into a similar issue. Several Virtual Servers on different VLANs that did not have SelfIPs or Floating IPs configured. In our situation, allocating SIPs/FIPs for those VLANs was not an option, because we were migrating specific virtual servers from one Data Center to another (not the networks) and did not want to readdress the virtual servers. Migrating DNS virtual servers.
Our solution was to allocate SIPs/FIPs from the 198.51.100.0/24, which is a reserved subnet Assigned as "TEST-NET-2" for use in documentation and examples. It should not be used publicly.[7]
On failover, a gARP is issued for these addresses. Since there is no virtual servers, SNATs or anything else route-able or listening, the response is not required. But the gARP for those addresses triggers the failover from the switch perspective to the STANDBY port-channel.
Failover for the virtual servers is instant.
MAC Masquerade is enabled for the traffic group.
- Ron_Peters_2122Altostratus
We have finally resolved this issue and as promised I said I would comment on what the issue was. We confirmed 100% with a tcpdump on the F5s that they were sending Gratuitous ARPs out its 10G interfaces for all virtual-addresses after a failover event.
We opened a TAC case with Cisco and found that there is a hardware rate-limiter in place on the particular F1 card (very old card) that these F5's were terminating into. The rate-limit for class rl-4, which ARP was assigned to was set to 100 packets-per-second. This is way too low to support the amount of ARP traffic the F5 generates and we had millions of ARP drops on this particular card.
We analyzed the pcap file and found the rate at which the F5 transmitted these GARPs and adjusted the rate-limit on the rl-4 class to 3000 packets per second. We performed failover tests and the MAC addresses on both 7Ks updated immediately for all virtual-addresses.
Thanks for all the input you guys provided.
- portoalegreNimbostratus
My problem is finally fixed! I increased IP GLEAN from 100 to 5000 on each Cisco 7700 Switch, I manually forced the Primary LTM into Standby, the new promoted LTM sent out 2000 gratuitous APP's out and this time this burst of ARP's are now seen across OTV on the other Aggregation Nexus 7700 Switches, so all Virtual Servers have one F5 LTM MAC addresses. This command has forced my failover to work, so the problem was that IP GLEAN has a limit by default on how many PPS it can send (default 100 which isn't a lot IMO) NB: once you configured IP GLEAN on 7700's the propagates IP GLEAN config to OTV, AGG VDC's etc.
- Ron_Peters_2122Altostratus
This particular issue we were having was for the hardware rate limiter on older F1 cards. I found the following two links, which may be helpful. When I worked with TAC, I was able to prove via the packet captures that the F5s were in fact sending the GARPs and since they were directly attached to the 7Ks and the ARP tables were not updating (but the GARPs were being SPAN'd to a separate destination port on the 7K - it was an obvious issue with the 7Ks).
Perhaps it is an issue with how ARP works over OTV:
While the links below are for the 7000 series, it should be applicable in regards to the L3 glean Class on 77xx:
In any case, be persistent with Cisco and escalate if need be. Good Luck!
- portoalegreNimbostratus
I'm having the same problem with a pair of LTM's across Data centres using 4 x 7710 Nexus switches across OTV, I have failed over twice to Standby LTM to test failover and vice versa, most of my VS's are unavailable. When I SH IP ARP X.X.X.X (VS) the mac address on the switch which now connects new Standby (demoted Primary) is the wrong mac. I have to clear every VS ARP that have failed on the switch to get things working. So one DC works the demoted DC doesn't work fully. Frustrating!
I know the F5 is sending Gratuitous ARP's I can see that in my packet capture. Logged a Cisco TAC they haven't been very helpful so far, ticket I logged with F5 suggest MAC masquerading which I'm not to confident about and is a large Production change for me with VPC's, OTV etc.
The only limit I could see (running 7710 with N77-SUP2E is a rate limiter for glean packets which is only 100! So I guess these glean packets include Gratuitous ARP's where the mac has changed or the switch cannot find ARP resolution? glean maybe not relevant. But there are drops as below, only thing I could find so far, please be aware this infrastructure was working fine over a pair of 6500 switches previously across L" DWDM between DC's
Any suggestions would be helpful.
Module: 1
Rate-limiter PG Multiplier: 1.00
R-L Class Config Allowed Dropped Total +------------------+--------+---------------+---------------+-----------------+ L3 glean 100 172479882 7398242 179878124
Port group with configuration same as default configuration Eth1/1-2 Eth1/3-4 Eth1/5-6 Eth1/7-8 Eth1/9-10 Eth1/11-12 Eth1/13-14 Eth1/15-16 Eth1/17-18 Eth1/19-20 Eth1/21-22 Eth1/23-24
- Ron_Peters_2122Altostratus
Thank you all for the responses. I do have a TAC case open and we are going to do some investigation on why the Nexus 7Ks are dropping the GARP traffic. We are going to be coordinating another maintenance window to perform an ELAM capture to determine what is transpiring on the 7K side but regardless of what we find, it does indeed look like MAC Masquerading may be the way to go regardless as it would also have the added benefit of improving failover speed.
I will respond again once I know more from Cisco TAC as to what the cause of the dropped GARP traffic is for those that may be curious.
- tatmotivCirrostratus
Same here. I was also sceptical regarding MAC masquerading, but upon experiencing the above mentioned ARP problems in the nexus infrastructure (which were somehow related to fabric-path if I remember correctly), I converted all (60+) traffic-groups on my devices to MAC masquerading and didn't have any bad experiences. We are also connecting to Nexus switches using vpc (albeit N5Ks as spines, being cross-connected via N7Ks as hubs) and we are also using vcmp (on Viprion though, not 5250).
Hi Ron,
I've to second Tatmotiv's opinion to enable MAC masquerading on your Traffic-Groups.
We're sucessfully running this setup on VPC enabled Nexus devices. Our switch CAM tables will detect the failover event immediately on the very first received GARP packet and perform a single MAC flap.
Cheers, Kai
- tatmotivCirrostratus
...that means that shared MAC would only be learned on one Port-Channel interface.
Right. Or to put it a 100% correct, it should only be learned on one vpc at the same time.
When failover occurs, you're saying the new Active will start sending packets including the shared MAC and the switch CAM tables should update to reflect the MAC now being on a new Port-Channel interfaces (thus no need for GARPs)?
Right. I think it actually will send GARPs nevertheless, but those are not needed to update the ARP tables on all neighboring devices, those can be left unchanged. Thus, the network will be less error-prune to the N7K not learning the ARP update (regardless of the reason why that happens...).
- Ron_Peters_2122Altostratus
Thank you for the "tmsh load sys config" - we are relatively new to F5 and this will save us quite a bit of time.
Regarding MAC masquerading. Since each F5 LTM had dual uplinks to each Nexus Switch in a SEPARATE VPC/Port-Channel. Po127 = F51 and Po128 = F52. If the Standby unit doesn't send packets utilizing the shared MAC and we are already having issues with GARPs, that means that shared MAC would only be learned on one Port-Channel interface. When failover occurs, you're saying the new Active will start sending packets including the shared MAC and the switch CAM tables should update to reflect the MAC now being on a new Port-Channel interfaces (thus no need for GARPs)? Just want to make sure I'm understanding this 100% correctly.
I'm also going to be opening a TAC case with Cisco this morning to discuss this GARP issue as well to see if there is anything that can be done on the Nexus side to resolve this (or if it is indeed a bug).
Thanks for your help. I'll respond with what I find out.
- tatmotivCirrostratus
What we experienced was a bug in nx-os. The dropped packets also could not be seen in any counters...
MAC masquerading should not cause MAC flapping in normal operation since the standby unit will not send any packets with the "floating" MAC as source. The MAC masquerading address is only used for floating objects and those are only active on the active system for the according traffic-group. You are right though that you can see traffic from the standby machines "original" MAC address in normal operation (which is indeed also used for the floating objects when in active mode). This is most probably health check traffic, which is originating from a non-floating self IP. So, this traffic will not change when enabling MAC masquerading. Both machines will continue to send health checks from their non-floating self IPs, and those packets will sustain the original (non-masquerading) MAC address.
BTW, for provoking additional GARPs for all floating objects, you could also do a "tmsh load sys config", which will achieve the same without having to disable / re-enable all objects. But don't forget to make sure the config is saved before ;-)
- Ron_Peters_2122Altostratus
From everything I've been able to look at thus far, I see no indications that the 7Ks are dropping the ARP traffic. The only thing we have in place that would possibly filter ARP traffic is the control-plane policing (CoPP) policies we have in place on the Admin VDC. When I look at the statistics that are defined for ARP, there are no drops (see below). I'll be running the tcpdump on the Standby tonight before I failover our DEVQA vCMP instance and that will for sure tell me if the F5 is sending GARPs after I force the Active to Standby.
I had some concerns with MAC masquerading - I wasn't 100% we could do that without causing MAC confusion/flapping on the 7Ks. Since each F5 has 10G uplinks to each 7K in a port-channel / port-channel/vPC on the 7K side, it seemed to me like that could cause MAC flapping if the Standby also advertised its MACs for the virtual-servers while in Standby. It presently does this as I can see the different individual MACs for each VLAN on each port-channel interface.
I will also point out, that after I failover, I have a script I run via CLI to disable/reenable all virtual-servers/ARP and then re-enable them. I just paste it all in. When I do this, all MACs on the 7Ks refresh so I believe I'm replicating the GARP flooding but only as fast as the console commands are processed.
class-map copp-system-p-class-normal (match-any) match access-group name copp-system-p-acl-mac-dot1x match protocol arp set cos 1 police cir 680 kbps bc 250 ms conform action: transmit violate action: drop module 2: conformed 3681791362 bytes, 5-min offered rate 207 bytes/sec peak rate 12986775 bytes/sec at Thu Jan 14 22:24:13 2016 violated 0 bytes, 5-min violate rate 0 bytes/sec peak rate 0 bytes/sec module 9: conformed 250641810380 bytes, 5-min offered rate 19456 bytes/sec peak rate 320515191 bytes/sec at Thu Jan 14 22:29:13 2016 violated 0 bytes, 5-min violate rate 0 bytes/sec peak rate 710 bytes/sec at Thu Jan 14 22:29:13 2016 module 10: conformed 88017602822 bytes, 5-min offered rate 6214 bytes/sec peak rate 153161524 bytes/sec at Thu Jan 14 22:29:13 2016 violated 0 bytes, 5-min violate rate 0 bytes/sec peak rate 0 bytes/sec
- tatmotivCirrostratus
Can you really be sure that there is no packet loss in the N7Ks? We have experienced a similar situation and it turned out that the N7Ks indeed dropped packets, but those could not be seen in any counters. It took quite some effort with sniffers and taps to prove this.
However, as a workaround you should consider enabling MAC masquerading in the according traffic-group(s) of the BigIP. Thus, there is no longer need for sending grat ARPs in case of failover.
Recent Discussions
Related Content
* Getting Started on DevCentral
* Community Guidelines
* Community Terms of Use / EULA
* Community Ranking Explained
* Community Resources
* Contact the DevCentral Team
* Update MFA on account.f5.com