ARP/MAC Tables Not Updating on Core Switches After F5 LTM Failover (GARP Issue?)

We have two F5 LTM 5250v appliances configured with 2 vCMP instances each in an HA pair (Active/Standby). Each F5 5250v has a 10G uplink to two core switches (Cisco Nexus 7010) configured as an LACP port-channel on the F5 side and a Port-Channel/vPC on the Nexus side.

Port-Channel127/vPC127 = F5ADC01 Port-Channel128/vPC128 = F5ADC01

When I look at the MAC address tables on both 7K1 and 7K2, I can see all the individual F5 MACs for each VLAN we have configured on the F5 vCMP instances.

We are having an issue during automatic or manual failover where the MAC addresses for the virtual-servers are not being updated. If F5ADC01 is Active and we force it Standby, it immediately changes to Standby and F5ADC02 immediately takes over the Active role. However, the ARP tables on the Nexus 7K Core switches do not get updated so all the virtual-servers continue to have the MAC address associated with F5ADC01.

We have multiple partitions on each vCMP instance with several VLANs associated with each partition. Each partition only has a single route-domain the VLANs are allocated to. For traffic to virtual-servers, we are using Auto-MAP to SNAT to the floating Self-IP and using Auto-Last Hop so return traffic passes through the correct source VLAN. We are not using MAC masquerading.

The ARP time out on the Nexus 7Ks is 1500 seconds (default) so it takes 25min after a failover for a full network recovery. Eventually the ARP entries age out for all virtual servers and get refreshed with the correct MAC address. Obviously this is not acceptable.

I found an SOL article that talks about when GARPs can be missed after failover: SOL7332: Gratuitous ARPs may be lost after a BIG-IP failover event. We have confirmed the upstream core switches are not dropping any GARPs. As a test I went in and manually disabled all virtual-servers and then enabled them and all MACs updated immediately.

I have opened a support case with F5 and we have yet to determine where the issue lies. Does anybody have any ideas what the issue might be? If I need to provide more information about our configuration, let me know.

We are pretty new to the F5 platform. We recently migrated from the Cisco ACE30 platform. Failover on the ACE platform worked perfectly. Similar cabling setup (two port-channels to two separate Catalyst 6509 switches with an ACE30 module in each switch). After ACE failover, the MAC tables/ARP caches immediately updated.

Thank You!

Ron_Peters_2122
Altostratus
Sep 15, 2016
Thanks for the reply.

Yes, for all virtual-servers, the Traffic-Group is set to floating and ARP is enabled. Below is a sample virtual-server config. All of them are set the same:
ekaleido
Cirrus
Sep 15, 2016
Navigate to Local Traffic->Virtual Servers->Virtual Address List and check the following:

Is Traffic Group set to (floating)? Is the ARP enabled checkbox checked?
ekaleido
Cirrus
Sep 15, 2016
Out of curiousity, can you force a failover whilse having the following tcpdump running on the box that will become active?

tcpdump -ni 0.0:nnn -s0 -vvv arp

Or optionally to capture to a pcap...

tcpdump -ni 0.0:nnn -s0 -vvv arp -w /shared/tmp/garp.pcap
Ron_Peters_2122
Altostratus
Sep 15, 2016
Unfortunately I'm not able to do this at this time as these boxes are up and running in production. We performed a failover test last night during a scheduled maintenance window in our DEVQA environment. We are however SPANing all network traffice (including ARP) on our network and I'm looking into the capture file to see what was sent during that duration.
Andrew_Husking
Cirrus
Sep 21, 2016
What version are you running?

We've seen hints of this in some of our failovers after upgrading to 12.1.0/12.1.1.

Also have you heard anything back from the case? I'm curious to see how things go for you guys.

Cheers,
Ron_Peters_2122
Altostratus
Sep 21, 2016
We are running Version 11.5.3 HF2.

We have not heard anything definitive back from support yet. They e-mailed me and linked me the following SOL article: SOL11880: BIG-IP objects may not send gratuitous ARP requests during failover https://support.f5.com/kb/en-us/solutions/public/11000/800/sol11880.html

However, we do not feel this applied. We have multiple partitions on each vCMP instance. Each partition has only one default route-domain. Each partition has multiple VLANs allocated to it. Every VLAN has 2 Self-IPs and 1 Floating IP address. All virtual-servers share the same subnet as their designated VLAN/floating-IP. We are utilizing Auto-Map for all virtual-servers instead of using SNAT pools. We are also utilizing Auto-Last Hop so return traffic passes through the original source VLAN instead of using the single default route we have tied to the single route-domain.

Note, the F5s are not utilized as the default gateway by the nodes. They only send return traffic through the F5s for traffic entering through the virtual-server. Each VLAN has an SVI on both upstream Nexus switches and we are utilizing HSRP with a virtual-address. The HSRP virtual-address is used as the default gateway by the nodes.

We have another maintenance window scheduled for this Wednesday evening to perform another manual failover on our DEVQA vCMP instance where I will be setting up a tcpdump on the unit that will become Active to capture all ARP traffic. This is to verify whether the unit is sending GARPs after failover or not. We were also linked the following SOL article from support but we have monitored the switch and checked the logs/statistics and have confirmed the switches are not dropping any ARP traffic: SOL7332: Gratuitous ARPs may be lost after a BIG-IP failover event https://support.f5.com/kb/en-us/solutions/public/7000/300/sol7332.html

We have verified each of these points and have confirmed that the upstream 7Ks are not dropping any ARP traffic. The ARP timeout is set to the default aging period, which on this platform is 25min.

I will respond again after we conclude the failover test tomorrow night.

Thank you for the responses thus far.
tatmotiv
Cirrostratus
Sep 21, 2016
Can you really be sure that there is no packet loss in the N7Ks? We have experienced a similar situation and it turned out that the N7Ks indeed dropped packets, but those could not be seen in any counters. It took quite some effort with sniffers and taps to prove this.

However, as a workaround you should consider enabling MAC masquerading in the according traffic-group(s) of the BigIP. Thus, there is no longer need for sending grat ARPs in case of failover.
Ron_Peters_2122
Altostratus
Sep 21, 2016
From everything I've been able to look at thus far, I see no indications that the 7Ks are dropping the ARP traffic. The only thing we have in place that would possibly filter ARP traffic is the control-plane policing (CoPP) policies we have in place on the Admin VDC. When I look at the statistics that are defined for ARP, there are no drops (see below). I'll be running the tcpdump on the Standby tonight before I failover our DEVQA vCMP instance and that will for sure tell me if the F5 is sending GARPs after I force the Active to Standby.

I had some concerns with MAC masquerading - I wasn't 100% we could do that without causing MAC confusion/flapping on the 7Ks. Since each F5 has 10G uplinks to each 7K in a port-channel / port-channel/vPC on the 7K side, it seemed to me like that could cause MAC flapping if the Standby also advertised its MACs for the virtual-servers while in Standby. It presently does this as I can see the different individual MACs for each VLAN on each port-channel interface.

I will also point out, that after I failover, I have a script I run via CLI to disable/reenable all virtual-servers/ARP and then re-enable them. I just paste it all in. When I do this, all MACs on the 7Ks refresh so I believe I'm replicating the GARP flooding but only as fast as the console commands are processed.

class-map copp-system-p-class-normal (match-any) match access-group name copp-system-p-acl-mac-dot1x match protocol arp set cos 1 police cir 680 kbps bc 250 ms conform action: transmit violate action: drop module 2: conformed 3681791362 bytes, 5-min offered rate 207 bytes/sec peak rate 12986775 bytes/sec at Thu Jan 14 22:24:13 2016 violated 0 bytes, 5-min violate rate 0 bytes/sec peak rate 0 bytes/sec module 9: conformed 250641810380 bytes, 5-min offered rate 19456 bytes/sec peak rate 320515191 bytes/sec at Thu Jan 14 22:29:13 2016 violated 0 bytes, 5-min violate rate 0 bytes/sec peak rate 710 bytes/sec at Thu Jan 14 22:29:13 2016 module 10: conformed 88017602822 bytes, 5-min offered rate 6214 bytes/sec peak rate 153161524 bytes/sec at Thu Jan 14 22:29:13 2016 violated 0 bytes, 5-min violate rate 0 bytes/sec peak rate 0 bytes/sec
tatmotiv
Cirrostratus
Sep 22, 2016
What we experienced was a bug in nx-os. The dropped packets also could not be seen in any counters...

MAC masquerading should not cause MAC flapping in normal operation since the standby unit will not send any packets with the "floating" MAC as source. The MAC masquerading address is only used for floating objects and those are only active on the active system for the according traffic-group. You are right though that you can see traffic from the standby machines "original" MAC address in normal operation (which is indeed also used for the floating objects when in active mode). This is most probably health check traffic, which is originating from a non-floating self IP. So, this traffic will not change when enabling MAC masquerading. Both machines will continue to send health checks from their non-floating self IPs, and those packets will sustain the original (non-masquerading) MAC address.

BTW, for provoking additional GARPs for all floating objects, you could also do a "tmsh load sys config", which will achieve the same without having to disable / re-enable all objects. But don't forget to make sure the config is saved before ;-)
Ron_Peters_2122
Altostratus
Sep 22, 2016
Thank you for the "tmsh load sys config" - we are relatively new to F5 and this will save us quite a bit of time.

Regarding MAC masquerading. Since each F5 LTM had dual uplinks to each Nexus Switch in a SEPARATE VPC/Port-Channel. Po127 = F51 and Po128 = F52. If the Standby unit doesn't send packets utilizing the shared MAC and we are already having issues with GARPs, that means that shared MAC would only be learned on one Port-Channel interface. When failover occurs, you're saying the new Active will start sending packets including the shared MAC and the switch CAM tables should update to reflect the MAC now being on a new Port-Channel interfaces (thus no need for GARPs)? Just want to make sure I'm understanding this 100% correctly.

I'm also going to be opening a TAC case with Cisco this morning to discuss this GARP issue as well to see if there is anything that can be done on the Nexus side to resolve this (or if it is indeed a bug).

Thanks for your help. I'll respond with what I find out.