Help Major network outage involving F5 LTM

Hi,

I'm really hoping someone can help me. Last Friday we had a major problem which affected access to all our Core Systems. The initial problem was caused due to a bug within the Cisco Nexus IOS which caused loopguard to block the vlans on a port-channel and then unblock them.

The 3 vlans used by the F5 (real, virtual servers and heartbeat) between our two LTM's became blocked for a few microseconds.

2010 Nov 19 14:31:43 GR_Core2 %STP-2-LOOPGUARD_BLOCK: Loop guard blocking port port-channel1 on VLAN0205. 2010 Nov 19 14:31:43 GR_Core2 %STP-2-LOOPGUARD_UNBLOCK: Loop guard unblocking port port-channel1 on VLAN00205

We have two LTM's, in active (data centre1) - standby (data centre2).

When we came to investigate why users couldn't access the systems it was because the servers couldn't reach their default gateway which is a floating ip on the F5 LTM. To solve the problem I pressed update on the F5 self ip used as the DG. Suddenly the servers could reach their DG and access to systems was restored. I'm interested to know what this would have done. I suspect it sent out a gratuitous arp?

Having checked the logs the Standby LTM became Active. The LTM also reported address conflicts for some of the IP's which are used for the Virtual Servers.

Any help to determine the cause will be very much appreciated as we are new to the F5 world so troubleshooting is difficult as we are used to Cisco products. our support company isn't being very helpful.

One thing I have noticed as that we are not using MAC masquerade.

Many Thanks

Darren

config

design

Dazzla_20011
Nimbostratus
Dec 03, 2010
We use a dedicated vlan for the network failover. During the night when the backup runs I'm seeing the inter-site links between the data centres hit 90% at times. I would have thought this would impact all vlans. I need to look in to policing this backup traffic.

I totally agree that spanning-vlans across both data centres is not good practice.

The problem is this this network requires layer 2 across each data centre for v-motion, css boxes, f5 boxes, MS cluster,

In terms of load balancing for the websites we host the traffic comes in via the public ip, is translated to the F5 virtual server and then load balanced across the web servers in our external dmz.

I was asked can we utilize the F5 devices to monitor and load balance requests sent from the web servers in our external dmz to the application servers in our internal dmz. Basically if a server or service fails within the the internal dmz we want the F5 to mark this as down. Currently we have to implement manual processes if a server or service fails.

The external and internal dmz's are on seperate networks so have to be routed via 2 sets of firewalls for them to communicate with each. No bridging is taking place. I am using source NAT to get around the routing issue as the external dmz servers do not have a route back to the internal dmz application servers via the F5. The firewall rules permit the f5 to the internal dmz servers. This is working perfectly well with the active LTM. The problem is with the standby LTM which sits in our other data centre. This is because its default gateway points to DC1 firewall (same as active LTM) but the routes back to it from the internal dmz point to DC2 so we have assymetric routing. My plan is to NAT the traffic of the standby LTM to an address from DC1 to get around this. My first plan was to have a different default route on the standby ltm. I was told I can't do this as the routes on the LTM's are sync'ed between each other.

Not sure anyone will have a clue what's going on without a diagranm. I will try to upload one when I return to work.

Thanks very much for everyones comments.
L4L7_53191
Nimbostratus
Dec 06, 2010
One of the worst outages I've ever seen had to do with a network heartbeat VLAN between two DCs that was severed. The BigIPs were split, as were 10 or so other devices that rely on GARP for failover. They all did what they were configured to do: they went active and all started sending GARPs out to every VLAN. What's worse is that only the failover VLAN was taken down and all of the other vlans were still connected. The arp storm was quite a spectacle to behold!

My personal angle is that network FO is fine, but you need to know the ramifications very well and think out all of the failure scenarios and their behaviors. That way, you'll at least know what to expect outage wise and it'll shorten your recovery times substantially. When it bites, layer 2 is merciless...

-Matt
Hamish
Cirrocumulus
Dec 06, 2010
All very true... When configuring network failover I make it mandatory to have a NON core network path between the two boxes (And avoid 9.4 because it'll only use ONE path for network failover traffic. WIth 10.x you can configure multiple connections).

Non core network means at it's simplest an optical SFP in each F5 and a piece of dark-fibre. Or you can uses a pair of dedicated heartbeat switches with dark fibre between them etc.

H
Dazzla_20011
Nimbostratus
Dec 06, 2010
We use dark fibre to inter connect our data centres. We have some spare so I will use a pair just for the failover.

I should have thought of this ages ago.

Many Thanks

Darren
Austin_Geraci
MVP
Dec 06, 2010
Yes, dedicate ports, fiber etc whatever you can to avoid any issues.

Post a drawing if you can.. to many ?s to keep going back and forth..

Thanks!
Paul_Szabo_9016
Historic F5 Account
Dec 06, 2010
What's worse is that only the failover VLAN was taken down and all of the other vlans were still connected. The arp storm was quite a spectacle to behold!

Then configuring network failover I make it mandatory to have a NON core network path between the two boxes (And avoid 9.4 because it'll only use ONE path for network failover traffic. WIth 10.x you can configure multiple connections).

Absolutely, highly important to have a redundant path for network failover traffic! That's why the multiple heartbeat paths feature was added (it actually started in v9.6.1)

Forum Discussion

Help Major network outage involving F5 LTM

Recent Discussions

Help with an I-rule rewrite

When adding new GTM(DNS) to the existing group, if software version is not same, what will happen?

Keep encoding when request is handled by irule

Application drop down menus does not work behind F5 APM Portal Access

irule does not work in browsers other than google

Related Content

What is Multi-Cloud Networking?

Simplify Network Segmentation for Hybrid Cloud

Demo Guide & Video Series for F5 Distributed Cloud Network Connect (Multi-Cloud Networking)

Community Learning Path: Multi-Cloud Networking

A complete Multi-Cloud Networking walkthrough with F5 Distributed Cloud