Forum Discussion

Dazzla_20011's avatar
Dazzla_20011
Icon for Nimbostratus rankNimbostratus
Nov 26, 2010

Help Major network outage involving F5 LTM

Hi,

 

 

I'm really hoping someone can help me. Last Friday we had a major problem which affected access to all our Core Systems. The initial problem was caused due to a bug within the Cisco Nexus IOS which caused loopguard to block the vlans on a port-channel and then unblock them.

 

 

The 3 vlans used by the F5 (real, virtual servers and heartbeat) between our two LTM's became blocked for a few microseconds.

 

 

2010 Nov 19 14:31:43 GR_Core2 %STP-2-LOOPGUARD_BLOCK: Loop guard blocking port port-channel1 on VLAN0205. 2010 Nov 19 14:31:43 GR_Core2 %STP-2-LOOPGUARD_UNBLOCK: Loop guard unblocking port port-channel1 on VLAN00205

 

 

We have two LTM's, in active (data centre1) - standby (data centre2).

 

 

When we came to investigate why users couldn't access the systems it was because the servers couldn't reach their default gateway which is a floating ip on the F5 LTM. To solve the problem I pressed update on the F5 self ip used as the DG. Suddenly the servers could reach their DG and access to systems was restored. I'm interested to know what this would have done. I suspect it sent out a gratuitous arp?

 

 

Having checked the logs the Standby LTM became Active. The LTM also reported address conflicts for some of the IP's which are used for the Virtual Servers.

 

 

Any help to determine the cause will be very much appreciated as we are new to the F5 world so troubleshooting is difficult as we are used to Cisco products. our support company isn't being very helpful.

 

 

One thing I have noticed as that we are not using MAC masquerade.

 

 

Many Thanks

 

Darren
  • We use a dedicated vlan for the network failover. During the night when the backup runs I'm seeing the inter-site links between the data centres hit 90% at times. I would have thought this would impact all vlans. I need to look in to policing this backup traffic.

     

    I totally agree that spanning-vlans across both data centres is not good practice.

     

    The problem is this this network requires layer 2 across each data centre for v-motion, css boxes, f5 boxes, MS cluster,

     

     

    In terms of load balancing for the websites we host the traffic comes in via the public ip, is translated to the F5 virtual server and then load balanced across the web servers in our external dmz.

     

     

    I was asked can we utilize the F5 devices to monitor and load balance requests sent from the web servers in our external dmz to the application servers in our internal dmz. Basically if a server or service fails within the the internal dmz we want the F5 to mark this as down. Currently we have to implement manual processes if a server or service fails.

     

    The external and internal dmz's are on seperate networks so have to be routed via 2 sets of firewalls for them to communicate with each. No bridging is taking place. I am using source NAT to get around the routing issue as the external dmz servers do not have a route back to the internal dmz application servers via the F5. The firewall rules permit the f5 to the internal dmz servers. This is working perfectly well with the active LTM. The problem is with the standby LTM which sits in our other data centre. This is because its default gateway points to DC1 firewall (same as active LTM) but the routes back to it from the internal dmz point to DC2 so we have assymetric routing. My plan is to NAT the traffic of the standby LTM to an address from DC1 to get around this. My first plan was to have a different default route on the standby ltm. I was told I can't do this as the routes on the LTM's are sync'ed between each other.

     

     

    Not sure anyone will have a clue what's going on without a diagranm. I will try to upload one when I return to work.

     

     

    Thanks very much for everyones comments.

     

  • One of the worst outages I've ever seen had to do with a network heartbeat VLAN between two DCs that was severed. The BigIPs were split, as were 10 or so other devices that rely on GARP for failover. They all did what they were configured to do: they went active and all started sending GARPs out to every VLAN. What's worse is that only the failover VLAN was taken down and all of the other vlans were still connected. The arp storm was quite a spectacle to behold!

     

     

    My personal angle is that network FO is fine, but you need to know the ramifications very well and think out all of the failure scenarios and their behaviors. That way, you'll at least know what to expect outage wise and it'll shorten your recovery times substantially. When it bites, layer 2 is merciless...

     

     

    -Matt
  • Hamish's avatar
    Hamish
    Icon for Cirrocumulus rankCirrocumulus
    All very true... When configuring network failover I make it mandatory to have a NON core network path between the two boxes (And avoid 9.4 because it'll only use ONE path for network failover traffic. WIth 10.x you can configure multiple connections).

     

     

    Non core network means at it's simplest an optical SFP in each F5 and a piece of dark-fibre. Or you can uses a pair of dedicated heartbeat switches with dark fibre between them etc.

     

     

    H
  • We use dark fibre to inter connect our data centres. We have some spare so I will use a pair just for the failover.

     

     

    I should have thought of this ages ago.

     

     

    Many Thanks

     

    Darren
  • Yes, dedicate ports, fiber etc whatever you can to avoid any issues.

     

     

    Post a drawing if you can.. to many ?s to keep going back and forth..

     

     

    Thanks!
  • Paul_Szabo_9016's avatar
    Paul_Szabo_9016
    Historic F5 Account
    What's worse is that only the failover VLAN was taken down and all of the other vlans were still connected. The arp storm was quite a spectacle to behold!

     

    Then configuring network failover I make it mandatory to have a NON core network path between the two boxes (And avoid 9.4 because it'll only use ONE path for network failover traffic. WIth 10.x you can configure multiple connections).

     

     

    Absolutely, highly important to have a redundant path for network failover traffic! That's why the multiple heartbeat paths feature was added (it actually started in v9.6.1)