Forum Discussion

f5mkuDefault's avatar
Feb 26, 2021

WEBSERVICES becomes inaccessible when failover

Hi experts,

 

I would like to get some help with this issue we've been trying to resolve that even F5 TAC is unable to find the root cause until now for several months already.

 

We have an F5 Big-IP which runs LTM and ASM. Each virtual server is assigned with a WAF policy.

Our big problem is, whenever we failover the active unit to standby unit all websites becomes inaccessible for more than an hour. Some comes up in 30 mins, most are within an hour and some after nearly 2 hours. It is also the same symptom when we fail back the active unit.

 

Anyone can advise what issue we could be facing here?

We are using version 12.1.3.3 Build 0.0.1 Point release 3

 

Thanks a lot in advance,

9 Replies

  • There is a lot to unpack here - does the use of ASM make a difference? Do you have MAC masquerading configured? Are the ARP tables being updated after GARP? Do pools go down? Do you see client-side traffic hitting the newly active BIG-IP?

    • f5mkuDefault's avatar
      f5mkuDefault
      Icon for Cirrus rankCirrus

      Hi Pete, during some incidents removing the ASM resolves the issue. So for example, we failover and experience the slowness where web pages does not load, we remove the ASM policy and then web immediately load. However for some instances this does not help for other websites. The ARP tables yes it becomes update pointing to the new active self-ip. The pools never went down and we see clients hitting the new big-ip.

       

      On recent tshooting we did, as per the F5 support, client traffic hits all the way to the real server. However, the return traffic between the F5 and the firewall keeps on bouncing. F5 keeps on sending but no reply from the firewall. On the firewall end we get a reply that there is no issue with the firewall.

  •  ,

     

    I would 1st start looking from tracepath if it even reaches the self ip of the box, to confirm whether its going to active box or the standby box.

    Then start looking on the show sys connection table on boxes to check where the traffic is landing.

    Then check the packets whats happening...

    • f5mkuDefault's avatar
      f5mkuDefault
      Icon for Cirrus rankCirrus

      hi Jaikumar, these part we have already checked but not able to find the root cause. We asking for RMA as we suspect it could be due to resource issue but F5 refuse to accept it but after months we still don't have clarity.

  • Have you checked the ARP tables on the Firewall? It is common for firewalls to drop GARP packets because of the risk of ARP cache poisoning attacks. ie the request may be coming through the BIG-IP, through the firewall and to the server, then the response gets back to the firewall which sends it to the standby BIG-IP. This then depends on the firewall ARP cache to time out ( which explains the time variance ) before it does an ARP request and receives the MAC address for the correct BIG-IP. Worth taking a look ( or configuring MAC masquerading on the traffic group, which is the best solution )

  • Just want to update this, currently f5 is pushing this to firewall problem...no closure yet

    • eey0re's avatar
      eey0re
      Icon for Cirrostratus rankCirrostratus

      This does sounds like a firewall problem. For example, when a failover occurs the TCP connections are not recognised by the new appliance (unless network mirroring is enabled for a VS). This results in a large number of TCP RSTs to all the servers and clients. I've seen a "nextgen" firewall see the large number of RSTs from BIG-IP and think it's a port scan.

    • Nikoolayy1's avatar
      Nikoolayy1
      Icon for MVP rankMVP

      I agree with eey0re that you may test with f5 connection mirroring and  mac masquerade and also during a failover the firewall teams needs to check the security and ddos logs.