I would like to get some help with this issue we've been trying to resolve that even F5 TAC is unable to find the root cause until now for several months already.
We have an F5 Big-IP which runs LTM and ASM. Each virtual server is assigned with a WAF policy.
Our big problem is, whenever we failover the active unit to standby unit all websites becomes inaccessible for more than an hour. Some comes up in 30 mins, most are within an hour and some after nearly 2 hours. It is also the same symptom when we fail back the active unit.
Anyone can advise what issue we could be facing here?
We are using version 126.96.36.199 Build 0.0.1 Point release 3
Thanks a lot in advance,
There is a lot to unpack here - does the use of ASM make a difference? Do you have MAC masquerading configured? Are the ARP tables being updated after GARP? Do pools go down? Do you see client-side traffic hitting the newly active BIG-IP?
Hi Pete, during some incidents removing the ASM resolves the issue. So for example, we failover and experience the slowness where web pages does not load, we remove the ASM policy and then web immediately load. However for some instances this does not help for other websites. The ARP tables yes it becomes update pointing to the new active self-ip. The pools never went down and we see clients hitting the new big-ip.
On recent tshooting we did, as per the F5 support, client traffic hits all the way to the real server. However, the return traffic between the F5 and the firewall keeps on bouncing. F5 keeps on sending but no reply from the firewall. On the firewall end we get a reply that there is no issue with the firewall.
I would 1st start looking from tracepath if it even reaches the self ip of the box, to confirm whether its going to active box or the standby box.
Then start looking on the show sys connection table on boxes to check where the traffic is landing.
Then check the packets whats happening...
hi Jaikumar, these part we have already checked but not able to find the root cause. We asking for RMA as we suspect it could be due to resource issue but F5 refuse to accept it but after months we still don't have clarity.
Have you checked the ARP tables on the Firewall? It is common for firewalls to drop GARP packets because of the risk of ARP cache poisoning attacks. ie the request may be coming through the BIG-IP, through the firewall and to the server, then the response gets back to the firewall which sends it to the standby BIG-IP. This then depends on the firewall ARP cache to time out ( which explains the time variance ) before it does an ARP request and receives the MAC address for the correct BIG-IP. Worth taking a look ( or configuring MAC masquerading on the traffic group, which is the best solution )
This does sounds like a firewall problem. For example, when a failover occurs the TCP connections are not recognised by the new appliance (unless network mirroring is enabled for a VS). This results in a large number of TCP RSTs to all the servers and clients. I've seen a "nextgen" firewall see the large number of RSTs from BIG-IP and think it's a port scan.