Forum Discussion

mkyrc's avatar
mkyrc
Icon for Cirrus rankCirrus
Aug 23, 2023

Poll members not stable after failover

Hi,
Our setup:
- two vcmp guests in HA (viprion with two blades)
- ~10 partitions
- simple configuration with LTM, AFM. nodes directly connected to f5 device (f5 device is default gw for nodes).
- sw 16.1.3.3, after upgrade 16.1.4
^^ this setup in two data centers. 

We are hitting interesting behaviour in first data center only:
- second f5 guest is active: pool members monitors (http and https) respond without problem. everything is stable. this is valid for both f5 devices in HA.
- after failover (first f5 guest is active): pool members response is not stable (not stable for https monitor, http is stable again). sometimes are all pool members down, then virtual server is going down. 
^^ it looks like a problem on node side, but it's not, because when second f5 device is active, everything is stable.

This issue is hitting almost all partitions. We checked:
- physical interface: everything is stable, no error on ports, ether-channels (trunks).
- arp records: everything looks correct, no mac flapping
- spanning tree: stable in environment
- routing: correct, default gw on node side: correct, subnet mask: correct on nodes and both f5 devices. floating addresses is working correctly (including arp in network)
- log on f5 devices: without any issue connected to this behaviour.

I don't know what else connected to this issue we can check.

Configuration for all f5 devices (2x dc1, 2x dc2 - two independed ha pairs) is the same (configured with automation), sw version is the same (we did upgrade to 16.1.4 two days ago). It looks that someting is "blocked" on first f5 device in dc1 (reboot or upgrade is not solving our issue).

Do you have any idea what else to check?

2 Replies

  • Hi mkyrc , 

    try to capture the https monitor traffic in vCMP guest 1 which senses this flapping. 

    take pcap at the flapping time and look at time interval in each request form bigip self ip address to server nodes specially with https , this will help to detect if bigip has issues with monitor traffic or not ? 

    also check if you increased the monitor check interval to 10 instead of 5 & timeout 31 instead of 16. 

    and see the behavior , maybe there are issues in server side for the first bigip and feels some delay from servers. 

    anyway , test that , it will be good starting point  

    • mkyrc's avatar
      mkyrc
      Icon for Cirrus rankCirrus

      Hi Mohamed_Ahmed_Kansoh,
      All this I did, but no success. I discussed with application team and I think, we moved a bit forward.

      Situation:
      1. f5 devices (both) are monitoring directly connected server (pool members). This is FLOW1. Here is no problem, depending on which device is active.
      2. I found out that the server is establishing a new connection (FLOW2), to other system located in different subnet. Our server has default GW to f5 devices (his float ip address). i think, the issue is connected to this FLOW2 connection.

      Reason of the problem:
      - when active device is f5-02, FLOW2 is routed (virtual server 0.0.0.0/0 FastL4) correctly. no problem here.
      - when f5-01 is active, FLOW2 is going to f5-01 (because float ip (default gw) is moved here), and servers have a problem contacting other system located in different subnet in data center. the connection is very often interrupted.

      There is a no problem of moving float IPs between f5-02 and f5-02, no problem with arp records (a neighbors learned correct MAC addresses), f5 devices accept FLOW2 traffic (AFM rule is allowing all trafic sourced from server IP addresses to 'any'), also return traffic is routed correctly (to float address located on 'outside' of the f5 devices).

      During other maintenance window, I will check:
      - tcpdump in front and behind f5-01 if the packets are going correctly
      - prepare logging afm rule
      - cpu usage on f5-01
      - there is no network issue (packet drops on physical layer, checked last time)
      - /var/log/ltm is without issue (checked last time)
      - what else can I check? any idea?