LTM VE virtual server unreachable sometimes
Hello ,
we use a HA cluster of 2 LTM's running on version 13.1.1.5 . Up til some weeks ago the LTM were deployed on vcmp guests on a viprion 2400 platform .
Early 2022 we migrated our LTM's from viprion towards LTM VE units running on vmware ESX . During migrations we didn't encounter any issue. (we basically used RMA process for replacing the units 1by1 with LTM VE's)
basic setups (VIP + TCP port) is done for some applications on this cluster . Where we use a virtual server , together with SNAT for pointing to pool members . And we use mac-masquerading also for creating fake mac-addresses. For this purpose we put "promiscious mode" - "'forged transmit" - "mac address changes" to "accept" on vmware .
Vmware is running on HP blade enclosure . Where normally 1 blade = 1 vmware host . During setup we asked vmware team to always have both units in HA cluster , running on different vmware hosts. (redudancy we thought)
Since replacement we are getting for some setups complaints that during the night (period of low traffic) , access to the virtual server (VIP + TCP port) is lost for a short period . When we check out ltm logs , we see however that pool members are still reachable as there are no UP/Down events. Neither do we see "failover" messages in logs . So clearly the HA cluster remains stable & pool member monitoring keeps working.
Further investigation at network level was done & we noticed that during the nights were the issue is seen , all mac addresses from active unit but also from standby unit . Even if we thought they are on different vmware hosts . Long story short , after a while vmware team confirmed us that even if your have server on different vmware hosts (blades) these still can use the same uplinks to network. Each blade has 4 uplinks to virtual server , and virtual server than has 4 uplinks to virtual connect modules . Vmware uses round robin for determining which uplink to use . Consequence , even when you are on different blade you have a 25% possibility of using same uplinks.
When this occurs , we see that during periods with little traffic we loose connectivity . This doesn't occur when we are using different uplinks.
We are suspecting this has to do with aging timer of mac-address of virtual server. Which is using the mac-masquerade address . We're suspecting that at vmware level the mac address (mac masquerade) is not known anymore , while at network level (cisco switches/routers) we are still seeing the mac-masquerade address . Thus you need to wait untill a ARP is done ar router level in order to get mac-addresses known again .
Does anybody has similar experiences ? Is there anybody who has more info about how mac-masquerading addresses are learned at virtual switch level & eventually how long they will be cached there ?