we use a HA cluster of 2 LTM's running on version 126.96.36.199 . Up til some weeks ago the LTM were deployed on vcmp guests on a viprion 2400 platform .
Early 2022 we migrated our LTM's from viprion towards LTM VE units running on vmware ESX . During migrations we didn't encounter any issue. (we basically used RMA process for replacing the units 1by1 with LTM VE's)
basic setups (VIP + TCP port) is done for some applications on this cluster . Where we use a virtual server , together with SNAT for pointing to pool members . And we use mac-masquerading also for creating fake mac-addresses. For this purpose we put "promiscious mode" - "'forged transmit" - "mac address changes" to "accept" on vmware .
Vmware is running on HP blade enclosure . Where normally 1 blade = 1 vmware host . During setup we asked vmware team to always have both units in HA cluster , running on different vmware hosts. (redudancy we thought)
Since replacement we are getting for some setups complaints that during the night (period of low traffic) , access to the virtual server (VIP + TCP port) is lost for a short period . When we check out ltm logs , we see however that pool members are still reachable as there are no UP/Down events. Neither do we see "failover" messages in logs . So clearly the HA cluster remains stable & pool member monitoring keeps working.
Further investigation at network level was done & we noticed that during the nights were the issue is seen , all mac addresses from active unit but also from standby unit . Even if we thought they are on different vmware hosts . Long story short , after a while vmware team confirmed us that even if your have server on different vmware hosts (blades) these still can use the same uplinks to network. Each blade has 4 uplinks to virtual server , and virtual server than has 4 uplinks to virtual connect modules . Vmware uses round robin for determining which uplink to use . Consequence , even when you are on different blade you have a 25% possibility of using same uplinks.
When this occurs , we see that during periods with little traffic we loose connectivity . This doesn't occur when we are using different uplinks.
We are suspecting this has to do with aging timer of mac-address of virtual server. Which is using the mac-masquerade address . We're suspecting that at vmware level the mac address (mac masquerade) is not known anymore , while at network level (cisco switches/routers) we are still seeing the mac-masquerade address . Thus you need to wait untill a ARP is done ar router level in order to get mac-addresses known again .
Does anybody has similar experiences ? Is there anybody who has more info about how mac-masquerading addresses are learned at virtual switch level & eventually how long they will be cached there ?
There are some bugs related to MAC Masquerade and VE. Check them out:
articles are linked to X520 & SRIOV. I've checked with our vmware team and we don't use this on our infrastructure.
The weird point is also that most of the time it is all working fine without any issue . Only at nighttime , when there is little traffic , then we see occurences of the issue . Virtual server becomes unreachable for clients trying to reach it . It also for a short period and then becomes available again . If i read the above articles i would expect it not to work at all with mac-masquerading . This is not the case for us , it works fine most of the time .