Hamish,
Much appreciate your patience with an obvious n00b. I'm an old school architect; dirt simple... this one vexes me. Last time I saw something similar, it turned out to be a secondary firewall that was interpreting traffic between parts of the distributed application as a DDOS attack... stealthed the destination ports''. But that was when they still let you put Etherial/wireshark on your laptop.
Requested the tcpdump, waiting on that... will answer that one probably Monday when the folks with access to the F5 return.
I'm not sure what you mean by VS; the VIP itself responds to pings, and the traceroute goes just to the VIP. Does this rule out the F5?
The customer use SNAT. The F5 acts as a proxy, and the pool member servers always see sessions on a given VLAN coming from the same IP address (which I believe is the F5’s “self-IP” address on that vlan).
As for n-path, NAT'ed VS's, BGP routing from non-directly attached VLAN's, OTV mac address caching, asymmetrically routing back through the F5... they are not using any of those. There is no routing taking place at all: everything is at layer 2.
When a traceroute returns good, the traceroute shows a single hop, indicating that no routing is taking place. All VIPs and servers are on the same IP subnet and vlan. No iRules are in use for this distributed application at all.
The F5 configuration has not changed since well before the problem began. However, along with some scheduled network and server upgrades, the customer took the opportunity to upgrade the F5 code versions as well as the F5 box itself.
All F5 upgrades have been rolled back, to no effect. Unfortunately a great deal of the hardware and networking environment has changed, all at the same time, rather than one by one and then smoke-tested.
The layer 2 path between the F5 and the servers is significantly different than it was, before the problem began.
The F5 guru provided me with the F5 configuration, as below. Relevant bits have been sanitised (customer's privacy & anonymity come first), including sample node definitions, the health monitor definition used by the 'problem' pool, as well as all pool and virtual server definitions.
Thanks!
Robert
-------------------------------------------------------
node x.x.x.1 {
screen hostname1
}
node x.x.x.0 {
screen hostname0
}
monitor eMyApp_PROBLEMPOOL_http {
defaults from http
dest *:x00x
recv "Test Passed"
send "GET //PROBLEMPOOL/health.html\r\n"
}
pool myapptst.customer.com-x00x {
lb method member least conn
monitor all eMyApp_PROBLEMPOOL_http
member x.x.x.1:x00x
member x.x.x.0:x00x
}
virtual myapptst.customer.com-80 {
destination x.x.x.3:http
snat automap
ip protocol tcp
profile http tcp
pool myappstg.customer.com-x00x
vlans eMyApp-x.x.x.8-25 enable
}