Timeout on request to F5 pool

Question

Folks,&nbsp;
&nbsp;My customer has several distributed application components.   Each component is scaled for HA, and has a pool set up for it.  Recently, network requests from any component to one particular pool have started timing out, seemingly at random.  The hosts in the pool are accessible directly.&nbsp;
&nbsp;The pool in question has a DNS entry which resolves fine --- but if I do a traceroute -n to either the URL or the IP address of the pool, I get all * * * which indicates no route to destination, or packet loss.&nbsp;
&nbsp;I say seemingly at random, because it happens to individual hosts trying to contact a pool.  The pool might be inaccessible for minutes or hours from a particular host, then suddenly be accessible again.  When that happens, the same pool becomes inaccessible from a different host that could formerly access it.&nbsp;
&nbsp;Where do I go from here?  My customer had another F5 and swapped it out... problem did not go away.  DNS resolution works fine all around.&nbsp;
&nbsp;Thanks for being patient.&nbsp;
&nbsp;Robert&nbsp;

hamish · Answer

Hmm... You mean IP of the VS? Sorry, you're mixing the terminology around a bit which means it could be a bit ambiguous trying to work things out. 
&nbsp;  
&nbsp; Are you using SNAT? n-path? NAT'ed VS's? And iRules?  
&nbsp;  
&nbsp; What's the rest of the network look like? I've seen issues like this in various setups, but it was always external to the BigIP... e.g. BGP routing from non-directly attached VLAN's, or problems with OTV mac address caching... Check your routing in particular that you're not trying to asymmetrically route back through the BigIP (It's won't' work by default, although IIRC you can change a setting in the DB to enable that on some versions. Aaron may remember which ones0&gt; 
&nbsp;  
&nbsp; H 
&nbsp;  &nbsp;

hamish · Answer

Oh. And when you do have a hot that can't access your VS... WHt do you see when you tcpdump on the LTM looking for traffic t/from the client? SYN? No SYN-ACK? SYN/SYNACK/ACK but no traffic? 
&nbsp;  
&nbsp; H

plumtree_72679 · Answer

Hamish,&nbsp;
&nbsp;Will get answers on these questions from the network gurus.  Many thanks for clarifying the right questions to ask.  &nbsp;
&nbsp;Thanks,
&nbsp;Robert&nbsp;

plumtree_72679 · Answer

Hamish, &nbsp;
&nbsp;Much appreciate your patience with an obvious n00b. I'm an old school architect; dirt simple... this one vexes me. Last time I saw something similar, it turned out to be a secondary firewall that was interpreting traffic between parts of the distributed application as a DDOS attack... stealthed the destination ports''. But that was when they still let you put Etherial/wireshark on your laptop. &nbsp;
&nbsp;Requested the tcpdump, waiting on that... will answer that one probably Monday when the folks with access to the F5 return.&nbsp;
&nbsp;I'm not sure what you mean by VS; the VIP itself responds to pings, and the traceroute goes just to the VIP.  Does this rule out the F5? &nbsp;
&nbsp;The customer use SNAT.  The F5 acts as a proxy, and the pool member servers always see sessions on a given VLAN coming from the same IP address (which I believe is the F5’s “self-IP” address on that vlan). &nbsp;
&nbsp;As for n-path, NAT'ed VS's, BGP routing from non-directly attached VLAN's, OTV mac address caching, asymmetrically routing back through the F5... they are not using any of those. There is no routing taking place at all: everything is at layer 2.  &nbsp;
&nbsp;When a traceroute returns good, the traceroute shows a single hop, indicating that no routing is taking place.  All VIPs and servers are on the same IP subnet and vlan. No iRules are in use for this distributed application at all. &nbsp;
&nbsp;The F5 configuration has not changed since well before the problem began.  However, along with some scheduled network and server upgrades, the customer took the opportunity to upgrade the F5 code versions as well as the F5 box itself.  &nbsp;
&nbsp;All F5 upgrades have been rolled back, to no effect. Unfortunately a great deal of the hardware and networking environment has changed, all at the same time, rather than one by one and then smoke-tested.  &nbsp;
&nbsp;The layer 2 path between the F5 and the servers is significantly different than it was, before the problem began. &nbsp;
&nbsp;The F5 guru provided me with the F5 configuration, as below. Relevant bits have been sanitised (customer's privacy &amp; anonymity come first), including sample node definitions, the health monitor definition used by the 'problem' pool, as well as all pool and virtual server definitions. &nbsp;
&nbsp;Thanks! 
&nbsp;Robert &nbsp;
&nbsp;------------------------------------------------------- 
&nbsp;node x.x.x.1 {    
&nbsp;screen hostname1 
&nbsp;} 
&nbsp;node x.x.x.0 {    
&nbsp;screen hostname0 
&nbsp;} &nbsp;
&nbsp;monitor eMyApp_PROBLEMPOOL_http {    
&nbsp;defaults from http    
&nbsp;dest *:x00x    
&nbsp;recv "Test Passed"    
&nbsp;send "GET //PROBLEMPOOL/health.html
" 
&nbsp;} &nbsp;
&nbsp;pool myapptst.customer.com-x00x {    
&nbsp;lb method member least conn    
&nbsp;monitor all eMyApp_PROBLEMPOOL_http    
&nbsp;member x.x.x.1:x00x    
&nbsp;member x.x.x.0:x00x 
&nbsp;} &nbsp;
&nbsp;virtual myapptst.customer.com-80 {    
&nbsp;destination x.x.x.3:http    
&nbsp;snat automap    
&nbsp;ip protocol tcp    
&nbsp;profile http tcp    
&nbsp;pool myappstg.customer.com-x00x    
&nbsp;vlans eMyApp-x.x.x.8-25 enable 
&nbsp;}

hamish · Answer

VS == Virtual Server... The config line 'virtual myapptst.customer.com-80' defines a VS with IP x.x.x.3 and port 80 (HTTP).  Using SNAT, looks fine.  
&nbsp;  
&nbsp; If it all looks OK between the VS and the client, then it becomes a little more difficult to debug. The SNAT makes it difficult... The way I usually try to debug this varies.. But you could try a custom iRule that SNAT's to a dedicated IP address on a 1:1 basis (1 client to a dedicated IP). Then tcpdump for the client-VS traffic and SNAT-poolmembers traffic simultaneously. 
&nbsp;  
&nbsp; Do you have a complicated layer2 network? Is it possible your'e having spanning-tree problems?  
&nbsp;  
&nbsp; H

Forum Discussion

Timeout on request to F5 pool

7 Replies

Recent Discussions

Migration from i series 10200 with 1 child VCMP to r series 10900 series

Azure DP-100 Exam Questions

Deploying F5 WAF in front of Azure Web App Services

What happens if I only enable ASM in BIG-IP Under System > Resource Provisioning

What is the meaning is 52% block in WAF

Related Content

Request Resubmit on timeout

big3d timeouts

Logging DNS Requests

api token timeout change

iRules LX Sideband Connection - Handling timeouts