Forum Discussion

Plumtree_72679's avatar
Icon for Nimbostratus rankNimbostratus
Jan 21, 2012

Timeout on request to F5 pool




My customer has several distributed application components. Each component is scaled for HA, and has a pool set up for it. Recently, network requests from any component to one particular pool have started timing out, seemingly at random. The hosts in the pool are accessible directly.



The pool in question has a DNS entry which resolves fine --- but if I do a traceroute -n to either the URL or the IP address of the pool, I get all * * * which indicates no route to destination, or packet loss.



I say seemingly at random, because it happens to individual hosts trying to contact a pool. The pool might be inaccessible for minutes or hours from a particular host, then suddenly be accessible again. When that happens, the same pool becomes inaccessible from a different host that could formerly access it.



Where do I go from here? My customer had another F5 and swapped it out... problem did not go away. DNS resolution works fine all around.



Thanks for being patient.





7 Replies

  • Hamish's avatar
    Icon for Cirrocumulus rankCirrocumulus
    Hmm... You mean IP of the VS? Sorry, you're mixing the terminology around a bit which means it could be a bit ambiguous trying to work things out.



    Are you using SNAT? n-path? NAT'ed VS's? And iRules?



    What's the rest of the network look like? I've seen issues like this in various setups, but it was always external to the BigIP... e.g. BGP routing from non-directly attached VLAN's, or problems with OTV mac address caching... Check your routing in particular that you're not trying to asymmetrically route back through the BigIP (It's won't' work by default, although IIRC you can change a setting in the DB to enable that on some versions. Aaron may remember which ones0>






  • Hamish's avatar
    Icon for Cirrocumulus rankCirrocumulus
    Oh. And when you do have a hot that can't access your VS... WHt do you see when you tcpdump on the LTM looking for traffic t/from the client? SYN? No SYN-ACK? SYN/SYNACK/ACK but no traffic?



  • Hamish,



    Will get answers on these questions from the network gurus. Many thanks for clarifying the right questions to ask.







  • Hamish,



    Much appreciate your patience with an obvious n00b. I'm an old school architect; dirt simple... this one vexes me. Last time I saw something similar, it turned out to be a secondary firewall that was interpreting traffic between parts of the distributed application as a DDOS attack... stealthed the destination ports''. But that was when they still let you put Etherial/wireshark on your laptop.



    Requested the tcpdump, waiting on that... will answer that one probably Monday when the folks with access to the F5 return.



    I'm not sure what you mean by VS; the VIP itself responds to pings, and the traceroute goes just to the VIP. Does this rule out the F5?



    The customer use SNAT. The F5 acts as a proxy, and the pool member servers always see sessions on a given VLAN coming from the same IP address (which I believe is the F5’s “self-IP” address on that vlan).



    As for n-path, NAT'ed VS's, BGP routing from non-directly attached VLAN's, OTV mac address caching, asymmetrically routing back through the F5... they are not using any of those. There is no routing taking place at all: everything is at layer 2.



    When a traceroute returns good, the traceroute shows a single hop, indicating that no routing is taking place. All VIPs and servers are on the same IP subnet and vlan. No iRules are in use for this distributed application at all.



    The F5 configuration has not changed since well before the problem began. However, along with some scheduled network and server upgrades, the customer took the opportunity to upgrade the F5 code versions as well as the F5 box itself.



    All F5 upgrades have been rolled back, to no effect. Unfortunately a great deal of the hardware and networking environment has changed, all at the same time, rather than one by one and then smoke-tested.



    The layer 2 path between the F5 and the servers is significantly different than it was, before the problem began.



    The F5 guru provided me with the F5 configuration, as below. Relevant bits have been sanitised (customer's privacy & anonymity come first), including sample node definitions, the health monitor definition used by the 'problem' pool, as well as all pool and virtual server definitions.










    node x.x.x.1 {


    screen hostname1




    node x.x.x.0 {


    screen hostname0





    monitor eMyApp_PROBLEMPOOL_http {


    defaults from http


    dest *:x00x


    recv "Test Passed"


    send "GET //PROBLEMPOOL/health.html\r\n"





    pool {


    lb method member least conn


    monitor all eMyApp_PROBLEMPOOL_http


    member x.x.x.1:x00x


    member x.x.x.0:x00x





    virtual {


    destination x.x.x.3:http


    snat automap


    ip protocol tcp


    profile http tcp




    vlans eMyApp-x.x.x.8-25 enable


  • Hamish's avatar
    Icon for Cirrocumulus rankCirrocumulus
    VS == Virtual Server... The config line 'virtual' defines a VS with IP x.x.x.3 and port 80 (HTTP). Using SNAT, looks fine.



    If it all looks OK between the VS and the client, then it becomes a little more difficult to debug. The SNAT makes it difficult... The way I usually try to debug this varies.. But you could try a custom iRule that SNAT's to a dedicated IP address on a 1:1 basis (1 client to a dedicated IP). Then tcpdump for the client-VS traffic and SNAT-poolmembers traffic simultaneously.



    Do you have a complicated layer2 network? Is it possible your'e having spanning-tree problems?



  • Plumtree: it's a little bit hard to ferret this out from the information here, but on the surface it sounds like you may be running into PAWS dropped SYNS (see RFC 1323), which exhibit the same behavior and are really hard to track down. SNAT scenarios are exactly where you'd run into them too. If you all are still running into this and you've got a capture, hit my inbox and we'll take this off forum.



    The basic idea here is that you've got the same 4-tuple (src_ip:src_port, dst_ip:dst_port) being used for these flows, but the timestamp slid backwards. The stack expects the TS option to always increase, "monotonically", in RFC speak, over the course of a flow. If it sees the same 4-tuple with a timestamp in the past it'll do a silent discard and show the same behavior. Here's a quote from the RFC:



    "The basic idea is that a segment can be discarded as an old duplicate if it is received with a timestamp SEG.TSval less than some timestamp recently received on this connection."



    And that discard is the killer, because it's silent and you'll see bizarro stalls that last for minutes. One way to rule this in or out is define specific SNAT addresses for the hosts having the problem and see if it goes away.



    PS: I suspect that this PAWS problem can be worse in VMware environments because clock skew is so common. It'll affect those timestamps and potentially increase the chances you'll hit this with automap snats.



    • fcaminos_181193's avatar
      Icon for Nimbostratus rankNimbostratus
      Hello Matt I think I have discovered the problem you describe, can you help or give your thoughts I have a new post, Thank you.