Forum Discussion

sriramgd_111845's avatar
sriramgd_111845
Icon for Nimbostratus rankNimbostratus
Oct 17, 2011

F5 and a pool of webservers - what settings/algorithm to use

We have a pool with a set of IIS based webserver nodes behind our F5 LTM (BIG-IP 9.4.5 Build 1049.10 Final).

 

 

The F5 terminates SSL. The VIP is set with a customized profile based on http profile.

 

 

We use the least connections algorithms to distribute the http requests from users browsers to the webservers.

 

 

We thought that this would be enough to protect 'high request executing' problems on individual webservers.

 

 

We have ~20 concurrent requests executing on a webserver during normal hours.

 

 

Looking at the F5, there are ~325 active connections for each webserver in the pool. This also matches the number when I run a netstat on the webserver. There are ~325 connections to the client browsers (we dont use SNAT), and ~94 more connections to the F5 itself (all in TIME_WAIT).

 

 

Before we saw this, we were under the impression that the connections correspond to the number of requests our webserver is actually processing at a time, but seems we we understood incorrectly.

 

 

So when a webserver goes bad for any reason, the number of concurrent requests executing on the server increases to say 100, it seems like the F5 still uses the least connections which it keeps (~325?) and keeps sending requests normally to the webserver whose threads pile up quickly and finally we have to take the webserver out of the pool.

 

 

Normally, we would like the F5 to stop sending requests based on the concurrent requests on our webserver since the situation would self correct (since the slowness is due to some external resource e.g. cache/DB etc., which eventually frees up as long as we dont add too many threads).

 

 

We thought of the following options

 

1. Using Observed algorithm, so that the F5 can route by both connections open and speed (which would definitely get slower on the effected webserver)

 

2. Use a dynamic ratio with a custom WMI monitor for requests executing on webserver.

 

3. Reduce some TCP timeout setting on the F5 so that the number of connections match the actual requests executing. I am guessing this will come at a cost.

 

Anything else?

 

 

2. is not something we want to do due now to the work involved and dependency on the webserver WMI.

 

 

Any advice on this would be appreciated.

 

 

Thanks,

 

Sriram

 

 

  • Always a few ways to tackle a problem like this and it really comes down to environment variables..

     

     

    Before we get into it, What is your "Action on Service Down" set on the Pool? And what Monitor(s) are you using?
  • Thanks for taking interest.

     

     

    Action on Service Down -> None

     

    Health Monitor -> Active -> http

     

    Availability -> All Health Monitor(s)

     

     

    The http monitor configuration (in Advance dropdown in the web UI) is:

     

    Interval 5 seconds

     

    Timeout 16 seconds

     

    Manual Resume No

     

    Send String GET /

     

    Receive String

     

    User Name

     

    Password

     

    Reverse No

     

    Transparent No

     

    Alias Address * All Addresses

     

    Alias Service Port * All Ports

     

  • Hi sriramgd,

     

     

    Your situation is tricky. The most that I can give you since I am not aware of your exact circumstances is a suggestion to alter your configuration.

     

     

    First, if you are using IIS then the thread counts should be directly tied the application pool being utilized by the website (on that configuration). If you say have 2 servers, I would suggest "duplicating" the website on the same server with a different application pool being used for the second instance of the website (you should be able to reference the same code that the original website is utilizing to decrease the administrative overhead for altering your configuration). Then add the duplicate websites into the pool to effectively double the number of pool members.

     

     

    Website.Server.Pool:

     

    10.10.10.10:80 - Server 1 - Original Website

     

    10.10.10.10:81 - Server 1 - Duplicate Website

     

    20.20.20.20:80 - Server 2 - Original Website

     

    20.20.20.20:81 - Server 2 - Duplicate Website

     

     

    Then use Priority Activation Groups to separate the Originals from the Duplicates:

     

     

    Website.Server.Pool:

     

    10.10.10.10:80 - Server 1 - Original Website - Priority Activation Group 2

     

    20.20.20.20:80 - Server 2 - Original Website - Priority Activation Group 2

     

    10.10.10.10:81 - Server 1 - Duplicate Website - Priority Activation Group 1

     

    20.20.20.20:81 - Server 2 - Duplicate Website - Priority Activation Group 1

     

     

    Note: Higher the Activation Group Number the higher the preference.

     

     

    Then set the Priority Activation Group setting to "Less than...." "2" Members Available. This will make sure that 2 of the 4 are always available.

     

     

    Then set your connection Limit to a recoverable limit for your servers.

     

     

    Website.Server.Pool:

     

    10.10.10.10:80 - Server 1 - Original Website - Priority Activation Group 2 - Connection Limit 100?

     

    20.20.20.20:80 - Server 2 - Original Website - Priority Activation Group 2 - Connection Limit 100?

     

    10.10.10.10:81 - Server 1 - Duplicate Website - Priority Activation Group 1 - Connection Limit 100?

     

    20.20.20.20:81 - Server 2 - Duplicate Website - Priority Activation Group 1 - Connection Limit 100?

     

     

    If they reach that limit they will be "removed from service" and one of the duplicates will become available to keep the application available (without overloading them). As soon as they process enough to fall below the connection limit the original websites would be come available again and the duplicates would go back into a standby mode (after they finished processing what they had sent to them during the overflow).

     

     

    This is just a suggestion that you will need to test and tweak to fit your situation, but I think it might help you resolve your overload issue.

     

     

    Hope this helps.

     

  • There are a few things to consider - to be honest I don't think you need to mess with the TCP settings on the BigIP here. It's much more likely related to your actual workload, config, and monitoring setup - at least that's my gut feeling.

     

     

    1) If request concurrency is an issue you may consider your monitoring strategy - remember that the monitor requests themselves can exacerbate situations where the server is struggling to keep up, especially in I/O bound situations like this. One thing I'd suggest is putting an explicit close header in the monitor GET string, i.e. GET HTTP/1.1 \r\n Connection: close \r\n, which will hopefully help free up a few threads for requests. I don't know if this will help you, but it's definitely worth doing. Also, pull back that interval to something less aggressive and it'll help free things up a bit.

     

     

    2) Consider using oneconnect with a /32 mask. This will help with your top-end thread counts.

     

    3) Consider using passive monitors, with a fallback (requires v10 or higher though).

     

    4) Pool level connection limits, as suggested above.

     

     

    Here's the trick with least connections in a situation like this: you've described a scenario where each web server has roughly the same number of connections, and all (or most) of them are in some I/O bound connection state. In this situation, which server should the algorithm choose from a connection count standpoint, if they're all essentially the same? It sounds like this isn't the ideal LB method for this use case. Maybe look into fastest, observed, etc.

     

     

    At the end of the day though, this is a problem that will probably need to be solved with the architecture itself, as opposed to BigIP. BigIP will help you deal with the situation extremely intelligently, but it can't ultimately solve it. You point out that adding more threads will exacerbate the situation on the DB or the cache bottleneck, etc. so adding servers won't help either. Ultimately you'll need to figure out how to architect for this:

     

     

    -- Move static stuff off of the back end (think about enabling caching on BigIP). This frees up some resources.

     

    -- Fix the I/O constraints behind the servers, if at all possible (read-only DB caches, memory stashing ala memcached, etc.)

     

    -- Move to an non blocking async I/O model if possible. This means a thread won't block until a response is written back out to it. In a situation like this a queue would be helpful.

     

     

    Just some suggestions, which may or may not make sense or be feasible in your environment :)

     

     

    --Matt Cauthorn