For more information regarding the security incident at F5, the actions we are taking to address it, and our ongoing efforts to protect our customers, click here.

Forum Discussion

RyanFeiock_2247's avatar
RyanFeiock_2247
Icon for Nimbostratus rankNimbostratus
Sep 30, 2015

Action On Service Down - expected results

I came across this article:

 

https://support.f5.com/kb/en-us/solutions/public/13000/300/sol13310.html

 

About using BigIP for maintenance, and would like to work it in to our code deployment process. I have an automated deployment workflow, where I can insert the "modify ltm node" commands and take a server out of rotation while I apply the updated code. This process "works", but during my load testing, if I run an application deployment I get about a 4% failure rate. Meaning out of the 254 tests that ran during my load test, around 10 of them failed. The most common error I get is "An existing connection was forcibly closed by the remote host".

 

So far I have tried the following:

 

  • Set the node state to "Forced Offline" using the following command:

     

    • modify ltm node NodeName state user-down
  • Set the node state to "Disabled" using the following command:

     

    • modify ltm node NodeName session user-disabled
  • I have tried the above using the following "Action On Service Down" settings:

     

    • None
    • Reject
    • Reselect

The best scenario combination I have found is setting the node state to Forced Offline with the Action On Service Down to Reject, but as I mentioned above, it results in about a 4% failure rate.

 

So my question is...should I expect this 4% failure rate for my users is typical, or should the system be able to do this with 0 errors. Also, is my approach a recommended use of the BigIP, or is there a better way to approach this?

 

7 Replies

  • Disabling a node means that the node can only handle persistent (requests for which there's a persistence table entry) and active (TCP) connections. Forcing the node down means that it can only handle active TCP connections. Neither option will "kick" an active TCP connection from the node. So your errors are very likely coming from the fact that you're simply turning off the server before active TCP connections have had a chance to bleed off. For HTTP traffic this shouldn't normally be very long. You might want to consider adding a looping check to make sure there are no active connections before moving on to turning off the server.

     

  • Thanks for the response. I had considered that and did indeed add a wait to my process (should have mentioned that before). While it is running, I have the Pool - Statistics window open that shows the current connection count for the node. That value be 0 for quite a while (10 - 30 seconds) prior to changes occurring on the server. I assume if it is showing 0 that the connections have successfully bled off, correct?

     

  • Correct. If there are no active TCP connections and the node is forced down, you're safe to turn it off.

     

  • Gotcha. So given the timing of the errors, I think they occur during the process of forcing a node offline (as opposed to another deployment activity). I am fairly certain it is this process because I can run a load test to generate load on the servers, and then manually force nodes offline one at a time via the F5 Management UI. Most of the time there are no errors, but around 4% of the time I will encounter them.

     

    BTW, load in this scenario is about 2100 connections spread across 4 pools, 2 pools on the web tier and 2 on the app. Each pool has 3 nodes (physical servers). There are 3 web servers and 3 app servers. Are you able to force a node offline with similar load without errors?

     

  • I'm not sure I understand the question. If there's an active TCP connection between the F5 and a server, and you remove that server before those connections have bled off, you're going to get an error regardless of the disable/force offline setting. These settings simply enforce whether or not new TCP connections can be established.

     

  • I was under the impression that I could issue a force offline command to the BigIP and that it would redirect traffic to the other nodes in the pool gracefully without errors. In my experience of using Reject, this is not the case as most connections are redirected gracefully, but a small number are killed and cause errors for the users. I was asking if this was expected behavior, or if I am doing something wrong on my end.

     

  • So Reject is def causing your 'forcibly closed by remote host' errors. If you set to None and then wait for connections to bleed off you should get 0 errors.