Forum Discussion
mahnsc
Nimbostratus
Jan 13, 2009Selectively Disabling Keep-Alives
This is probably going to be another odd-ball post but there is a good reason for it.
We have a site with a recently discovered bug that sends our app servers into full garbage collection mode for very, very long periods of time when specific customer conditions are met. While reproducing the problem and upon further investigation, we've learned that after a period of 5 minutes (and Yes, people seem to be waiting that long on this site for a response), the bigip issues a connection reset and then the browser retransmits the POST. 5 minutes later, another reset followed by in some cases, another retransmit. This appears to be standard browser behavior.
I don't want to disable keep-alives wholesale--I am wondering if there is a way to disable keep-alives on POSTs using an irule? But I'm more concerned whether this is something I should not even think about attempting.
11 Replies
- hoolio
Cirrostratus
I thought most browsers would show 'page cannot be displayed' or some other error when they receive a RST. I don't think there is an automatic retry.
What is the criteria you'd want to use for disabling keep-alives? I would guess this wouldn't work as expected because the client would establish a new TCP connection to make a new request after receiving a RST on a past request.
Can you explain more on what you think the problem is and how you'd like to try to fix it?
Thanks,
Aaron - mahnsc
Nimbostratus
I've only been able to so far find two links that describe the same symptom--however I'm seeing it not only in IE6 but IE7 and Firefox 3.0.5:
http://www.experts-exchange.com/Software/Internet_Email/Web_Browsers/Q_20915496.html
http://www.coderanch.com/t/68471/BEAWeblogic/Resends-same-request-after-minutes - hoolio
Cirrostratus
What web/app servers are you using? Are you load balancing the client to web connections as well as the web to app connections using two separate VIPs?
Is the client retrying the request or is the application automatically retrying the request? The second link you provided suggests the app is resending the request (not the browser). They suggest a solution for WebLogic:
By default, WLIOTimeoutSecs for the WebLogic Plugin (which you are using for IPlanet) is configured to 300 seconds. After 300 seconds (5 minutes) the plugin will interpret the request as being "hung" and try to send another. This is what is causing your multiple requests.
You options are...
1) Up the WLIOTimeoutSecs value.
2) Change the paradigm you are using so that a request doesn't take 5 minutes. This is generally a bad user experience anyways. For example, you could execute the logic asynchronously and return immediately to the user with a message to either check back later for results or email them with a link to the results.
I'd suggest you try to determine exactly what the failure is before trying to solve it with an iRule or changes to the application. You can use a browser plugin like HttpFox for FF or Fiddler for IE to see what the client is sending. You can use an iRule to log request/response headers (Click here). It might also help to enable debug on the application and check the logs there as well. If the systems are in production, you might want to enable this logging during a maintenance window.
Aaron - mahnsc
Nimbostratus
BigIP>>Two Apache Servers >> mod_jk >> 5 JBoss Application Servers
We did use HTTPFox and HTTPWatch under Firefox and Internet Explorer respectively and don't see the retransmit show up in either utility. However, the tcpdump between browser and BigIP shows a retransmit and the mod_jk logs show that a second POST to the same resource was performed even though I did not refresh the browser or resubmit the transaction manually. HttpFox shows the intial POST but then nothing until 10 minutes later when the connection is reset the second time.
An irule logging Request and Response headers isn't going to help because there is no response. The app server is hung. The request header for the first post is:
POST /xxxx/xx/createDocLink?templateId=6005F8056AA0EB1CE0409E0AE8125ECE&visibility=Everyone HTTP/1.1
Host:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer:
Cookie: JSESSIONID=
Content-Type: multipart/form-data; boundary=---------------------------41184676334
Content-Length: 282
In my trace, this POST occurs at 16:33:34.322991. At 16:38:34.790165, the BigIP issues the connection reset. At 16:38:34:951729, the browser sends the POST again. At 16:43:35.189957, the BigIP issues another connection reset. **At no time is a response other than a connection reset ever received**
The Request Headers for the retransmitted POST are:
POST /xxxx/xx/createDocLink?templateId=6005F8056AA0EB1CE0409E0AE8125ECE&visibility=Everyone HTTP/1.1
Host:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer:
Cookie:
Content-Type: multipart/form-data; boundary=---------------------------41184676334
Content-Length: 282
Thanks for taking the time to think about this! - mahnsc
Nimbostratus
If this helps out at all, I've gleaned this from the rfc for http 1.1:
8.2.4 Client Behavior if Server Prematurely Closes Connection
If an HTTP/1.1 client sends a request which includes a request body, but which does not include an Expect request-header field with the "100-continue" expectation, and if the client is not directly connected to an HTTP/1.1 origin server, and if the client sees the connection close before receiving any status from the server, the client SHOULD retry the request. If the client does retry this request, it MAY use the following "binary exponential backoff" algorithm to be assured of obtaining a reliable response:
1. Initiate a new connection to the server
2. Transmit the request-headers
3. Initialize a variable R to the estimated round-trip time to the
server (e.g., based on the time it took to establish the
connection), or to a constant value of 5 seconds if the round-
trip time is not available.
4. Compute T = R * (2**N), where N is the number of previous
retries of this request.
5. Wait either for an error response from the server, or for T
seconds (whichever comes first)
6. If no error response is received, after T seconds transmit the
body of the request.
7. If client sees that the connection is closed prematurely,
repeat from step 1 until the request is accepted, an error
response is received, or the user becomes impatient and
terminates the retry process.
What I'm trying to figure out is if there is a way to prevent the client from retrying the request for POST methods. - mahnsc
Nimbostratus
So the subject of this post should probably be renamed to "Selectively Disabling Retransmissions" - hoolio
Cirrostratus
Interesting... I hadn't read that part of the RFC before. I'm surprised that neither HttpFox or Fiddler would show the retry. Are you running tcpdump on the client or LTM? Do you see the retried request on the client VLAN? If you run tcpdump on the client using a tool like Wireshark, do you actually see the request being sent from the client?
Another thing to check is if you have the pool's 'Action on Service Down' set to Reselect. I'm not sure whether this action would be taken if the pool member was marked down by a monitor or if it's also used when the pool member takes longer than the idle timeout to respond. If the latter was true and you had Reselect enabled, LTM would resend the request to a new pool member.
I still don't remember ever seeing a browser automatically retry a request after receiving a RST. If that is indeed what is happening here, is the problem then that the server takes more than the LTM's idle timeout (default of 5min) to start sending the response? So the client retried the request and it eats up server resources? If so, can you test by existing the idle timeout on the client TCP profile to a bit longer than the time you expect the server to take to respond? Tweaking the idle timeout might be more ideal than blocking the client's subsequent retries. If modifying the TCP profile fixes the problem, you could consider using an iRule to dynamically extend the idle timeout for these specific types of requests. You can do this using IP::idle_timeout:http://devcentral.f5.com/wiki/default.aspx/iRules/ip__idle_timeout when HTTP_REQUEST { if {$some_condition == 1}{ log local0. "original timeout: [IP::idle_timeout]" IP::idle_timeout 1801 log local0. "updated timeout: [IP::idle_timeout]" } } when SERVER_CONNECTED { log local0. "original timeout: [IP::idle_timeout]" if {$update_serverside_idle_timeout}{ IP::idle_timeout 1802 log local0. "updated timeout: [IP::idle_timeout]" } }
Aaron - mahnsc
Nimbostratus
tcpdump was run on the LTM. we were sniffing traffic between "everything" and the LTM as well as traffic between the LTM and the web servers. "Everything" is in quotes because we were testing against an identically configured stack as our production site. (Same LTM. Different web servers and app servers--although the servers in the lab are a bit beefier than production. We have a swing environment set up of two identical silos of servers. Identical numbers of web servers, app servers, database servers, etc. and we use shell scripts on the LTM to detach and attach pools to point traffic to a particular silo but that's another long story)
At the same time we ran the trace, the mod_jk log level on the web servers was set to 'debug'. While my colleague was running the tcpdump, I was staring at httpfox on my local machine as well as tailing the logs on the web servers so I neglected to run wireshark locally as well.
On our test bed, when the problem occurs, the transaction never completes. The nature of this particular bug places so much data into the jvm's memory that the jvm may very well perform full garbage collections forever. In production, due to the volumes of traffic we have, this bug ultimately results in the web servers hitting MaxClients and no longer being able to service any kind of request. The weird thing about this particular problem, which is outside the scope of this particular thread but is in-scope for a ticket with JBoss, is that even though only 1 of the 6 app servers is in Full GC mode, threads on all application servers quickly pile up until we're out of threads on all app servers, the web servers are at MaxClients, and users are down. It's a nasty bug to say the least! - mahnsc
Nimbostratus
Oh, one other thing. Since we're seeing the browser retransmit these POSTs after a period of time, we're thinking that the retransmission of these is hastening the demise of the site. Not the root cause but if you have several hundred users retransmitting every 5 minutes, we're running out of threads faster than we can assemble the right people on a call to deal with it. - hoolio
Cirrostratus
Do you have any way to differentiate between the initial request and the second request? I suppose you could track POST requests by client IP and maybe user-agent plus the URI using the session table. You could remove the session table entry for that client if/when the response is received. If the client already has an unanswered request pending and sends another one from the same IP with the same user-agent to the same URI, you could send back a 503 or some other response. You would want to set a timeout when adding the session table entry to that the client would be able to make another request after the timeout expired.
Else, would extending the idle timeout on earlier requests help give the server more time to answer those requests and prevent the client from retrying it over and over again?
Aaron
Help guide the future of your DevCentral Community!
What tools do you use to collaborate? (1min - anonymous)Recent Discussions
Related Content
DevCentral Quicklinks
* Getting Started on DevCentral
* Community Guidelines
* Community Terms of Use / EULA
* Community Ranking Explained
* Community Resources
* Contact the DevCentral Team
* Update MFA on account.f5.com
Discover DevCentral Connects