Forum Discussion
Runaway Connection Counts on Virtual Server
We have a BigIP 1500 running 9.4.3 which handles about 100 VIP's. The LTM does SSL offload for the backend servers, which are mostly tomcat/http. Throughput is around 60Mbps and average SSL TPS is about 125. The single busiest VIP accounts for 90 percent of the traffic.
Over the past 6 weeks, I have watched the number of active connections on this VIP climb steadily until it is almost 50K at this time. I can display the connections using the "bigpipe conn server 10.30.10.135:443 show" command like so:
[doxenhandler@sjc1-bigip-01:Active] netops b conn server 10.202.10.135:443 show | head
172.16.64.202:64355 <-> 10.30.10.135:https <-> any6:any tcp
172.16.85.187:4433 <-> 10.30.10.135:https <-> any6:any tcp
172.16.87.193:14042 <-> 10.30.10.135:https <-> 10.30.124.23:10011 tcp
I've done some research, and it appears that the connections displaying "any6:any" in the third column are waiting for the LTM to make a load-balancing decision. Using "grep -c" I determined that 48527 connections are in this state, while only 1297 have a complete connection to the back end (those with IP address in third column).
The back-end servers are not overloaded, and resources (memory, cpu) on the LTM are moderately utilized.
Has anybody seen connection counts climb like this? Any thoughts on where to look for potential performance bottlenecks?
11 Replies
- What_Lies_Bene1
Cirrostratus
Does bigtop show the same connection counts? Anything changed recently?
Any profile timeouts set to indefinite?
Could you post a suitably 'redacted' Virtual Server and associated profile and health monitor output? - danielo303_1961
Nimbostratus
Thanks for taking a look! Here are my responses-
bigpipe virtual shows 52K connections:
[doxenhandler@sjc1-bigip-01:Active] netops b virtual Acme_Enterprises show
VIRTUAL ADDRESS 10.30.10.135 UNIT 1
| ARP enable
| (cur, max, limit, tot) = (52858, 58433, 0, 1.584G)
| (pkts,bits) in = (70.66G, 518.8T), out = (47.41G, 93.60T)
bigtop shows fewer connections, just the established ones, apparently
| bits since | bits in prior | current
| Feb 26 14:36:13 | 4 seconds | time
BIG-IP ACTIVE |---In----Out---Conn-|---In----Out---Conn-| 11:39:30
sjc1-bigip-01.sjc1.ca 998.0T 252.9T 6.094G 327.3M 71.15M 1309
Anything changed recently. (A) we tried adding a third interface to the LTM about 6 weeks ago. It created asymmetrical routing, which broke stuff, and we shut the interface down. Also, the number of clients connecting to this VIP has increased about 40% over the last couple of months.
Profile timeouts set to indefinite: no
virtual server
[doxenhandler@sjc1-bigip-01:Active] netops b virtual Acme_Enterprises list
virtual Acme_Enterprises {
snatpool CIQInfrastructure_SNAT_POOL
pool Acme_Enterprises
destination 10.30.10.135:https
ip protocol tcp
vlans loadbalance enable
rules
Strip_Incoming_X-Forwarded_For
Acme_Enterprises_iRule
profiles
Redirect_HTTPS_2_HTTP
sslprofile.acme.com
tcp
}
pool configuration
[doxenhandler@sjc1-bigip-01:Active] netops b pool Acme_Enterprises list
pool Acme_Enterprises {
monitor all Uploader_Health
members
10.30.124.10:10011
session disable
10.30.124.22:10011
10.30.124.23:10011
}
monitor
[doxenhandler@sjc1-bigip-01:Active] netops b monitor Uploader_Health list
monitor Uploader_Health {
defaults from http
recv "HTTP/1.1 200 OK"
send "GET /Uploader/status HTTP/1.0\n\n"
}
iRule
[doxenhandler@sjc1-bigip-01:Active] netops b rule Acme_Enterprises_iRule list
rule Acme_Enterprises_iRule {
when HTTP_REQUEST {
switch -glob [HTTP::uri] {
"/datatest*" {
pool Acme_Datatest
return
}
default {
if { ([HTTP::uri] starts_with "/cpm" ) or ! ([HTTP::method] equals "POST")} {
discard
}
pool Acme_Enterprises
return
}
}
}
} - What_Lies_Bene1
Cirrostratus
Hmmm. This article suggests those entries relate to OneConnect, is that possible? I don't see anything obvious in your configuration. http://support.f5.com/kb/en-us/solu...r=25785594
I personally suspect your failed change is responsible for the issue and a reboot/restart would fix it.
On a related note, the asynchronous issue you previously had could probably be overcome if you disable VLAN Keyed Connections. - danielo303_1961
Nimbostratus
Steve,
Appreciate your suggestions. Rebooting came to my mind as well, will see if I can get my manager to agree to it!
Will check out the async routing fix as well.
Best,
~Daniel
- What_Lies_Bene1
Cirrostratus
You're welcome. Please do post back on whatever fixes the issue.
If you need more help around the async routing, create a new post and I'll provide more details. - danielo303_1961
Nimbostratus
Update on the situation-
I scheduled a maintenance outage to reboot the LTM. Got involved in another issue which was that the standby was not properly sync'ed. Tried failing over last night, but had to fail back before I could reboot the primary. The good news is that I fixed the config sync problem, so I hope to be able to schedule a new maintenance period, and I feel more encouraged that failover will work, and I'll be able to reboot the primary.
The connections did come down during the maintenance last night, but they quickly climbed back up over 40K, which still seems quite high. It May just be the increase in traffic - these are mostly mobile devices which are uploading data, so the sessions are quite brief (1-3sec). It may just be taking them longer to time out of the connection table or something.
Will try to post another update after I get this rebooted.
Cheers!
-Daniel - What_Lies_Bene1
Cirrostratus
Thanks for the update Daniel. If the reboot doesn't fix the issue, lowering idle-timers might but let's see. Cheers - danielo303_1961
Nimbostratus
Rebooted both members of the failover pair about a week abo and the connections are still climbing. Currently the LTM has ~105k connections, for which this one VIP is using 103k. Not sure how it's possible to even go above 65,000 but we are. We are considering creating a second VIP pointing to the same pool, and do DNS round-robin on the front end to distribute these high connections.
Other than the scary connection counts everything seems to be working okay now. - What_Lies_Bene1
Cirrostratus
I'd definitely recommend you open a case with F5 support, although the impact is low this really shouldn't be happening. - danielo303_1961
Nimbostratus
Wanted to provide some follow up on this issue. Was able to bring the active connections down by tweaking the tcp profile used by this VIP. Created a new tcp profile "tcp_timeout" and lowered the idle timeout to 120 seconds, from the default of 300. After applying this profile, my active connections dropped from around 120K to about 30K. I noticed TMM memory usage also dropped from 400M to 200M which is a good thing, too.
Probably need to do further investigation as to why so many idle connections were hanging around, but at least I've reduced impact on my LTM without affecting the application.
Help guide the future of your DevCentral Community!
What tools do you use to collaborate? (1min - anonymous)Recent Discussions
Related Content
* Getting Started on DevCentral
* Community Guidelines
* Community Terms of Use / EULA
* Community Ranking Explained
* Community Resources
* Contact the DevCentral Team
* Update MFA on account.f5.com
