Forum Discussion
Brad_10289
Nimbostratus
Apr 05, 2012PROBLEM: Pool Member Won't Work Through Big IP LTM
Hi all, I was wondering if anyone might have some insight into a strange issue I'm seeing in our environment that I have had zero success in finding any solution or related posted issue about.
PROBLEM: Client can seemingly connect to Pool Member on HTTP (Port 80) via Virtual Server, but Pool Member will not honor GET request. Other Pool Member behaves perfectly as do all other Pool Members in other Pools/Virtual Servers. However clients are successfully able to directly connect to actual IP of offending Pool Member over HTTP.
BACKGROUND: Offending Pool Member is running on Windows 2003 server as a guest OS under VMWare ESX. HTTP server software/version is unknown by me at this point. Big IP version is 10.x.x.
TROUBLESHOOTING: Created new Virtual Server with both Pool Members for isolated testing as not to affect production environment. Confirmed using WGET under Windows Client that unable to connect to offending Member (eventually produces Read Error, Server Reset Connection error). Confirmed WGET to other Member works flawlessly. Confirmed using direct IP of offending Member works. All tests 100% consistent in result.
Logged into console via SSH and confirmed able to ping real IP of offending member. Likewise, had network administrator ping F5 and traceroute to F5 from offending member.
Confirmed able to TELNET to port 80 to real IP and working Member and successfully able to simulate GET request and receive HTML in response. When attempting to connect to offending member via TELNET through Virtual Server, connection is made but there is no response to GET request and connection eventually closes on its own.
From F5 shell via SSH, was able to successfully make GET requests via TELNET on port 80 of offending member via Virtual Server IP (as well as other servers/IPs).
This to me is suggesting there must be something going on between F5 and the client for this one specific pool member, since F5 can seemingly connect to offending member via virtual IP. NAT/SNAT are enabled.
Also tried to delete and re-add pool member to no avail.
Any input/advice greatly appreciated. Thank you.
ETA: Although this should go without mentioning given the information in bold above, I neglected to mention that the built-in default HTTP monitor does recognize the offending member is up and active. I may go ahead and try a modified monitor that looks for a specific response, but as mentioned, I can communicate with the server via TELNET on an SSH connection, so I don't believe there's an issue between F5 and the member.
13 Replies
- hoolio
Cirrostratus
Hi Brad,
I would have guessed that you don't have SNAT enabled and the server's default gateway is pointing to something other than LTM. But apparently you have SNAT enabled. Can you use tcpdump to check what's happening on the client and serverside connections? You can check SOL411 for details on tcpdump. If you need help analyzing the output you can open a case with F5 Support.
sol411: Overview of packet tracing with the tcpdump utility
http://support.f5.com/kb/en-us/solutions/public/0000/400/sol411.html
tcpdump -ni 0.0 -s0 host 1.1.1.1 or host 2.2.2.2
(where 1.1.1.1 and 2.2.2.2 are the client and server addresses)
Aaron - Brad_10289
Nimbostratus
Thank you, Aaron. I will try the tcpdump. Already opened a call with F5 but figured it couldn't hurt to try here too. At least I'll be ready now if they ask for the dump.
If anything, at least you alleviated my concerns that I'm an idiot or missing something obvious here. - Brad_10289
Nimbostratus
I ran the dump twice, once against the member that responds properly (and it is in production so the dump may include activity from other clients) and once against the bad member.
.99 = my client PC
.142 & .144 = self IPs for F5
.147 = virtual server
.16 = good member
.153 = bad member
GOOD MEMBER
09:57:22.933063 IP XXX.XXX.XXX.142.37907 > XXX.XXX.XXX.16.http: S 3393865625:3393865625(0) win 5840 out slot1/tmm1 lis=
09:57:22.933674 IP XXX.XXX.XXX.16.http > XXX.XXX.XXX.142.37907: S 4287057984:4287057984(0) ack 3393865626 win 16384 in slot1/tmm1 lis=
09:57:22.933991 IP XXX.XXX.XXX.142.37907 > XXX.XXX.XXX.16.http: . ack 1 win 46 out slot1/tmm1 lis=
09:57:22.934074 IP XXX.XXX.XXX.142.37907 > XXX.XXX.XXX.16.http: P 1:8(7) ack 1 win 46 out slot1/tmm1 lis=
09:57:23.099200 IP XXX.XXX.XXX.16.http > XXX.XXX.XXX.142.37907: . ack 8 win 65528 in slot1/tmm1 lis=
09:57:23.604580 IP XXX.XXX.XXX.99.51251 > XXX.XXX.XXX.147.http: S 870178715:870178715(0) win 8192 in slot1/tmm1 lis=
09:57:23.604624 IP XXX.XXX.XXX.147.http > XXX.XXX.XXX.99.51251: S 1152696155:1152696155(0) ack 870178716 win 4380 out slot1/tmm1 lis=NOS_Test
09:57:23.604887 IP XXX.XXX.XXX.99.51251 > XXX.XXX.XXX.147.http: . ack 1 win 256 in slot1/tmm1 lis=NOS_Test
09:57:23.608355 IP XXX.XXX.XXX.99.51251 > XXX.XXX.XXX.147.http: P 1:106(105) ack 1 win 256 in slot1/tmm1 lis=NOS_Test
09:57:23.608397 IP XXX.XXX.XXX.144.51251 > XXX.XXX.XXX.16.http: S 3589583094:3589583094(0) win 4380 out slot1/tmm1 lis=NOS_Test
09:57:23.609020 IP XXX.XXX.XXX.16.http > XXX.XXX.XXX.144.51251: S 1971215146:1971215146(0) ack 3589583095 win 16384 in slot1/tmm1 lis=NOS_Test
09:57:23.609033 IP XXX.XXX.XXX.144.51251 > XXX.XXX.XXX.16.http: . ack 1 win 4380 out slot1/tmm1 lis=NOS_Test
09:57:23.609046 IP XXX.XXX.XXX.144.51251 > XXX.XXX.XXX.16.http: P 1:106(105) ack 1 win 4380 out slot1/tmm1 lis=NOS_Test
09:57:23.613151 IP XXX.XXX.XXX.16.http > XXX.XXX.XXX.144.51251: P 1:530(529) ack 106 win 65430 in slot1/tmm1 lis=NOS_Test
09:57:23.613194 IP XXX.XXX.XXX.147.http > XXX.XXX.XXX.99.51251: P 1:530(529) ack 106 win 4485 out slot1/tmm1 lis=NOS_Test
09:57:23.712861 IP XXX.XXX.XXX.144.51251 > XXX.XXX.XXX.16.http: . ack 530 win 4909 out slot1/tmm1 lis=NOS_Test
09:57:23.746385 IP XXX.XXX.XXX.99.51251 > XXX.XXX.XXX.147.http: R 106:106(0) ack 530 win 0 in slot1/tmm1 lis=NOS_Test
09:57:23.746419 IP XXX.XXX.XXX.144.51251 > XXX.XXX.XXX.16.http: R 106:106(0) ack 530 win 4909 out slot1/tmm1 lis=NOS_Test
BAD MEMBER
09:59:05.150720 IP XXX.XXX.XXX.99.51298 > XXX.XXX.XXX.147.http: S 261892195:261892195(0) win 8192 in slot1/tmm0 lis=
09:59:05.150744 IP XXX.XXX.XXX.147.http > XXX.XXX.XXX.99.51298: S 225654915:225654915(0) ack 261892196 win 4380 out slot1/tmm0 lis=NOS_Test
09:59:05.151035 IP XXX.XXX.XXX.99.51298 > XXX.XXX.XXX.147.http: . ack 1 win 256 in slot1/tmm0 lis=NOS_Test
09:59:05.152322 IP XXX.XXX.XXX.99.51298 > XXX.XXX.XXX.147.http: P 1:106(105) ack 1 win 256 in slot1/tmm0 lis=NOS_Test
09:59:05.152362 arp who-has XXX.XXX.XXX.153 tell XXX.XXX.XXX.142 out slot1/tmm0 lis=
09:59:05.153029 IP XXX.XXX.XXX.144.51298 > XXX.XXX.XXX.153.http: S 2929947028:2929947028(0) win 4380 out slot1/tmm0 lis=NOS_Test
09:59:05.152940 arp reply XXX.XXX.XXX.153 is-at 00:50:56:84:27:44 in slot1/tmm1 lis=
09:59:05.251401 IP XXX.XXX.XXX.147.http > XXX.XXX.XXX.99.51298: . ack 106 win 4485 out slot1/tmm0 lis=NOS_Test
09:59:08.151357 IP XXX.XXX.XXX.144.51298 > XXX.XXX.XXX.153.http: S 2929947028:2929947028(0) win 4380 out slot1/tmm0 lis=NOS_Test
09:59:11.351549 IP XXX.XXX.XXX.144.51298 > XXX.XXX.XXX.153.http: S 2929947028:2929947028(0) win 4380 out slot1/tmm0 lis=NOS_Test
09:59:14.551543 IP XXX.XXX.XXX.144.51298 > XXX.XXX.XXX.153.http: S 2929947028:2929947028(0) win 4380 out slot1/tmm0 lis=NOS_Test
09:59:17.751434 IP XXX.XXX.XXX.144.28570 > XXX.XXX.XXX.153.http: S 1018696290:1018696290(0) win 4380 out slot1/tmm0 lis=NOS_Test
09:59:20.751355 IP XXX.XXX.XXX.144.28570 > XXX.XXX.XXX.153.http: S 1018696290:1018696290(0) win 4380 out slot1/tmm0 lis=NOS_Test
09:59:22.400040 IP XXX.XXX.XXX.144.ssh > XXX.XXX.XXX.99.50902: P 3398111607:3398111675(68) ack 157644937 win 80 out slot1/tmm0 lis=
09:59:22.400513 IP XXX.XXX.XXX.99.50902 > XXX.XXX.XXX.144.ssh: P 1:37(36) ack 68 win 252 in slot1/tmm0 lis=
09:59:22.400870 IP XXX.XXX.XXX.144.ssh > XXX.XXX.XXX.99.50902: . ack 37 win 80 out slot1/tmm0 lis=
09:59:23.951673 IP XXX.XXX.XXX.144.28570 > XXX.XXX.XXX.153.http: S 1018696290:1018696290(0) win 4380 out slot1/tmm0 lis=NOS_Test
09:59:27.151393 IP XXX.XXX.XXX.144.28570 > XXX.XXX.XXX.153.http: S 1018696290:1018696290(0) win 4380 out slot1/tmm0 lis=NOS_Test
09:59:30.351698 IP XXX.XXX.XXX.144.51298 > XXX.XXX.XXX.153.http: S 2956356628:2956356628(0) win 4380 out slot1/tmm0 lis=NOS_Test
09:59:33.351467 IP XXX.XXX.XXX.144.51298 > XXX.XXX.XXX.153.http: S 2956356628:2956356628(0) win 4380 out slot1/tmm0 lis=NOS_Test
09:59:36.551418 IP XXX.XXX.XXX.144.51298 > XXX.XXX.XXX.153.http: S 2956356628:2956356628(0) win 4380 out slot1/tmm0 lis=NOS_Test
09:59:39.751838 IP XXX.XXX.XXX.144.51298 > XXX.XXX.XXX.153.http: S 2956356628:2956356628(0) win 4380 out slot1/tmm0 lis=NOS_Test
09:59:42.951662 IP XXX.XXX.XXX.147.http > XXX.XXX.XXX.99.51298: R 1:1(0) ack 106 win 4485 out slot1/tmm0 lis=NOS_Test
09:59:43.992057 IP XXX.XXX.XXX.99.51319 > XXX.XXX.XXX.147.http: S 1826829106:1826829106(0) win 8192 in slot1/tmm1 lis=
09:59:43.992086 IP XXX.XXX.XXX.147.http > XXX.XXX.XXX.99.51319: S 2526585337:2526585337(0) ack 1826829107 win 4380 out slot1/tmm1 lis=NOS_Test
09:59:43.992258 IP XXX.XXX.XXX.99.51319 > XXX.XXX.XXX.147.http: . ack 1 win 256 in slot1/tmm1 lis=NOS_Test
09:59:43.995470 IP XXX.XXX.XXX.99.51319 > XXX.XXX.XXX.147.http: P 1:106(105) ack 1 win 256 in slot1/tmm1 lis=NOS_Test
09:59:43.995533 arp who-has XXX.XXX.XXX.153 tell XXX.XXX.XXX.142 out slot1/tmm1 lis=
09:59:43.996152 arp reply XXX.XXX.XXX.153 is-at 00:50:56:84:27:44 in slot1/tmm1 lis=
09:59:43.996169 IP XXX.XXX.XXX.144.51319 > XXX.XXX.XXX.153.http: S 3086797946:3086797946(0) win 4380 out slot1/tmm1 lis=NOS_Test
09:59:44.094894 IP XXX.XXX.XXX.147.http > XXX.XXX.XXX.99.51319: . ack 106 win 4485 out slot1/tmm1 lis=NOS_Test
09:59:44.097300 IP XXX.XXX.XXX.99.51319 > XXX.XXX.XXX.147.http: R 106:106(0) ack 1 win 0 in slot1/tmm1 lis=NOS_Test - Brad_10289
Nimbostratus
Sorry to keep responding to myself, but now I'm just going insane. So while I thought I was able to simulate the GET request via telnet on the F5 to the Virtual Host, I tried again today only to find out I couldn't, apparently for either member.
Then I realized I can just do a GET command (didn't realize yesterday it was case sensitive). So now if I ssh into F5 and GET, it works fine if only the good member is active in the pool, I can GET to the Virtual Host successfully. And I can also do a GET to the good server directly. But here's where it gets weird...
I can only occasionally GET from F5 to bad member through virtual host AND its server IP. In other words, it sometimes works for a little bit, but often it just simply returns:
500 Server closed connection without sending any data back
And sometimes I can GET directly to bad server IP but not do it through Virtual Server. Also I believe that even when I can't GET to the bad server ip directly, that I can do a WGET on my client to the server directly successfully.
It's getting very hard to even decide how to troubleshoot with this being so inconsistent. - hoolio
Cirrostratus
You can use curl to generate a valid HTTP request from the LTM command line. You can test direct to the pool members as well as to the virtual server.
curl -v 'http://1.1.1.1/path/to/file.ext?param1=value1'
If you want to make a request for the root object on the web app, you can just use:
curl -v 'http://1.1.1.1/'
It would be simpler to troubleshoot if you create a test virtual server and then tcpdump on the client and server IPs. You can create a test pool as well and only enable one server at a time to help isolate the issue.
Aaron - Brad_10289
Nimbostratus
Thanks, Aaron. They are in a test virtual server/pool so I can experiment freely without affecting production.
I've now learned two things: First, periodically, the F5 can not reach the bad server directly, but then other times it can very easily, with no apparent reason for this. Even when F5 can't access the server by IP, other clients such as my PC have no problem. Second, it appears that when going through Virtual Host, it's F5 that opens the HTTP connection which means I was incorrectly assuming my traffic was reaching the bad server simply because a connection was established from my client, I'm just opening an HTTP connection with the F5. It seems as though the F5 then has trouble communicating with the server and eventually times out. I was able to confirm this by disabling all members in the pool and I would get the exact same response, only the connection would be closed instantly instead of timing out.
In any case, the results of the curl tests:
VIRTUAL HOST - ONLY BAD MEMBER ACTIVE
* About to connect() to XXX.XXX.XXX.147 port 80
* Trying XXX.XXX.XXX.147... connected
* Connected to XXX.XXX.XXX.147 (XXX.XXX.XXX.147) port 80
> GET / HTTP/1.1
> User-Agent: curl/7.15.5 (i686-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5
> Host: XXX.XXX.XXX.147
> Accept: */*
>
* Empty reply from server
* Connection 0 to host XXX.XXX.XXX.147 left intact
curl: (52) Empty reply from server
* Closing connection 0
VIRTUAL HOST - ONLY GOOD MEMBER ACTIVE
* About to connect() to XXX.XXX.XXX.147 port 80
* Trying XXX.XXX.XXX.147... connected
* Connected to XXX.XXX.XXX.147 (XXX.XXX.XXX.147) port 80
> GET / HTTP/1.1
> User-Agent: curl/7.15.5 (i686-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5
> Host: XXX.XXX.XXX.147
> Accept: */*
>
< HTTP/1.1 200 OK
< ETag: "0-87-4da6e91b"
< Content-Type: text/html
< Last-Modified: Thu, 14 Apr 2011 12:31:23 GMT
< Server: Oracle-Application-Server-10g/10.1.2.0.2 Oracle-HTTP-Server OracleAS-Web-Cache-10g/10.1.2.3.0 (G;max-age=0+0;age=0;ecid=3337039370360,0)
< Content-Length: 135
< Date: Thu, 05 Apr 2012 13:47:43 GMT
< Accept-Ranges: bytes
Connection 0 to host XXX.XXX.XXX.147 left intact
* Closing connection 0
DIRECT TO BAD MEMBER
* About to connect() to XXX.XXX.XXX.153 port 80
* Trying XXX.XXX.XXX.153...
(never times out)
later attempt with no known changes:
* About to connect() to XXX.XXX.XXX.153 port 80
* Trying XXX.XXX.XXX.153... connected
* Connected to XXX.XXX.XXX.153 (XXX.XXX.XXX.153) port 80
> GET / HTTP/1.1
> User-Agent: curl/7.15.5 (i686-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5
> Host: XXX.XXX.XXX.153
> Accept: */*
>
< HTTP/1.1 200 OK
< ETag: "0-87-4bf15078"
< Content-Type: text/html
< Last-Modified: Mon, 17 May 2010 14:19:36 GMT
< Server: Oracle-Application-Server-10g/10.1.2.0.2 Oracle-HTTP-Server OracleAS-Web-Cache-10g/10.1.2.3.0 (G;max-age=0+0;age=0;ecid=4178853203493,0)
< Content-Length: 135
< Date: Thu, 05 Apr 2012 15:33:45 GMT
< Accept-Ranges: bytes
(I then attempted to connect to bad member through virtual server at same time I was able to connect directly and still unable to do so)
DIRECT TO GOOD MEMBER
* About to connect() to XXX.XXX.XXX.16 port 80
* Trying XXX.XXX.XXX.16... connected
* Connected to XXX.XXX.XXX.16 (XXX.XXX.XXX.16) port 80
> GET / HTTP/1.1
> User-Agent: curl/7.15.5 (i686-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5
> Host: XXX.XXX.XXX.16
> Accept: */*
>
< HTTP/1.1 200 OK
< ETag: "0-87-4da6e91b"
< Content-Type: text/html
< Last-Modified: Thu, 14 Apr 2011 12:31:23 GMT
< Server: Oracle-Application-Server-10g/10.1.2.0.2 Oracle-HTTP-Server OracleAS-Web-Cache-10g/10.1.2.3.0 (G;max-age=0+0;age=0;ecid=3852435304296,0)
< Content-Length: 135
< Date: Thu, 05 Apr 2012 15:35:45 GMT
< Accept-Ranges: bytes - hoolio
Cirrostratus
When you add a TCP profile to a virtual server, TMM will complete a three way handshake with the client regardless of the state of the pool so what you're seeing is expected.
Your first tcpdump shows the server not responding to TMM's connection attempts:
09:59:23.951673 IP XXX.XXX.XXX.144.28570 > XXX.XXX.XXX.153.http: S 1018696290:1018696290(0) win 4380 out slot1/tmm0 lis=NOS_Test
09:59:27.151393 IP XXX.XXX.XXX.144.28570 > XXX.XXX.XXX.153.http: S 1018696290:1018696290(0) win 4380 out slot1/tmm0 lis=NOS_Test
09:59:30.351698 IP XXX.XXX.XXX.144.51298 > XXX.XXX.XXX.153.http: S 2956356628:2956356628(0) win 4380 out slot1/tmm0 lis=NOS_Test
09:59:33.351467 IP XXX.XXX.XXX.144.51298 > XXX.XXX.XXX.153.http: S 2956356628:2956356628(0) win 4380 out slot1/tmm0 lis=NOS_Test
09:59:36.551418 IP XXX.XXX.XXX.144.51298 > XXX.XXX.XXX.153.http: S 2956356628:2956356628(0) win 4380 out slot1/tmm0 lis=NOS_Test
09:59:39.751838 IP XXX.XXX.XXX.144.51298 > XXX.XXX.XXX.153.http: S 2956356628:2956356628(0) win 4380 out slot1/tmm0 lis=NOS_Test
You could run a tcpdump on the server using Wireshark or whatever network monitoring tool you might have on it. That would allow you to check if the SYN packets are making it to the server. Does the server have multiple NICs? If so, do you see the responses going out a different interface?
Aaron - Brad_10289
Nimbostratus
There should only be one NIC on the server. I assume/say that because it is a virtual machine, plus the fact that it seems to work from every other client. Right now the F5 seems to successfully execute the curl against the server IP so I'm hoping that will fail soon so I can try seeing how other things react. For example, a network admin installed IIS on a different port so we can see if it's the Oracle aspect that's somehow breaking down.
But I just don't get it. Even when the curl doesn't work, the HTTP health monitor seems to have no issue passing. The only thing I haven't done is modified it so it's looking for a specific response from the server, it's just doing the default GET / with a blank response string.
I mean despite the communication being off/on between F5 and the server in SSH, the fact is it generally does work. So why does going through the virtual server fail 100% of the time? What's different? - Brad_10289
Nimbostratus
I am pleased to note we have made significant process in identifying the issue. It appears that every server on the same vlan as the F5 self IPs and virtual servers (management is on a different vlan) will not work as expected when reached through the virtual server but will work as expected when reached directly.
What we did was begin systematically going through every server we could test with a specific process we could telnet to to verify functionality on the affected vlan and after several failures with no successes, we went ahead and started doing the same thing on other vlans and were 100% successful with no failures. We then went ahead and relocated one of the virtual servers from the affected vlan to another and it worked fine.
We were clouded on the issue because we thought we had servers (both physical and virtual) on the affected vlan that were working, but upon closer examination, we witnessed that only pool members on other vlans were fielding traffic properly.
My colleague wants to go ahead to do further testing with wireshark to see what incoming F5 packets look like on affected servers (although I don't believe the packets are reaching the affected servers at all) before we look into it further.
We have only one active interface and it's in the untagged column of the VLAN configuration but I'm wondering if specifying the VLAN in the upper section where internal is named is somehow contributing to the issue. - Brad_10289
Nimbostratus
In addition we have learned that from within the same vlan as the F5, hosts can not ping the active F5 self IP, the floating IP and all of the virtual servers. So F5 can ping server on same vlan, but server cannot ping F5. We have confirmed packets coming from F5 are being received by server, but the server is unable to send packets back.
at this point, we're thinking the issue may be that because both the floating IP and the primary self IP have the same mac address, it's breaking ARP requests within the same vlan.
Help guide the future of your DevCentral Community!
What tools do you use to collaborate? (1min - anonymous)Recent Discussions
Related Content
DevCentral Quicklinks
* Getting Started on DevCentral
* Community Guidelines
* Community Terms of Use / EULA
* Community Ranking Explained
* Community Resources
* Contact the DevCentral Team
* Update MFA on account.f5.com
Discover DevCentral Connects
