Forum Discussion

Nimbostratus

Aug 02, 2013

High Packet Drop and connection failure

Have a pair of LTM 1600 (named LTM1 & LTM2) and a pair of cisco2960 (2960-1 2960-2) whereby the detailed connection are as below:-

LTM1 internal-trunk = interface 1.3 + 1.4

LTM1 internal-trunk (LACP Enabled, LACP Mode=Active, LACP Timeout = Long, Link Selection Policy = Auto, Frame Distribution Hash=Src/Dst IP)

LTM1 fibre = interface 2.1 + 2.2

LTM1 VLAN External (Tag=10, Untagged Interface=1.1)

LTM1 VLAN Internal (Tag=4093, Untagged Interface=internal-trunk)

LTM1 VLAN pri-failover (tag=4092, Untagged Interface=Fibre)

LTM1 interface 1.1 -> uplink cisco

LTM1 internal-trunk -> 2960-1 port channel 3

LTM1 Fibre -> LTM2 Fibre

LTM2 with exactly the same configuration

2960-1 port channel 5 -> 2960-2 port channel 5

Please find below show run cutting relevant information :-

2960-1show run

Building configuration...

Current configuration : 6188 bytes

version 12.2

hostname 2960-1

no ip source-route

no ip domain-lookup

vtp domain f5-private

vtp mode transparent

spanning-tree mode pvst

spanning-tree extend system-id

port-channel load-balance src-dst-ip

vlan internal allocation policy ascending

vlan 4093

name f5-private-vlan

interface Port-channel3

switchport access vlan 4093

switchport mode access

no keepalive

flowcontrol receive desired

interface Port-channel5

switchport access vlan 4093

switchport mode access

interface GigabitEthernet1/0/1

switchport access vlan 4093

switchport mode access

no keepalive

flowcontrol receive desired

no cdp enable

no cdp tlv server-location

no cdp tlv app

spanning-tree portfast disable

channel-group 3 mode active

interface GigabitEthernet1/0/2

switchport access vlan 4093

switchport mode access

no keepalive

flowcontrol receive desired

no cdp enable

no cdp tlv server-location

no cdp tlv app

spanning-tree portfast disable

channel-group 3 mode active

interface GigabitEthernet1/0/3

switchport access vlan 4093

switchport mode access

spanning-tree portfast disable

channel-group 5 mode desirable non-silent

interface GigabitEthernet1/0/4

switchport access vlan 4093

switchport mode access

spanning-tree portfast disable

channel-group 5 mode desirable non-silent

interface Vlan1

no ip address

shutdown

interface Vlan4093

ip address 192.168.1.1 255.255.255.0

ip sla enable reaction-alerts

no cdp run

end

2960-2 with exactly the same configuration. The detailed situation is that it seems to have high connection failure rate from external subnet to virtual server. I have done a flood ping from 2960-1 to LTM1 without problem vice versa, but I have observed that there are around 10% packet drop when I tried to ping from LTM1 to LTM2 using either internal IP or external IP. Have reached the same result (10% packet drop) when I tried to ping from any host sitting in the internal subnet of LTM to LTM1/LTM2 using either internal or external IP. But I can reach 0 packet drop when I ping from host to 2960-1/2960-2 or vice versa. Is this caused by mis-configuration? How can I troubleshoot this?

config

design

nitass
Employee
Aug 02, 2013
do you know whether it affects application traffic (i.e. client to virtual server)? if yes, have you captured packet when problem is happening?

sol10191: Troubleshooting packet drops

http://support.f5.com/kb/en-us/solutions/public/10000/100/sol10191.html
frankcheong_304
Nimbostratus
Aug 02, 2013
Yes, it is affecting traffic whereby there are some connection going from external Host to the Virtual Server failed. Have captured the traffic and am analyzing the traffic now. But the traffic volume is huge and it is really quite difficult and takes much time. Therefore, I would like to start from the basic which is to first ensure the cisco switch as well as the LTM configuration is right.

In addition, when I show interface and I found that the cisco etherchannel side have a lot of packet drop (over 10k for 1 week), does it means there must have some kind of problem between the connection of this two unit?
marco_octavian_
Nimbostratus
Aug 02, 2013
The 2910-1 to LTM tests pretty much verifies the hardware/cabling is good. Let's dig deeper for an app or utiliization issue.

1) When you issue a "sh int" on the Cisco switch, is the negotiated line speed/duplex correct?

2) Have you polled the bigip interfaces with a mgmt/snmp tool to check utiliziation? Or what is the utilization on the Cisco port side?

3) The 1600 is only rated at 1gb. Marketing aside, the advertised rate for ANY vendor means total throughput. That is 1gb total process of traffic inbound and outbound simultaneously. Let's verify 2.

4) I'm not worred about the drops if iti is a busy network. Any mutlicast traffic such as lldp, cdp, stp, etc will be dropped unless the box is configured for it, which it isn't.

5) when looking at the captures... are there a lot of resets? Look for the longest Delta times and get a feel for what is going on.

6) Have you added any new iRules lately?

8) Are there errors on the ports (cisco or bigip)?

7) Why do you need this command on the switch - " flowcontrol receive desired"? You really need to know your environment if you start messing flow control, especially on gig links. I always leave everything to auto on both ends.

8) What LB algorith are you using on the pool?

I would take a capture but only on the most problematic VS and attempt to capture what is happening on the external and internal at the same time. Just run two tcpdumps, one for each vlan.
frankcheong_304
Nimbostratus
Aug 02, 2013
Thx Macro for your quick reply. I also think the the direct ping between LTM and 2960 shows that the network cable is good at least and there are some other issues with that. Lets see my answer below:-

1. The dupex and line speed are 1000 full on both end which is matched.

2. From what I can see, the network utilization is not high. I don't have mrtg graph but just to get a test using sh int as well as using the dashboard of F5 only.

3. see 2

4. I have already turned off cdp on the port in order to reduce the packet drop and actually cdp is not useful for port connecting F5 either. The reason behind turning off CDP is that, I would like to reduce the no of packet drop down to zero which is required by our management team so is there any configuration in the cisco end that I can enable/disable in order to further reduce the packet drop due to unused feature or traffic?

5. Still scratching my head and pulling my hair with the capture, coz it is in terms of tens of mega bytes. Will start with TCP reset. Any other hints for me?

6. No. Not actually using any iRules.

7. There are no errors on the cisco etherchannel, just packet drop. In higip, how can I obtain the interface errors?

8. Actually I have been fighting for the packet drop for long and found article saying that the flowcontrol is not working very well between cisco and big-ip with the auto setting. Anyway, can try to switch it back to auto if that would really help.

9. The LB algorithm is simply round robin but I think it is not related. Coz I have packet drop even simply pinging to the node drirectly from external subnet.

I have done a capture on both interface (as well as on the node and the external host), therefore I am now checking four * tens of mega bytes capture. The size is huge bcoz the connection is really quite intermittent. I can successfully establish a connection with the node very often but not always. Thats why I have to capture the traffic for a bit long in order to ensure that there must of at least one failure connection within the capture.
frankcheong_304
Nimbostratus
Aug 02, 2013
Thx Macro for your quick reply. I also think the the direct ping between LTM and 2960 shows that the network cable is good at least and there are some other issues with that. Lets see my answer below:-

1. The duplex and line speed are 1000 full on both end which is matched.

2. From what I can see, the network utilization is not high. I don't have mrtg graph but just to get a test using sh int as well as using the dashboard of F5 only.

3. see 2

4. I have already turned off cdp on the port in order to reduce the packet drop and actually cdp is not useful for port connecting F5 either. The reason behind turning off CDP is that, I would like to reduce the no of packet drop down to zero which is required by our management team so is there any configuration in the cisco end that I can enable/disable in order to further reduce the packet drop due to unused feature or traffic?

5. Still scratching my head and pulling my hair with the capture, coz it is in terms of tens of mega bytes. Will start with TCP reset. Any other hints for me?

6. No. Not actually using any iRules.

7. There are no errors on the cisco etherchannel, just packet drop. In higip, how can I obtain the interface errors?

8. Actually I have been fighting for the packet drop for long and found article saying that the flowcontrol is not working very well between cisco and big-ip with the auto setting. Anyway, can try to switch it back to auto if that would really help.

9. The LB algorithm is simply round robin but I think it is not related. Coz I have packet drop even simply pinging to the node drirectly from external subnet.

I have done a capture on both interface (as well as on the node and the external host), therefore I am now checking four * tens of mega bytes capture. The size is huge bcoz the connection is really quite intermittent. I can successfully establish a connection with the node very often but not always. Thats why I have to capture the traffic for a bit long in order to ensure that there must of at least one failure connection within the capture.
marco_octavian_
Nimbostratus
Aug 02, 2013
4) Take a capture and put it into wireshark. filter for all broadcast and multicast. From there we can get an idea of any harmful drops. Bursty traffic could overflow the buffers but you stated earlier that direct pings were fine. Did you do extended pings as well with a larger payload?

5) TCP Resets and look for long Delta times in wireshark. Isolate that conversation and compare the external and internal captures to see if a particular server is taking a long time to respond. You could also use AVR, fidderl or httpwatch to hit the most problematic VS and attempt to locate a particular URL/uri combo that is taking long time to complete.

7) In the gui, Network -> Interface or cli:

[root@ltm1:Active:Disconnected] config tmsh show /net interface

-----------------------------------------------------------------

Net::Interface

Name Status Bits Bits Pkts Pkts Drops Errs Media

In Out In Out

-----------------------------------------------------------------

1.1 up 118.1M 8.8K 137.9K 19 0 0 none

1.2 up 62.8M 10.5K 23.0K 24 0 0 none

1.3 up 90.7M 19.1M 80.8K 57.0K 0 0 none

mgmt up 9.6M 7.3M 8.2K 5.6K 0 0 100TX-FD

[root@ltm1:Active:Disconnected] config

9) a) If a particular server gets busy, RR will keep hammering it. I use Observed Member as my Best Practice. It will monitor server repsonse times (amongst other things) to determine when to back off of a server, giving it time to catch up.

b) When you say "pinging to the node", what node? LTM VS? If so, that ping does NOT make it to the pool member. If you are having trouble here, then I tend to look back at the network/cabling again.

10) Is the specific VS a 3-tier app? Are there database calls being made by the web or middleware that the webserver/client would need to wait on.

11) If you can find a time when you failed to make a connection to the VS (node in your terms?) and you have front and back captures, you will most likely have caught the main portion of your problem or at least the biggest clue. I would keep a continuous ping going as well to see what happens at the same a failed attempt occurs.
frankcheong_304
Nimbostratus
Aug 05, 2013
4. I have do a flood ping with results below:-

No packet drop:-

2960-1 <> LTM1

2960-2 <> LTM2

Around 10% packet drop:-

LTM1 <> LTM2

LTM1 <> Any Node sitting under 2960-1

LTM2 <> Any Node sitting under 2960-2

normal ping with packet size 1000 and 10000 to and from anywhere doesn't have any problem at all

To give a more concrete picture, please find below a brief network diagram

2960-3 ------------------------- 2960-4

| |

LTM-1 ============= LTM-2

|| ||

2960-1 ============ 2960-2

| |

| --------------------------------

| |

NODE-1.............................

Connection List

2960-3 <> LTM-1 Normal single UTP

2960-4 <> LTM-2 Normal single UTP

LTM-1 <> LTM-2 Fibre dual link

LTM-1 <> 2960-1, LTM-2 <> 2960-2 Dual UTP with etherchannel (LACP src/dst IP)

2960-1 <> 2960-2 Dual UTP with etherchannel (PAGP)

All Node connect to both 2960-1 and 2960-2 with active-standby bonding and both LTM is also running in active-standby mode.

Can I ask a simple question is the above configuration works? Anything else I have to check? I can get rid of the LACP and PAGP if dual link is not really necessary (between LTM and 2960 as well as in between two 2960), coz I have overlook that the throughput of 1600 is barely 1GHz and dual links seems not really necessary, right? What about VTP? Should I round VTP in transparent mode? One last question is the STP, coz from the LTM STP interface I found that there is one STP setting with Bridge Priority 61440, does it need to be a match in Cisco? How can I check and fix if needed?

5. Will double check. Recently we found a lot of SMTP request connection failure, will try to focus on port 25.

7. Drop rate is high but error is zero.

9. a) I wonder if this problem is not related to VS coz simply pinging from LTM-1 to LTM-2 also have problem.

b) But pinging from cisco to the immediate connected F5 or node doesn't yeild any problem and thus I guess it is not related to cabling problem, am I guess correct?

10. Up to now, we can only find that a lot of smtp connection problem and the smtp server is rather powerful and guess that it is not related.

11. Will try to check out the tcpdump with wireshark but it is really difficult.
nitass
Employee
Aug 05, 2013
since smtp connection is affected, i think it may be easier to investigate it (e.g. tcpdump on bigip and see if packet arrives at bigip or whether packet is dropped by bigip).

tcpdump -nni 0.0:nnn -s0 -w /var/tmp/output.pcap host x.x.x.x or host y.y.y.y or host z.z.z.z and port 25

x.x.x.x is virtual server ip

y.y.y.y/z.z.z.z is pool member ip

just my 2 cents.
marco_octavian_
Nimbostratus
Aug 06, 2013
See notes inband and below"

LTM-1 <> LTM-2 Fibre dual link -->> Does this fiber directly connect the two units or are they going through a switch? (HA subnet?)

In regard to these tests:

Around 10% packet drop:- --> very high but there may be something we are missing here.

LTM1 <> LTM2 --> Is this a direct fiber cable between each unit or are they going through the switch?

LTM1 <> Any Node sitting under 2960-1 --> Are you routing to the node or is the LTM directly on the same subnet (single leg or 802.1q)?

LTM2 <> Any Node sitting under 2960-2 -> Is this ping taking place over an HA subnet or internal vlan subnet from self-ip to self-ip?

There is no need to get rid of the channel configuration. This gives you physical layer redundancy. Temporarily though, it wouldn't hurt to just pull one cable from each channel to simplify the infrastructure. Singe links temporarily. VTP shouldn't have anything to do with this, although I have never been a big fan of it and always left it in transparent mode, but that's another discussion. The default for STP is passthrough although I usually disable it. Don't let my so-called best practices sway you, everyone has an opinion on it.

Leave the bridge id alone. It's just set high to keep it from becoming root. It shouldn't "match" anything unless you really know what you are doing. You're other assumptions are correct (9, 10).

As Nitass stated and as said before, let's focus on the problematic VS. In this case you mentioned the smtp vs. Take front and back dumps simultaneously and attempt to catch/find a "broken" connection. From there, we can analyze further.

Also, have you tried failing the LTMs over in case one is acting up?
frankcheong_304
Nimbostratus
Aug 06, 2013
LTM-1 <> LTM-2 Fibre link directly attached without going over any SAN switch.

Ping from LTM-1 to LTM-2 actually goes thru the path LTM-1 (LACP) -> 2960-1 (PAGP) -> 2960-2 (LACP) -> LTM-2

Ping from LTM-1 to any node sitting under 2960-1 is actually using the same F5 subnet (subnet where the internal interface belongs).

Ping from LTM-2 to any node sitting under 2960-1 is actually using the same F5 subnet (subnet where the internal interface belongs).

Have performed tcpdump on F5 internal interface, F5 external interface, the Node as well as the client and found that there are actually quite a lot of TCP RST sent from the client. It is really strange, coz to perform this test, I have written one simple telnet script something like:-

telnet smtpserver 25 << EOF

QUIT

EOF

Any reason why the tcpdump show us the TCP connection being reset by client with this simple script? Seems like the TCP RST is being sent from the network layer instead but I really don't know why.

For we have tried to fail over from LTM-1 to LTM-2 but have not performed the tcpcapture nor doing any ping test. Would that give more information?