Forum Discussion

frankcheong_304's avatar
frankcheong_304
Icon for Nimbostratus rankNimbostratus
Aug 02, 2013

High Packet Drop and connection failure

Have a pair of LTM 1600 (named LTM1 & LTM2) and a pair of cisco2960 (2960-1 2960-2) whereby the detailed connection are as below:-

 

 

LTM1 internal-trunk = interface 1.3 + 1.4

 

LTM1 internal-trunk (LACP Enabled, LACP Mode=Active, LACP Timeout = Long, Link Selection Policy = Auto, Frame Distribution Hash=Src/Dst IP)

 

LTM1 fibre = interface 2.1 + 2.2

 

LTM1 VLAN External (Tag=10, Untagged Interface=1.1)

 

LTM1 VLAN Internal (Tag=4093, Untagged Interface=internal-trunk)

 

LTM1 VLAN pri-failover (tag=4092, Untagged Interface=Fibre)

 

 

LTM1 interface 1.1 -> uplink cisco

 

LTM1 internal-trunk -> 2960-1 port channel 3

 

LTM1 Fibre -> LTM2 Fibre

 

LTM2 with exactly the same configuration

 

 

2960-1 port channel 5 -> 2960-2 port channel 5

 

Please find below show run cutting relevant information :-

 

2960-1show run

 

Building configuration...

 

 

Current configuration : 6188 bytes

 

!

 

version 12.2

 

 

hostname 2960-1

 

no ip source-route

 

!

 

no ip domain-lookup

 

vtp domain f5-private

 

vtp mode transparent

 

!

 

!

 

spanning-tree mode pvst

 

spanning-tree extend system-id

 

!

 

port-channel load-balance src-dst-ip

 

!

 

vlan internal allocation policy ascending

 

!

 

vlan 4093

 

name f5-private-vlan

 

!

 

!

 

!

 

interface Port-channel3

 

switchport access vlan 4093

 

switchport mode access

 

no keepalive

 

flowcontrol receive desired

 

!

 

interface Port-channel5

 

switchport access vlan 4093

 

switchport mode access

 

!

 

interface GigabitEthernet1/0/1

 

switchport access vlan 4093

 

switchport mode access

 

no keepalive

 

flowcontrol receive desired

 

no cdp enable

 

no cdp tlv server-location

 

no cdp tlv app

 

spanning-tree portfast disable

 

channel-group 3 mode active

 

!

 

interface GigabitEthernet1/0/2

 

switchport access vlan 4093

 

switchport mode access

 

no keepalive

 

flowcontrol receive desired

 

no cdp enable

 

no cdp tlv server-location

 

no cdp tlv app

 

spanning-tree portfast disable

 

channel-group 3 mode active

 

!

 

interface GigabitEthernet1/0/3

 

switchport access vlan 4093

 

switchport mode access

 

spanning-tree portfast disable

 

channel-group 5 mode desirable non-silent

 

!

 

interface GigabitEthernet1/0/4

 

switchport access vlan 4093

 

switchport mode access

 

spanning-tree portfast disable

 

channel-group 5 mode desirable non-silent

 

!!

 

interface Vlan1

 

no ip address

 

shutdown

 

!

 

interface Vlan4093

 

ip address 192.168.1.1 255.255.255.0

 

!

 

ip sla enable reaction-alerts

 

no cdp run

 

!

 

end

 

 

 

2960-2 with exactly the same configuration. The detailed situation is that it seems to have high connection failure rate from external subnet to virtual server. I have done a flood ping from 2960-1 to LTM1 without problem vice versa, but I have observed that there are around 10% packet drop when I tried to ping from LTM1 to LTM2 using either internal IP or external IP. Have reached the same result (10% packet drop) when I tried to ping from any host sitting in the internal subnet of LTM to LTM1/LTM2 using either internal or external IP. But I can reach 0 packet drop when I ping from host to 2960-1/2960-2 or vice versa. Is this caused by mis-configuration? How can I troubleshoot this?

 

 

 

 

  • These types of problems are always like going down the rabbit hole, especially over a medium like this. It does look like we are starting to narrow it down.

     

     

    The LTM1 <> LTM2 ping loss is worth looking into, but not yet. I would be more concerned about losing pins from LTM1/LTM2 to any node under the 2960. I want you to find a node that is actually plugged into 2960-1. This is important. We don't want that traffic going across the PAgP link. I then want you to perform two ping tests. One with a 1000 and one with 10000 pings issued from LTM to the internal/inside node that is plugged into 2960-1 physically. Please report back.

     

     

    If you are getting TCP RSTs from the client, then you need to dig further into this. This is the clue we have been looking for. Please verify the RST is from the client first. Make sure there isn't one from the server beforehand and also verify what communication took place right before the RST. Sample a couple of RSTs to see if it is the same call causing this. Focus in using the tcp.port or udp.port filter in wireshark.

     

     

    The client is probably sending TCP RSTs because you're not doing anything and you ending the session (QUIT command). Sounds like normal behavior to me.

     

     

    BTW, what version are you running again? 10.2?
  • The client is probably sending TCP RSTs because you're not doing anything and you ending the session (QUIT command). Sounds like normal behavior to me. it does send FIN here. since QUIT is sent right after connect, i do not think client should send RST.

     

     

    just my 2 cents.
  • Nitass, In my experience of going through app traces for years, I have seen that at least half of all apps don't end sessions cleanly. You don't get the 4-way tcp closure very often, not like you would expect. I see Resets being very common and it's often messy for both sides of the connection.

     

     

    If this particular conversation of the trace could be posted here, that would help. You could export it to text and the substitute the ip addresses with anonymous ones.
  • In my experience of going through app traces for years, I have seen that at least half of all apps don't end sessions cleanly. You don't get the 4-way tcp closure very often, not like you would expect. I see Resets being very common and it's often messy for both sides of the connection.i see. thanks 🙂

    If this particular conversation of the trace could be posted here, that would help. You could export it to text and the substitute the ip addresses with anonymous ones.this is what i tested here.

     virtual server
    
    [root@ve11a:Active:Changes Pending] config  tmsh list ltm virtual bar
    ltm virtual bar {
        destination 172.28.20.111:25
        ip-protocol tcp
        mask 255.255.255.255
        pool foo
        profiles {
            tcp { }
        }
        source 0.0.0.0/0
        source-address-translation {
            type automap
        }
        vs-index 6
    }
    
     client
    
    [root@centos17 ~] telnet 172.28.20.111 25
    Trying 172.28.20.111...
    Connected to 172.28.20.111 (172.28.20.111).
    Escape character is '^]'.
    220 ESMTP
    quit
    221 
    Connection closed by foreign host.
    
     packet trace
    
    [root@ve11a:Active:Changes Pending] config  tcpdump -nni 0.0 host 172.28.20.111 and port 25
    tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
    listening on 0.0, link-type EN10MB (Ethernet), capture size 96 bytes
    23:36:36.489354 IP 172.28.20.17.55215 > 172.28.20.111.25: S 2510838242:2510838242(0) win 5840 
    23:36:36.489397 IP 172.28.20.111.25 > 172.28.20.17.55215: S 485573986:485573986(0) ack 2510838243 win 4380 
    23:36:36.490487 IP 172.28.20.17.55215 > 172.28.20.111.25: . ack 1 win 5840 
    23:36:36.874546 IP 172.28.20.111.25 > 172.28.20.17.55215: P 1:24(23) ack 1 win 4380 
    23:36:36.876531 IP 172.28.20.17.55215 > 172.28.20.111.25: . ack 24 win 5840 
    23:36:37.921334 IP 172.28.20.17.55215 > 172.28.20.111.25: P 1:7(6) ack 24 win 5840 
    23:36:37.921360 IP 172.28.20.111.25 > 172.28.20.17.55215: . ack 7 win 4386 
    23:36:38.111502 IP 172.28.20.111.25 > 172.28.20.17.55215: P 24:41(17) ack 7 win 4386 
    23:36:38.111515 IP 172.28.20.111.25 > 172.28.20.17.55215: F 41:41(0) ack 7 win 4386 
    23:36:38.112569 IP 172.28.20.17.55215 > 172.28.20.111.25: . ack 41 win 5840 
    23:36:38.112573 IP 172.28.20.17.55215 > 172.28.20.111.25: F 7:7(0) ack 42 win 5840 
    23:36:38.112586 IP 172.28.20.111.25 > 172.28.20.17.55215: . ack 8 win 4386 
    
  • Frank,

     

     

    I think we are at a stalemate here. Yes, retransmissions are normal. Resets are common but not a good thing (most of the time). Nitass is right also. You shouldn't get a reset, of course. I just don't want you get hung up on it. It's just a clue (symptom) to uncovering the issue.

     

     

    Items:

     

    1) These simple tests may not be enough. It is not the same as your production traffic. Just sending a hello and/or just a quit isn't much. Where's the mail to and from? Smtp doesn't really have that many commands but it's the one's your missing in your tests that could be causing the problem. Perhaps there is an authentication issue (if required)? Perhaps a username isn't being recognized. ??? Maybe an unsupported command is coming across. ??

     

     

    2) I don't know why telnet causes the resets for you. It is probably due to the buffer and the EOF all being pushed down the pipe. I know EOF is just to let the shell know the input is ended but it all gets put into the buffer and it makes a difference in the trace. I]m wondering if the QUIT is getting there before the 220 comes back. It works itself out but it makes the order of the packets look a bit strange. I have attached two wireshark images (taken from LTM tcpudmps). The one called smtp_without_eof.png and is a simple connect and QUIT without using EOF. The second image is called smtpeof.png and is when I used EOF like you did. Notice how it throws off the decodes in wireshark. I get weird results sometimes when replicating http monitors using telnet. There's no reason to put more cycles int this, at this time.

     

     

    3) Did you compare your telnet session to a legitimate connection using tcpdump/wireshark? Let's focus on getting the right captures at this point. Just take captures from the front and back and then focus on the resets or long delta times for starters. Let's see if the same type of connections/users causing the resets or if it is load. You can also open a case and they will help you parse through the captures.

     

     

    4) Did you test to both/all of your smtp servers directly or were you hitting the vip everytime? Are you using persistence? I also assume you are running these tests/scripts from the LTM.

     

     

    When reply back regarding 3, we can examine it further.

     

  • Finally, we found happened. It is because the Pool do not have session persistence set. We have changed the persistence profile to source address and the issues is solved. Thx everyone for help.