Forum Discussion

brad_11440's avatar
brad_11440
Icon for Nimbostratus rankNimbostratus
Apr 07, 2012

GTM Upgrade FAIL - DC Links & ICMP Monitors

We upgraded a HA LTM pair from 10.2.0HF2 to 10.2.3HF1 without issue. Then we attempted to update a single GTM using the same code trains. However, when the GTM was on 10.2.3HF1, it would mark down the WideIP's associated with the LTM HA pair after the 90 second timeout. The only monitor was the default bigip monitor on the LTM HA Pair. Rolling back to 10.2.0 fixed the problem.

 

 

We tried re-doing the bigip_add on all 3 sides and validating communication with iqdump, although I'm not sure how to make sure the iqdump messages were valid or not. Regardless, that didn't fix it.

 

 

 

The thing is VS discovery is not enabled due to NAT. So I am wondering if there was a bug introduced somewhere after 10.2.0HF2 that was not or is not corrected relating to GTM BIGIP monitors and NAT. Does anyone have any ideas?

 

 

 

UPDATE See my 2nd post in this thread. DC Link & gateway ICMP monitors are to blame but still no idea why. They are not sending traffic on 2 of 3 on 10.2.0 and when upgraded to 10.2.3, the 3rd stopped sending out pings as well...?!

 

5 Replies

  • We are fairly close in code levels, so I find this an interesting post. But they are not exactly the same - our GTMs run 10.2.2HF1. I did recently upgrade one LTM pair from 10.2.0 to 10.2.3HF1, but I didn't have any trouble with LTM/GTM communication.

     

     

    You definitely did the right thing trying bigip_add and iqdump. If you don't get an SSL handshake error and you get XML output, then communication should be OK. Another thing to verify, after executing bigip_add, is that the LTM certificate is in the GTM's Trusted Server Certificates list, and that the GTM is in the LTM's Trusted Device Certificates list. When I've encountered handshake errors in the past, I cleared out the Trusted Device Certificates list on both LTMs, and the Trusted Device Certificates and Trusted Server Certificates list on the GTM. Then I executed bigip_add (GTM) on both LTMs, and bigip_add (LTMs) on the GTM.

     

     

    I have encountered one other problem , but it is not likely something you will see. While troubleshooting LTM/GTM communication after an upgrade one time, I performed a network trace between the two units. Looking at the handshake, I noticed that the Subject Name on the certificate which was presented by the LTM referenced the name of an application. Searching the filesystem, I found I had SSL certificates on the LTM in a weird path - I suspect left over from a v4->v9 upgrade at some point. Once I removed those certificates, the LTM chose the right certificate and that fixed the SSL communication.

     

     

    I don't use NAT, so I'm not clear on exactly how that fits into the picture.

     

     

    One other thought...I find the documentation about what files are retained after booting from one volume to another extremely lacking. It might be possible that the SSL certificate being used by the LTM or the GTM after the upgrade was not retained on the new boot volume. However if you ran bigip_add, I think it would fix that particular problem.

     

  • Thanks for the response. I came here yesterday to post an update but got side tracked.

     

     

    I did more digging into the logs and found out that it WASN'T the BigIP monitors that caused the issue. My first post was the morning after the maintenance window, on very little sleep, before I had a chance to look at the logs and into the configuration.

     

     

    The problem is with the Data Center Links that are defined and being monitored. There are 3 links, pinging the ISP routers, using the default gateway_icmp monitor. We can ping them from the CLI just fine. However, when on 10.2.0HF2, only one of them is actually up. The other two are down and from reviewing the gtm logs directly after boot, they never actually come up - I can see their state go from blue -> red. Looking at the monitor statistics in the GUI, they are flatlined. All zeroes for input/output packets for the two that are down. Tcpdumps also confirm that the monitors aren't even attempting to send out any traffic.

     

     

    Now when we upgraded to 10.2.3HF1, the same thing happened to the 3rd link. Blue -> red directly after boot. I didn't have time to troubleshoot as we were already past the maint. window and I was very tired. But if I had to guess, it wasn't even trying to send traffic out. Once all 3 links were down, it brought down everything defined for that data center. Which ultimately forced us to roll back.

     

     

    Any ideas what the problem is here? I'm thinking of creating custom monitors and see if the default is just acting up but really, that's just a stab in the dark.

     

     

    Apr 7 04:53:32 local/dc1-lb-03 alert gtmd[3754]: 011ae0f2:1: Monitor instance gateway_icmp x.x.x.3:0 UNKNOWN_MONITOR_STATE --> DOWN from gtmd (no reply from big3d: timed out)

     

    Apr 7 04:53:32 local/dc1-lb-03 alert gtmd[3754]: 011a2003:1: SNMP_TRAP: Link dc-link-02 (ip=x.x.x.3) state change blue --> red (Link dc-link-02: Monitor gateway_icmp from gtmd : no reply from big3d: timed out)
  • So no monitor traffic, eh? That's a tough one to diagnose without knowing more about your environment. A couple things come to mind:

     

     

    * Long monitor Interval: In my experience, it takes a few minutes for the LTM/GTM to initialize its network interfaces. If you have a long monitor Interval, it's possible that the GTMD process attempted the monitor ping before the network interfaces were initialized. You simply took the tcpdump before the next Interval. If it's large, perhaps try dropping the interval down to something like 10 seconds temporarily before running tcpdump. Seems very unlikely that TMM would have such a major flaw that would cause it to not send monitor traffic at all.

     

     

    * I've seen upgrades (admittedly, in the v4 -> v9 days) move the VLAN from untagged to tagged, and our switches weren't configured for it. This caused complete network outage until we moved the interfaces back to untagged.

     

     

    * Interface speed/duplex changes - perhaps double-check these interface properties, or even more basic, verify the interfaces weren't disabled during the upgrade?

     

     

    Those are the things I would be checking first.
  • There are only two interfaces configured as a "trunk", so if it was a problem with the interface I would expect the one working monitor (of 3 total) would go down, or the rest of the bigip monitor's going to the LTM pair within the data center. Again, 10.2.3, all 3 data center links went down. Immediately from booting to 10.2.0, the one working monitor was fine again but the other two are still down. With zero traffic trying to go out. I agree it seems unlikely but there have been weirder bugs across all vendors in the networking space, and I have seen them first hand.

     

     

    I appreciate the recommendations but I don't think it's any of those things. I've opened a case with F5 to work on it and will update this post if we ever come to a resolution.
  • I was wondering if this ever had a resolution. I have a simlar situation but am having an issue with V10.2 to V11.3. I have two GTMs monitoring two DC's, one of my monitors is not talking to a DC but the other GTM can talk to it. I can see the ICMP hitting the the moinitoring IP but the GTM marks the DC as down.