Forum Discussion

JN_AU's avatar
JN_AU
Icon for Altostratus rankAltostratus
Jun 16, 2023

Two virtual servers go down after an upgrade

Hi Everyone,

I'm wondering if anyone has seen this behaviour before. After I perform an upgrade, two virtual servers out of about 10 go down. The applications using these VSs stop working, from the looks of it outbound traffic out to the internet stops.

The health monitor used is the default HTTPS. I've created a custom monitor to GET a file from the backend pool members which marks the nodes as up, but the apps still don't work. The backend servers are running IIS10.0.

Some versions of the F5 software work, most do not. Working versions are 14.1.2.8, 14.1.4, and 15.1.3.1. All other versions seem to not work. As soon as the upgraded device is made active (in the HA pair) the VSs go down. Packet captures don't seem to show the issue, but they do indicate for some reason there's a 75 second+ pause in the response from the pool members. This isn't there when one of the working versions is active so I don't think this is an issue with the pool member.

The traffic passes through two sets of Checkpoint firewalls, and is NATed each time on these firewalls on the way to the internet.

Could anyone provide information as to why this would work with some versions of the BIG-IP software, and not others?

Thanks,

  • Hi All,

    This looks to have possibly been this bug: https://my.f5.com/manage/s/article/K85805058

    The actual issue was on pool members behind the F5 - they used the F5s as a gateway to get to the internet using an IP forwarding VS. When the issue occurred these pool members were unable to get to the internet. It looks like the standard HTTPS health checks failed because the pool members were timing out trying to load internet content.

    After further examination of packet captures it was observed there was possibly async traffic (based on MACs observed).

    The fix was to create a new FastL4 profile and make sure 'loose init' and 'loose close' were enabled. This profile was then used on the ip forwarding VS, and it looks like this has solved the issue.

    14.1.5.6 was installed and is so far working fine.

  • Hi All,

    This looks to have possibly been this bug: https://my.f5.com/manage/s/article/K85805058

    The actual issue was on pool members behind the F5 - they used the F5s as a gateway to get to the internet using an IP forwarding VS. When the issue occurred these pool members were unable to get to the internet. It looks like the standard HTTPS health checks failed because the pool members were timing out trying to load internet content.

    After further examination of packet captures it was observed there was possibly async traffic (based on MACs observed).

    The fix was to create a new FastL4 profile and make sure 'loose init' and 'loose close' were enabled. This profile was then used on the ip forwarding VS, and it looks like this has solved the issue.

    14.1.5.6 was installed and is so far working fine.

  • Hi All,

    Thanks for the replies.

    Paulius:

    The health monitor and everything works fine when using the working versions - 14.1.4 or 15.1.3.1. It's only when a different version is made active that the issues start. Nothing else changes in the environment, the only difference is the F5 upgraded version (no routing/firewall changes).

    I went and compared the versions - 14.1.4 to 14.1.5.4, and 15.1.3.1 to 15.1.4.1 (also tested not working).

    Comparing 14.1.4 (working) to 14.1.5.4 (non-working), the only difference I could see in the config was under the virtual addresses for the VIPs, they had 'icmp-echo enabled' in 14.1.5.4 (non-working version).

    Comparing 15.1.3.1 (working) to 15.1.4.1 (non-working) I couldn't see any differences in the config.

    Ben_Novak:

    I uploaded QKViews to iHealth but they didn't shed any light on the issue unfortunately. I've had a case open for quite some time with F5 for this issue. No errors come up in the logs when this issue happens that I've found so far.

    Unfortunately every version I've tried in 14.x over 14.1.4 has not worked - same thing with anything over 15.1.3.1 in the 15.x branch. I also can't leave these F5s running 15.1.3.1 as they are 2000s, and only up to 15.1.2 is officially supported.

    Mohamed_Ahmed_Kansoh:

    Thanks for letting me know about the bug tracker. I did search this for the versions listed but unfortunately couldn't find anything related to these issues I'm facing.

    • JN_AU If you are not able to share your configuration the only other thing that I can recommend is performing a code upgrade without copying the configuration over to the new installation and then configuring the F5 from scratch to see if it works after that. You should be able to load the configuration quickly by gathering all the configuration through the CLI and then loading it using the following command. Aside from this as a last resort, this might be something that you have to wait for F5 to respond on.

      load sys config from-terminal merge

  • In addition to Paulius, I suggest you start with uploading a qkview to iHealth, from there it might be able to identy any issues.

    With code upgrades, the ciphers also get upgraded.  This can cause some server connectivity issues.

    My first stop would be the ltm logs to see if there are any iRule errors, cipher errors, or health monitor timeouts.

    I also suggest getting up to the latest 15.1.x train so you have all the latest hotfixes.

    • Hi JN_AU , 
      in addition to Ben_Novak and Paulius , 
      use F5 Bug scrub portal to identify any Bugs before proceeding on your upgrade , maybe you hit in a bug. 
      This is the URI to use it : 
      https://my.f5.com/manage/s/bug-tracker

      Let me give you brief about Bug tracker , you will add your target virsion and filter on (blocking , Critical , High ) issues also filter on ( TMOS , LTM ) Bugs. 
      so Based on your design and scenario I think you may find some thing related to your issue.

      >> Note , Running Bug Scrub for a target version you aim to use it for upgrade is very crucial before Upgrade , indeed ! 

  • JN_AU I would check the following.

    1. Is the health monitor making it to the destination? You can verify this with either a tcpdump or wireshark on the destination device.
    2. Are the firewalls between the F5 and the destination blocking any communication from the F5s? For health monitors you would be looking for the F5 self-IPs for the active and standby unit.
    3. Is the same routing in place for the new code version F5 and old code version F5?
    4. Have you compared the old and new configuration of the F5s to see what is different. I have found the easiest way to do this if you don't have a BIG-IQ is to generate an SCF before the code upgrade and after and then diff the two files. The following URL should help with this.

    https://my.f5.com/manage/s/article/K13408