Problems Overcome During a Major LTM Software/Hardware Upgrade

I recently completed a successful major LTM hardware and software migration which accomplished two high-level goals:

·         Software upgrade from v9.3.1HF8 to v10.1.0HF1
·         Hardware platform migration from 6400 to 6900
 
I encountered several problems during the migration event that would have stopped me in my tracks had I not (in most cases) encountered them already during my testing. This is a list of those issues and what I did to address them. While I may not have all the documentation about these problems or even fully understand all the details, the bottom line is that they worked. My hope is that someone else will benefit from it when it counts the most (and you know what I mean).
 
 
Problem #1 – Unable to Access the Configuration Utility (admin GUI)
The first issue I had to resolve was apparent immediately after the upgrade finished. When I tried to access the Configuration utility, I was denied:
 
Access forbidden!
You don't have permission to access the requested object.
Error 403
 
I happened to find the resolution in SOL7448: Restricting access to the Configuration utility by source IP address. The SOL refers to bigpipe commands, which is what I used initially:
 
bigpipe httpd allow all add
bigpipe save
 
Since then, I’ve developed the corresponding TMSH commands, which is F5’s long-term direction toward managing the system:
 
tmsh modify sys httpd allow replace-all-with {all}
tmsh save / sys config
 
 
Problem #2 – Incompatible Profile
I encountered the second issue after the upgraded configuration was loaded for the first time:
 
[root@bigip2:INOPERATIVE] config # BIGpipe unknown operation error: 01070752:3: Virtual server vs_0_0_0_0_22 (forwarding type) has an incompatible profile.
 
By reviewing the /config/bigip.conf file, I found that my forwarding virtual servers had a TCP profile applied:
 
virtual vs_0_0_0_0_22 {
 destination any:22
 ip forward
 ip protocol tcp
 translate service disable
 profile custom_tcp
}
 
Apparently v9 did not care about this, but v10 would not load until I manually removed these TCP profile references from all of my forwarding virtual servers.
 
 
Problem #3 – BIGpipe parsing error
Then I encountered a second problem while attempting to load the configuration for the first time:
 
BIGpipe parsing error (/config/bigip.conf Line 6870): 012e0022:3: The requested value (x.x.x.x:3d-nfsd {) is invalid (show | <pool member list> | none) [add | delete]) for 'members' in 'pool'
 
While examining this error, I noticed that the port number was translated into a service name – “3d-nfsd”. Fortunately during my initial v10 research, I came across SOL11293 - The default /etc/services file in BIG-IP version 10.1.0 contains service names that may cause a configuration load failure. While I had added a step in my upgrade process to prevent the LTM from service translation, it was not scheduled until after the configuration had been successfully loaded on the new hardware. Instead I had to move this step up in the overall process flow:
 
bigpipe cli service number
b save
 
The corresponding TMSH commands are:
 
tmsh modify cli global-settings service number
tmsh save / sys config
 
 
Problem #4 – Command is not valid in current event context
This was the final error we encountered when trying to load the upgraded configuration for the first time:
 
BIGpipe rule creation error: 01070151:3: Rule [www.mycompany.com] error: line 28: [command is not valid in current event context (HTTP_RESPONSE)] [HTTP::host]
 
While reviewing the iRule it was obvious that we had a statement which didn’t make any sense, since there is no Host header in an HTTP response. Apparently it didn’t bother v9, but v10 didn’t like it:
 
when HTTP_RESPONSE {
 switch -glob [string tolower [HTTP::host]] {
    <do some stuff>
 }
}
 
We simply removed that event from the iRule.
 
 
Problem #5: Failed Log Rotation
After I finished my first migration, I found myself in a situation where none of the logs in the /var/log directory were not being rotated. The /var/log/secure log file held the best clue about the underlying issue:
 
warning crond[7634]: Deprecated pam_stack module called from service "crond"
 
I had to open a case with F5, who found that the PAM crond configuration file (/config/bigip/auth/pam.d/crond) had been pulled from the old unit:
 
#
# The PAM configuration file for the cron daemon
#
#
auth    sufficient      pam_rootok.so
auth    required        pam_stack.so service=system-auth
auth    required        pam_env.so
account required        pam_stack.so service=system-auth
session required        pam_limits.so
#session        optional        pam_krb5.so
 
I had to update the file from a clean unit (which I was fortunate enough to have at my disposal):
 
#
# The PAM configuration file for the cron daemon
#
#
auth       sufficient pam_rootok.so
auth       required   pam_env.so
auth       include    system-auth
account    required   pam_access.so
account    sufficient pam_permit.so
account    include    system-auth
session    required   pam_loginuid.so
session    include    system-auth
 
and restart crond:
 
bigstart restart crond
 
or in the v10 world:
 
tmsh restart sys service crond


Problem #6: LTM/GTM SSL Communication Failure

This particular issue is the sole reason that my most recent migration process took 10 hours instead of four. Even if you do have a GTM, you are not likely to encounter it since it was a result of our own configuration. But I thought I’d include it since it isn’t something you’ll see documented by F5. One of the steps in my migration plan was to validate successful LTM/GTM communication with iqdump. When I got to this point in the migration process, I found that iqdump was failing in both directions because of SSL certificate verification despite having installed the new Trusted Server Certificate on the GTM, and Trusted Device Certificates on both the LTM and GTM. After several hours of troubleshooting,  I decided to perform a tcpdump to see if I could gain any insight based on what was happening on the wire. I didn’t notice it at first, but when I looked at the trace again later I noticed the hostname on the certificate that the LTM was presenting was not correct. It was a very small detail that could have easily been missed, but was the key in identifying the root cause.
 
Having dealt with Device Certificates in the past, I knew that the Device Certificate file was /config/httpd/conf/ssl.crt/server.crt. When I looked in that directory on the filesystem, there I found a number of certificates (and subsequently, private keys in /config/httpd/conf/ssl.key) that should not have been there. I also found that these certificates and keys were pulled from the configuration on the old hardware. So I removed the extraneous certificates and keys from these directories and restarted the httpd service (“bigstart restart httpd”, or “tmsh restart sys service crond”). After I did that, the LTM presented the correct Device Certificate and LTM/GTM communication was restored. I'm still not sure to this day how those certificates got there in the first place...
Published May 27, 2010
Version 1.0
  • j_pedley_46776's avatar
    j_pedley_46776
    Historic F5 Account
    Great work! Another thing to worry about iRules is datagroup naming. In 9.4.X, I have datagroups such as ::dg_name. This works fine in 9.4 but fails in 10.2. Fix is to add the $. You can also do this pre-migration, but be warned, if your datagroup name has a hyphen in it, it will truncate the variable in 9.4, and since it doesn't match a valid datagroup, it will abort.

     

     

    Clients really don't like to see TCP resets...
  • "Jpendley: [For data groups the] fix is to add the $."

     

     

    The better fix is to remove $:: or :: from any data group names to preserve CMP compatibility.

     

     

    Aaron