upgrades
2 TopicsProblems Overcome During a Major LTM Software/Hardware Upgrade
I recently completed a successful major LTM hardware and software migration which accomplished two high-level goals: · Software upgrade from v9.3.1HF8 to v10.1.0HF1 · Hardware platform migration from 6400 to 6900 I encountered several problems during the migration event that would have stopped me in my trackshad I not (in most cases) encountered them already during my testing. This is a list of those issues and what I did to address them. While I may not have all the documentation about these problems or even fully understand all the details, the bottom line is that they worked. My hope is that someone else will benefit from it when it counts the most (and you know what Imean). Problem #1 – Unable to Access the Configuration Utility (admin GUI) The first issue I had to resolve was apparent immediately after the upgrade finished. When I tried to access the Configuration utility, I was denied: Access forbidden! You don't have permission to access the requested object. Error 403 I happened to find the resolution in SOL7448: Restricting access to the Configuration utility by source IP address. The SOL refers to bigpipe commands, which is what I used initially: bigpipe httpd allow all add bigpipe save Since then, I’ve developed the corresponding TMSH commands, which is F5’s long-term direction toward managing the system: tmsh modify sys httpd allow replace-all-with {all} tmsh save / sys config Problem #2 – Incompatible Profile I encountered the second issue after the upgraded configuration was loaded for the first time: [root@bigip2:INOPERATIVE] config # BIGpipe unknown operation error: 01070752:3: Virtual server vs_0_0_0_0_22 (forwarding type) has an incompatible profile. By reviewing the /config/bigip.conf file, I found that my forwarding virtual servers had a TCP profile applied: virtual vs_0_0_0_0_22 { destination any:22 ip forward ip protocol tcp translate service disable profile custom_tcp } Apparently v9 did not care about this, but v10 would not load until I manually removed these TCP profile referencesfrom all of my forwarding virtual servers. Problem #3 – BIGpipe parsing error Then I encountered a second problem while attempting to load the configuration for the first time: BIGpipe parsing error (/config/bigip.conf Line 6870): 012e0022:3: The requested value (x.x.x.x:3d-nfsd {) is invalid (show | <pool member list> | none) [add | delete]) for 'members' in 'pool' While examining this error, I noticed that the port number was translated into a service name – “3d-nfsd”. Fortunately during my initial v10 research, I came across SOL11293 - The default /etc/services file in BIG-IP version 10.1.0 contains service names that may cause a configuration load failure. While I had added a step in my upgrade process to prevent the LTM from service translation, it was notscheduled until after the configuration had been successfully loaded on the new hardware. Instead I had to move this step up in the overall process flow: bigpipe cli service number b save The corresponding TMSH commands are: tmsh modify cli global-settings service number tmsh save / sys config Problem #4 – Command is not valid in current event context This was the final error we encountered when trying to load the upgraded configuration for the first time: BIGpipe rule creation error:01070151:3: Rule [www.mycompany.com] error: line 28: [command is not valid in current event context (HTTP_RESPONSE)] [HTTP::host] While reviewing the iRule it was obvious that we had a statement which didn’t make any sense, since there is no Host header in an HTTP response. Apparently it didn’t bother v9, but v10 didn’t like it: when HTTP_RESPONSE { switch -glob[string tolower [HTTP::host]] { <do some stuff> } } We simply removed that event from the iRule. Problem #5: Failed Log Rotation After I finished my first migration, I found myself in a situation where none of the logs in the /var/log directory were not being rotated. The /var/log/secure log file held the best clue about the underlying issue: warning crond[7634]: Deprecated pam_stack module called from service "crond" I had to open a case with F5, who found that the PAM crond configuration file (/config/bigip/auth/pam.d/crond) had been pulled from the old unit: # # The PAM configuration file for the cron daemon # # auth sufficient pam_rootok.so auth required pam_stack.so service=system-auth auth required pam_env.so account required pam_stack.so service=system-auth session required pam_limits.so #session optional pam_krb5.so I had to update the file from a clean unit (which I was fortunate enough to have at my disposal): # # The PAM configuration file for the cron daemon # # auth sufficient pam_rootok.so auth required pam_env.so auth include system-auth account required pam_access.so account sufficient pam_permit.so account include system-auth session required pam_loginuid.so session include system-auth and restart crond: bigstart restart crond or in the v10 world: tmsh restart sys service crond Problem #6: LTM/GTM SSL Communication Failure This particular issue is the sole reason that my most recent migration process took 10 hours instead of four. Even if you do have a GTM, you are not likely to encounter it since it was a result of our own configuration. But I thought I’d include it since it isn’t something you’ll see documented by F5. One of the steps in my migration plan was to validate successful LTM/GTM communication with iqdump. When I got to this point in the migration process, I found that iqdump was failing in both directions because of SSL certificate verification despite having installed the new Trusted Server Certificate on the GTM, and Trusted Device Certificates on both the LTM and GTM. After several hours of troubleshooting, I decided to perform a tcpdump to see if I could gain any insight based on what was happening on the wire. I didn’t notice it at first, but when I looked at the trace again later I noticed the hostname on the certificate that the LTM was presenting was not correct. It was a very small detail that could have easily been missed, but was the key in identifying the root cause. Having dealt with Device Certificates in the past, I knew that the Device Certificate file was /config/httpd/conf/ssl.crt/server.crt. When I looked in that directory on the filesystem, there I found a number of certificates (and subsequently, private keys in /config/httpd/conf/ssl.key) that should not have been there. I also found that these certificates and keys were pulled from the configuration on the old hardware. So I removed the extraneous certificates and keys from these directories and restarted the httpd service (“bigstart restart httpd”, or “tmsh restart sys service crond”). After I did that, the LTM presented the correct Device Certificate and LTM/GTM communication was restored. I'm still not sure to this day how those certificates got there in the first place...827Views0likes3CommentsTech Tip: BIG-IP as Upgrade Facilitator
If you have machines behind your BIG-IP that are not load balanced, you still get a ton of benefits from their location. Virtualization is a big one, as is the number of metrics that are available in the BIG-IP. This Tech Tip focuses on a simple way to get more out of your BIG-IP by simply putting servers behind it. Yep, we said that. Just put the servers behind the BIG-IP and get more. The idea is simple. Upgrades cost downtime - generally a not insignificant amount of downtime. But what if you could reduce that downtime to zero? What if we said the chances were minuscule that even one customer would be effected? Well you can, if you have extra hardware. Assuming you’ve got your server to be upgraded behind a BIG-IP, here are some simple steps to upgrade: If you’ve got another box that you can use for your upgrade, then place it behind your BIG-IP. Give the new box any old open IP address and hostname. Install everything the machine needs. Place the machine in a pool of its own – we’ll call it testPool. Create a new Virtual Node (we’ll call this testVIP) and assign testPool to be the default pool. Copy all of the relevant data from your current (production) server. Give your testers rights to testVIP, and run acceptance testing. Copy all of the relevant data again to get the changes in production since testing started. Add your new server to the current production pool. Change the state on the old server to “only active connections allowed” this will stop accepting connections and break persistent connections. In short, all new traffic will be routed to the new node. Because we did not allow persistent connections to stay with the old server, when they immediately try to reconnect the BIG-IP will direct them to the new server. After all connections have ended with the old server, remove it from the current production pool. Decommission your old server in whatever manner is normal for your company. The benefit here is the virtualization of BIG-IP. The Virtual Node address of your production server never changed, so users saw no issues. You did not have to take down the existing server, change a bunch of IP settings on the new server, then bring it online. You just add the new server to the pool and change the state of the old server. Regular end users will not experience any downtime in this process, and persistent connections will reset and attach to the new server. You look like a hero for such a smooth upgrade, your boss will get congratulations on “pulling it off”. Everyone is happy, and the extra work involved is simply a setting change in your BIG-IP.199Views0likes0Comments