Forum Discussion

ab_f5_admin_263's avatar
ab_f5_admin_263
Icon for Nimbostratus rankNimbostratus
Jun 27, 2017

FAILED failover: Virtual servers were disabled on standby unit.

We experienced a failed failover the other day where the standby F5 was unable to take over. I would like to ask for any help preventing this in the future and if possible further investigating what may have gone wrong.

 

Setup:

 

  1. F5_1 and F5_2 are virtual editions running on Xen cluster along with about 40 virtual servers providing many services.
  2. The main services that were affected: * Server farm 14 web servers, dev, production, mail, cdn load balanced by F5_1 & F5_2. AND a small cluster of three user authentication servers load balanced by F5_1 & F5_2
  3. LB_2 was acting as the active unit.
  4. LB_1 was in standby mode.
  5. Both units were “In Sync”. (Some configurations were changed and synced from LB_2 to LB_1 about a week prior to this failure. We did not actually log in to LB_1 and check any status at that time.)
  6. The web service monitors hit the web server’s status pages every 10 seconds to test online conditions.
  7. The login service monitor uses default tcp check to test online conditions.

Timeline of the failed failover.

 

  1. LB_1 in standby mode recorded a monitor failure when checking web servers May 15, 2017 @ 3:13 am. This message was unnoticed until the day of the failover event.
  2. From LB_1, web server nodes were apparently considered online (blue square) but web virtual servers (port 25,80,443 etc) were considered unreachable until the failover June 23. Logs from the servers do not show attempts from LB_1 to access the web servers’ online status report page. We assume LB_1 did not try or failed to try hitting the servers’ status pages.
  3. LB_1 in standby mode recorded a monitor failure when checking user authentication servers May 21, 2017 @ 5:56 am
  4. user authentication nodes were apparently considered online but user authentication services were considered unreachable until the failover June 23. We could not confirm if LB_1 was attempting to test online conditions from the server side.
  5. LB_2 experienced a network heartbeat failure daemon failure June 23, 2017 @ 5:16 am
  6. LB_2 failed over responsibilities to LB_1 5:16:24.880 am
  7. LB_1 shows updating a few ASM configurations, but no warnings or updated offline events reported for the web and user authentication services.
  8. LB_2 and LB_1 both reported configurations “In Sync” at all times during this event.
  9. Communication to web and user authentication services was effectively “blackholed” and virtual servers providing those services were listed as disabled with red diamond indicators in now active LB_1.
  10. Service was manually failed back to LB_2 9:40 am and all services were available.
  11. LB_1 remained in the same state for several hours while some investigation took place.
  12. We tried disable and re-enabling the virtual nodes that were marked offline with red-diamonds. The services did not return and LB_1 did not log a new offline message for the nodes using the monitor.
  13. When LB_1 was rebooted at 21:40 pm all services were discovered and normal offline checks were taking place.

Note: This post appears similar to our situation - Node marked offline in standby

 

It was nuclear FAIL.

 

The units reported "In sync" but they were not in sync and the standby unit was unprepared to take up the active unit responsibilities.

 

  • We cannot find any log explaining why LB_1 failed to conduct subsequent health checks on various servers.
  • We cannot determine from the logs why new health checks did not occur when the LB_2 failed over to LB_1.

We want to prevent something like this from happening again.

 

Here are 2 questions for the board:

 

Can you share advice to help prevent a recurrence?

 

Can you share any advice to further investigate?

 

  • These are very basic logs read from a tar that was made from each virtual F5 /var/log directories just after the manual fail back to F5_2.

    MacBook-Pro:lb-logs auser$ zgrep -r sod lb1/var/log/ 
    lb1/var/log//audit:Jun 23 09:40:03 new-f5lb-1 info syscalld[7414]: 01070417:6: AUDIT - user adminuser - RAW: Request to run /usr/bin/cmd_sod go standby GUI.
    lb1/var/log//liveinstall.log:info: RPM: sod-11.6.1-0.0.317 
    lb1/var/log//liveinstall.log.0:info: sod/i686 -- S=11.6.0-0.0.401, H=11.6.0-5.0.429 
    lb1/var/log//liveinstall.log.0:info: sod/i686 -- will be upgraded 
    lb1/var/log//liveinstall.log.0:info: RPM: sod-11.6.0-5.0.429 
    lb1/var/log//liveinstall.log.1:info: RPM: sod-11.6.0-0.0.401 
    lb1/var/log//ltm:Jun 23 05:16:24 new-f5lb-1 notice sod[8865]: 010c007f:5: Receiving status updates from peer device /Common/f5lb-2.domain.com (123.123.123.43) (Offline).
    lb1/var/log//ltm:Jun 23 05:16:24 new-f5lb-1 notice sod[8865]: 010c006d:5: Leaving Standby for Active (best load): NextActive:.
    lb1/var/log//ltm:Jun 23 05:16:24 new-f5lb-1 notice sod[8865]: 010c0053:5: Active for traffic group /Common/traffic-group-1.
    lb1/var/log//ltm:Jun 23 05:16:24 new-f5lb-1 notice sod[8865]: 010c0019:5: Active
    lb1/var/log//ltm:Jun 23 05:17:24 new-f5lb-1 notice sod[8865]: 010c007f:5: Receiving status updates from peer device /Common/f5lb-2.domain.com (123.123.123.43) (Online).
    lb1/var/log//ltm:Jun 23 09:40:03 new-f5lb-1 notice sod[8865]: 010c0044:5: Command:  go standby GUI.
    lb1/var/log//ltm:Jun 23 09:40:03 new-f5lb-1 notice sod[8865]: 010c0052:5: Standby for traffic group /Common/traffic-group-1.
    lb1/var/log//ltm:Jun 23 09:40:03 new-f5lb-1 notice sod[8865]: 010c0018:5: Standby
    lb1/var/log//rpmpkgs:sod-11.6.1-0.0.317.i686.rpm
    lb1/var/log//rpmpkgs.1.gz:sod-11.6.1-0.0.317.i686.rpm
    
    
    MacBook-Pro:lb-logs auser$ zgrep -r sod lb2/var/log/ 
    lb2/var/log//liveinstall.log:info: RPM: sod-11.6.1-0.0.317 
    lb2/var/log//liveinstall.log.0:info: sod/i686 -- S=11.6.0-0.0.401, H=11.6.0-5.0.429 
    lb2/var/log//liveinstall.log.0:info: sod/i686 -- will be upgraded 
    lb2/var/log//liveinstall.log.0:info: RPM: sod-11.6.0-5.0.429 
    lb2/var/log//liveinstall.log.1:info: RPM: sod-11.6.0-0.0.401 
    lb2/var/log//ltm:Jun 23 05:16:14 new-f5lb-2 warning sod[6662]: 01140029:4: HA daemon_heartbeat bd fails action is restart.
    lb2/var/log//ltm:Jun 23 05:16:24 new-f5lb-2 warning sod[6662]: 01140029:4: HA proc_running bd fails action is go offline and down links.
    lb2/var/log//ltm:Jun 23 05:16:24 new-f5lb-2 notice sod[6662]: 010c0050:5: Sod requests links down.
    lb2/var/log//ltm:Jun 23 05:16:24 new-f5lb-2 notice sod[6662]: 010c0054:5: Offline for traffic group /Common/traffic-group-1.
    lb2/var/log//ltm:Jun 23 05:16:24 new-f5lb-2 notice sod[6662]: 010c003e:5: Offline
    lb2/var/log//ltm:Jun 23 05:16:34 new-f5lb-2 notice sod[6662]: 010c006c:5: proc stat: [0] 24764 (bigd) ... lots of numbers ... .
    lb2/var/log//ltm:Jun 23 05:17:14 new-f5lb-2 notice sod[6662]: 010c006c:5: proc stat: [0] 24838 (bd) ... lots of numbers ... .
    lb2/var/log//ltm:Jun 23 05:17:24 new-f5lb-2 notice sod[6662]: 01140030:5: HA proc_running bd is now responding.
    lb2/var/log//ltm:Jun 23 05:17:24 new-f5lb-2 notice sod[6662]: 010c006d:5: Leaving Offline for Standby for dbvar is redundant.
    lb2/var/log//ltm:Jun 23 05:17:24 new-f5lb-2 notice sod[6662]: 010c0052:5: Standby for traffic group /Common/traffic-group-1.
    lb2/var/log//ltm:Jun 23 05:17:24 new-f5lb-2 notice sod[6662]: 010c0018:5: Standby
    lb2/var/log//ltm:Jun 23 05:17:24 new-f5lb-2 notice sod[6662]: 010c0049:5: Tmm ready - links up.
    lb2/var/log//ltm:Jun 23 09:40:03 new-f5lb-2 notice sod[6662]: 010c006d:5: Leaving Standby for Active (best load): NextActive:.
    lb2/var/log//ltm:Jun 23 09:40:03 new-f5lb-2 notice sod[6662]: 010c0053:5: Active for traffic group /Common/traffic-group-1.
    lb2/var/log//ltm:Jun 23 09:40:03 new-f5lb-2 notice sod[6662]: 010c0019:5: Active
    lb2/var/log//ltm.5.gz:Jun 18 04:22:34 new-f5lb-2 notice sod[6662]: 010c007e:5: Not receiving status updates from peer device /Common/f5lb-1.domain.com (123.123.123.42) (Disconnected).
    lb2/var/log//ltm.5.gz:Jun 18 04:22:36 new-f5lb-2 notice sod[6662]: 010c007f:5: Receiving status updates from peer device /Common/f5lb-1.domain.com (123.123.123.42) (Online).
    lb2/var/log//rpmpkgs:sod-11.6.1-0.0.317.i686.rpm
    lb2/var/log//rpmpkgs.1.gz:sod-11.6.1-0.0.317.i686.rpm
    
  • I have definitive answers here but maybe some info to help along the way...

     

    Were all the pools offline on lb1 or just the ones you mention while others were still green/online? Bigip has built-in snmp trap alerts for pools/pool members marked offline/online, were those received from lb1?

     

    Bigip does not continuously log offline status of vs/pools. The down events are logged once at the time they occur only. Monitors should have continued checking every 10 secs even after pools were down so it seems there was some issue (obviously) with lb1 internally or its ability to reach the servers.

     

    Regarding the "In sync" status, that refers to the configuration not the status of VS/pools. Each bigip runs monitors independently and the resulting status is not synced/shared with the other bigip. The offline status of VS/pools will not stop a bigip from going active. In other words lb1 going active even though vs/pools were offline is expected behavior. HA groups and/or vlan failsafe features may have altered your experience here.

     

    Have you opened a case with F5 support to get any insight they may have?

     

  • Thank you. I reply:

     

    Were all the pools offline on lb1 or just the ones you mention while others were still green/online?

     

    I did find one or two VS that remained green. One is a sort of place holder VS, which was set up to eventually take over for a virtual server on a totally separate standalone F5. The other is the testing pool. I do believe that all the other enabled VS were in the red diamond state. We did intentionally force down several VS during the other configuration I mentioned, and those were reported correctly during the incident. Only the majority of VS that were "enabled" on F52 were found with red diamonds on F5_1 during the incident.

     

    Re: " built-in snmp trap alerts "

     

    can you share the log to search for this alert? We downloaded the tar of existing logs at the time of reloading the LB_2 as the active unit.

     

    re: does not continuously log offline status

     

    Thank you. It clears up that part for me. No way to expect it to log continual offline, even when failing over.

     

    Re: Monitors should checking every 10 secs ...

     

    It is what we expected for sure. So that they aught to have been noticed and come online. I think this is the most distressing thing to solve, but the fact that the original offline condition was triggered so long ago, and never recovered we are at a loss to understand or explain. It is most frustrating. Especially to explain for the boss.

     

    Re: it seems there was some issue (obviously) with lb1 internally or its ability to reach the servers.

     

    While testing with LB_1 in the broken condition. I could curl the web servers test pages and got the expected result from the LB_1 ssh command line. So, we are all the more confused to discover what cause, apparently, the F5 to not be sending the monitor test.

     

    "Regarding the "In sync" status,"

     

    Thank you again for clearing up the misunderstanding. I assumed that the status would be shared as part of "syncing" Thus, i think we are facing an unexplainable fail on LB_1 to continue or successfully accomplish the web server monitoring.

     

    Re: ...F5 Support ...

     

    No. The system is up, so we will likely just move on. We have had scheduled upgrade time anyway for few weeks away.

     

    I think we must devise a script to capture the VS statuses on F5_1 and 2. The script will simply compare them and alert us if the statuses are not the same at any time. Even a hourly interval check should be fine.

     

  • With this new information thanks to you, I just now compared the test pool screenshot from the down time, to today's live and running LB_1 standby network map to compare that testing pool which was reporting online during the incident.

     

    During down time, both test nodes are shown green. Today, one has the red diamond and the notice of failed time is 2017/06/13. This status is the same in F5_2 and F5_1.

     

    This is the correct status, for the second node in that pool is down, and not online.

     

    This makes me all the more unable to trust the report during down time of both nodes in the testing pool being online.

     

    I may never know what caused this inconsistent monitoring and reports.

     

  • Re: can you share the log to search for this alert?

     

    Change in pool member state due to monitoring is logged by syslog in /var/log/ltm. The snmp traps are sent by another process on bigip named alertd. Alertd has built-in trap events for pool members marked up or down by monitors (and many other events) which will send to the snmp trap destination defined on bigip. Alertd can also send email alerts directly from bigip. See this reference material on askf5.com

     

    External Monitoring Implementations

     

    Configuring Email Notifications

     

    Configuring Custom SNMP Traps

     

    Testing SNMP Traps

     

    Pre-configured Alerts for SNMP Traps

     

    Running a Command or Script Based on Syslog Message

     

    I agree that if the curl requests returned the expected results on LB1 then some internal failure stopped the health monitors from properly performing their checks.

     

  • Thank you again for the help.

     

    I will look for ways to configure future alerts according to these documents.