ab_f5_admin_263
Jun 27, 2017Nimbostratus
FAILED failover: Virtual servers were disabled on standby unit.
We experienced a failed failover the other day where the standby F5 was unable to take over. I would like to ask for any help preventing this in the future and if possible further investigating what may have gone wrong.
Setup:
- F5_1 and F5_2 are virtual editions running on Xen cluster along with about 40 virtual servers providing many services.
- The main services that were affected: * Server farm 14 web servers, dev, production, mail, cdn load balanced by F5_1 & F5_2. AND a small cluster of three user authentication servers load balanced by F5_1 & F5_2
- LB_2 was acting as the active unit.
- LB_1 was in standby mode.
- Both units were “In Sync”. (Some configurations were changed and synced from LB_2 to LB_1 about a week prior to this failure. We did not actually log in to LB_1 and check any status at that time.)
- The web service monitors hit the web server’s status pages every 10 seconds to test online conditions.
- The login service monitor uses default tcp check to test online conditions.
Timeline of the failed failover.
- LB_1 in standby mode recorded a monitor failure when checking web servers May 15, 2017 @ 3:13 am. This message was unnoticed until the day of the failover event.
- From LB_1, web server nodes were apparently considered online (blue square) but web virtual servers (port 25,80,443 etc) were considered unreachable until the failover June 23. Logs from the servers do not show attempts from LB_1 to access the web servers’ online status report page. We assume LB_1 did not try or failed to try hitting the servers’ status pages.
- LB_1 in standby mode recorded a monitor failure when checking user authentication servers May 21, 2017 @ 5:56 am
- user authentication nodes were apparently considered online but user authentication services were considered unreachable until the failover June 23. We could not confirm if LB_1 was attempting to test online conditions from the server side.
- LB_2 experienced a network heartbeat failure daemon failure June 23, 2017 @ 5:16 am
- LB_2 failed over responsibilities to LB_1 5:16:24.880 am
- LB_1 shows updating a few ASM configurations, but no warnings or updated offline events reported for the web and user authentication services.
- LB_2 and LB_1 both reported configurations “In Sync” at all times during this event.
- Communication to web and user authentication services was effectively “blackholed” and virtual servers providing those services were listed as disabled with red diamond indicators in now active LB_1.
- Service was manually failed back to LB_2 9:40 am and all services were available.
- LB_1 remained in the same state for several hours while some investigation took place.
- We tried disable and re-enabling the virtual nodes that were marked offline with red-diamonds. The services did not return and LB_1 did not log a new offline message for the nodes using the monitor.
- When LB_1 was rebooted at 21:40 pm all services were discovered and normal offline checks were taking place.
Note: This post appears similar to our situation - Node marked offline in standby
It was nuclear FAIL.
The units reported "In sync" but they were not in sync and the standby unit was unprepared to take up the active unit responsibilities.
- We cannot find any log explaining why LB_1 failed to conduct subsequent health checks on various servers.
- We cannot determine from the logs why new health checks did not occur when the LB_2 failed over to LB_1.
We want to prevent something like this from happening again.
Here are 2 questions for the board:
Can you share advice to help prevent a recurrence?
Can you share any advice to further investigate?