Knowledge sharing: Investigating/troubleshooting crash and failover events
1. Many crash events generate a core file and for some core file is not generated and f5 may need to provide ENG hotfix just to log what causes the crash:
2. A good note is tha the core file may have the name of the process that generated it and also the "user.log" is helpfull and the bug tracker(https://support.f5.com/csp/bug-tracker?sf189923893=1) and release notes for "Fixes and Known Issues" can be checked for any issues with the process. Also check the kern logs and ltm logs as before the issue the system may have been overutilized as I have explained such checks in the article below:
3. For tmm crash the ''tmm'' logs have info as some issues have error messages that can show what was the issue even without waiting for the TAC to open the core file as it could take a lot of time. For in older versions there was issue when changing the ssl profile settings when using SNI, so audit log also helped as many bugs are caused by config change and a bug can be found more easily but before version 11.6 the audit needed to be enabled and that was really bad (https://support.f5.com/csp/article/K16304).
Examples:
4.The crash can also be caused by using commands with many records that overwhelm the memory like "tcpdump" without filters, "show sys connections" without filters or reviewing persistance records from the CLI or GUI without filters.
Examples:
- https://cdn.f5.com/product/bugtracker/ID827293.html
- https://support.f5.com/csp/article/K20234023
- https://support.f5.com/csp/article/K15246
- https://support.f5.com/csp/article/K44284472
- https://cdn.f5.com/product/bugtracker/ID490537.html
5. A strange issue/bug is when the device is constantlly in offline state then check the HA table as sometimes this as it should be with vlan or gateway fail-safe but sometimes someting gets stuck and the process or the device as a whole needs reboot:
- https://support.f5.com/csp/article/K20060182
- https://support.f5.com/csp/article/K15367
- https://support.f5.com/csp/article/K13297
6. Also a good note is when rebooting F5 VIPRION blade systems to do a full reboot with the clsh command/tool at all blades. Also for virtual editions always check if the reboot was not caused by for example the Vmware ESXi issue or someone shust shuting/rebooting the F5 VM. For vCMP always check also the vCMP host ltm and tmm log when there are issues with the vCMP quests crashing/failovering. If a vCMP quest is stuck it is for the best to redeploy it from the vCMP host as this will clean many errors:
- JRahmAdmin
good stuff, Nikoolayy1