big3d timeouts
We are experiencing intermittent big3d timeout errors from our GTM sync group. It seems that the GTM whose gtmd is selected to poll is okay, but the other GTMs in our sync group will report a bip3d timeout:
GTM2.ABC.LOCAL alert gmtd[12345]: 011a6006:1: SNMP_TRAP: VS /Common/website_vs (ip:port=192.168.1.100:443) (Server /Common/LTM1.ABC.LOCAL) state change green --> red ( Monitor /Common/bigip from /Common/GTM1.ABC.LOCAL : no reply from big3d: timed out)
GTM3.ABC.LOCAL alert gmtd[12345]: 011a6006:1: SNMP_TRAP: VS /Common/website_vs (ip:port=192.168.1.100:443) (Server /Common/LTM1.ABC.LOCAL) state change green --> red ( Monitor /Common/bigip from /Common/GTM1.ABC.LOCAL : no reply from big3d: timed out)
GTM4.ABC.LOCAL alert gmtd[12345]: 011a6006:1: SNMP_TRAP: VS /Common/website_vs (ip:port=192.168.1.100:443) (Server /Common/LTM1.ABC.LOCAL) state change green --> red ( Monitor /Common/bigip from /Common/GTM1.ABC.LOCAL : no reply from big3d: timed out)
The VS will remain in an down state on those GTMs until we trigger some kind of change like adding/removing a GTM generic host. But that might also trigger a big3d timeout on another VS. There doesn't seem to be any pattern into which VS result in big3d timeouts. It's not the same set of GTMs everytime either.
Our LTMs have discovery enabled (no delete) and there are about 1500 total VS discovered across 6 LTM pairs. The LTMs have Service Check disabled in the iQuery settings as we do not want our LTMs performing monitoring for the ~400 generic hosts.
I looked at https://my.f5.com/manage/s/article/K52381445 to try and track down the reason for the big3d timeouts. The iQuery mesh looks good and I see packets incrementing between all BIG-IP devices. I enabled gtm.debugprobelogging and set the log.gtm.level to debug on all the GTMs and reproduced the big3d timeouts. It generated many 'probe to' and 'probe from' messages and I could see references to various VS in the events, but I found no reference to the name or IP address of the VS that timed out. I'm not sure if debug logging is working as expected because I searched for a VS that remained healthy for the 15 minutes I was running debugging and still found no reference to it.
Are there any recommendations to tracking down the underlying cause of the timeouts? I tried capturing the iQuery traffic to see what is happening on the wire but I was unable to decrypt it. I'm guessing mutual auth is preventing me from doing this. We are rather hesistant to start poking at configuration settings without getting a better idea of issue.
F5 support indicated the cause was related to bugs ID1046785 and ID1128369.
The first bug matched exactly what we were experiencing. After we upgraded our integration environment to 16.1.3.4, which has the bug fix, I confirmed we were no longer seeing unexpected big3d timeouts. Our prod environment wasn't scheduled to be upgraded for a few more weeks, so we applied the workaround of increasing Max Synchronous Monitor Requests from the default of 20 to 200 based on F5 recommendation. This seemed to fix our issue as well. We have since upgraded our prod environment to 16.1.3.4, but we decided to leave MSMR at 200 due to the relatively large number of VS in our environment.
The latter bug ID is not public, but F5 support said that it is related to monitoring several LTMs with the same bigip monitor. This can result in massive bursts of probes being sent to a big3d at the same time, which can result in mcpd becoming full and blocking, causing big3d to drop the query. The recommended fix for this is to apply unique bigip monitors on the LTM clusters with slightly incremented intervals, ex: 30sec, 31sec, etc. We decided not to pursue this since it appears that addressing the first bug fixed our big3d timeout issue.