Forum Discussion

Chris_Phillips's avatar
Chris_Phillips
Icon for Nimbostratus rankNimbostratus
Mar 14, 2017

Monitoring ephemeral pool members & nodes

Hi, I'm looking to use the dynamic FQDN resolution for nodes / pool members, but have a few queries...

 

If an Ephemeral pool member goes down, does that failure trigger anything on the DNS / Node side? Immediate resolution in case the result has changed etc.? Resolution appears to be wholly at the node level, suggesting the health of the pool member is irrelevant, and always limited to the IP's returned periodically by the DNS lookup attached to the node - and "DNS monitoring" merely means a resolution occured.

 

The documentation about "auto populate" is confusing me. What is the real life difference between Enabled and Disabled? I see that when Disabled, the very first result is always used, however it still creates the Ephemeral node, it's still done periodically etc., so the only meaningful difference seems to be if more than one A is returned at a time. There's reference to Enabled removing members that are no longer being returned, but isn't that already implicitly true for Disabled? If that one result changes, then the pool member will change accordingly. Is it really any more meaningful than "Disabled = ignore additional results, Enabled = create nodes for all answers."?

 

What would it mean for a node to auto populate, but a pool member using that node to NOT be set to do that? I see we only get a single pool member, but multiple nodes... but what is the consequence of this? Is there a reason this would be found useful?

 

How does the node resolution internal work with the DNS Cache option in the system settings? The reoslution does use these settings, right? Would it make sense to set the resolution low ~ 5 seconds and enable caching on LTM, meaning that the name would be resolved almost as soon as the TTL expires on the record, thereby falling out of the cache? Could this be seen as a realistic best practise, or are there dragons hiding around here? Setting node resolution at an arbitrary hour interval as per default seems very dangerous to me.

 

Thanks!

 

  • If an Ephemeral pool member goes down, does that failure trigger anything on the DNS / Node side? No, new DNS request is only done after TTL expires, or bigd is restarted.

     

    What is the real life difference between Enabled and Disabled? Disable means the system will only use one IP from the DNS response. Enable, means the system will use all IPs in the response, and also delete nodes that are not present anymore in the response.

     

    What would it mean for a node to auto populate, but a pool member using that node to NOT be set to do that? The auto populate option is only for nodes, as the FQDN only applies to nodes.

     

    How does the node resolution internal work with the DNS Cache option in the system settings? The bigd process that is responsible for monitoring, is also responsible for the DNS queries for this functionality. It will cache the request, until the TTL expires, or the process is restarted.

     

    See this solution for more information:

     

    https://support.f5.com/csp/article/K47726919

     

  • Last night I deployed a config which did use auto populate. I came back to it a few hours later and found the following...

     

    1) An unrelated change had caused all network connectivity to fail, as such all pool members went down, including this new pool.

     

    2) An hour later the change was rolled back (by UCS load) and connectivity came back up. At THIS point, the ephemeral pool member (only was A record was coming back anyway) was deleted.

     

    In the logs I see bigd restarts etc., and the three messages saying this ephemeral pool member "was not found" and one that the parent node wasn't found either. What does that actually mean??

     

    3) from that point on, nothing was resolved. refresh interval was 300 seconds, down interval 5 seconds, but 5 hours elapsed with that pool being completely empty, despite the network being fine again.

     

    4) To nudge the code, I changed the refresh time on the node from 300 to 301, and the pool immediately repopulated with the same member as before.

     

    So clearly the node was deleted based on some form of logic, but only when the network connectivity was restored and once deleted, nothing ever changed again... Can this behaviour be explained?

     

    I've since recreated the node and pool with autopopulate disabled in case this might make a difference, but I can't see any way the observed behaviour would be by design, and therefore relevant to a feature being turned on or off in the first place?

     

  • If the functionality does not work as per documentation, you have a valid reason the open a F5 support case. They can analyze your data (qkview, full logs, etc...) and provide an explanation. It would be difficult to do here via devcentral, as I don't have access to your data.