Forum Discussion
BIG-IP : iControl : LocalLBDataGroupFile.set_local_path() : swap large file under load
F5 BIG-IP v11.4.1 (Build 635.0) LTM on ESXi
I have a .NET C app that uses iControl to perform following sequence :
• transfer data-file to staging location on BIG-IP device
• recache data-group with contents of this data-file
The specific iControl API used for the recache is :
LocalLBDataGroupFile.set_local_path()
This operation has been 100% consistently successful in non-prod environments with very low traffic to perform data-group-file updates of up to 2M records.
Non-prod config : single stand-alone device ( no HA pair ).
The operation has also been 100% consistently successful in prod environments with high traffic to update a non-live data-group-file up to 300K records.
Prod config : HA pair consisting of 2-node sync-failover device-group with auto-sync disabled.
NOTE: By "live data-group-file" I mean an enabled virtual-server has an assigned iRule that references the data-group ( performs matches against data-group maps ). By "non-live data-group-file" I mean that the data-group exists but either is not referenced by any iRules, or iRules that reference it currently are not assigned to any enabled virtual-server.
Here is where the problem occurs :
When the operation is run in prod environments with high traffic (40-60% baseline cpu utilization, 200-400 Mbps baseline throughput) to update a live data-group-file ( 100K+ records ) the iRule fails.
Exactly how the iRule fails is unknown and currently is under investigation by F5 Support, however here are some data-points :
• the file-transfer and data-group recache iControl calls return success to the C caller
• requests that the iRule normally conditionally rewrites to various backend pools no longer arrive at those servers
• BIG-IP logs contain zero errors related to either the iControl operation of the iRule
• public client requests that should be processed by the iRule display generic Akamai 500 error-pages
NOTE: I have a test that removes Akamai from the equation, but have not yet had an opportunity to run it.
My understanding is that
LocalLBDataGroupFile.set_local_path()
was re-designed/coded for 11.4 and was lab-tested up to 1M records. However, I wonder if any testing was performed in an environment with significant load ?
Through trial-and-error I discovered the following workaround (for an HA pair only) :
• create a/b pair of data-groups and corresponding set of a/b iRules that are identical except that "a" iRule references "a" data-group, and "b" iRule references "b" data-group
• on active node, initially configure virtual-server to use "a" iRule
• use C application to update "b" data-group-file
( NOTE: possibly this could also be accomplished via the admin browser, but above 100K records the time-lags and potential impact on prod operations become concerning. )
• if now swap-in "b" iRule to virtual-server ( effectively swapping-in "b" data-group ) the irule will begin to behave strangely (requests swallowed and never routed to backend pool although no errors present in LTM logs)
• however, the following "trick" seems to work :
- sync active to standby
- promote standby to active
- on new active, swap-in "b" iRule to the virtual-server
- reboot new standby* sync new active to new standby
Somehow the sync operation "cures" the issues induced by swapping the live iRule to point to a just-updated data-group.
So in summary it seems that for a high-load environment attempting to swap new contents into a live data-group somehow induces a failure-case for iRule lookups against that data-group.
The failure symptoms are identical both for the technique of re-caching the live data-group with new contents ( iControl API
LocalLBDataGroupFile.set_local_path()
), and for the iRule a/b swap technique.
However, an active-to-standby sync operation seems to "cure" whatever bad-state the data-group has been put into.
Can anyone provide insights as to why swapping-in new contents to a large data-group-file associated with an iRule assigned to a VIP under heavy load would cause iRule data-group lookup failures ?
- samstepCirrocumulus
I suggest you raise a support call with F5 if you experience any "site down" type failures. I have seen TMM crashing, core-ing and re-starting under heavyload with some iControl calls before and the impact varies version-to-version.
You need to examine the LTM logs at the time the VIP came down - there should be plenty of log messages informing about the reason for the service interruption.
Recent Discussions
Related Content
* Getting Started on DevCentral
* Community Guidelines
* Community Terms of Use / EULA
* Community Ranking Explained
* Community Resources
* Contact the DevCentral Team
* Update MFA on account.f5.com