Forum Discussion
BIG-IP : iControl : LocalLBDataGroupFile.set_local_path() : swap large file under load
F5 BIG-IP v11.4.1 (Build 635.0) LTM on ESXi
I have a .NET C app that uses iControl to perform following sequence :
• transfer data-file to staging location on BIG-IP device
• recache data-group with contents of this data-file
The specific iControl API used for the recache is :
LocalLBDataGroupFile.set_local_path()
This operation has been 100% consistently successful in non-prod environments with very low traffic to perform data-group-file updates of up to 2M records.
Non-prod config : single stand-alone device ( no HA pair ).
The operation has also been 100% consistently successful in prod environments with high traffic to update a non-live data-group-file up to 300K records.
Prod config : HA pair consisting of 2-node sync-failover device-group with auto-sync disabled.
NOTE: By "live data-group-file" I mean an enabled virtual-server has an assigned iRule that references the data-group ( performs matches against data-group maps ). By "non-live data-group-file" I mean that the data-group exists but either is not referenced by any iRules, or iRules that reference it currently are not assigned to any enabled virtual-server.
Here is where the problem occurs :
When the operation is run in prod environments with high traffic (40-60% baseline cpu utilization, 200-400 Mbps baseline throughput) to update a live data-group-file ( 100K+ records ) the iRule fails.
Exactly how the iRule fails is unknown and currently is under investigation by F5 Support, however here are some data-points :
• the file-transfer and data-group recache iControl calls return success to the C caller
• requests that the iRule normally conditionally rewrites to various backend pools no longer arrive at those servers
• BIG-IP logs contain zero errors related to either the iControl operation of the iRule
• public client requests that should be processed by the iRule display generic Akamai 500 error-pages
NOTE: I have a test that removes Akamai from the equation, but have not yet had an opportunity to run it.
My understanding is that
LocalLBDataGroupFile.set_local_path() was re-designed/coded for 11.4 and was lab-tested up to 1M records. However, I wonder if any testing was performed in an environment with significant load ?
Through trial-and-error I discovered the following workaround (for an HA pair only) :
• create a/b pair of data-groups and corresponding set of a/b iRules that are identical except that "a" iRule references "a" data-group, and "b" iRule references "b" data-group
• on active node, initially configure virtual-server to use "a" iRule
• use C application to update "b" data-group-file
( NOTE: possibly this could also be accomplished via the admin browser, but above 100K records the time-lags and potential impact on prod operations become concerning. )
• if now swap-in "b" iRule to virtual-server ( effectively swapping-in "b" data-group ) the irule will begin to behave strangely (requests swallowed and never routed to backend pool although no errors present in LTM logs)
• however, the following "trick" seems to work :
- sync active to standby
- promote standby to active
- on new active, swap-in "b" iRule to the virtual-server
- reboot new standby* sync new active to new standby
Somehow the sync operation "cures" the issues induced by swapping the live iRule to point to a just-updated data-group.
So in summary it seems that for a high-load environment attempting to swap new contents into a live data-group somehow induces a failure-case for iRule lookups against that data-group.
The failure symptoms are identical both for the technique of re-caching the live data-group with new contents ( iControl API
LocalLBDataGroupFile.set_local_path() ), and for the iRule a/b swap technique.
However, an active-to-standby sync operation seems to "cure" whatever bad-state the data-group has been put into.
Can anyone provide insights as to why swapping-in new contents to a large data-group-file associated with an iRule assigned to a VIP under heavy load would cause iRule data-group lookup failures ?
1 Reply
- samstep
Cirrocumulus
I suggest you raise a support call with F5 if you experience any "site down" type failures. I have seen TMM crashing, core-ing and re-starting under heavyload with some iControl calls before and the impact varies version-to-version.
You need to examine the LTM logs at the time the VIP came down - there should be plenty of log messages informing about the reason for the service interruption.
Help guide the future of your DevCentral Community!
What tools do you use to collaborate? (1min - anonymous)Recent Discussions
Related Content
* Getting Started on DevCentral
* Community Guidelines
* Community Terms of Use / EULA
* Community Ranking Explained
* Community Resources
* Contact the DevCentral Team
* Update MFA on account.f5.com