Forum Discussion

BaltoStar_12467's avatar
Nov 02, 2014

BIG-IP : iControl : LocalLBDataGroupFile.set_local_path() : swap large file under load

F5 BIG-IP v11.4.1 (Build 635.0) LTM on ESXi

I have a .NET C app that uses iControl to perform following sequence :

• transfer data-file to staging location on BIG-IP device

• recache data-group with contents of this data-file

The specific iControl API used for the recache is :

LocalLBDataGroupFile.set_local_path()

This operation has been 100% consistently successful in non-prod environments with very low traffic to perform data-group-file updates of up to 2M records.

Non-prod config : single stand-alone device ( no HA pair ).

The operation has also been 100% consistently successful in prod environments with high traffic to update a non-live data-group-file up to 300K records.

Prod config : HA pair consisting of 2-node sync-failover device-group with auto-sync disabled.

NOTE: By "live data-group-file" I mean an enabled virtual-server has an assigned iRule that references the data-group ( performs matches against data-group maps ). By "non-live data-group-file" I mean that the data-group exists but either is not referenced by any iRules, or iRules that reference it currently are not assigned to any enabled virtual-server.

Here is where the problem occurs :

When the operation is run in prod environments with high traffic (40-60% baseline cpu utilization, 200-400 Mbps baseline throughput) to update a live data-group-file ( 100K+ records ) the iRule fails.

Exactly how the iRule fails is unknown and currently is under investigation by F5 Support, however here are some data-points :

• the file-transfer and data-group recache iControl calls return success to the C caller

• requests that the iRule normally conditionally rewrites to various backend pools no longer arrive at those servers

• BIG-IP logs contain zero errors related to either the iControl operation of the iRule

• public client requests that should be processed by the iRule display generic Akamai 500 error-pages

NOTE: I have a test that removes Akamai from the equation, but have not yet had an opportunity to run it.

My understanding is that

LocalLBDataGroupFile.set_local_path()
was re-designed/coded for 11.4 and was lab-tested up to 1M records. However, I wonder if any testing was performed in an environment with significant load ?

Through trial-and-error I discovered the following workaround (for an HA pair only) :

• create a/b pair of data-groups and corresponding set of a/b iRules that are identical except that "a" iRule references "a" data-group, and "b" iRule references "b" data-group

• on active node, initially configure virtual-server to use "a" iRule

• use C application to update "b" data-group-file

( NOTE: possibly this could also be accomplished via the admin browser, but above 100K records the time-lags and potential impact on prod operations become concerning. )

• if now swap-in "b" iRule to virtual-server ( effectively swapping-in "b" data-group ) the irule will begin to behave strangely (requests swallowed and never routed to backend pool although no errors present in LTM logs)

• however, the following "trick" seems to work :

  • sync active to standby
  • promote standby to active
  • on new active, swap-in "b" iRule to the virtual-server
  • reboot new standby* sync new active to new standby

Somehow the sync operation "cures" the issues induced by swapping the live iRule to point to a just-updated data-group.

So in summary it seems that for a high-load environment attempting to swap new contents into a live data-group somehow induces a failure-case for iRule lookups against that data-group.

The failure symptoms are identical both for the technique of re-caching the live data-group with new contents ( iControl API

LocalLBDataGroupFile.set_local_path()
), and for the iRule a/b swap technique.

However, an active-to-standby sync operation seems to "cure" whatever bad-state the data-group has been put into.

Can anyone provide insights as to why swapping-in new contents to a large data-group-file associated with an iRule assigned to a VIP under heavy load would cause iRule data-group lookup failures ?

  • I suggest you raise a support call with F5 if you experience any "site down" type failures. I have seen TMM crashing, core-ing and re-starting under heavyload with some iControl calls before and the impact varies version-to-version.

     

    You need to examine the LTM logs at the time the VIP came down - there should be plenty of log messages informing about the reason for the service interruption.