Forum Discussion

mhite_60883's avatar
mhite_60883
Icon for Cirrocumulus rankCirrocumulus
Apr 15, 2013

Scaling iControl, race conditions, and other pain points

Hello.

 

I have written a reporting tool which connects to a cluster leader, determines all the cluster members, and spawns a thread for each load balancer in the cluster. Each thread collects partition, pool, virtual server, and pool member statistics from a load balancer. The final results are merged into a cluster-wide roll-up report. You can see an example of what one of the pool reports looks like in the attached screenshot. Many of these reports are generated -- one for each pool in a cluster. (The screenshot is an example from a lab H/A pair, and not my 12-member cluster.)

 

 

I am running into several issues scaling this approach out on one of our largest clusters -- a 12 member active/active cluster. It is a fairly active cluster, with many hosts leaving and joining pools every few minutes. My first approach was to connect to a load balancer, set the active folder to "/", and set a recursive query state. I would then retrieve ALL POOLS from every folder/partition in a single API call, and then perform a lookup of all pool members in the pools returned by the previous API call. In my case, there are hundreds of pools, and thousands of pool members. I then iterate through a number of pool member specific API calls. Problems arise, however, in that a race condition is introduced wherein some pool members will actually be removed from the load balancers while the report is running, and hence member-related API calls will fail with node not found errors. There's not much that can be done except rerun the report at this point (using this approach).

 

 

Once the race condition became obvious, I rewrote the tool to crawl the load balancer on a partition by partition basis, so at least the race condition could be isolated to a partition-wide view. I could then either retry the report generation for the partition or just move on to the next partition when a race condition is hit and the node not found exception is experienced. This is my current approach.

 

 

However, I am finding the partition-by-partition crawl to be EXTREMELY SLOW. I also find that iControl because mostly unresponsive to other programs that need to use it (ie. node registration, other reporting processes, etc.).

 

 

Is iControl a bad approach? If I tried this with SNMP, would we expect it be significantly faster for gathering pool member, pool and related statistics? Just looking for some high level advice from the pro's who have tackled this before.

 

Thanks,

 

-M

 

3 Replies

  • Patrick_Chang_7's avatar
    Patrick_Chang_7
    Historic F5 Account
    Unfortunately, the iControl process on the F5 side is pretty much single threaded (it can only process one request at a time). If you have a process that is tying it up for long periods, it will become unresponsive to other processes that are trying to issue iControl commands. In general, it is better to use SNMP in order to gather statistics and only use iControl to issue actual configuration changes/queries. In older versions of code (pre v10.2.2) we had some major inefficiencies in the way we processed SNMP requests that made SNMP pretty unusable for gathering statistics on large numbers of objects at one time. If you are running a pre v10.2.2 TMOS version, you can request an engineering HF that fixes this. Prior to v11.2.0 there were statistics available through iControl that were not available via SNMP. Since v11.2.0, one can create custom MIB entries that enable one to grab anything via SNMP that could be gotten through the command line. The process to do this is described here: http://support.f5.com/kb/en-us/solutions/public/13000/500/sol13596.html?sr=28857189
  • Thanks, Patrick. I appreciate the response. I will give a try to your suggestion about splitting some of the queries out into SNMP. Do you expect the situation to be different with the introduction of the REST API in 11.4? I would guess that the lower overhead will allow iControl to process requests quicker so the effective capacity will be ultimately higher?

     

     

    -M
  • Patrick_Chang_7's avatar
    Patrick_Chang_7
    Historic F5 Account
    REST API should improve things, but it would have to be tested to be sure and I have seen no performance testing done on it yet. In addition, the first iteration of the REST API is geared towards making configuration changes and will probably not be able to collect all the statistics you want.