Troubleshooting high CPU utilisation on BIG-IP systems

Introduction

This is not really a step-by-step troubleshooting guide.

What I'm sharing here is the result of reverse engineering the kind of knowledge that led me to succeed on troubleshooting CPU issues during the time I worked for Engineering Services department at F5.

Here's what I'll cover sequentially with a mix of what we should know and where to find the problem:

Know what HyperThreading (HT) is
Know how HT is used within F5
Find out if F5 box supports HyperThreading (HT)
Know the difference between Forwarding plane (TMM) vs Control plane (Linux) CPU consumption
Confirm if the problem is TMM or another daemon
Where to look further when TMM CPU is high
What if it's a control plane daemon?
Learn how to interpret graphs
High CPU in non-HT boxes
High CPU in HT+ boxes
Use scripts when necessary to collect real time data

1. Know what HyperThreading (HT) is

Physical core, as the name implies, is a physical CPU core connected to mothership's socket
Physical CPU core has several execution units (modules) capable of performing different tasks
e.g. basic integer maths, another for more advanced maths, loading and storing data from/to memory, etc.
HT uses 2 or more logical CPU cores to use execution units that are not being utilised by process A, so process B can use them if needed.
When 2 programs want to use the same part of the physical core, then it's inevitable that one of them will have to wait
The Operating System (OS) scheduler decides which process gets execution priority in this case
This is when 2 (or more) actual physical cores would perform better as this limitation is not present
i.e. 2 physical cores would be able to concurrently perform tasks using their own execution units

2. Know how HT is used within F5

Before BIG-IP v11.5.0 on systems with HyperThreading (HT) Technology, we would have:
1 TMM per logical core
Each logical core processes both data plane (TMM) and control plane (Linux) tasks
v11.5.0+ (affects only processors with HT Technology)
Data plane (TMM) reside in even-numbered cores (0, 2, 4, etc)
Control plane cores (Linux) reside in odd-numbered cores (1, 3, 5, etc)
When TMM reaches 80% of actual CPU utilisation, odd-numbered cores limit control plane tasks so they can only use up to 20% of CPU capacity, allowing remaining to be used by overloaded forwarding plane (TMM).
vCMP host must also be using v11.5.0+ or newer in order for guests to use HTSplit technology.
We can disable it manually by issuing the following command:

3 Find out if your box supports HyperThreading (HT)

The hardware boxes listed with HT+ in K14358, all support HyperThreading technology.

Here's how to check the number of cores in a given BIG-IP box (this is a VIPRION C2200 chassis with 2250 blade installed):

The above box is able to run 2 threads per physical core (Thread(s) per core) with a total of 10 physical cores (Core(s) per socket) and a total of 20 (logical) cores (CPU(s)).

Here's the same output from a 3900 series box that does not support HT:

The above box is able to run 1 thread per physical core (Thread(s) per core) with a total of 4 physical cores (Core(s) per socket) and a total of 4 cores (CPU(s)).

4 Know the difference between Forwarding plane (TMM) vs Control plane (Linux) CPU consumption

4.1 Confirming if it's TMM or Linux

BIG-IP's forwarding plane is TMM.
TMM is a daemon/process within Linux space.
If tmm CPU usage is high, then we know high CPU utilisation is a forwarding plane issue.
The other daemons are part of BIG-IP's control plane (e.g. bigd - monitoring daemon).
In this example, both tmm (102.3%) and bigd (51.8%) are high here:

If TMM CPU utilisation is high, we will need to troubleshoot CPU usage of internal TMM components.
For other daemons, there are different places to look.
For example, for bigd (monitoring daemon), we need to check BIG-IP's monitors.
AskF5 has a nice how-to guide here.
Here's a list of BIG-IP daemons.

4.2 TMM CPU utilisation or forwarding plane CPU utilisation

Check tmsh show ltm virtual <virtual server name> to confirm if there is a particular virtual server eating up tmm CPU cycles:

Check iRules

Check tmsh sys tmm-info to see the breakdown of TMM cpu utilisation per tmm:

4.3 Linux CPU utilisation or data plane CPU utilisation

For anything else apart from TMM, top output is your best friend for confirmation of which daemon is the culprit.
tmsh show sys proc-info is also another command we can use to gather process specific CPU information.
Here I'm checking bigd's monitoring daemon information:

5. Learn how to interpret graphs

5.1 High CPU in non-HT boxes

The below graph is just an example taken from 3900 box that doesn't have HT split
Because graphs are generated based on average cpu utilisation then we can assume that cpu utilisation is very high at times
Because there is no HT-split the below cpu utilisation can be either due to TMM or due some other Linux daemon
We can confirm using top command
In the below graph it was due to both tmm and bigd
to confirm normal usage we always try to match with other numbers in the graph (e.g. active connections, etc)

Note: this is a graph as seen in qkview (Clicking on System > Support) which takes a snapshot of the system. It can then be uploaded to ihealth and is mostly used to sharing snapshot of BIG-IP systems with F5 support. However, the graph here is used for illustrative purposes to understand CPU utilisation as seen in graphs.

5.2 High CPU in HT+ boxes

This other graph here was taken from a 4200 series box which has HT split enabled
Notice that CPU cores 0, 2, 4 and 6 (tmm/data plane) show CPU at about 60%
Cores 1, 3, 5 and 7 show very minimal CPU utilisation with some spikes
Spikes can be due to AVR/ASM daemons described in K16469 and K15606
Or because TMM has reached 80% of cpu utilisation and is now using control plane's cores
This is an example of mostly normal/regular cpu utilisation
When HT is enabled and TMM cores use less than 80% of cpu, then data-plane cores remain mostly 'quiet'.

6. Use scripts when necessary to collect real time data

Sometimes just by looking at the graphs and commands is not enough to determine why CPU is high.

Here's an example of a script to collect real-time TMM/Linux CPU stats on BIG-IP every 60 seconds and copy output to /var/log/cpu-average.log

top command output is also copied to /var/log/top-output.log:

Output should be similar to this:

The number after "Counter64" is the percentage value representing how busy each CPU core is.

For example, TMM0.0 and TMM0.1 are both at 1% of capacity.

We can add H to top command (e.g. top -Hcbn 1) in the script above to show the individual threads of a process, including TMM threads.

When opening a support case with F5, it may be useful to include the full tmctl table as it contains roughly all raw data about everything we can possibly find on BIG-IP system.

The below is an example of a script that collects all tmctl information every 5 seconds:

Apart from knowing where to look, understanding the CPU usage pattern when it comes to our own organisation's production traffic is really important. It enables us to compare, for example, the number of active connections with a spike in CPU in the graphs to understand if the spike is related to a sudden and sharp increase in traffic.

Published Jun 03, 2020

Version 1.0