on 03-Jun-2020 02:58
This is not really a step-by-step troubleshooting guide.
What I'm sharing here is the result of reverse engineering the kind of knowledge that led me to succeed on troubleshooting CPU issues during the time I worked for Engineering Services department at F5.
Here's what I'll cover sequentially with a mix of what we should know and where to find the problem:
The hardware boxes listed with HT+ in K14358, all support HyperThreading technology.
Here's how to check the number of cores in a given BIG-IP box (this is a VIPRION C2200 chassis with 2250 blade installed):
The above box is able to run 2 threads per physical core (Thread(s) per core) with a total of 10 physical cores (Core(s) per socket) and a total of 20 (logical) cores (CPU(s)).
Here's the same output from a 3900 series box that does not support HT:
The above box is able to run 1 thread per physical core (Thread(s) per core) with a total of 4 physical cores (Core(s) per socket) and a total of 4 cores (CPU(s)).
Note: this is a graph as seen in qkview (Clicking on System > Support) which takes a snapshot of the system. It can then be uploaded to ihealth and is mostly used to sharing snapshot of BIG-IP systems with F5 support. However, the graph here is used for illustrative purposes to understand CPU utilisation as seen in graphs.
Sometimes just by looking at the graphs and commands is not enough to determine why CPU is high.
Here's an example of a script to collect real-time TMM/Linux CPU stats on BIG-IP every 60 seconds and copy output to /var/log/cpu-average.log
top command output is also copied to /var/log/top-output.log:
Output should be similar to this:
The number after "Counter64" is the percentage value representing how busy each CPU core is.
For example, TMM0.0 and TMM0.1 are both at 1% of capacity.
We can add H to top command (e.g. top -Hcbn 1) in the script above to show the individual threads of a process, including TMM threads.
When opening a support case with F5, it may be useful to include the full tmctl table as it contains roughly all raw data about everything we can possibly find on BIG-IP system.
The below is an example of a script that collects all tmctl information every 5 seconds:
Apart from knowing where to look, understanding the CPU usage pattern when it comes to our own organisation's production traffic is really important. It enables us to compare, for example, the number of active connections with a spike in CPU in the graphs to understand if the spike is related to a sudden and sharp increase in traffic.
Finally, a good article on how to troubleshoot high CPU on BIGIP system.