Knowledge sharing: High CPU/Memory/Swap investigation/troubleshooting

I will share some basic knowedge about troubleshooting and resolving high data plane or contol plane CPU.
First there is an already great article, so first check it:

If the CPU 0, 2, 4 are high then it is a data plane tmm issue and if the CPU 1, 3, 5 are high then it is control plane CPU issue. Please read:

If the control plane CPU is constantly high just run linux "top" command see which process causes the issues and reboot the process and check for known bugs in google, askf5, the bug tracker, the release notes for known bugs and ihealth but be carefull as restarting critical processes may do some impact and if the process is not critical like the bigd then just restart it as I have seen bugs where the bigd or the snmpd have memory leakage and need restart.

If the control plane CPU jumps from time to time then it is harder to catch the issue but if you see there is a pattern check the logs, cron job any REST-API scripts that may run at the same time when the CPU jumps. For example the F5 ASM datasyncd may cause periodic jumps as mentioned in https://support.f5.com/csp/article/K02827102 or the ASM policy builder is enabled and learns too many thigs as mentioned in https://support.f5.com/csp/article/K58571155.

Also if you see the top command has many "tmsh" processes that means there is a REST-API script that does not close the connection correctly that causes many tmsh sessions to hang causing high CPU and Memory(in this case configure tmsh timeout as it is not configured by default https://support.f5.com/csp/article/K9908). To catch an issue that runs at a random time then you may need to follow the below article or run the top command with some arguments like "top -n 10 -d 10 >> /var/tmp/top.txt" as this will run the top 10 times with interval of 10 seconds:

If the CPU is high for the TMM process during peak working hours your system maybe overutlized then you may need increase the number of cores for virtual systems (if the license allows it) or VCMP (of there are free cores) or buy another device. Things like log messages in the /var/log/kern for the idle enforcer or the "clock advanced" messages in the /var/log/ltm may also indicate tmm cpu issues:

You may try some small optimizations like:

For memory issues don't forget that "top" command shows the memory for the date plane and "show sys memory" shows the memory for the F5 tmm subsystems. For example a bad irule is causing the "tcl" subsystem memory to go high. Also logs in /var/log/ltm for the memory sweeper are a good indication https://support.f5.com/csp/article/K13302777 and https://support.f5.com/csp/article/K15740. Also a DDOS may cause high memory so be carefull. For control plane memory don't forget that if you see many tmsh sessions opened in the top then check your REST-API scripts and automations and configure tmsh timeout as I have seen this to many times to even count. The memory for vCMP is increased by adding more cores if needed and for virtual edtions is much more easy.

Note: A high Other Used memory usage on the BIG-IP system Dashboard may not indicate an issue, as Linux kernel allocates memory to buffers and disk caching that can be released as needed.

Example high control plane memory:

Examples for high tmm data plane memory:

  • K02620345
  • K13889
  • K09336400
  • K15245
  • ID633402.html
  • K44385170

For SWAP issues now you can enable the top to show you the process causing the issue or jst upload qkview to ihealth and see from there:

Also don't forget to check the hard disk as it can cause high CPU if the logs can't be written, because of full or faulty hard drive:

Updated Jul 19, 2022
Version 2.0

Was this article helpful?

2 Comments

  • Nikoolayy1 - these posts you made several months ago are now being promoted to CrowdSRC articles. Thanks for all that you do in, and for, the DevCentral community.

    🚀

  • Nice summary, thanks a lot Nikoolayy1 .

    Monitoring might be a reason for constantly high CPU.

    The way tcp_half_open was implemented, could cause a high CPU base line. As well a huge number of https monitors.

    Regarding ICMP, TCP, HTTP and HTTPS monitors we can now use the In-TMM monitoring as described here:

    K11323537: Configuring In-TMM monitoring