Trends in enterprises today have, and are, encouraging more and more use of virtualization to deliver applications and workloads. From a research study F5 published back in 2009, a full 14 years from this article, on Trends in Enterprise Virtualization Technologies:
A great deal has changed in 14 years and the growth and modernization of applications have exploded as mobile devices, tablets, and SaaS offerings have forced enterprises to rapidly expand and scale their environments. Public cloud solutions emerged and now play a significant role in architecture both small and global. It is important to optimize and squeeze out as much performance as possible from these installations, but many times the nuances of hardware and virtualization options can be confusing and sometimes unintuitive. Infrastructure teams are already burdened with making technology and architecture decisions and typically performance tuning is a luxury that is unavailable.
This article hopes to provide some simple and quick optimizations for BIG-IP Virtual Edition deployments in private clouds. We will focus on three areas: Hardware, Firmware, and Software.
Ryan Howard, for his detailed and vast understanding of the topic and gracious assistance.
Buu Lam, for his meticulous review and continued guidance.
Aubrey King, for his wisdom on NUMA and many hours in a performance lab
Hardware is normally the most difficult of the three areas we can optimize as availability, existing vendors, existing standards and so on will make specific platform or accessory cards a challenge for “quick optimizations”. That being said, the easiest thing you can do is to maximize the memory:
Maximize all the available slots with the largest and fastest memory that is supported by the platform. Memory is a shared commodity that all the virtual machines will share and the more that processes make use of allocated memory as opposed to disk the better systems, including BIG-IP will perform.
Firmware is the first place you can make some easy optimizations that will increase performance in your hypervisor and can be implemented with reasonable speed. Some of these might be platform specific (Dell) verbiage but you are likely to find the equivalent on other platforms. The general theme is to fix and maximize CPU speed and performance, streamline access to compute and PCIe resources, and remove anything that scales back or lowers performance. Additionally, you want to maximize cooling and then leave everything in place. Autoscaling performance and then trying to scale up and down cooling works well for home computing and so on but in an enterprise situation you want everything maximized and left alone :
This will almost certainly be enabled already if your hypervisor is already installed but if this is a new deployment you want to enable this. It changes the way the CPU functions into two modes: root and non-root. These are orthogonal to the privilege rings and real/protected spaces you might be familiar with. The hypervisor will run in the root mode and the guests will run in non-root. Code running in non-root mode will mostly execute in the same way it would as root mode but without the same freedoms. This allows this code to run at near native speed.
SR-IOV is a specification that allows a single Peripheral Component Interconnect Express (PCIe) physical device under a single root port to appear as multiple separate physical devices to the hypervisor or the guest operating system. SR-IOV uses physical and virtual functions to manage global functions for the SR-IOV devices. This allows the hypervisor greatly increased performance and speed for this device. These are typically network cards that allow above gigabit type speeds for hypervisor guests.
Even if you do not have SR-IOV devices it doesn’t hurt anything to have this on in the firmware. If the system doesn’t detect eligible devices during startup, it is essentially unused.
Turbo normally refers to Intel Turbo Boost Technology which controls if the processor is allowed to transition to a higher frequency than the processor’s rated speed if the processor has available power and is within temperature limits. For AMD systems, it works in a similar way, but the technology is referred to as AMD Turbo Core or AMD Core Performance Boost.
On Dell hardware platforms this can typically be left alone as it defaults to enabled.
Hyperthreading (Cluster Multithreading in AMD or Simultaneous Multi-Threading in AMD Ryzen) is a technology that allows more than one thread of execution to run on each CPU core. More threads equate to more work done in parallel. Typically, this is an improvement for multi-threaded applications. However, because of the way that BIG-IP manages its processes to disaggregate and re-aggregate network traffic across TMM instances this can work against the system performance wise.
In some cases, this can boost BIG-IP VE performance by as much as 175%, although most cases will see a more modest improvement.
Speed Step Technology allows the system to dynamically adjust processor voltage and core frequency, which can result in decreased power consumption and heat production. This is undesirable, obviously, because we don’t want the system to determine that the processor should be ‘stepped down’ for whatever reason.
Power Management or C-States are power modes that the cpu can be placed in to save energy, usually when the CPU is determined to be idle for a length of time. In enterprise deployments, you don’t want compute resources to slowly ramp up from a sleep state or endure the time it takes for the processor and hypervisor to wake resources up to load balance a packet. In enterprise deployments there are far better methods to scale out performance dynamically and doing it on the hypervisor platform is not as beneficial as it would seem.
This setting will vary on different BIOS types, but you are looking for the Power and Performance topic and the items that control CPU Power and Performance Policy. You want this set to Performance which sets a policy that focuses on optimizations that focus on performance even at the expense of energy efficiency. There is usually a Workload Profile that can be set, and you want that to IO/Sensitive or I/O Throughput depending on your platform. This disables processor utilization driven power management features that have performance impacts to the links between I/O and memory.
Because all the dynamic scaling features are being disabled and all of the performance settings are being maximized, you will want to fix the fans to performance and disable any sort of scaling features. Noise should never be a concern and the costs of fans scaling up and down while the heat and processor frequencies are cycling are simply not an optimal way to run the hypervisor platform. Therefore, maximize the cooling as much as possible and leave it fixed.
There are a lot of settings available within the hypervisor to tune and maximize performance which is a complete article topic itself. However, one quick thing that can be done is coordinating the number of physical cores with the Non Uniform Memory Access nodes. This is a RedHat specific virtualization optimization, but its worth exploring if your platform differs for an equivalent construct.
Historically, all memory on AMD64 and Intel 64 systems is equally accessible by all CPUs. Known as Uniform Memory Access (UMA), access times are the same no matter which CPU performs the operation.
This behavior is no longer the case with recent AMD64 and Intel 64 processors. In Non-Uniform Memory Access (NUMA), system memory is divided across NUMA nodes, which correspond to sockets or to a particular set of CPUs that have identical access latency to the local subset of system memory.
In general, you want to assign as many physical cores from one CPU on the same NUMA as the PCIe slot the NIC is assigned to. It is also advised to pin the CPUs. This ensures that the compute and memory resources are as optimally configured in virtualization as possible so that there is as little overhead to the guest system as possible. After speaking with some architects about this, your packet performance can drop significantly if you cross NUMA boundaries, in some cases almost a 350% reduction. CPU and RAM load will remain stable which can also add to confusion as to why this may be happening, or erroneous assumptions that the software is just not ‘performant’. For those reasons, and because the point of a BIG-IP largely sits in moving packets efficiently, its well advised to spend some time in this area and get it right.
There are some tools and resources to help you in the references section of this article that you may want to review when optimizing this specific configuration.
If you feel you want a little more depth and exploration the Business Development Engineers wrote a very detailed article specifically on BIG-IP VE capabilities on Dell PowerEdge R650 located and titled "Exploring BIG-IP VE capabilities on Dell PowerEdge R650 Servers". Even if your choice platform isn’t a Dell PowerEdge R650 this is an excellent review of common features and their impact and a worthwhile review for any additional optimizations past the quick and easy.