Exploring BIG-IP VE capabilities on HPE ProLiant DL365 Gen10 Plus servers - Part 2 of 2
This is part two of this article, please find the first part at the following link:
Exploring BIG-IP VE capabilities on HPE ProLiant DL365 Gen10 Plus servers - Part 1 of 2
The Optimizations We Selected
After careful consideration of the delivered HPE ProLiant DL365 Gen 10 server and installed ESXi hypervisor, and initial benchmark tests were completed, the following changes were made:
Processor Adjustments
The first change was to the AMD EPYCTM server processor from its default out-of-the-box behavior. By default SMT (simultaneous multithreading) is enabled, with this AMD processors allow a physical core (PCPU) to appear as two “logical” cores, meaning the 32 physical cores of this processor can appear to the operating system as 64 cores, frequently referred to as 64 threads. We chose to disable SMT so that we were guaranteed that virtual cores (VCPU) were mapped to PCPUs. SMT-enabled lists of all cores traditionally will interlock the PCPUs and each corresponding logical core. Therefore, cores appearing as 0,2,4,6,…., will be physical cores whereas 1,3,5,7,…., will be associated logical (multithreaded) cores. The value of SMT—allowing highly complex and disparate workloads to task one physical core simultaneously—does not add value in a focused, single use case of the server which seeks out optimal performance of the F5 software.
Without disabling multithreading, to ensure separate PCPUs are used a careful assignment of cores would need to be followed to ensure only physical cores were leveraged (pinned). The simplest approach in our opinion was to disable SMT by accessing the HPE ProLiant server BIOS, using the ancillary Integrated Lights-Out (iLO) server management software, accessible on the appliance out-of-band at its own IP address . The first step to achieving this is to shut down the server and to access the System Utilities upon startup by clicking the F9 button as the server begins to first power up. The initial screen will appear.
Next, enter the Processors screen by following the path highlighted in the red rectangle of the next image and simply disable the AMD SMT option with one mouse click and then accept the confirmation dialog screen. After a system save and restart, simultaneous multithreading will be disabled.
Upon restart, to ensure consistency, one should also quickly confirm that SMT has been disabled from the perspective of the ESXi operating system, now that it is set to off at the server BIOS level. By accessing the HPE vSphere interface, via vCenter or the local vSphere HTML on-box GUI, the following screenshot demonstrates where to confirm after the reboot that simultaneous multithreading, referred to as Hyperthreading by vSphere, is noted as being unavailable now.
In keeping with best practice of deploying a BIG-IP we recommend the allocation of 2 GB of memory for each of the 8 cores, 1 TMM (Traffic Management Microkernels) per VCPU up to 8 TMMs. As such 8 of the 32 physical cores are assigned to the BIG-IP VE, by default cores 0 through 7.
Historically, additional optimization with respect to harnessing full, maximized processing power has been proposed through CPU core affinity adjustments. A potentially small boost may have been achieved through bypassing core 0 and opting for cores 1 through 8.
The recommendation of bypassing core 0 stems from previous versions of ESXi where management tasks and networking on a virtual machine would typically task the first physical core, CPU0. Although not tested here, this change can be made by logging directly into the host and utilizing the vSphere interface to adjust the affinity setting.
Power Management Optimization
The next optimization applied to the HPE ProLiant server was to change the default power management active policy to one of the available policies which prioritizes server performance and reduces the amount of CPU throttling that has been known to inhibit latency sensitive applications (VMware KB). Step one to this policy change is done through the BIOS, from a server restart and utilizing the HPE iLO management interface. As with processor changes, click the F9 key to access iLO and the default System Utilities page.
As denoted by the red rectangle in the preceding image, follow the iLO path System Utilities -> System Configuration->BIOS/Platform Configuration (RBSU). On this screen simply change the Workload Profile setting to “Virtualization- Max Performance” and acknowledge the confirmation pop up screen by clicking OK.
After booting the server, as a second step in realizing a high-performance server, it is important to adjust the ESXi power management scheme from the balanced default to a low latency selection.
With the BIOS setup saved and the server restarted, it is recommended to set in the vSphere Console the Power Management Active Policy to “High Performance”. The following image shows the default value and the location of the editor.
Select the high performance option in the dialog box that appears, as shown below, and click OK.
High-speed Ethernet Adjustments
One primary concern in achieving high traffic rates through the HPE server is to ensure unhindered performance of the Mellanox 25 Gbps network interface cards (NICs). To do this, we first confirmed the drivers and firmware revision on the Mellanox MT2894 ConnextX-6 LX adapter was one of the latest drivers possible. A check with VMware’s Hardware Compatibility List (HCL), indicated that the HPE factory install contains a fully supported driver. To confirm in your setup, use the search function in the compatibility tool with values such as those shown below:
The resulting details (shown below) will both identify the latest driver compatible with ESX 7.0 Update 3 and provide a link where the driver can be downloaded and applied to the server, if necessary.
To validate the current installed drivers for the Mellanox interfaces, we simply opened an SSH session to the ESXi OS and issued the following two commands highlighted in the next two images. The second image reflects a subset of all the features of the driver, which spanned one screen of features and default values.
One of chief considerations was to ensure Receive Side Scaling (RSS) was enabled on the Mellanox NICs. RSS is a technology that allows any single TCP connection to be handled by more than one core on the receiving host. The default driver and default configuration for the Mellanox NIC was confirmed to both support RSS and have it enabled. The RSS support can be seen highlighted in the Features list in the previous firmware availability screenshot.
By default, the virtual NICs of an ESXi-based virtual machine, such as the BIG-IP VE, will only receive traffic from one hardware queue in the physical Intel network interface. To achieve at least 10Gb of throughput it is important to have the virtual machine request traffic from at least four queues offered by the Mellanox interface.
The virtual machine layer parameter that governs the requested queue count is the variable ethernetX.pnicFeatures = “4”, where X in the NIC number in the VMX file (0 is MGMT, 1 Internal, 2, External, 3 HA in a default BIG-IP OVA Deployment).
Another parameter that we defined, one that would turn out to be critical, was ethernetX.ctxPerDev, which controls the maximum number of threads available per virtual NIC. Conventional wisdom suggested that with the layer2 speeds in question we should set the parameter to 1, keeping in mind that a value of 3 would allow for between 2 and 8 threads per virtual NIC). For much larger workloads our thinking was that we would consider setting this to 3 in the future. In our past study of Dell PowerEdge servers, it was indeed found that the initial value was 1 was sufficient to reach our performance goals. However, the net result of using 1 with HPE ProLiant was actually an overall slight decrease in achievable throughput. Instead, with the HPE server, testing with a value of 3 generated the best performance. This divergence between HPE and Dell in terms of advanced settings was interesting to take note of.
There are two equivalent approaches to achieving the objectives of queue count of four and multi-thread support per vNIC: 1) manually edit the underlying VMX file onboard the Dell host, and 2) modify the variable using the vCenter/ESXi HTML5 Console. The console configuration approach is highlighted in the following screenshots where both ethernetX.pnicFeatures and its companion variable ethernetX.ctxPerDev are set. For either approach, the results will be the following assignments in the VMX file:
ethernet1.ctxPerDev = 3
ethernet2.ctxPerDev = 3
ethernet3.ctxPerDev = 3
ethernet1.pnicFeatures = “4”
ethernet2.pnicFeatures = “4”
ethernet3.pnicFeatures = “4”
To reach the editing screen for the above in vCenter, select the BIG-IP VE in the Inventory screen, right-click and select “Edit Settings,” followed by “VM Options.” Expand the “Advanced” nested menu item and click on the blue hyperlink labelled “Edit Configuration,” as shown here:
At this point, you simply enter the list of variables to ensure features such as each virtual NIC being able to receive traffic from all four hardware queues of the Intel physical interfaces. Simply click the “Add Configuration Params” button and ensure the four parameters and corresponding values of the following image are added.
TLS Performance of the Optimized Solution
With the HPE ProLiant and ESXi environment optimized, an effort was made to gauge the solution’s capacity when equipped with a standard 10Gbps license on BIG-IP and tested completely with TLS traffic. Making the test scenario especially demanding with regards to performance was the fact that no form of TLS session ID reuse was in use, such as classic session ID or newer TLS session tickets. Every TCP connection required a full TLS handshake in our case.
The first test utilized the traffic model aimed at producing the most bandwidth by having Ixia simulated clients download 100 successive 512 KB (kilobyte) objects on each TCP connection. The BIG-IP is tasked with intercepting the TLS sessions, performing decryption, inspection, and in the first scenario performing an SSL offload function thus passing emulated Ixia servers clear channel HTTP traffic. The server profile is layer7 HTTP aware.
As seen in the above image, the solution performed impressively, transporting more than 7.5 Gbps of traffic encrypted with TLS towards the external side and utilizing HTTP on the internal server direction. The test utilized elliptic-curve certificates (EC) and AES encryption for data flows.
To test another common deployment model, a different TLS load test was conducted, again aimed around bandwidth measurements with 100 512 KB objects successively downloaded by Ixia emulated clients over each SSL session. There were two major adjustments with this second test. The first change was to use traditional RSA certificates supporting the popular 2048-bit key lengths. The second major alteration was to impose a TLS requirement on the internal, server-side of the BIG-IP traffic flows. In this scenario, traffic must also be encrypted with respect to communications to back-end servers, as opposed to the previous TLS offload model which allowed for HTTP internally. This may be more popular than SSL offload due to increasingly stringent security objectives and regulatory requirements.
The result of the second TLS test was very similar to the first, with an average of 7.4 Gbps of TLS traffic carried. The requirement to support TLS on the internal client side of the BIG-IP had little to no impact on numbers. In both test cases CPU load on the 8 cores used by the F5 software averaged between 65 and 70 percent. The only test scenarios that pushed core utilization towards maximum values were bulk transaction rate tests in which single, small sized objects were requested on new SSL sessions each time.
Summary of Findings
In the out-of-the-box experience with the HPE ProLiant DL365 Gen10 server coupled with a newly deployed OVA of F5 BIG-IP Virtual Edition, users can expect exceptional performance results. Using an Ixia PerfectStorm load generator, the throughput results using a Performance L4 profile and 512KB payload with 1024 concurrent users (100 requests per connection) we were able to achieve greater than 9.5 Gbps based upon the standard 10Gbps F5 license. With an L7 HTTP Profile (with the same concurrency and 128B payload) we were able to achieve up to 365,000 transactions per second (TPS).
Applying subsequent optimizations to both the HPE server, VMware hypervisor and BIG-IP Virtual Machine File (VMX), the BIG-IP produced even more impressive performance numbers. With the 10Gbps Virtual Edition license, the test bed measured the full 10 Gbps of throughput on the same Performance L4 Profile. Testing saw the same L7 HTTP Profile supported transactions per second also increased significantly, now exceeding 822,000 (an increase in TPS of 125%).