high performance
4 TopicsHow to Prepare Your Network Infrastructure to Add HPC Clusters for AI to Your Data Center
HPC AI clusters are getting deployed as highly-engineered 'lego blocks' which are opaque to established data center operations and standards. By taking advantage of established Kubernetes based networking solutions that provide high-speed intelligent networking, you can save yourself from expensive cost overruns, data center re-auditing, and delays. By using Kubernetes based solutions which take advantage of the high-speed networking solutions already required by HP AI deployments, you further optimize your investment in AI.190Views4likes0CommentsHandle Over 100 Gbps With a Single BIG-IP Virtual Edition
Cloud computing is an inescapable term. The general public knows that their cat pictures, videos, and memes go to the cloud somehow. Companies and application developers go cloud-first, or often cloud-only, when they develop a new service. They think of the cloud as the group of resources and APIs offered by cloud providers. Large enterprises and service providers have a bifurcated view of cloud computing: they see a public cloud and a private cloud. A service provider might mandate that any new software or services must run within their orchestrated virtualization or bare metal environment, actively discouraging or simply disallowing new vendor-specific hardware purchases. This has the effect of causing traditional networking vendors to improve the efficiency of their software offerings, and take advantage of the available server hardware in an opportunistic fashion. Behold the early fruits of our labor. 100+ Gbps L4 From a Single VE?! We introduced the high performance license option for BIG-IP Virtual Edition with BIG-IP v13.0.0 HF1. This means that rather than having a throughput capped license, you can purchase a license that is only restricted by the maximum number of vCPUs that can be assigned. This license allows you to optimize the utilization of the underlying hypervisor hardware. BIG-IP v13.0.0 HF1 introduced a limit of 16 vCPUs per VE. BIG-IP v13.1.0.1 raised the maximum to 24 vCPUs. Given that this is a non-trivial amount of computing capacity for a single VM, we decided to see what kind of performance can be obtained when you use the largest VE license on recent hypervisor hardware. The result is decidedly awesome. I want to show you precisely how we achieved 100+ Gbps in a single VE. Test Harness Overview The hypervisor for this test was KVM running on an enterprise grade rack mount server. The server had two sockets, and each socket had an Intel processor with 24 physical cores / 48 hyperthreads. We exposed 3 x 40 Gbps interfaces from Intel XL710 NICs to the guest via SR-IOV. Each NIC utilized a PCI-E 3.0x8 slot. There was no over-subscription of hypervisor resources. Support for "huge pages", or memory pages much larger than 4 KB, was enabled on the hypervisor. It is not a tuning requirement, but it proved beneficial on our hypervisor. See: Ubuntu community - using hugepages. The VE was configured to run BIG-IP v13.1.0.1 with 24 vCPU and 48 GB of RAM in an "unpacked" configuration. This means that we dedicated a single vCPU per physical core. This was done to prevent hyperthread contention within each physical core. Additionally, all of the physical cores were on the same socket. This eliminated any inter-socket communication latency and bus limitations. The VE was provisioned with LTM only, and all test traffic utilized a single FastL4 virtual server. There were two logical VLANs. The 3 x 40 Gbps interfaces were logically trunked. The VE only has two L3 presences, one for the client network and one for the server network. In direct terms, this is a single application deployment achieving 100+ Gbps with a single BIG-IP Virtual Edition. Result The network load was generated using Ixia IxLoad and Ixia hardware appliances. The traffic was legitimate HTTP traffic with full TCP handshakes and graceful TCP teardowns. A single 512 kB HTTP transaction was completed for every TCP connection. We describe this scenario as one request per connection, or 1-RPC. It's worth noting that 1-RPC is the worst case for an ADC. For every Ixia client connection: Three-way TCP handshake HTTP request (less than 200 B) delivered to Ixia servers HTTP response (512 kB, multiple packets) from Ixia servers Three-way TCP termination The following plot shows the L7 throughput in Gbps during the "sustained" period of a test, meaning that the device is under constant load and new connections are being established immediately after a previous connection is satisfied. If you work in the network testing world, you'll probably note how stupendously smooth this graph is... The average for the sustained period ends up around 108 Gbps. Note that, as hardware continues to improve, this performance will only go up. Considerations Technical forums love car analogies and initialisms, like "your mileage may vary" as YMMV. This caveat applies to the result described above. You should consider these factors when planning a high performance VE deployment: Physical hardware layout of the hypervisor - Non-uniform memory access (NUMA) architectures are ubiquitous in today's high density servers. In very simple terms, the implication of NUMA architectures is that the physical locality of a computational core matters. All of the work for a given task should be confined to a single NUMA node when possible. The slot placement of physical NICs can be a factor as well. Your server vendor can guide you in understanding the physical layout of your hardware. Example: you have a hypervisor with two sockets, and each socket has 20c / 40t. You have 160 Gbps of connectivity to the hypervisor. The recommended deployment would be two 20 vCPU high performance VE guests, one per socket, with each receiving 80 Gbps of connectivity. Spanning a 24 vCPU guest across both sockets would result in more CPU load per unit of work done, as the guest would be communicating between both sockets rather than within a single socket. Driver support - The number of drivers that BIG-IP supports for SR-IOV access is growing. See: https://support.f5.com/csp/article/K17204. Do note that we also have driver support for VMXNET3, virtio, and OvS-DPDK via virtio. Experimentation and an understanding of the available hypervisor configurations will allow you to select the proper deployment. Know the workload - This result was generated with a pure L4 configuration using simple load balancing, and no L5-L7 inspection or policy enforcement. The TMM CPU utilization was at maximum during this test. Additional inspection and manipulation of network traffic requires more CPU cycles per unit of work.4KViews1like7Comments