Choosing an Instance for F5 BIG-IP in AWS
One of the responsibilities I have at F5 is helping our Solution Engineers work through performance and sizing deployments in AWS. Sometimes this is a straight-forward conversation and other times we need to dive deep. In this article I will focus on AWS, but the same concepts can be applied to most cloud environments, each of which will have their own nuance.
Performance in the Cloud - Multiple Layers
When we look at sizing, we have the intersection of the AWS network layer, EC2 network layer, and impacts on BIG-IP due to losing FPGAs and different network drivers. Prior to sharing rules of thumbs for what types of instances you should choose, let's dive into each one of these to provide context.
AWS Network Layer
In AWS, there are performance dimensions at the network layer that exist independently of the EC2 instance. When customers are looking at deploying, migrating, or building a DR site in AWS, we need to account for these aspects.
Egress Bandwidth - Non-AWS Endpoints
When you have a client that is communicating from outside of AWS to a system running on EC2, you will run into limits based on the number of CPUs allocated to your AWS instance. The peer system could be in your data center or Internet-based.
At the time of this article, the limits are:
- Less than 32 CPUs the limit outbound from AWS is 5 Gb/S
- 32+ CPUs: 50% of the instances network interface bandwidth
These limits can have a material impact on how many services can be deployed on an instance in AWS if you approach it from a vertically scaled deployment or if you are only looking at the bandwidth allocated to the instance in AWS.
As you consider this performance dimension:
- How much bandwidth does the single largest application need to send out of AWS?
- What is my total outbound bandwidth for AWS?
Egress Bandwidth - Internal to AWS
When both systems are inside of AWS, there are different dimensions. While the instance does have full access to the defined network characteristics (more on this later), the number of flows and the diversity in communication peers (IE flow tuple entropy) matters. Let's review what AWS has in their documentation:
Single-Flow Traffic -
Bandwidth for aggregate multi-flow traffic available to an instance depends on the destination of the traffic.
-
Within the Region – Traffic can utilize the full network bandwidth available to the instance.
-
To other Regions, an internet gateway, Direct Connect, or local gateways (LGW) – Traffic can utilize up to 50% of the network bandwidth available to a current generation instance with a minimum of 32 vCPUs. Bandwidth for a current generation instance with less than 32 vCPUs is limited to 5 Gbps.
Multi-Flow Traffic -
Baseline bandwidth for single-flow traffic is limited to 5 Gbps when instances are not in the same cluster placement group. To reduce latency and increase single-flow bandwidth, try one of the following:
-
Use a cluster placement group to achieve up to 10 Gbps bandwidth for instances within the same placement group.
-
Set up multiple paths between two endpoints to achieve higher bandwidth with Multipath TCP (MPTCP).
-
Configure ENA Express for eligible instances within the same subnet to achieve up to 25 Gbps between those instances.
Placement Groups - Note that using cluster placement groups has a material impact on single-flow traffic that is contained within the placement group.
As you consider the performance dimension:
- Where are the network flow peers located?
- How many flows are in question?
- Does traffic cross an environmental or regional boundary?
Network Maximum Transmission Unit (MTU)
When we are talking about bandwidth MTU matters. When we consider the AWS environment, the MTU that is supported is impacted by where the instance resides (WaveLength Zone vs VPC region) and the communication pattern (Internet Gateway? Internal? Direct Connect? Another region?).
As you consider this performance dimension:
- What MTU can my systems support?
- What MTU can my deployment location support?
- What MTU can my network path support?
- Do my systems and SGs support PMTUD?
EC2 Network Layer
This layer of the network we need to look at aspects that may not be expressly defined by AWS. This is the raw performance you can get from an instance based on its network characteristics and will vary based both on instance type and instance size.
EC2 Bandwidth Characteristics
Instances in EC2 will have listed network characteristics. These characteristics will vary based on generation of instance (2/3/4/5/6), sub-type (n - network optimized), and CPU scale (small, medium, large, xlarge etc). AWS does not publish all of these numbers, so the data is observed in my own testbed using iperf3.
Descriptor | Observation |
Low | up to 300 Mb/S |
Medium | up to 700 Mb/S |
High | up to 900 Mb/S |
Up To | Full bandwidth for time interval* |
Dedicated | Full bandwidth |
*interval - In AWS there some types of instances have network behavior that works on a burst system and once you have exhausted the burst you will be rate-limited at a base level until enough time has passed.
In the graphs below, I test several instance types. We can see the impact on network bandwidth once we have used our burst interval if applicable.
Below I tested a m6i.2xl instance using iperf3 using multiple flows between two instances.
There are two items of note in this test.
- The initial burst interval achieved line rate of the "up to 12.5 Gb/S" for approximately 20 minutes.
- After the burst interval, traffic was policed to 3.1 Gb/S
- A second test run (stopping and restarting traffic in succession) allowed for a much shorter burst period before throttling was applied by the hypervisor.
In the next test, I moved to a much larger instance, m6i.8xl, with dedicated access to a NIC. While I will only display a single graph, I as not able to locate a throttle time period; aligning to AWS documentation of dedicated access to the NIC.
My testing of various m5, m5n, c6i and c6in held consistent with the instance sizes and how they impact performance. An item of note is that consistent packet performance does seem to vary by instance class. If we look at the following graphs, we can see that the c5n.2xl has much greater variance than the c5.2xl. What is unclear is if this variance was upstream, IE noisy neighbor, or if the variance is consistent over time, as both of these instances are rated as "up to" in their network capacity.
C5n.2xl
C5.2xl
And you can see variation in sustained performance between similarly sized instances (again, was this an upstream event? changing instance types causes the instance to be moved to a new server, and not all tests were done on the same day). What if we look at systems that are not rate-limited and should have dedicated access? Again, we see variances with similar types..
c5n.9xl
c6in.8xl
c5.9xl
As you consider this performance dimension, you need to consider the following
- What is the sustained bandwidth required for my instance?
- Does the communication peer pattern support this sustained bandwidth?
- What is the aggregate bandwidth I expect to see across my estate?
- Does auto-scaling make sense for this application, my required architecture and my operational model?
- What is the impact of uncontrollable, such as congestion upstream that I might not be able to see?
EC2 Security Group Flow Table Characteristics
Just like the network performance characteristics changes based on instance type.class.size so does the security group flow table. AWS states that the security groups (SGs) are stateful; IE if you allow the connection to be initiated, the response will be allowed. If we consider how both TCP and UDP work, this means that there is a connection tracking table that must be created in memory on the hypervisor; and means that it can be exhausted. AWS has not published a list of connection table sizes, but customers with an NDA can ask for them. Flow table sizing and impacts to performance can be impacted by your architecture, flow timeouts and if you can disable SG rules.
As you consider this performance dimension:
- Are your flows short-lived or long-lived?
- Are they bursty?
- You may need to tune the TCP idle timeout. Do all parts of the system support this?
EC2 Packets Per Second (PPS)
Similar to the flow limits, EC2 instances have a limit to the PPS that can be processed. Once an instance starts exceeding the PPS limits, the hypervisor may drop them, impacting packets sent to or from the instance. While AWS does not list this dimension on the instance types, you can find reference to the existence of it in multiple AWS locations, so ask your AWS Solution Architect for the details (NDA required). Once you have the data, you will still need to baseline the performance, but it is a tool to ensure you are tuning the right dimensions.
As you consider this performance dimension:
- What are the expected PPS requirements for my application?
- Do the sustained PPS requirements sit below the instance PPS limit?
- Can my application handle more/less PPS than the current instance?
EC2 Number of IPs to Virtual Servers
Instances in AWS have a limit to the number of IP addresses that they can have assigned to them. Some customers prefer to have a highly dense deployment of Virtual Servers, and these limits are material in their implications. The number of public IPs assigned to an instance cannot exceed the number of IPs assigned to the instance in AWS. When we think of applications that are only served to internal users, there are other architectures that can be used to drive density.
As you consider this architecture dimension:
- How many virtual servers to I need?
- How many virtual servers must have a public IP?
- How do I connect my AWS and non-AWS environments? Do these support using a routed model for internal apps?
EC2 - Number of Network Interfaces
The number of interfaces in AWS changes the number of IP addresses, route tables and subnets that BIG-IP can connect to, but it does not change the total bandwidth available to the instance. The network characteristic reported at the instance level is a mix of the following attributes such as instance type, instance size and reports best-case throughput.
As you consider this architecture dimension:
- If you are trying to replicate a VLAN pattern to network interfaces, instances have limits
- Network interfaces add network complexity without adding additional capacity
BIG-IP Performance Considerations
When we deploy network-centric workloads in the public cloud, we see a shift. If a customer is in the midst of a migration, they are probably used to using our appliances or chassis systems that have been optimized from the network interface to the CPU and have FPGAs to assist on packet processing, this changes in the cloud.
BIG-IP - CPU, SSL and Packet Forwarding
In the cloud, we do not have FPGAs or Smart NICs, and deploying systems in AWS changes how packets can be processed and how SSL is handled. With SSL and packet forwarding moving to CPU, cloud-based deployments will see higher CPU utilization than appliance-based systems. Due to this change, I recommend that users adopt Elliptic Curve for TLS to increase performance.
BIG-IP - Network Drivers
When we deploy into AWS, the number of network interfaces on BIG-IP can cause the systems to load different network drivers. Single-interface systems use a Sock driver which has lower performance than the AWS ENA driver that we have implemented as part of or DPDK stack and provides higher network performance on systems with dedicated data plane interfaces.
Other Modules: Adv. WAF, AFM, APM
By moving SSL (L6) and packet processing (L4) to the CPU, we create additional resource considerations. If your deployment leverages additional modules, we may need to use more CPUs than you have deployed ore we may need to reshape your architecture by disaggregating it from large and monolith to a wide and horizontal pattern.
When you consider BIG-IP performance and architecture dimensions:
- CPU matters. We no longer have FPGAs. How is your current system performing? How much SSL is being handled by an FPGA?
- What does your network traffic look like? Small packets? Large packets? Elephant flows?
- Are you using Elliptic Curve for TLS?
- Logging also generates CPU load. Do you have CPU cycles?
- Does your topology require a single NIC or can you have dedicated TMM interfaces?
Rule(s) of Thumb - Intersecting Performance and Rearchitecting
Below, I have captured the most common questions that need to be answered when moving a deployment to AWS.
- Do you have a specific VLAN network architecture that must be followed? If the answer is yes and if the number of VLANs is greater than the number of interfaces you can have on the TMM data plane, then you will need to disaggregate the configuration.
- Do you exceed the SSL TPS that is capable to be achieved in software? I recommended that customers look at elliptic curves, but there are limits that will require a configuration to be disaggregated.
- Does your network performance requirement exceed any of the items listed in the sections above? If so, you will require disaggregation.
- Do you have a high number of VIPs that exceeds the number of IPs that can be assigned AND they need to be public? If so, your configuration will require disaggregation.
- Do you have a high number of internal VIPs and we cannot use an alien IP range, then you have a configuration that will require disaggregation.
- Do you have additional modules that are driving CPU and memory consumption?
Rule(s) Thumb - Instance Class
This chart provides general rules for thumb of the instance you should choose based on software and use cases.
Wrapping Up
Clouds have the appearance of infinite capacity and infinite resources. While it is true that there is significant flexibility, it does not replace the engineering work required to architect the deployment for your organization. It is better to engage early in the cycle; after the fact. There are multiple network layers and multiple teams involved. Nothing beats proper planning to minimize your MTTR when something does go wrong.