Increase Security in AWS without Rearchitecting your Applications - Part 4: Thursday
Welcome back! On Tuesday we discussed the security concerns we are facing, different architecture patterns and why using F5 SSL Orchestrator with AWS Gateway Load Balancer can solve a multi-dimensional security problem. On Wednesday morning we reviewed the configuration items to enable SSLO to address the different security needs followed by an afternoon investigation into how the AWS objects are structure to enable the solution Today we will look at scalability, resiliency, common troubleshooting scenarios and a brief discussion on how to deploy BIG-IP.
Reviewing the End to End Solution
When we look at the end to end solution we note a complex pattern comprised of Application (protected) VPCs, a Security VPC, endpoints and objects. Yesterday we went through this architecture in depth.
Scaling The System
Instance Types
In AWS you can find an array of instance types that may be compute or memory optimized, with different generations of hypervisors. An example of this would be M(emory)5.2xl or a C(ompute)6.4xl demonstrating a 5th generation and a 6th generation options with different numbers of CPU cores. My general recommendation is lean towards the M instances and nothing older than the 5th generation. The other aspect of the instances is how they behave on the network. If you review the AWS documentation you will see that smaller instances are labeled with up-to bandwidth numbers and larger CPU counts have dedicated bandwidth numbers. This is extremely important in the SSLO use case as these bandwidth numbers reflect the amount of bandwidth an instance can send in bursts and at sustained levels. If we use a simple example of a server that is responding with a 1 GB/S stream (I know not common, but the math is easy) and the responses are being inspected by SSLO, and we have two security chains this means we are sending 3 GB/S of traffic "off" the instance.
out-to-sec-1 | 1 Gb/S |
out-to-sec-2 | 1 Gb/S |
out-to-client | 1 Gb/S |
Total | 3 Gb/S |
The above example shows how easy it is to have to send a lot of traffic. So let's take a look at how an instance with an up-to designation behaves vs. a dedicated network allowance. In the example below IPERF was used to send traffic until the rate limiter was applied. (RUN 1) at which point the traffic was stopped and immediately restarted (RUN 2) to see how long it would take to impact traffic again. While the graphs are the same size the number on the bottom is seconds (the rate limiter asserts itself quickly on the second run)
If we compare that to an M6i.8XL you will see that the rate limiter is not applied since it has dedicated network access.
For our use-case it is encouraged to use instances that have dedicated network access.
Security Functions
Within the security VPC we have one or more SSL Orchestrators and their associated security appliances. I like to think of them is a disaggregated system with different considerations at the aggregate and component level. I think of these as "blocks" in my security system. We need to approach scaling the system by looking at the entire security processing of a single SSL Orchestrator Instance and the security devices associated with that deployment.
SSL Orchestrator Scale Block
With SSLO being the first and last hop of the security processing, these would be the metrics of interest. We have deployed into the public cloud and pushed all the work to the instance CPU (noting FPGAs are no longer available to handle complex SSL operations), including any other L7 processing that needs to happen. Eventually, you may need to add additional parallel blocks (SSLO + security services) to meet your traffic requirements.
Resource |
Description of Usage |
Scale Option |
Instance CPU |
Processing SSL, network traffic, protocol inspection, access policy, monitoring |
Scale up to 16 vCPU, add parallel deployment block |
Instance Bandwidth |
AWS instances have network characteristics based on type. All egress bandwidth counts |
Scale to a larger network instance, add parallel deployment block |
Service Chain Instances Scale Block
Instances in the Security Chain have similar limits as the SSL Orchestrator Nodes but the scaling options differ. You can add additional security service instances as long as the SSLO deployment in the block has additional capacity or you can redeploy the security service on a larger instance prior to deploying additional scale block.
Resource |
Description of Usage |
Scale Option 1 |
Scale Option 2 |
Instance CPU |
Processing SSL, network traffic, protocol inspection, access policy, monitoring |
Scale up to 16 vCPU, add parallel deployment block |
If SSL Orchestrator Block is less than capacity add inspection node |
Instance Bandwidth |
AWS instances have network characteristics based on type. All egress bandwidth counts |
Scale to a larger network instance, add parallel deployment block |
If SSL Orchestrator Block is less than capacity add inspection node |
Understanding Resiliency
In the hardware world we think of active standby and high availability via failover. In the cloud world we use the term resiliency, and it is accomplished by deploying N number of active instances horizontally. Inside the processing chain we can think of this as two layers; resiliency at the SSL Orchestrator Layer and resiliency at the security device layer.
Resiliency Item |
Single AZ considerations |
Multiple AZ Considerations |
SSL Orchestrator |
Instance Protection, Delete Protection, N Number of Scale blocks per AZ |
Repeat Single AZ pattern across 2 or more AZs |
Inspection Device |
Instance Protection, Delete Protection, N Number of Scale blocks per AZ |
Repeat Single AZ pattern across 2 or more AZs |
If borrow the same diagram from the scale blocks we can demonstrate how horizontal scale works across three AWS AZs. In this scenario we will have the horizontal capacity during normal operations and should an SSL Orchestrator instance need to be rebooted traffic will simply be sent to the security processing chain of one of the other scale blocks.
All Active Deployment
When we deploy this solution each block is isolated. GWLB will allow us to distributed traffic acrossN number of SSLO systems providing horizontal scale inter and intra AZ. Should AWS suffer a full AZ failure traffic will be sent over the other deployments and such time that traffic is restored. Should you need to take an SSLO instance offline, again traffic will be distributed. Each deployment in the resiliency pattern is an active participant for processing traffic.
Troubleshooting Packets in the Security VPC
If you need to troubleshoot the packets in the security VPC you need to understand the end to end flow thought the security VPC. Let's look at a sample flow from the internet to a protected endpoint. The same pattern could be applied between two internal endpoints. We will use the following IP addresses to build the examples. Internet client 1.1.1.100 and protected IP address of 192.168.1.1. We will assume that all traffic is permitted.
Interface Name |
Interface Function |
geneve-tunnel |
VLAN associated with inner tunnel traffic |
out-to-sec-chain |
VLAN associated with leaving SSL Orchestrator to 1 or more security devices |
in-frm-sec-chain |
VLAN associated with leaving a 1 or more security devices and entering SSL Orchestrator |
Example Capture Commands
Captured Flow |
Interface |
Example command |
Notes |
1.1.1.100 to 192.168.1.1 (ingress) |
geneve-tunnel |
tcpdump -ni geneve-tunnel src 1.1.1.100 dst 192.168.1.1 |
This will capture both in / out flow of the ingress processing |
1.1.1.100 to 192.168.1.1 (ingress) |
out-to-sec-chain |
tcpdump -ni geneve-tunnel src 1.1.1.100 dst 192.168.1.1 |
This will capture the out flow from SSL Orchestrator of the ingress processing |
1.1.1.100 to 192.168.1.1 (ingress) |
in-from-sec-chain |
tcpdump -ni geneve-tunnel src 1.1.1.100 dst 192.168.1.1 |
This will capture the in flow from the inspection device to SSL Orchestrator of the ingress processing |
Common Issues
Issue |
Validate |
GWLB is not sending SSL Orchestrator any packets |
Did you open the SG for the SSL Orchestrator VLAN that hosts the tunnel to UDP 6081? Did you use the correct SSL Orchestrator interface in the GWLB Service? |
SSL Orchestrator/security device is sending packets but my security device/SSL Orchestrator does not receive them |
On the security device - Did you disable SRC/DST check? Did you permit 0.0.0.0/0 in your security group? |
SSL Orchestrator is not sending packets back down the tunnel |
Did you create the ARP entry? Is the pool marked down? Did you disable strictness in the security policy? |
TCP Handshakes complete but my files do not transfer |
Did you create a virtual server for NON-TCP and NON-UDP traffic? This is required for PMTUD Did you set the SSL Orchestrator interface MTU to 8500? |
I do not see the packets arriving on by application instance |
Ensure that your SG allows the source to send traffic to the instance. Validate your routing configurations |
The TCP handshake never completes, I see the SYN on may app instance |
Egress routing issue. |
Deployment Options
As you are aware, F5 offers many different options to deploy BIG-IP into AWS. We have example CloudFormation templates and terraform modules, or you can deploy it manually. As you look at building your initial environment I recommend starting manually. Prior to automating the deployment you need to ensure your team has a solid understanding of F5, AWS, and other security appliance data plane objects in the solution and how you want them applied to your environment. Deploying an F5 AMI in AWS requires that you subscribe to a BYOL based instance (at the time of this writing we do not offer an hourly SSL Orchestrator enabled system) or upload your self created image built with the F5 Image Generator.
General VPC Architecture
From my testing and experience building complex network usecases in AWS I have assembled a list of recommendations.
- Dedicate service subnets (and route tables) for objects like GWLB endpoints, TGW endpoints, S3 Endpoints, EC2 API endpoints.
- To ensure proper East / West traffic flow insertion you should create a dedicated subnet in each AZ and all VPCs for the GWLB Endpoints. My testing demonstrated that placing the GWLB Endpoints in dedicated subnets is best; when trying to send traffic form an instance on subnet A to a GWLB endpoint on subnet A to an instance on subnet B resulted in traffic not being forwarded correctly.
- AWS defaults to /20 ranges when it creates subnets. This address scope is large if East / West inspection is of concern and should be decreased as appropriate.
- Production VPCs should not place management interfaces in a public route table.
- When using GWLB to inspect traffic between VPCs think about the flow. For example we want to inspect traffic from VPC A to VPC B. If we insert into the network at both VPCs traffic could be inspected twice. It may make more sense to inspect at VPC A only.
- The GWLB Endpoints are AZ level constructs. It helps to give them a human friendly name to ensure you are using the correct one.
- All of your load balancers (ALB, NLB, ELB, GWLB) should be enabled for cross zone load balancing.
- If using Transit Gateway in your topology please set it to Appliance Mode.
Conclusion
Thank you for spending time with me over the last couple of days. Many customers are currently iterating on how to further inspect and secure traffic in complex multi-cloud environments and F5 has solutions that can be leveraged across environments and in unique ways. Please let us know how we can assist you and your organization on their application security journey.