Using F5 Distributed Cloud AppStack & CE Site Survivability
Introduction
In private and in secure network environments, one of the greatest challenges is keeping apps running when connectivity goes down. Whether it’s due to network maintenance or an unintended outage, when apps can’t connect upstream to retrieve data, users are left with little or nothing to do.
F5 Distributed Cloud AppStack and Customer Edge (CE) survivability, a new feature in Distributed Cloud, provides a unique advantage in upstream outage situations. When all connectivity is lost between a CE and its Regional Edge (RE), including to the Global Controller, CE Survivability kicks in by allowing users to continue to access their services local to the site. Before this capability, routing to alternate sites, even when connected through a site mesh group, would fail due to identity management certificate authorities (CA’s) held within the Global Controller not being reachable, failing verification.
With AppStack and CE survivability, local site services are covered with alternative upstreams, both remote and local, in the event of a total loss of external connectivity. When connectivity is lost, the CE creates a new control plane formerly provided by the Global Controller, and it establishes a local CA under a new root CA trusted by its dependent services. To reach the remaining remote nodes that continue to be reachable within the CE’s site mesh group, Border Gateway Protocol (BGP) is utilized to optimally connect each site.
Loss of connectivity is now protected between each of the following:
Nodes within a multi-node Customer Edge site
Customer Edge site and Regional Edge (RE) sites
Customer Edge site and the Global Controller (GC)
Customer Edge sites within a Full Mesh Site Mesh Group
Customer Edge sites within a Hub/Spoke Site Mesh Group (future release)
Customer Edge sites within a DC Cluster Group (future release)
CE offline survivability solves this problem by enhancing each local site's control plane and routing:
- Local Control Plane – A local Control plane is set up for the management of certificates and secrets using a local Certificate Authority (CA). This CA will be minted under a new Root CA and will be trusted by services under that tenant.
- Routing – The routing will be exchanged via BGP among nodes of a site and among nodes across sites in a Site Mesh Group.
Note: A maximum of seven days is supported for a site to survive without connectivity to the Global Controller.
How To Enable CE Survivability on AppStack & Customer Edge Sites
Pre-requisites:
- Two or more deployed CE Sites
Log in to the Distributed Cloud Console and go to Shared Configuration. Navigate to Virtual Sites, then Add Virtual Site. Enter the following data:
Name: full-mesh-ce-sites
Site Type: CE
Selector Expression: ves.io/siteType = ves-io-ce
Now, enable CE Survivability on each of your deployed CE sites. Navigate to Cloud & Edge Sites > Site Management > [your site type] > [Your Site] > Manage Configuration. Scroll to Advanced Configuration, click Show Advanced Fields, then Enable Offline Survivability Mode.
Repeat this action for each CE identified by the virtual site Selector Expression entered above.
Now, navigate to Cloud & Edge Sites > Networking > Add Site Mesh Group, and enter the following:
Name: full-mesh-ce-group
Virtual Site: shared/full-mesh-ce-sites
Mesh Choice: Full Mesh
Full Mesh Choice: Control and Data Plane Mesh
To confirm and validate the enablement of CE Survivability, navigate to each CE Site's Dashboard via the Site Mesh Group. This can be found by going to Cloud and Edge Sites > Site Connectivity > Site Mesh Group, select the group that was just created, then click on each site in the group. Confirm that "Offline Survivability" is "Enabled" on the Detail panel, and that the "Local Control Plane Status" is enabled under the Health panel. An example of a site mesh group with this configured is shown as follows.
Example: Bookinfo distributed app deployment
Bookinfo is a modern K8s distributed app maintained by Istio, and its deployment model demonstrates the power of Distributed Cloud’s Multi-Cloud Networking CE Survivability. For more information about Bookinfo, including how to deploy it, go to https://istio.io/latest/docs/examples/bookinfo/. For this exercise, each of the CEs in the site mesh group must have the Offline Survivability status Enabled. This status changes from Configured status to Enabled after the CE restarts when the setting is configured. To confirm the current status, go to the CE Site’s Dashboard. Within the “Software Version” frame, observe the Offline Survivability and its associated status.
To confirm the tunnel between CE’s will L3 route the traffic with the Site Mesh Group Full Mesh configuration, open the Site Dashboard, and navigate to the Tools section, then select Show routes, and use Virtual Network Type VIRTUAL_NETWORK_SITE_LOCAL_INSIDE. The following route to the Origin server existing on the AWS TGW Site has a route entry on the Azure VNET Site, meaning that, if L3 routing is required directly from Azure, it can transit directly between both CE sites, bypassing the Global Network. This feature bolsters CE site connectivity, empowering it to locally make L3 routed and L7 HTTP load balancing decisions, should it loose access to the Global Network and the Global Controller while still with some connectivity to the CEs in the Site Mesh Group.
In the event of connectivity loss, the CE transitions to offline mode. To demonstrate this easily, in HTTP load balancing, origin pools configured with a priority 0 only work when no other origin pool is available. This makes it possible to use just the local origin pool only when absolutely needed. Additionally, any SSL-backed services with certs that can no longer be validated to the Global Controller's root CA due to loss of connectivity are automatically re-issued to use certs from the Site’s own internal CA. This preserves the SSL chain of authority while the site is unable to reach the Global Controller and potentially also the Internet, allowing encrypted services that would otherwise be unable to run while being unable to reach the root CA.
When a CE Site is online, this distributed app example is configured for users to connect to the Product Info (details) page locally with an on-prem point-of access controller. The Details service then connects to the reviews service to show product feedback, and the reviews service then connects to the ratings service to pull in more volatile 1–5-star ratings.
In the following Distributed Cloud HTTP Load Balancer configuration sample, each CE site is configured with a Static IP-based Origin Pool. Under normal circumstances, the Global Controller processes HTTP requests to the ratings service received by the On-Prem/Azure CE Site, and it reverse proxies the requests to the destination service running in AWS. However, with CE Survivability, when the HTTP LB service goes offline and is severed from the Global Controller and/or the Global Network, the CE configures itself to make load balancing decisions locally. This results in requests being sent to the locally cached ratings DB until the site is back online, allowing users to continue to have complete access to the app, albeit with some potentially latent or stale ratings.
The following YAML can be copied to the Distributed Cloud Console to either the Load Balancers or Distributed App services at Manage > Load Balancers > HTTP Load Balancers > Add HTTP Load Balancer. In the configuration section, select JSON view, then change to the YAML config format. The following HTTP LB config works with the app’s Bookinfo ratings service separately deployed to multiple sites. (NOTE: origin pools must be pre-configured before this section).
metadata:
name: booking-ratings-multisite
namespace: default
labels:
ves.io/app_type: bookinfo
annotations: {}
disable: false
spec:
domains:
- ratings.cluster.local
- ratings
http:
dns_volterra_managed: false
port: 9080
advertise_custom:
advertise_where:
- site:
network: SITE_NETWORK_INSIDE_AND_OUTSIDE
site:
tenant: default
namespace: system
name: azure-vnet-wus2
kind: site
use_default_port: {}
default_route_pools:
- pool:
tenant: default
namespace: default
name: bookinfo-ratings-aws
kind: origin_pool
weight: 1
priority: 1
endpoint_subsets: {}
- pool:
tenant: default
namespace: default
name: bookinfo-ratings-local
kind: origin_pool
weight: 1
priority: 0
endpoint_subsets: {}
routes: []
While the On-Prem CE site is online, access to the ratings service can be observed as follows:
country: PRIVATE
kubernetes: {}
app_type: bookinfo
timeseries_enabled: false
browser_type: Firefox
device_type: Mac
req_id: 03145f22-ddc6-413a-b6e3-a511e1840a94
path: /
hostname: master-0
original_authority: ratings:9080
rtt_upstream_seconds: "0.015000"
src_instance: UNKNOWN
req_headers: "null"
tenant: tme-lab-works-oeaclgke
longitude: PRIVATE
app: obelix
rtt_downstream_seconds: "0.000000"
policy_hits:
policy_hits:
- result: allow
policy_set: ves-io-active-service-policies-network-security-dpotter
malicious_user_mitigate_action: MUM_NONE
policy_namespace: shared
policy_rule: allow
policy: ves-io-allow-all
rate_limiter_action: none
method: GET
as_number: "0"
rsp_body: UNKNOWN
time_to_last_downstream_tx_byte: 0.019969293
dst_instance: STATIC
vh_type: HTTP-LOAD-BALANCER
x_forwarded_for: 172.17.0.5
duration_with_no_data_tx_delay: "0.015985"
rsp_size: "164"
api_endpoint: UNKNOWN
authority: ratings:9080
domain: ratings
region: PRIVATE
time_to_first_downstream_tx_byte: 0.019928893
has_sec_event: true
rsp_code_class: 2xx
rsp_code_details: via_upstream
time_to_last_upstream_rx_byte: 0.019945393
dst: S:10.0.163.106
scheme: http
city: PRIVATE
dst_site: aws-tgw-site
latitude: PRIVATE
messageid: dea91c9a-beed-4561-67af-ab4112426b1f
tls_version: VERSION_UNSPECIFIED
connection_state: CLOSED
dst_ip: NOT-APPLICABLE
network: PRIVATE
src_site: azure-vnet-wus2
terminated_time: 2023-01-25T06:45:04.842657797Z
as_org: PRIVATE
duration_with_data_tx_delay: "0.016025"
src_ip: 172.17.0.5
connected_time: 2023-01-25T06:45:04.822615905Z
stream: svcfw
tls_cipher_suite: VERSION_UNSPECIFIED/TLS_NULL_WITH_NULL_NULL
original_path: /ratings/0
req_size: "391"
user_agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0)
Gecko/20100101 Firefox/109.0
severity: info
cluster_name: azure-vnet-wus2-default
tls_fingerprint: UNKNOWN
src: N:site-local-inside
time_to_first_upstream_rx_byte: 0.019900193
rsp_code: "200"
time_to_first_upstream_tx_byte: 0.003944379
src_port: "42106"
site: azure-vnet-wus2
"@timestamp": 2023-01-25T06:45:05.134969Z
req_body: UNKNOWN
sample_rate: 0
time_to_last_upstream_tx_byte: 0.003949779
dst_port: "80"
namespace: default
req_path: /ratings/0
time: 2023-01-25T06:45:05.134Z
asn: PRIVATE
user: IP-172.17.0.5
vh_name: ves-io-http-loadbalancer-booking-ratings-multisite
node_id: envoy_0
proxy_type: http
total_duration_seconds: 0.02
While the CE is offline and isolated from the Global Network, Global Controller, and the Internet connection requests aren’t visible in the Distributed Cloud Console, but using a local client, we can see that requests continue to be delivered using the local ratings service. Below, we can infer that the local CE has assumed the primary role of making load balancing decisions. Depending on the severity of the outage, the load balancing decision can either be to use the Internet site or the direct CE-CE site mesh group tunnel to reach the priority origin pool, or it can be handled entirely locally on site.
Note: To isolate the On-Prem CE artificially in this exercise, security policies can be configured to 1) block connections inbound from the remote CE site(s), and 2) block connections outbound to the F5 Global Network RE Sites and the Global Controller. Because the IPSEC and SSL tunnel connections can be long-lived, you may ssh into the CE site’s CLI and soft-restart the ver, vpm , ike, and openvpn services running the CE Site(s).
When the CE can no longer reach the Global Controller through the RE’s or the Site Mesh Group CE’s, the CE begins making decisions, including load balancing from its own local control plane. You can see this happening in the following example. Opening a shell on a container running on a K8s cluster local to the site. Curl requests to the ratings app continue to resolve via service discovery with the locally advertised VIP from Distributed Cloud. Because the Origin Pool in AWS is no longer reachable, traffic is sent instead to the backup local ratings service.
% k exec -it jump -- /bin/sh
/ # curl -v ratings:9080/health
* Trying 10.40.1.5:9080...
* Connected to ratings (10.40.1.5) port 9080 (#0)
> GET /health HTTP/1.1
> Host: ratings:9080
> User-Agent: curl/7.79.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-type: application/json
< date: Wed, 25 Jan 2023 07:11:36 GMT
< x-envoy-upstream-service-time: 7
< server: volt-adc
< transfer-encoding: chunked
<
* Connection #0 to host ratings left intact
{"status":"Ratings is healthy"}/ #
The following video covers each of the settings in this article, including how to test offline survivability when disconnecting the outside network at the edge site isn't desirable or possible.
Conclusion
Outages can be unpredictable; CE Survivability enables a “best case scenario” suite of services, whether keeping your business apps available and your users going until connectivity is restored. With Distributed Cloud AppStack and CE Site Survivability enabled, both at the data plane and control plane, services that are distinguished when online yet still useable while offline is the new level of service you can expect from your users. CE Survivability on Distributed Cloud delivers this and more.
For more information, please visit the following resources:
https://www.f5.com/cloud/products/multi-cloud-transit
https://docs.cloud.f5.com/docs/how-to/site-management/manage-site-offline-survivability
Video demo: https://youtu.be/InyJKwksbos