SLA Uptime, Resiliency, and why 100% Uptime can be Misleading

Summary

SaaS providers often promise uptime for their services of 99.99% or more, and sometimes even 100%! But planning for resiliency in the cloud goes beyond trusting a Service Level Agreements (SLA) or the promise of credits in the event of an outage. As in many areas of life, some analysis, architecting for resiliency, and planning for the unexpected go further than a promise of high uptime. This article seeks to provide some technical advice when approaching the question of your SaaS provider's resiliency and your SLA .

SLA's and planning technical resiliency

An SLA can help define the uptime you should expect from your SaaS service, but of course that's just an expectation. There are usually details including metrics to determine uptime, tiers of service credits if the promised uptime is not achieved, and exclusions to the SLA.

My goal with this article is to analyze what is most important when considering a SaaS vendor's resiliency and SLA, and to plan for outages despite a vendor's promise that an outage will not occur (or is extremely unlikely). I'll break this into 3 steps:

Analyzing uptime, and why 100% can be misleading
Architecting for resiliency as a SaaS provider
Reasonable considerations and questions to ask your SaaS provider

Analyzing uptime, and why 100% can be misleading

Let's take a simple example of a SaaS-based DNS provider. This example is generic and not based on any single provider, but let's briefly compare some public-facing SLA's from F5, AWS, and Cloudflare. When reading through these, you may notice:

Some promise 100% uptime! But, when I search "major dns provider outages" I can read about how outages have affected even providers with 100% promised uptime.
Some promise 99.99% uptime or more. Consider the chart below, along with the reality that unexpected outages do occur. Is "additional nines" more important than resilient architecture and planning?
In most cases, these SLA's are detailed, complex, and the metrics are determined by the provider.
Many have exclusions to the SLA metrics, like emergency mainenance or default of third parties.
Many will apply to a given tier of service (eg, business vs individual plans).

Outage times assumed by SLA uptime percentage

After trying myself, I think there is no realistic way to confidently compare vendors based on SLA's alone. For this reason, I think it's wise to ask ourselves the following questions when analyzing an SLA:

Which components of this service fall under this SLA? Do I rely on other components?
Do I believe the vendor has taken the technical steps to ensure they can meet their SLA?
What are the exclusions to the SLA?
If my provider's 3rd parties fail them, and in turn my provider fails me, does my SLA protect me?

I encourage you to read the fine print of your SLA, and then ask yourself: would this have covered me during the major internet outages of last year? If the financial implications of an outage - despite the uptime expected from an SLA - are significant to me, how do I plan for the unexpected?

Architecting for resiliency and transparency as a SaaS provider

This section of my article is intended to give readers an insight into some measures that a SaaS provider takes to ensure resiliency and transparency of their platform. It is by no means everything that a SaaS provider must do, but should help provide some confidence when discussing resilient architectures. Here are some interesting things I learned about F5 Distributed Cloud (XC) and resilient architecture.

Let's start with an easy example. BGP peering relationships provide performance and resliency for networks in the case of outages of other global networks. If you review the Hurricane Electric Internet Services BGP Peering Report, you'll see F5 is one of the most peered global networks in the world with over 5000 IPv4 adjacencies and almost 4000 IPv6 adjacencies. This means our network is more easily reachable, with fewer hops, than almost any other network in the world.

F5 is one of the most peered global networks

Secondly, as a SaaS provider, a big differentiator for F5 XC is the ability to deliver SaaS services within customers' premises by deploying nodes to create customer sites . These nodes connect to the two closest RE's to pull down their configuration, recieve updates, and connect to the globl backbone. Connections to these two RE's are made over two separate tunnelling protocols, TLS and IPSec, which means any CE running at your remote site has four redundant connections to the global backbone and control plane.

A simplified view of highly-meshed Regional Edges and redundancy for customer sites.

Thirdly, let's look at the services this platform provides (eg IP switching, firewalling, security services, running K8s workloads, etc). F5 XC is natively application-aware. The platform is run on our highly-meshed F5 XC global PoP's, each of which we call a Regional Edge (RE), and extended to customer sites as explained above.

This platform is made of many microservices to perform the various functions of the platform, and these microservices are managed at scale using K8s. So then, the F5 XC platform is a modern application in itself - it's fully distributed, updated frequently via automated releases, and broken up into many small parts, each of which can scale individually.

Also, consider the hosting of the management portal (also called F5 Distributed Cloud Console). This component runs in a combination of F5 XC's global infrastructure as well as public cloud providers like AWS, Azure, and GCP.

Finally, let's consider transparency. You want your SaaS provider to have a real-time status page to show outages. For F5 XC, that is here: https://www.f5cloudstatus.com/ . In addition, you want in-depth root cause analysis and details of past incidents that you can refer to in your own internal reporting. Here's an example from F5: https://www.f5cloudstatus.com/incidents/z5zjhczbch21 and I know from experience that the SRE team here strives to provide transparency and details for these situations.

This section has been somewhat high level, so ask your account team if you'd like more detail. The takeaway here is that an extremely modern network has been built with frequent, automated updates and fully distributed management in mind.

Reasonable considerations and questions to ask your SaaS provider

By "reasonable considerations", I mean that all of us must live within some constraints when it comes to balancing cost, performance, availability and operational support. But it's also reasonable to expect an architect has planned for resiliency if he or she is using a SaaS service. In the case of SaaS DNS, I think that a second DNS provider is a good idea for major organizations, but probably overkill for a small business. However, given that DNS outages have recently taken down a good portion of the Internet, any business can ask the following questions of a DNS SaaS provider, or any provider of SaaS services.

Which providers do you rely on yourself? This varies greatly by service, but look for answers that show that your provider has planned for their own provider's failure.
How do you ensure redundancy across your own providers? Again, look for signs of resilient architecture in their platform.
What were the root causes behind your past outages? Look for lessons learned here and actions taken to mitigate future occurrences.
What is your frequency and procedure for upgrades and maintenace? Look for answers that reduce the potential of human error, like the use of automation in regular maintenance. Frequent platform upgrades are likely better than occasional updates.

Conclusion

An SLA can be useful in setting expectations, but it does not remove the responsibility on architects to ensure that a service is available. SLA's can be complex, so read the details and plan for what to do in the event of an outage. Where possible, architect solutions that assume failure of the SaaS services you rely on, whether that means using dual providers, backing up configuration of a single provider, planning for a graceful pause of services when your providers are unavailable, or at the very least choosing a SaaS provider that has a well-documented and thorough approach to resilient architectures of their own. Thanks for reading!

Published Mar 31, 2023

Version 1.0