Design for resiliency and protect against cloud outages with F5 DNS and application monitoring

How to reduce DNS recovery time and know when a provider, region, or control plane is having a bad day.

Why DNS resiliency matters

Major outages happen more often than many architectures assume. The most painful part is frequently not the incident itself, but the operational loss of control that comes from tightly coupling critical functions (like DNS) to a single platform or provider. When that platform is impaired, workarounds become limited and recovery slows.

Design principle: fail safely and recover fast

A useful way to frame resiliency is failures will occur, so the architecture should prioritize rapid, low-risk recovery. That typically means eliminating single points of dependency, automating failover where practical, and ensuring you can change traffic direction even when one control plane is degraded.

The DNS failure mode: what breaks and how long it takes

When authoritative DNS is hosted with a single vendor, a DNS incident can translate into recovery times on the order of 30 minutes to 3 hours (depending on the failure domain, TTLs, and operational procedures). With an automated, multi-provider design, recovery can be reduced dramatically, down to ~60 seconds in some scenarios.

Solution overview

This article describes an end-to-end resiliency pattern that combines (1) multi-provider authoritative DNS, using F5 BIG-IP DNS (commonly deployed on-prem or in IaaS) with F5 Distributed Cloud DNS as an additional authoritative provider, and (2) application assurance via F5 Distributed Cloud Synthetic Monitoring. The DNS design helps keep applications reachable during cloud-service impairments or regional failures by enabling automated failover and preserving the ability to shift control when a dependency is degraded. Synthetic DNS/HTTP checks then continuously validate external reachability and performance, so you can detect issues early and triage faster when incidents occur.

What you get from multi-provider authoritative DNS

  • Higher availability: a second authoritative provider reduces the blast radius of a single-vendor outage.
  • Lower query latency: globally distributed anycast networks can shorten resolver-to-authoritative RTT for many users.
  • Built-in DDoS resistance: distributed networks can absorb and disperse volumetric attacks more effectively than a small on-prem footprint.
  • Elastic capacity: the service can scale during traffic spikes without pre-provisioning appliances for peak usage.
  • Better visibility: per-query metrics and synthetic checks help validate reachability from multiple regions.

Example: improving availability and latency for Acme Bank

Acme Bank, whose name has been changed for the purposes of this article, struggled with higher DNS latency and periodic downtime when their on-prem DNS appliances failed. They also had to plan for peak capacity in advance to handle traffic spikes, an approach that can be expensive and still leave gaps when demand exceeds forecasts.

By adding Distributed Cloud DNS as an additional authoritative DNS provider alongside BIG-IP DNS, Acme Bank extended DNS serving closer to end users on a globally distributed network. This improved DNS availability and reduced query latency, while providing a platform that can scale to meet demand.

Reference architecture (high-level)

At a minimum, you are operating two authoritative DNS providers for the same zone:

  1. Primary authoritative: BIG-IP DNS serving the zone (often integrated with existing on-prem or cloud-adjacent infrastructure).
  2. Secondary/additional authoritative: Distributed Cloud DNS hosting the same zone data (via zone transfer and/or secondary zone configuration).
  3. Delegation: Your registrar/parent zone publishes NS records so recursive resolvers can reach either provider.

Configuration walkthrough

Step 1: Enable zone transfers from BIG-IP DNS

Configure BIG-IP DNS to allow zone transfers (AXFR/IXFR) to the Distributed Cloud DNS name servers for the zones you want to protect. Validate transfers and ensure TSIG and IP-based allowlists (as applicable) are in place to prevent unauthorized replication.

Step 2: Add the zone as secondary in Distributed Cloud DNS

Add your domain as a secondary DNS zone in Distributed Cloud DNS and point it to BIG-IP DNS for transfers. Once the initial transfer completes, verify the zone is online and that records (including SOA/NS) match expectations. Use the console to inspect zone content and confirm refresh/retry timers align with your operational goals.

Step 3: Update delegation at the registrar (planned cutover)

Update the domain delegation at your DNS registrar/parent zone to publish the desired authoritative name servers (for example, shifting primary delegation from BIG-IP DNS to Distributed Cloud DNS, or publishing both sets depending on your strategy). Plan for propagation by lowering TTLs ahead of time when feasible, and document a rollback procedure (e.g., reverting NS to the previous set) before making changes.

Monitoring and app assurance with synthetic checks

Once secondary DNS is active, use DNS and HTTP synthetic monitoring from multiple geographies to validate end-to-end reachability. Track query success rate, response codes, and latency, and alert on anomalies that indicate partial outages (e.g., a single region failing, increased NXDOMAIN/SERVFAIL rates, or unexpected record changes).

Application assurance (synthetic monitoring)

Even with resilient DNS, application incidents still happen and the worst-case operational pattern is learning about them from users first. Synthetic monitoring helps you detect externally visible failures early (often before customer reports), so response starts with evidence rather than guesswork.

F5 Distributed Cloud Synthetic Monitoring continuously simulates DNS lookups and HTTP requests to validate the external health and performance of your applications. Over time, you can establish a baseline for availability and latency, i.e., “what normal looks like,” which makes deviations easier to detect and triage.

  • Global vantage points: run checks from multiple regions to avoid a single-location “false negative.”
  • Multiple providers: compare results across providers to separate internet-path issues from app/origin issues.
  • Actionable alerts: alert on latency spikes, elevated error rates (e.g., HTTP 5xx), and DNS resolution failures.
  • Fast drill-down: pivot from an alert to region-level breakdowns, timelines, and event tables to isolate where the failure is occurring.

Example triage workflow: an alert flags a critical payroll application. In the console, you can correlate a single-region degradation (for example, West US) with a sharp increase in HTTP latency and a burst of HTTP 500 responses. A regional timing breakdown can further indicate whether time is being spent in network connect, TLS negotiation, or server processing, helping you route the incident to the correct owning team (e.g., origin/app servers for that region) without hours of cross-team war-room triage.

The practical outcome is reduced mean time to detect (MTTD) and faster “mean time to innocence” by quickly narrowing down which component is failing and which team should engage.

Video Demonstration

The following video reviews each of the challenges described in this article and how F5 solves this by providing cloud resiliency with DNS services and app assurance with synthetic monitoring.

 

Conclusion

DNS is a critical dependency, and a common amplification point during outages, so a multi-provider authoritative DNS design (BIG-IP DNS plus Distributed Cloud DNS) helps preserve reachability and control when a vendor, region, or control plane is degraded. But resiliency is strongest when DNS failover is paired with application assurance: synthetic DNS/HTTP checks provide early, external detection and rapid triage signals that shorten both MTTD and time to mitigation. Together, DNS resiliency with app assurance form an end-to-end resiliency solution, keeping users routed to healthy endpoints while simultaneously proving what is (and isn’t) failing, so teams can respond faster with less guesswork. Next, validate your zone-transfer security model, define failover/runbook procedures, instrument synthetic checks and alert thresholds, and test delegation changes in a lower environment before production cutover.

Additional Resources

F5 DNS Products

Distributed Cloud Synthetic Monitoring

Related Technical Articles

Accelerate Your Initiatives: Secure & Scale Hybrid Cloud Apps on F5 BIG-IP & Distributed Cloud DNS

The Power of &: F5 Hybrid DNS solution

Use F5 Distributed Cloud to control Primary and Secondary DNS

Using F5 Distributed Cloud DNS Load Balancer health checks and DNS observability

Demo Guide: F5 Distributed Cloud DNS (SaaS Console)

Published Apr 16, 2026
Version 1.0
No CommentsBe the first to comment