Building for the Inevitable Next Cloud Outage – Part 1

June 28, 2022

The following is based on a talk by Section’s Pavel Nikolov at the KubeCon+CloudNativeCon Europe 2022 event. This first post will discuss the challenges in building for the next cloud outage. Part Two will demonstrate how to deploy a Kubernetes application across clusters in multiple clouds and regions with built-in failover to automatically adapt to cloud outages.

Every few months we read about the widespread impact of a major cloud outage. These events are unpredictable and inevitable, and, quite frankly, keep site reliability engineering (SRE) teams up at night. No matter your type of business, it is prohibitively expensive to deploy your applications everywhere around the world at the same time while still ensuring high availability.

Public cloud remains the most popular data center approach among the cloud native community, with multi-cloud growing in adoption. However, adopting a multi-cloud strategy isn’t as simple as hitting the “go” button. What’s more, despite best efforts at building out redundancy, the cloud providers cannot guarantee 100% uptime. As such, it’s not a question of if your servers or services will go down but rather when. And it will probably happen when you are either not prepared or least expect it (hello middle of the night support calls).

This is true for a number of reasons. For one, there are external factors, such as your Domain Name System (DNS) going down or upstream internet provider connectivity issues, that are outside the control of the public clouds. Then, too, there are the human factors involved, like when we make mistakes in code deployment that can be difficult to roll back. Of course, there are also natural disasters that can take down entire regions or cause significant headaches for services around the globe.

As a result, organizations spend a significant amount of time and money prepping disaster recovery plans while preparing for that next inevitable cloud outage.

Disaster Recovery to the Rescue (maybe)

The vast majority of organizations fall into one of four disaster recovery categories when it comes to responding to an outage:

  1. Active / active deployment strategy: If your primary server goes down, you flip the switch on your DNS and your request goes to a second active server. While this is the fastest and least-disruptive disaster recovery, you’re among the lucky few if your IT budget supports this option!

  2. Active / passive deployment strategy: This is very similar to active / active but it’s cheaper because you’re not paying for the hosting of the passive instance or cluster when you’re not using it. However, you have to spin up the passive instance and flip the switch on your DNS before service is restored, delaying the return to service.

  3. Periodic backup of your databases: In this instance, when your service goes down you must first spin up your code, restore the backups, and then continue serving as normal. While viable, this should not be considered a rapid response and can potentially extend service outages over more than 24 hours. The only thing worse is…

  4. No disaster recovery strategy: Truth be told, far too many organizations fall into this category. It’s understandable; you’re busy building features and don’t have time to think about disaster recovery. When something happens, you’ll figure it out!

The challenge with any of these disaster recovery strategies (except for the fourth one, of course) is that they require a high level of discipline. Your entire team needs to understand what will happen and know what they must do when an outage occurs, and even the best laid plans will likely require some level of human intervention to restore service. In addition, as you add new features or components to your system, you’ll need to test your disaster recovery plan to account for changes that have occurred. Ideally, this should happen at least every quarter – preferably every month – and it’s easy to get caught up in our day-to-day delivery deadlines, putting off review of the disaster recovery plan until it’s too late.

Multi-Cluster Disaster Recovery

Since you’re reading this blog, let’s assume you’re running a modern Kubernetes containerized application. Let’s further assume that your application is running on multiple distributed clusters to maximize availability and performance. How does that impact disaster recovery?

Just because you have multiple clusters does not mean automatic failover during an outage. The culprit is often DNS. First off, DNS servers can (and often do) become unavailable. But even if the servers themselves don’t go down, DNS configuration can cause problems during outages. DNS uses TTL (time to live) settings to handle routing, and the problem is that there is no guarantee that, worldwide, all providers will honor your TTL. This can effectively mean that distributed clusters are available but effectively invisible during an outage.

But what if there was another approach to disaster recovery? In our next post we’ll discuss a strategy using BGP + Anycast to significantly improve availability and recovery. If you’re eager to jump ahead, feel free to watch Pavel’s KubeCon talk.

Section’s Cloud-Native Hosting Solution Addresses Reliability (and much more)

On the other hand, if you need a solution today, why not turn to Section? As we know all too well, outages will happen eventually. It can be prohibitively expensive and labor intensive to maintain disaster recovery strategies for your organization. Fortunately, Section offers a wide range of Cloud-Native Hosting solutions that address the complexity of building and operating distributed networks. The complexities of routing across multi-layer edge-cloud topologies are perhaps the most daunting when it comes to building distributed systems. This is why organizations are increasingly turning to solutions like Section that take care of this for you.

In particular, Section’s Kubernetes Edge Interface (KEI), Adaptive Edge Engine (AEE) and Composable Edge Cloud work together to improve application availability. With KEI you can set policy-based controls using simple commands in tools like kubectl that control, among other things, cluster reliability and availability. AEE uses advanced artificial intelligence to interpret those commands and automatically handle configuration and routing in the background. Finally, Section’s Composable Edge Cloud features a heterogeneous mix of different cloud providers worldwide, ensuring application availability even when a provider network goes down.

To learn more, get in touch and we’ll show you how the Section platform can help you achieve the reliability, scalability, speed, security or other custom edge compute functionality that your applications demand.

Read Part 2 »