What We Can Learn from 2019 Cloud Outages to Do Better in 2020

January 27, 2020

Last year, the Internet experienced a run of major cloud outages, which had significant ripple effects on end users around the world. Most of these outages took place across the summer, disrupting many top tech companies, including China Telecom, Verizon, Cloudflare, AWS, Google Cloud, WhatsApp and Facebook.

2019 cloud outages

The Summer of Outages

9-hour outage demonstrates China Telecom’s global reach

The summer of outages started in May when China Telecom experienced substantial packet loss across its backbone over a nine hour period, mainly taking down network infrastructure in mainland China, but also impacting its network in Singapore and multiple parts of the US, including Los Angeles. Over one hundred services were affected with many big western sites, such as Apple, Slack, Amazon and Microsoft reporting disruption to their services across that time. In terms of the bigger picture, the incident revealed the reach of China Telecom well beyond the geographic limits of mainland China.

BGP route leak takes down Cloudflare and major US sites

In June, a tiny Internet service provider (ISP) in Pennsylvania yielded another lesson, again showing how fragile the modern Internet can be. DQE Communications, a small commercial ISP that services around 2,000 buildings in Pittsburgh, Pennsylvania, put out a mistaken signal, which led to a BGP route leak. According to Cloudflare CTO John Graham-Cumming, “This little company said, ‘These 2,400 networks, including some bits of Cloudflare, some bits of Amazon, some bits of Google and Facebook, whole swathes of the Internet,’ they said those networks are ours, you can send us their traffic.”

The misconfiguration was likely an error due to automatic route optimization software rather than an intentional act; nonetheless, the effect was the same.

As soon as the new route was announced, the route leak spread all the way up to Verizon who accepted the faulty routes and passed them on. As one of the world’s largest transit providers, the problem grew from there. Internet traffic originally destined for Cloudflare, AWS and Google instead went through DQE, Allegheny and Verizon. Consequently, a huge swatch of the Internet’s traffic (including to major destinations such as Google, Facebook, Reddit and AWS) suddenly went nowhere. It was as if Google Maps had sent large numbers of drivers down an unmarked cliff.

2019 Cloudflare outage

Image source: https://www.theregister.co.uk/2019/06/24/verizon_bgp_misconfiguration_cloudflare/

The Increasing Complexity of the Internet

Both these examples point to how interconnected the modern Internet has become with a large and complex ecosystem of Internet-facing services, from DNS to APIs to public cloud services. These services need to work in tandem to provide consistent performance for end-users.

It also reveals the way in which we still rely on a relatively naïve, trust-based system devised when the Internet first came into being. In the instance of the Cloudflare and AWS outages, the real problem, as Slate pointed out, was the Border Gateway Protocol (BGP) Internet routing system, which essentially relies on trust for its correct implementation. BGP has worked incredibly well for more than 25 years, but increasingly, problems are arising when even a small element goes wrong.

With IP traffic expected to reach 4.8 ZB per year by 2022, representing a threefold increase in global IP traffic over five years, more outages and slowdowns are to be expected.

IP traffic forecast

What Can We Do Better in 2020?

According to Mehdi Daoudi, CEO of digital experience monitoring firm Catchpoint, “We may be at a tipping point where the complexity and interdependence of our systems have grown to where the risks to business are greater.” Writing in Forbes last year, Daoudi advised “a new type of vigilance” was necessary.

Increase Use of Grassroots Protocols

One cause for hope looking forward to the rest of 2020 is the increasing use of grassroots protocols to help solve some of these issues. QUIC, for instance, was upgraded last year, with the goal of improved privacy, greater online experiences, and better security. There has also been more widespread deployment of the latest version of TLS, offering improved performance and upgraded security between users and websites. Following the Cloudflare Verizon outage, Graham-Cumming recommended the use of RPKI to avoid similar problems in the future, and let networks better filter faulty BGP routes. As Cloudflare said at the time, if Verizon had used RPKI, it wouldn’t have allowed the invalid routes from DQE through, and they would have been automatically dropped. Various providers, including AT&T, have started to embrace RPKI frameworks.

Build in Redundancy

With the growth in complexity of the Internet, enterprises have also been changing how they work with CDNs. One of Mehdi Daoudi’s recommendations is to “build in redundancies or have backups at the ready”. Multi-CDN usage is becoming increasingly popular with increasing numbers of companies pursuing a dual or multi-CDN distribution strategy in order to not risk content delivery on a single point of failure. The goal is to maintain the highest-quality delivery while also building in redundancy and resiliency. Cost consideration is also typically a factor in companies opting for a multi-CDN strategy.

Embrace Changing CDN Structures

Many CDNs have also been restructuring to change the focus on the types of services they offer in order to meet the demands of a decentralized web, moving towards providing edge compute resources. In its 2020 report, State of the Edge calls edge compute “the third act of the Internet” as it “seeks to resolve the problems with our current infrastructure and the challenges that come with supporting the applications we desire”.

Section’s Edge-First Approach

Section was designed as an edge compute platform from the get-go. When Stewart McGrath and Daniel Bartholomew founded the company, they wanted to build a new type of platform that not only delivered speed, security and scalability, but also the flexibility for developers to build software at the edge. In our CEO Stewart’s words, “Section is the only [edge platform] around that is hooked up to provide an immediately seamless integration with agile development workflows”…”giving developers and engineers all the right tools to manage [edge configurations] completely”.

Reduce Complexity and Improve Performance at the Edge

In the Forrester Analytics 2019 survey, 57% of mobility decision makers said they have edge computing on their roadmap for 2020, a significant statistic given edge computing’s relatively recent emergence in the tech space. As Forrester concluded, edge computing can be a major part of simplifying the challenges of the modern-day Internet since “computing at the edge avoids network latency and allows faster responses”. The emerging edge compute paradigm can also bring about improvements in scalability, flexibility and security.

While 2019 was seen as “the year of outages”, perhaps there’s still hope that 2020 will manage to buck the trend…