The Challenges of Distributed Databases at the Edge

December 7, 2020

Global Internet traffic in 2021 will be equivalent to 135x the volume of the entire Global Internet in 2005, according to Cisco. Globally, Internet traffic will reach 30 gigabytes per capita in 2021, up from 10 gigabytes per capita in 2016. Drivers of the huge increase in data volume include networked smart devices, emerging technologies – like IoT, 5G and AI – seeing rapid uptake, and manufacturing IIoT. Remote working has also contributed to the trend toward distributed data across 2020 and this looks set to largely continue through 2021. In parallel to this, end users have ever growing expectations for reliable connectivity, superior performance, and fast speed of service.

Demand is growing for edge computing, which offers many advantages in meeting these needs. By bringing data processing and storage as close as possible to the end user, edge computing offers benefits in speed, reliability and scalability, not to mention efficiency savings. Edge computing is a fast-growing market, with Statista forecasting global revenue to reach $9 billion by 2024. Meanwhile, Gartner predicts that by 2025, three-quarters of enterprise-generated data will be created and processed at the edge (compared to just 10% in 2018).

Before edge computing can truly deliver on its promise, however, the challenge of distributed databases at the edge needs to be solved. To date, edge computing workloads have been mostly stateless, but changing edge workloads are driving the need for persistent data at the edge. Using cloud and on-premise databases is not the ideal solution. We need to figure out the most efficient way to process the tsunami of data at the edge.

Why Conventional Distributed Databases Don’t Work at the Edge

Conventional distributed databases depend on the centralized coordination of stateful data, scaling out within a centralized datacenter. They rely on a specific set of design assumptions, including:

  • A reliable data center local network (low latency, high availability, few network splits)
  • Precise timekeeping using physical clocks and network time protocol (NTP) to ensure coordination
  • Decent consensus mechanisms due to low latencies and high availability

An alternative approach with geo-distributed databases (a single database spread across two or more geographically distinct locations) has a very different set of design assumptions:

  • Unreliable wide area networks
  • Lossy timekeeping
  • Dynamic network patterns with sporadic partitions
  • Consensus is expensive and slow with geo-distributed users at scale.
  • The mechanisms of coordination restrict how many participants or actors can take part in a network of coordinating nodes.

For stateful edge computing and geo-distributed databases to operate at scale and handle real world workloads, edge locations need to find a way to work together in a way that is coordination-free, which allows edge devices to move forward independently when network partitions do occur. Distributed systems need to be designed to work on the Internet within an unpredictable network landscape, use a form of time-keeping that isn’t lossy, and as stated, not be dependent on centralized forms of consensus.

What are the Challenges of Persistent Databases at the Edge?

1. The Distributed Nature of Edge Computing Systems

Edge computing systems are highly distributed by design. However, distributed systems require the designer to make decisions about sources of truth, synchronization, and replication. This can result in systems where users can access data and applications independently.

Each edge device needs to work on its own to perform its function, however these devices also need to share - and synchronize data with other edge devices and nodes. Coordinating several edge devices while simultaneously enabling them to work independently has proved to be a continued challenge for designers of distributed systems.

2. Stateless Data at the Edge

Edge computing is straightforward when the data is stateless or when state is local (when a device maintains its own state or is trivially partitionable). What do we mean when we talk about [stateless data](https://www.bizety.com/2018/08/21/stateful-vs-stateless-architecture-overview/? For one, there is no stored knowledge of, or reference to, past transactions in stateless data; examples include HTTP, IP and DNS. Stateless transactions consist of a single request and a single response, and typically use CDN, web, or print servers to process the short-term nature of requests. Up until recently, most edge computing use cases have been stateless.

Why is Stateless Straightforward?

Stateless works well for web applications that present static media and query database tables to inform applications. Stateless, database-centric applications also work well for performing batch analytics on historical data.

In stateless application design, application services don’t have to remember what they’ve done in the past. The database records an application’s state, and any time the application needs to do something, it will ask the database for information. This works well for data that is stored in one location, such as centralized cloud data centers.

The Real World isn’t Stateless

In reality, the majority of applications are stateful; they depend on data from previous request/response transactions in order to inform subsequent requests. Consider, for example, a banking application that keeps a ledger of expenses and deposits. In order to maintain the current balance, it must draw on insights from previous transactions, or state.

In distributed computing architectures, each edge node has its own local context that should inform the data it generates. This context ideally needs to be maintained and made locally available to applications, so that the latency savings that edge computing promises can be delivered on. When application context is available at the edge, it makes it easier to identify relevant insights from the full dataset and discard only the unnecessary information.

Stateless Design Causes Latency

Another significant challenge for stateless data at the edge is the centralized coordination necessary, which counters gains made on latency. If a request needs to travel across the network to a centrally stored database, latency is added with each trip (sending a data packet over a local network, compared with computing locally at the edge, comprises a difference in magnitude of 107). Network latency, even with advances like 5G, will always be subject to the speed of light, thus constrained by what’s physically possible.

Approaches to Building Stateless Edge Applications

Several approaches to building stateless edge applications have emerged, each with their pros and cons. These include:

Data filtering at the edge combined with data analysis in the cloud

  • Pro: This reduces the amount of data transmitted over networks.
  • Con: Due to the speed of light, waiting for the cloud will always be slower than processing data at the edge.

The use of IoT gateways or edge data centers

  • Pro: This approach moves the database closer in proximity to edge deployments, reducing the distance that network packets have to travel to reach a database and collect the required information.
  • Cons:
    • Delays – There will inevitably be delays in the network, which will cause increases in latency by orders of magnitude. This kind of delay for latency-critical applications (autonomous vehicles, IIoT, AR/VR, gaming) is unacceptable.
    • Consistency – Because the data is spread over more locations, the likelihood of network partitions increases, and thus the likelihood of data inconsistencies being introduced increases.

3. Why We Need Stateful Data at the Edge

Increasing numbers of use cases for edge computing are demanding the processing of stateful data, which is more complex and challenging than stateless.

What is stateful data? Stateful data comes with information about the history of previous events and interactions with other devices, programs, and users. Stateful applications use the same servers every time they process a user request.

“Without stateful data, the edge will be doomed to forever being nothing more than a place to execute stateless code that routes requests, redirects traffic or performs simple local calculations via serverless functions… these edge applications would be incapable of remembering anything of significance, forced, instead to constantly look up state somewhere else.” - Chetan Venkatesh and Durga Gokina, founders of Macrometa Corporation

Stateful Use Cases at the Edge

Stateful is useful for applications that require more context about users or end-user devices, to deliver more personalized experiences. These include:

  • Mobile apps
  • IoT devices
  • Multi-player gaming platforms
  • Virtual or augmented reality
  • IIoT and manufacturing systems e.g. retail inventory management, and logistics, fleet tracking and management

Stateful, real-time edge computing will enable latency-critical applications by providing the means for processing and distributing streaming data across complex systems without compromising on speed.

The Challenges of Stateful Data and Edge Computing

There are challenges involved with performing stateful computing at the edge.These include - for the above edge use cases, the ability to sync stateful data with guaranteed consistency. This is essential, for instance, to avoid lag in real-time gaming or prevent freezes in real-time streaming video calls. Without reliable consistency, different applications, devices and users will see different versions of data, leading to unreliable applications, data corruption and data loss.

How do you manage and coordinate state across a range of edge locations or edge nodes and sync data with guaranteed consistency?

Do Edge-Native Databases Provide the Solution?

One solution being worked on are edge-native databases, which are geo-distributed, multi-master data platforms capable of supporting multiple edge locations without the need for coordination. While they don’t require centralized forms of consensus, they can still guarantee consistency and arrive at a shared version of truth in real-time. These databases promise to overcome the data processing limitations experienced till now at the edge. An additional benefit is that they won’t require developers to have a specialist knowledge of how to design, architect or construct these databases.

Conclusion

According to Morgan Stanley, manufacturers, since 2010, have “collected 2,000 petabytes of potentially valuable data, but discarded 99% of it.” The 1% that is being processed makes use of Big Data. The challenge is how to filter through the remaining 99% to most efficiently extract value.

When application context is available at the edge, it is far easier to identify relevant insights from the full dataset and discard only the unnecessary information. True transformation will be possible when algorithms can have access to massive datasets and innovations in ML can be applied to huge volumes of streaming data in real-time without a dependency on historical datasets. When this is solved and distributed architectures can handle persistent data at the edge, it promises a huge breakthrough for the applications of tomorrow and the edge use cases they will enable.