Should you use a virtual waiting room or invest in scaling your infrastructure?

April 17, 2020

Recently, Kmart experienced some Twitter backlash when they launched an experimental online queuing system to help manage traffic surges on their Australian websites. While the queuing system is intended to keep the site online during high traffic periods, customers are criticizing Kmart for their decision to use a patchwork solution rather than invest in scaling their underlying infrastructure

kmart online queue tweet
So, this begs the question, are virtual waiting rooms a cheap alternative to avoid investment in properly scaling resource capacity?

If only it were that simple.

What are virtual waiting rooms (aka website queuing)?

Virtual waiting rooms are a way to limit the number of users on your website at one time.

They are used to protect your website during unexpected peak traffic events, so that a portion of users are still able to navigate, search, and transact with the website, rather than all users suffering delays and outages.

Virtual waiting rooms sit in front of your website, and only let users through when they think there is capacity on your website.

There are an assortment of virtual waiting room or website queuing solutions around, some provided by third-parties to add to your site via a simple integration, others custom designed by the site owners from scratch. If you’ve ever bought tickets to a popular concert or sporting event online you’ve likely experienced such a system.

Section offers a Virtual Waiting Room module for customers to deploy alongside other modules in their edge stack – for example caching, web application firewall (WAF), bot management, A/B testing, custom containers, etc.

But why not just address the underlying concern and scale your infrastructure?

In the modern world of elastic cloud infrastructure, automated scaling, and granular billing, it is tempting to suggest that we shouldn’t need to restrict the number of concurrent users accessing a service anymore. We should be able to add all the hardware we need to service the demand for the duration of the increased traffic, and then deprovision it all afterward, only paying for what was needed while it was being used.

Scaling Out

Scaling out, or adding more servers, is what the cloud excels at, but many existing software architectures were not designed to scale this way. One example is the single-master model often found in popular database systems: no matter how many replicas are added, there is always one server that needs to coordinate the rest and will be the bottleneck.

Scaling Up

Scaling up, or adding more CPU, RAM, disk, or network capacity to the same number of servers, is also an option in most clouds but can be tricky when applying such changes can disrupt availability to the service, or when you’re already using one of the largest instance types, or the large types are not available in the datacenter you need it to be in.

Hybrid Approach

One approach is to use a virtual waiting room to guide infrastructure scaling decisions. Building for a very large scale is simply waste for many systems until there is evidence that large scale needs to be supported. Using a queue allows the system to be online for most users instead of offline for all, and frees the engineers to focus on implementing better scaling solutions instead of fire fighting.

Planning for traffic surges

Recent circumstances have left many DevOps teams scrambling to support increased online traffic volumes, highlighting the need for ongoing preparation for these types of events. ITNews recently covered the pacing efforts of Coles Liquor, one of Australia’s largest liquor retailers, calling out years of background prep as the reason they’ve been able to handle the ‘digital stock-up surge’.

“We have been doing a lot of stress testing over the last couple of years since I joined Coles.” Juan De La Pava, martech and performance manager for Coles Liquor

Regular performance, load, and stress tests are an important part of the development lifecycle. It’s critical that teams continually reevaluate systems, resources, and processes to ensure there aren’t gaps or underinvestments in technology and/or people.