Automated Testing Practices for Verifying the Behavior of Your Edge Configuration

February 6, 2020

In a recent article, we discussed how to manage changes to your edge configuration through a CD pipeline, and we specifically called out the importance of testing through each stage of the pipeline. Here, we’ll expand upon how to employ automated testing practices for verifying the behavior of your edge configuration.

“If you’re not testing and verifying that each step in your continuous deployment (CD) pipeline is producing the expected outcomes, you’re essentially just shipping failure faster to production.” - Lindsay Holmwood, VP of Product at Section (from his talk at DevOpsDays Melbourne)

Decide what guarantees you’re providing

As you’re building out your CD pipeline and integrating testing into each stage, it’s important to understand and define what a successful system looks like so you can build tests that support and verify those requirements. Some of these decisions will stem from your users’ expectations and the guarantees (like service level objectives – SLOs, and service level agreements – SLAs) that you’ve promised.

For example, an SLO might look like:

  • 95e response time for HTTP requests in a one hour window is < 250 milliseconds
  • 99.95% of requests should be served successfully

The goal is to implement tests that reduce the risk of introducing errors into production and ensure SLOs continue to be met as changes are pushed through your CD pipeline.

Change one, test one

Chunking changes is one of the most critical testing practices when managing your edge configuration through CD pipelines. By breaking changes into smaller, verifiable units and testing as you go, you avoid being faced with failures at the end and not knowing what specific change introduced the failure.

Order matters

As you build tests into each stage of your CD pipeline, where you insert and run them most certainly matters!

When you start adding tests to an existing CD pipeline, it’s tempting to batch all the tests at the end of all your changes:

batch tests at end

This makes it easier to see all the tests in one place, but it increases the delay between making a change and knowing if it worked. Worse still, it makes it harder for you to know what change caused the tests to fail – is the edge test failing because of the app change, the database change, or the edge change?

Instead, we can interweave the tests between each configuration change:

batch tests at end

This closes the feedback loop faster between making a change and knowing if it works. It also improves debug-ability by reducing the surface area of changes.

Speed matters

It’s extremely important that tests finish quickly. A good baseline is to keep tests executing in under 10 seconds, and more preferably under 5 seconds.

Because you can’t test everything, be sure to focus on tests that deliver the best and fastest verification of health and requirements. Examples of things to test for might include:

  • Can I make a HTTP request and get a good response?
  • Are there any obviously bad log messages?
  • Is there a significant statistical deviation in metrics?

Building your tests

Goodness of fit tests, or statistical tests for steady-state systems, are helpful in identifying how changes affect a normal distribution. Here are some good starting points:

Make feedback visual

Raw data is useful, but visual representations help tell a more informed story around that data. You can even use visualization tools directly in the command line to further streamline your operations. (The below example uses gnuplot’s dumb | terminal | output)

  1480 ++---------------+----------------+----------------+---------------**
       +                +                +                + ************** +
  1460 ++                                            *******              ##
       |                                      *******                 #### |
  1440 ++                    *****************                 #######    ++
       |                  ***                                ##            |
  1420 *******************                                  #             ++
       |                                                   #               |
  1400 ++                                                ##               ++
       |                                             ####                  |
       |                                          ###                      |
  1380 ++                                      ###                        ++
       |                                     ##                            |
  1360 ++                               #####                             ++
       |                            ####                                   |
  1340 ++                    #######                                      ++
       |                  ###                                              |
  1320 ++          #######                                                ++
       ############     +                +                +                +
  1300 ++---------------+----------------+----------------+---------------++
       0                5                10               15               20

CRITICAL: Deviation (116.55) is greater than maximum allowed (100.00)

Level up tests by running them constantly

Building a pipeline that integrates testing that is running all the time is the key to making this all work. One simple way to get up and running quickly is to use Consul from HashiCorp.

consul test integration

If you have Consul deployed to all the different nodes in a cluster, you can drop in per node definitions for different services and then add monitoring checks specific to that type of service. The advantage to running your system in this way is that you only need a simple check to query the monitoring that is running constantly in the background, rather than having to do its own test.

Consul per node definitions

Summary

When you automate testing through your CD pipeline, you’re able to increase your pace of iteration by identifying and addressing fault points faster without the operational overhead of testing on a change-by-change basis.