Beware Prometheus counters that do not begin at zero

November 24, 2020

After using Prometheus daily for a couple of years now, I thought I understood it pretty well. But recently I discovered that metrics I expected were not appearing in charts and not triggering alerts, so an investigation was required.

There were two related scenarios where expected metrics appeared to be missing, both concerned a web service hosted in a Kubernetes cluster. The first scenario was HTTP response count metrics divided by HTTP status code not appearing for uncommon status codes that had not previously been served by a particular Pod, e.g. 405. The second scenario was HTTP response count for error status codes, e.g. 500, that only happened during Pod start-up. The expected metrics were spikey, i.e. normally quiet but with occasionaly large values over a very short time.

I knew, first from third-party reports, and ultimately by issuing my own HTTP requests to the service, that these responses were being served, but when I checked my charts the metrics did not show the same number I expected. My first suspicion was that my PromQL queries were incorrect, here’s one:

sum(rate(http_response_total{}[2m])) by (status)

There isn’t much to it:

  • Query the last 2 minutes of the http_response_total counter. The scrape interval is 30 seconds so there should be enough data points in that window.
  • Calculate the change over time because http_response_total is a Counter metric, always increasing.
  • Sum all the matching series together, because there is a series with a different pod label for each replica of the web service, and group by status label.

Rather than trying to debug each element of the query, I went directly to the raw counter values:

http_response_total{status=~"405|500"}

The raw counter values reveal the data that I thought was missing actually exists and also show a common pattern across the series that aren’t reflected in my summarised charts: they are all new series, i.e. each series has a data point with a non-zero value and there are no earlier data points for the same series inside my query time window.

Why is this happening?

It seems the rate PromQL function always returns zero for the first recorded sample of a series even when the sample value is non-zero. This is because the goal of the rate function is to compare multiple samples and interpolate the values in between. This interpolation behaviour is normally why counter metrics are ideal: they allow us to infer system behavior in the time window between scrape intervals, a capability not offered by gauge metrics.

The problem with the first sample of a new metric series is that rate is attempting to compare against a non-existent previous value and Prometheus does not have enough data with which to interpolate. For counters, one might suggest that Prometheus should assume the missing previous value is zero, but the rate function also needs to know the timestamp of the non-existent sample to calculate the change over time, and the timestamp also isn’t available. It might be reasonable in some scenarios to assume the timestamp of the missing sample is one scrape interval earlier than the first known sample, but the scrape configuration is not part of the time-series database, and ultimately this is the current implementation in Promtheus that we need to work with.

If most of your metric series (a series being a particular combination of metric name, label names, and label values) are long-lived you’ll rarely experience this issue because you rarely query a counter’s first data point. However new metric series are introduced quite often in some environments, like the environment I use: frequent Kubernetes deployments and Horizontal Pod Autoscaling create Pods with new names that in turn produce metric series with new pod label values and HTTP-related metrics with a status label that are not exported until a response with a given status code is first served.

The problem impacts counter metrics predominantly because the value of a counter is essentially never consumed directly, instead a PromQL query function like rate or one of rate’s various friends is used to derive the value to render in a chart or compare against an alert threshold. In my experience, most interesting Prometheus metrics are counters, even histograms and summaries are implemented as counters under the hood.

How could we fix it?

It may be tempting to simply change the PromQL query to sum first and then rate which would almost eliminate cases where there are too few data points to interpolate. However this suffers from two errors:

  1. The rate function requires a range-vector as input but sum returns an instant-vector, although this could be solved with Recording Rules and more complicated queries.
  2. Any counter resets, e.g. due to container restarts, would corrupt the calculated value arguably worse than the current missing data issue, which is why we never sum then rate.

I shared my experience with the Prometheus IRC community and received some helpful tips from Ben Kochie on approaches to mitigate this problem, at least in a Kubernetes environment or similar.

For the HTTP 405 scenario, where the Pod has been running for some time but never served that status code before, we can modify the application to initialize all possible metric label combinations to zero during start-up. Naturally this can surface any issues with label value sets with a large cartesian product but there are operational benefits to addressing this high cardinality upfront by choosing to use fewer labels or bucketing values together instead of discovering an explosion of metrics at run-time.

For the HTTP 500 scenario, where the Pod serves a number of error responses before it is first scraped, we can modify the Pod Readiness Probe so that requests are not delivered to the Pod until after Prometheus has scraped it at least once. This implies that the above change to initialize all metrics to zero has also been done. This would also require the Prometheus service discovery to be configured to scrape Pods that have not yet been marked as Ready.