Varnish Monitoring using Graphite for Metrics

April 1, 2015

Varnish Cache is a great tool in your reverse proxy chain because its superior programming model allows you to make complex caching decisions that are unachievable in other reverse proxy servers.

So, you spend a lot of time learning and deploying Varnish Cache. The next question is “How well is Varnish Cache working?”

In order to understand how Varnish Cache is behaving you’ll need to process logs provided by Varnish Cache’s toolset.

To start, you’ll need to configure varnishncsa to start writing logs (don’t forget to set up log file rotation!). You can then grep logs to find the requests you’re interested in.

Also, varnishstat provides a bunch of really interesting statistics that you can examine when you have a problem, or over time to spot check the health of your system. You might want to see the information that varnishstat provides.

Once you decide that your site needs high availability, you’ll be looking to do this on multiple servers.

A more advanced setup can aggregate and consolidate this data in an easy to use way, that can be digested by devs who do ops, and ops who do dev.

We like to use a few components to achieve this, in two parts. Also, all these tools are freely available on the Internet.

Firstly, to understand things like cache hit rates by content-type, user-agent and geoip:

  1. Use varnishncsa with a decent log format that captures a lot of data.
  2. Ship those logs to a centralised log processing system, using rsyslog or syslog-ng.
  3. Run Logstash using a syslog input or UDP input to receive the log lines.
  4. During Logstash processing, use the GeoIP filter and user-agent filter detection to enrich the data.
  5. Set up statsd from etsy, and point the Logstash output to statsd.
  6. We set statsd to flush aggregate data with means, medians and various percentiles to carbon-relay, the component of the Graphite stack that receives data.
  7. carbon-relay pushes to carbon-cache, which persists the files.
  8. We then use graphite-web to perform ad hoc queries on the data.

Secondly, for statistics from the instances of Varnish Cache:

  1. We run varnishstat as a collectd job periodically.
  2. collectd forwards the data obtained from varnishstat to our carbon-relay as above.
  3. carbon-relay sends the data to carbon-cache.
  4. We can then perform ad hoc queries on a single instance of Varnish Cache or look at the Varnish Cache cluster as a whole.

Graphite-web supports creating dashboards, so you can use those ad hoc queries you find interesting and group them together to build reusable metrics for Varnish Cache that you can use to maintain the health of your system.

In order to make these more manageable, you can use Tessera (thanks Urban Airship!) to build slick looking dashboards that suit your system’s requirements without any fancy programming.

Finally, don’t forget your high availability in all that setup. You don’t want to lose metrics when your system is having a problem.

So, if you’re using Varnish for Magento Acceleration, with Tupentine, perhaps you can use these techniques to make sure that you really are getting value out of our Varnish Cache instance. You might even be able to turn off a few of those application servers if you improve your cache hit rate a little.