Varnish Cache was the first reverse proxy offered by section.io and, for many of us, it is our favourite because it always just works and gives great results.
Yesterday though we encountered an issue where our Traces page in the section.io Aperture portal failed to retrieve the Varnish Logs for a particular section.io-hosted Varnish instance. It had timed out before the results returned. Our platform team investigated.
Our system gathers the Varnish logs on-demand by executing the
varnishlog tool in the relevant proxy container and capturing the output. By default,
varnishlog would tail the log indefinitely, reporting activity as traffic is handled by Varnish, but because we want a sample to return to the portal quickly, we specify the
-d option to
varnishlog which instructs it to output the recently recorded logs, then exit.
Upon investigation though we found that
varnishlog -d hadn’t exited, it was still running and still outputting logs as traffic flowed. My colleague, Ivan, discovered a Varnish bug reported in 2013 that
varnishlog was ignoring the
-d option entirely. This bug report included to a reference to a code change that resolved the issue, so we investigated that code further.
Ivan also noted that the output from the still active
varnishlog process included the text
Log overrun and
Log reacquired which lead us to another part of
varnishlog’s source code where we found logic almost identical to that which was changed in the aforementioned bug fix. This led us to conclude that there was an edge case where the
-d option would become ignored if the circular buffer used Varnish for its log was overrun whilst trying to read it. This edge case also appeared to exist in all versions of Varnish Cache since 4.0 inclusive.
We did not know how long it would take to fix this issue in Varnish’s log tool itself, or if any such fix would even be applied to all affected Varnish versions. The chance of the issue re-occuring is also somewhat unpredictable, so we needed to work around it in our system. We decided on an approach that would improve upon our platform’s general reliability and began to implement the changes.
Meanwhile, we submitted a description of the problem to the Varnish team along with a suggested code change to fix the issue. Within about 4 hours, our change had been accepted to be included in the eventual release of Varnish 5 and very likely in any updates to both Varnish 4.1 and 4.0.
I think I like Varnish even more now.