Varnish Cache was the first reverse proxy offered by Section and, for many of us, it is our favourite because it always just works and gives great results.
Yesterday though we encountered an issue where our Traces page in the Section Console failed to retrieve the Varnish Cache Logs for a particular Section-hosted Varnish Cache instance. It had timed out before the results returned. Our platform team investigated.
Broken
Our system gathers the Varnish Cache logs on-demand by executing the varnishlog
tool in the relevant proxy container and capturing the output. By default, varnishlog
would tail the log indefinitely, reporting activity as traffic is handled by Varnish Cache, but because we want a sample to return to the portal quickly, we specify the -d
option to varnishlog
which instructs it to output the recently recorded logs, then exit.
Upon investigation though we found that varnishlog -d
hadn’t exited, it was still running and still outputting logs as traffic flowed. My colleague, Ivan, discovered a Varnish bug reported in 2013 that varnishlog
was ignoring the -d
option entirely. This bug report included to a reference to a code change that resolved the issue, so we investigated that code further.
Ivan also noted that the output from the still active varnishlog
process included the text Log overrun
and Log reacquired
which lead us to another part of varnishlog
's source code where we found logic almost identical to that which was changed in the aforementioned bug fix. This led us to conclude that there was an edge case where the -d
option would become ignored if the circular buffer used Varnish Cache for its log was overrun whilst trying to read it. This edge case also appeared to exist in all versions of Varnish Cache since 4.0 inclusive.
Fixed
We did not know how long it would take to fix this issue in Varnish Cache’s log tool itself, or if any such fix would even be applied to all affected Varnish Cache versions. The chance of the issue re-occuring is also somewhat unpredictable, so we needed to work around it in our system. We decided on an approach that would improve upon our platform’s general reliability and began to implement the changes.
Meanwhile, we submitted a description of the problem to the Varnish Cache team along with a suggested code change to fix the issue. Within about 4 hours, our change had been accepted to be included in the eventual release of Varnish Cache 5 and very likely in any updates to both Varnish Cache 4.1 and 4.0.
I think I like Varnish Cache even more now.