`varnishreload` fails for non-trivial (large) VCLs #168

Bilge · 2023-10-21T09:56:37Z

Since Varnish 6.6/Jammy (official .deb) (for some reason, it did not occur on 6.5/Bionic (Packagecloud)), recompiling my VCL takes a non-trivial amount of time (about 10 seconds). This may be because I have a large ACL weighing in at 165 KB. This causes varnishreload to fail because it uses two discrete invocations of varnishadm to both compile (vcl.load) and activate (vcl.use) the VCL. Although it is possible to work around this issue by specifying an arbitrary wait (-w 10), it should be considered a poor solution for two reasons:

The user should have the reasonable expectation that simply running varnishreload will reload their configuration (without the need to know of or the use of any command options).
Arbitrary waits are a fundamentally flawed approach because we either wait too long (to be safe) or too little (causing a failure).

I would like to propose a better solution: send all commands to varnishadm together. This works because the invocation of vcl.load is inherently blocking, which means queuing up the vcl.use after it will only wait the exact amount of time required before the next invocation of vcl.use is safe to follow, which seems to me to be a much more elegant solution. The reason the current script fails is because discrete invocations of varnishadm are independent and do not know of or care about any pending commands in other processes.

The following proof of concept is all that is required to reload my configuration successfully, demonstrating the point.

#!/bin/bash
name=$(date +'test-%s')

varnishadm <<-.
	vcl.load "$name" /etc/varnish/default.vcl
	vcl.use "$name"
.

Might I suggest integrating this approach, and at the same time, dropping the unnecessary -w option?

The text was updated successfully, but these errors were encountered:

dridi · 2023-10-23T06:15:39Z

Thank you for the detailed report.

The -w option is here for asynchronous initialization at vcl.load time. It can be related to third-party VMODs, but as far as Varnish itself is concerned it helps with probes.

Probes have a default initial value of the threshold minus one, which means that a backend is considered healthy until at least its first probe request succeeds. When varnishreload is invoked (or really when a brand new VCL becomes active) in the middle of traffic this can lead to a burst of 503 responses while the probes first requests take place.

I don't know why using the -w option solves your problem. If anything, the varnishreload script is lacking a -t option to pass varnishadm. Large ACLs can take a long time to compile, and while it may succeed internally, it may take too long and appear to fail for varnishadm.

So what we probably should do is document the purpose of -w and and a -t option, and what we probably shouldn't do is batch commands (or at least not vcl.load and vcl.use).

It's also probably long overdue to move this script to the new contrib/ directory in the varnish-cache repository where it could grow a test suite before making any significant changes.

Bilge · 2023-10-23T21:44:12Z

I won't pretend to understand any of that, but certainly tests sound like a good idea. It occurred to me this script is designed to suit a lot more use-cases than just my own, though I have no idea what those are.

dridi · 2023-10-24T06:10:59Z

The script could move there: https://github.com/varnishcache/varnish-cache/tree/master/contrib

For the -t option, see https://varnish-cache.org/docs/7.4/reference/varnishadm.html#options

I think I understand now after reading the description again that -w does not solve your problem (and it shouldn't) but the batch approach does, and the reason for this success is that despite timing out, varnishadm already sent the vcl.use command and it is eventually picked up by varnishd.

If we were to fix varnishcache/varnish-cache#4012 as you suggested, that would break your batch approach anyway. In particular, varnishadm would not dissect the input (only the output) and I'm not sure how it would behave with regards to timeouts. Regardless, a non zero exit status would interrupt the execution of the script, so the reload would be effective but the cleanup wouldn't happen.

So really, the solution for you would be pass -t through to account for slow vcl.load (and to a lesser extent vcl.use) commands.

Bilge · 2023-10-24T08:39:12Z

I'm afraid I still don't think you understand.

The issue, according to my limited understanding, is that varnishadm communicates with a single, blocking socket (exported by the server). Therefore, if you just have one instance of varnishadm running at any given time then there is no issue, even if you run a command (like vcl.load) which may take a long time to complete, and blocks for the duration, because the client will wait until the socket becomes available again before attempting to send any further commands.

However, if you use multiple varnishadm sessions, such as this script does (by sending every single command with a new instance) then it cannot communicate with the blocked socket (so blocked by another process) and that is where the timeout comes from. It is also why batching the commands together (with one varnishadm process) works.

dridi · 2023-10-24T09:09:39Z

I'm afraid my patience is running thin, so my efforts at trying to engage just varnished.

Bilge mentioned this issue Oct 21, 2023

Changed varnishreload to send commands to varnishadm as a batch #170

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`varnishreload` fails for non-trivial (large) VCLs #168

`varnishreload` fails for non-trivial (large) VCLs #168

Bilge commented Oct 21, 2023 •

edited

dridi commented Oct 23, 2023

Bilge commented Oct 23, 2023

dridi commented Oct 24, 2023

Bilge commented Oct 24, 2023 •

edited

dridi commented Oct 24, 2023

varnishreload fails for non-trivial (large) VCLs #168

varnishreload fails for non-trivial (large) VCLs #168

Comments

Bilge commented Oct 21, 2023 • edited

dridi commented Oct 23, 2023

Bilge commented Oct 23, 2023

dridi commented Oct 24, 2023

Bilge commented Oct 24, 2023 • edited

dridi commented Oct 24, 2023

`varnishreload` fails for non-trivial (large) VCLs #168

`varnishreload` fails for non-trivial (large) VCLs #168

Bilge commented Oct 21, 2023 •

edited

Bilge commented Oct 24, 2023 •

edited