Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

varnishreload fails for non-trivial (large) VCLs #168

Open
Bilge opened this issue Oct 21, 2023 · 5 comments
Open

varnishreload fails for non-trivial (large) VCLs #168

Bilge opened this issue Oct 21, 2023 · 5 comments

Comments

@Bilge
Copy link

Bilge commented Oct 21, 2023

Since Varnish 6.6/Jammy (official .deb) (for some reason, it did not occur on 6.5/Bionic (Packagecloud)), recompiling my VCL takes a non-trivial amount of time (about 10 seconds). This may be because I have a large ACL weighing in at 165 KB. This causes varnishreload to fail because it uses two discrete invocations of varnishadm to both compile (vcl.load) and activate (vcl.use) the VCL. Although it is possible to work around this issue by specifying an arbitrary wait (-w 10), it should be considered a poor solution for two reasons:

  1. The user should have the reasonable expectation that simply running varnishreload will reload their configuration (without the need to know of or the use of any command options).
  2. Arbitrary waits are a fundamentally flawed approach because we either wait too long (to be safe) or too little (causing a failure).

I would like to propose a better solution: send all commands to varnishadm together. This works because the invocation of vcl.load is inherently blocking, which means queuing up the vcl.use after it will only wait the exact amount of time required before the next invocation of vcl.use is safe to follow, which seems to me to be a much more elegant solution. The reason the current script fails is because discrete invocations of varnishadm are independent and do not know of or care about any pending commands in other processes.

The following proof of concept is all that is required to reload my configuration successfully, demonstrating the point.

#!/bin/bash
name=$(date +'test-%s')

varnishadm <<-.
	vcl.load "$name" /etc/varnish/default.vcl
	vcl.use "$name"
.

Might I suggest integrating this approach, and at the same time, dropping the unnecessary -w option?

@dridi
Copy link
Member

dridi commented Oct 23, 2023

Thank you for the detailed report.

The -w option is here for asynchronous initialization at vcl.load time. It can be related to third-party VMODs, but as far as Varnish itself is concerned it helps with probes.

Probes have a default initial value of the threshold minus one, which means that a backend is considered healthy until at least its first probe request succeeds. When varnishreload is invoked (or really when a brand new VCL becomes active) in the middle of traffic this can lead to a burst of 503 responses while the probes first requests take place.

I don't know why using the -w option solves your problem. If anything, the varnishreload script is lacking a -t option to pass varnishadm. Large ACLs can take a long time to compile, and while it may succeed internally, it may take too long and appear to fail for varnishadm.

So what we probably should do is document the purpose of -w and and a -t option, and what we probably shouldn't do is batch commands (or at least not vcl.load and vcl.use).

It's also probably long overdue to move this script to the new contrib/ directory in the varnish-cache repository where it could grow a test suite before making any significant changes.

@Bilge
Copy link
Author

Bilge commented Oct 23, 2023

I won't pretend to understand any of that, but certainly tests sound like a good idea. It occurred to me this script is designed to suit a lot more use-cases than just my own, though I have no idea what those are.

@dridi
Copy link
Member

dridi commented Oct 24, 2023

The script could move there: https://github.com/varnishcache/varnish-cache/tree/master/contrib

For the -t option, see https://varnish-cache.org/docs/7.4/reference/varnishadm.html#options

I think I understand now after reading the description again that -w does not solve your problem (and it shouldn't) but the batch approach does, and the reason for this success is that despite timing out, varnishadm already sent the vcl.use command and it is eventually picked up by varnishd.

If we were to fix varnishcache/varnish-cache#4012 as you suggested, that would break your batch approach anyway. In particular, varnishadm would not dissect the input (only the output) and I'm not sure how it would behave with regards to timeouts. Regardless, a non zero exit status would interrupt the execution of the script, so the reload would be effective but the cleanup wouldn't happen.

So really, the solution for you would be pass -t through to account for slow vcl.load (and to a lesser extent vcl.use) commands.

@Bilge
Copy link
Author

Bilge commented Oct 24, 2023

I'm afraid I still don't think you understand.

The issue, according to my limited understanding, is that varnishadm communicates with a single, blocking socket (exported by the server). Therefore, if you just have one instance of varnishadm running at any given time then there is no issue, even if you run a command (like vcl.load) which may take a long time to complete, and blocks for the duration, because the client will wait until the socket becomes available again before attempting to send any further commands.

However, if you use multiple varnishadm sessions, such as this script does (by sending every single command with a new instance) then it cannot communicate with the blocked socket (so blocked by another process) and that is where the timeout comes from. It is also why batching the commands together (with one varnishadm process) works.

@dridi
Copy link
Member

dridi commented Oct 24, 2023

I'm afraid my patience is running thin, so my efforts at trying to engage just varnished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants