Dynamically reload cert/keys if changed on disk #1595

jtackaberry · 2024-01-13T04:12:49Z

Sketching out the ideas for #1554. Works for server certs, should work for client certs but I haven't tested it yet. CA certs aren't dynamically reloaded, but I don't think that's too important (and I don't think gotls makes that easy).

As mentioned in #1554, this implementation causes the node cert/key to be monitored multiple times (because they're passed multiple times to CreateClientConfig() and CreateServerConfig()), and so changes will result in logging several instances of "reloading certificate". I plan to fix this by keeping a global cache of monitoredCertificate objects keyed by cert/key and returning the same object when the same cert/key is passed, but if you have a better idea let me know.

monitorFiles() is the stat poller. Pretty simple. I left it under rtls for now and we can decide where to put it later.

No tests yet, I can look at that once the design is finalized.

jtackaberry · 2024-01-13T04:15:26Z

Also on reflection I'll probably rename "monitor" to "watch" (and variants thereof), just because it's a touch more idiomatic and "monitored" has other connotations.

otoolep · 2024-01-14T16:04:44Z

You want feedback on this now? Or should I wait?

jtackaberry · 2024-01-14T16:08:59Z

You want feedback on this now? Or should I wait?

If you have thoughts now, fire away. I probably won't be able to resume work for a couple days.

otoolep

This makes sense to me, but needs to be approached differently. For example services like this usually return a channel, allowing them to be shutdown cleanly. See https://github.com/rqlite/rqlite/blob/master/queue/queue.go#L198 for an example in rqlite. I would expect to see use of the "done" channel pattern.

Also, this could could be quite racy, since I don't see any read-write synchronization being done.

otoolep · 2024-01-14T19:23:28Z

rtls/monitor.go

+		if changed != nil {
+			cb(changed)
+		}
+		time.Sleep(2 * time.Second)


This is not idiomatic Go. You should use the ticker pattern.

https://gobyexample.com/tickers

Sure, I can adapt it easily enough. Although it's not clear to me what value using a ticker and channel will provide over sleep and callback in this case. A ticker makes sense when you want to select on multiple things (most notably a context), but in this case it was literally just a "do a thing every interval T."

Also, this could could be quite racy, since I don't see any read-write synchronization being done.

It uses an atomic.Value to hold a scalar (a pointer to tls.Certificate).

I would expect to see use of the "done" channel pattern.

What is it that will signal doneness?

In this case there's no cleanup activity for the goroutine to do, it's expected to run continuously until program exit, and it doesn't impede rqlited's ability to terminate.

To be clear, I'm not objecting to making the changes, just asking to be administered the cluebat a little bit more. :)

otoolep · 2024-01-14T20:04:51Z

Yeah, following Go patterns is very important to me. It promotes quality, ease of testing, and makes it clearer for other programmers what your actual intention is. In this case using a Ticker makes it crystal clear what you're doing, in a way Sleep does not.

Not being able to shut down the monitoring service makes unit-testing more difficult. For example, unit testing takes place within the context of a single binary. Without clean shutdown, if I force a stack trace of the test binary (perhaps I'm debugging a test deadlock), I've extra Goroutines running that are nothing to do with the test failure.

otoolep · 2024-01-14T20:16:24Z

Not being able to shut down the monitoring service makes unit-testing more difficult. For example, unit testing takes place within the context of a single binary.

Multiple unit tests run inside a single binary. Multiple different unit tests of the monitoring service would leave multiple goroutines running that make debug harder. Every time a given unit test finishes within the binary, I want all resources cleaned up so runtime is fully reset of the next unit test.

otoolep · 2024-01-14T20:18:03Z

More generally, the monitoring code should be a struct, one the code can instantiate. You can see this pattern in the fsnotify system -- NewWatcher.

jtackaberry · 2024-01-14T20:20:44Z

Not being able to shut down the monitoring service makes unit-testing more difficult. For example, unit testing takes place within the context of a single binary. Without clean shutdown, if I force a stack trace of the test binary (perhaps I'm debugging a test deadlock), I've extra Goroutines running that are nothing to do with the test failure.

Right, good point. I'm with you from a testing perspective.

Outside of the testing context, I'm not sure how this should work. The usual pattern is to propagate a context.Context throughout which is used to signal clean termination. rqlite isn't doing that, and with the goroutine being spawned from within rtls.CreateServerConfig() and rtls.CreateClientConfig(), that solution would end up being fairly invasive (needing an update to a large number of functions).

Alternatively, the file monitor goroutine could be spawned (and subsequently signaled to terminate) once from rqlited/main.go as a single goroutine (managed by a singleton struct) and rtls.Create{Client,Server}Config could register against that.

jtackaberry · 2024-01-14T20:24:12Z

More generally, the monitoring code should be a struct, one the code can instantiate. You can see this pattern in the fsnotify system -- NewWatcher.

Similar to what I concluded my last comment with. The only nuance with rqlite is if the watcher's lifecycle is to be maintained from rqlited/main.go it would AFAICT need to be a global singleton of that struct, otherwise, like with Context, a large number of functions would need to be updated to propagate the Watcher object.

otoolep · 2024-01-14T20:29:26Z

The use of Context within rqlite is weak, it should be better (rqlite development started before Context became a real thing, and that pattern has stuck unfortunately). If there is a better, idiomatic, solution using Context that gives the code complete control over the Monitoring object's lifecycle, that works better.

jtackaberry · 2024-01-14T21:58:57Z

If there is a better, idiomatic, solution using Context that gives the code complete control over the Monitoring object's lifecycle, that works better.

I think there is, similar to what you're doing with auto backups in rqlited/main.go -- creating a cancellable context and passing it to a function that spawns a goroutine -- the only thing is AFAICT it would require use of some form of global state. This is common enough in Go (even in rqlite I see globals are used for at least expvar.Map) but some folks strongly object to globals. (Presumably because they can indicate a code smell, though IMO they don't always.)

In fact, I actually started down these lines but back-pedaled to a single function when I thought I could get away with it. :) But it sounds like something more sophisticated is in order.

Two approaches come to mind:

Make the file watcher itself a single global goroutine where rtls (and possibly elsewhere in future) registers a file watch (slice of filenames with a channel to be poked) via a Watch() function. The registered files would be global state (map plus mutex) in the file watcher package (whatever that is). rqlited/main.go then calls the Start() function in that package, which takes a Context and spawns a goroutine to poll the registered files, and which terminates when the context is cancelled. Watch() then takes care of updating the global map which is checked by the single goroutine.
Or, NewFileWatcher() returns an instance of a FileWatcher struct, which keeps said map/mutex within the struct, and where each instance spawns its own goroutine similar to what's in my branch now. Except here the key difference is that the package also offers a SetDefaultContext() function which is called by rqlited/main.go to set a single context that's used to cancel all goroutines spawned by NewFileWatcher. Then the rtls package creates one global instance of FileWatcher for its use -- global mainly to minimize the duplication of watching the same file (namely the node cert/key, since IIRC CreateClientConfig() is called for each internode connection).

So in the first case the global state is a map (plus corresponding mutex) that tracks the files registered for watch, while in the second case the state is a shared context set by main() and a singleton instance of FileWatcher used by rtls.

jtackaberry · 2024-01-14T22:06:04Z

global mainly to minimize the duplication of watching the same file (namely the node cert/key, since IIRC CreateClientConfig() is called for each internode connection).

You know, maybe this is a premature/unnecessary optimization. Even if we do spawn a separate file watcher goroutine per node connection watching the same files, so what? We aren't talking about clusters with thousands or even hundreds of nodes. Ok, we stat the same files multiple times per interval, and this will look a little wonky to anyone paying attention to strace, but it'll simplify things in rtls which uses the file watcher.

LuKks · 2024-04-03T22:55:49Z

rtls/monitor.go

+
+// load reads the certificate and key files from disk and caches the result for Get()
+func (mc *monitoredCertificate) load() error {
+	cert, err := tls.LoadX509KeyPair(mc.certFile, mc.keyFile)


Question, does this safely fails if one file changed but the other still didn't?

For example, what happens if the cert file changes and there is some random slowness until the key file gets also written? Before loading them it could be wise to have a delay, unless LoadX509KeyPair fails if there is any kind of mismatch

unless LoadX509KeyPair fails if there is any kind of mismatch

Exactly. If the key doesn't match the cert, LoadX509KeyPair() returns an error and the active cert/key in memory won't be replaced. When the other file gets updated (or whatever operational issue is resolved), then this will get re-executed and at that time.

I did this similar feature in a lib

https://github.com/LuKks/simple-hosting/blob/main/index.js#L228-L255
Plus https://github.com/LuKks/simple-hosting/blob/main/lib/watch-file.js

Oh great then, I should find the equivalent for LoadX509KeyPair in Node.js, thanks!

jtackaberry force-pushed the dynamic-cert-reload branch from df8b1e1 to f5a95a0 Compare January 13, 2024 04:17

Dynamically reload cert/keys if changed on disk

71354e3

jtackaberry force-pushed the dynamic-cert-reload branch from f5a95a0 to 71354e3 Compare January 13, 2024 04:20

otoolep reviewed Jan 14, 2024

View reviewed changes

LuKks reviewed Apr 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamically reload cert/keys if changed on disk #1595

Dynamically reload cert/keys if changed on disk #1595

jtackaberry commented Jan 13, 2024

jtackaberry commented Jan 13, 2024

otoolep commented Jan 14, 2024

jtackaberry commented Jan 14, 2024 •

edited

otoolep left a comment

otoolep Jan 14, 2024

jtackaberry Jan 14, 2024 •

edited

jtackaberry Jan 14, 2024 •

edited

jtackaberry Jan 14, 2024

otoolep commented Jan 14, 2024 •

edited

otoolep commented Jan 14, 2024

otoolep commented Jan 14, 2024

jtackaberry commented Jan 14, 2024

jtackaberry commented Jan 14, 2024

otoolep commented Jan 14, 2024

jtackaberry commented Jan 14, 2024

jtackaberry commented Jan 14, 2024 •

edited

LuKks Apr 3, 2024

jtackaberry Apr 3, 2024

LuKks Apr 3, 2024

LuKks Apr 3, 2024

Dynamically reload cert/keys if changed on disk #1595

Are you sure you want to change the base?

Dynamically reload cert/keys if changed on disk #1595

Conversation

jtackaberry commented Jan 13, 2024

jtackaberry commented Jan 13, 2024

otoolep commented Jan 14, 2024

jtackaberry commented Jan 14, 2024 • edited

otoolep left a comment

Choose a reason for hiding this comment

otoolep Jan 14, 2024

Choose a reason for hiding this comment

jtackaberry Jan 14, 2024 • edited

Choose a reason for hiding this comment

jtackaberry Jan 14, 2024 • edited

Choose a reason for hiding this comment

jtackaberry Jan 14, 2024

Choose a reason for hiding this comment

otoolep commented Jan 14, 2024 • edited

otoolep commented Jan 14, 2024

otoolep commented Jan 14, 2024

jtackaberry commented Jan 14, 2024

jtackaberry commented Jan 14, 2024

otoolep commented Jan 14, 2024

jtackaberry commented Jan 14, 2024

jtackaberry commented Jan 14, 2024 • edited

LuKks Apr 3, 2024

Choose a reason for hiding this comment

jtackaberry Apr 3, 2024

Choose a reason for hiding this comment

LuKks Apr 3, 2024

Choose a reason for hiding this comment

LuKks Apr 3, 2024

Choose a reason for hiding this comment

jtackaberry commented Jan 14, 2024 •

edited

jtackaberry Jan 14, 2024 •

edited

jtackaberry Jan 14, 2024 •

edited

otoolep commented Jan 14, 2024 •

edited

jtackaberry commented Jan 14, 2024 •

edited