Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically reload cert/keys if changed on disk #1595

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jtackaberry
Copy link
Contributor

Sketching out the ideas for #1554. Works for server certs, should work for client certs but I haven't tested it yet. CA certs aren't dynamically reloaded, but I don't think that's too important (and I don't think gotls makes that easy).

As mentioned in #1554, this implementation causes the node cert/key to be monitored multiple times (because they're passed multiple times to CreateClientConfig() and CreateServerConfig()), and so changes will result in logging several instances of "reloading certificate". I plan to fix this by keeping a global cache of monitoredCertificate objects keyed by cert/key and returning the same object when the same cert/key is passed, but if you have a better idea let me know.

monitorFiles() is the stat poller. Pretty simple. I left it under rtls for now and we can decide where to put it later.

No tests yet, I can look at that once the design is finalized.

@jtackaberry
Copy link
Contributor Author

Also on reflection I'll probably rename "monitor" to "watch" (and variants thereof), just because it's a touch more idiomatic and "monitored" has other connotations.

@otoolep
Copy link
Member

otoolep commented Jan 14, 2024

You want feedback on this now? Or should I wait?

@jtackaberry
Copy link
Contributor Author

jtackaberry commented Jan 14, 2024

You want feedback on this now? Or should I wait?

If you have thoughts now, fire away. I probably won't be able to resume work for a couple days.

Copy link
Member

@otoolep otoolep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, but needs to be approached differently. For example services like this usually return a channel, allowing them to be shutdown cleanly. See https://github.com/rqlite/rqlite/blob/master/queue/queue.go#L198 for an example in rqlite. I would expect to see use of the "done" channel pattern.

Also, this could could be quite racy, since I don't see any read-write synchronization being done.

if changed != nil {
cb(changed)
}
time.Sleep(2 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not idiomatic Go. You should use the ticker pattern.

https://gobyexample.com/tickers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can adapt it easily enough. Although it's not clear to me what value using a ticker and channel will provide over sleep and callback in this case. A ticker makes sense when you want to select on multiple things (most notably a context), but in this case it was literally just a "do a thing every interval T."

Also, this could could be quite racy, since I don't see any read-write synchronization being done.

It uses an atomic.Value to hold a scalar (a pointer to tls.Certificate).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect to see use of the "done" channel pattern.

What is it that will signal doneness?

In this case there's no cleanup activity for the goroutine to do, it's expected to run continuously until program exit, and it doesn't impede rqlited's ability to terminate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I'm not objecting to making the changes, just asking to be administered the cluebat a little bit more. :)

@otoolep
Copy link
Member

otoolep commented Jan 14, 2024

Yeah, following Go patterns is very important to me. It promotes quality, ease of testing, and makes it clearer for other programmers what your actual intention is. In this case using a Ticker makes it crystal clear what you're doing, in a way Sleep does not.

Not being able to shut down the monitoring service makes unit-testing more difficult. For example, unit testing takes place within the context of a single binary. Without clean shutdown, if I force a stack trace of the test binary (perhaps I'm debugging a test deadlock), I've extra Goroutines running that are nothing to do with the test failure.

@otoolep
Copy link
Member

otoolep commented Jan 14, 2024

Not being able to shut down the monitoring service makes unit-testing more difficult. For example, unit testing takes place within the context of a single binary.

Multiple unit tests run inside a single binary. Multiple different unit tests of the monitoring service would leave multiple goroutines running that make debug harder. Every time a given unit test finishes within the binary, I want all resources cleaned up so runtime is fully reset of the next unit test.

@otoolep
Copy link
Member

otoolep commented Jan 14, 2024

More generally, the monitoring code should be a struct, one the code can instantiate. You can see this pattern in the fsnotify system -- NewWatcher.

@jtackaberry
Copy link
Contributor Author

Not being able to shut down the monitoring service makes unit-testing more difficult. For example, unit testing takes place within the context of a single binary. Without clean shutdown, if I force a stack trace of the test binary (perhaps I'm debugging a test deadlock), I've extra Goroutines running that are nothing to do with the test failure.

Right, good point. I'm with you from a testing perspective.

Outside of the testing context, I'm not sure how this should work. The usual pattern is to propagate a context.Context throughout which is used to signal clean termination. rqlite isn't doing that, and with the goroutine being spawned from within rtls.CreateServerConfig() and rtls.CreateClientConfig(), that solution would end up being fairly invasive (needing an update to a large number of functions).

Alternatively, the file monitor goroutine could be spawned (and subsequently signaled to terminate) once from rqlited/main.go as a single goroutine (managed by a singleton struct) and rtls.Create{Client,Server}Config could register against that.

@jtackaberry
Copy link
Contributor Author

More generally, the monitoring code should be a struct, one the code can instantiate. You can see this pattern in the fsnotify system -- NewWatcher.

Similar to what I concluded my last comment with. The only nuance with rqlite is if the watcher's lifecycle is to be maintained from rqlited/main.go it would AFAICT need to be a global singleton of that struct, otherwise, like with Context, a large number of functions would need to be updated to propagate the Watcher object.

@otoolep
Copy link
Member

otoolep commented Jan 14, 2024

The use of Context within rqlite is weak, it should be better (rqlite development started before Context became a real thing, and that pattern has stuck unfortunately). If there is a better, idiomatic, solution using Context that gives the code complete control over the Monitoring object's lifecycle, that works better.

@jtackaberry
Copy link
Contributor Author

If there is a better, idiomatic, solution using Context that gives the code complete control over the Monitoring object's lifecycle, that works better.

I think there is, similar to what you're doing with auto backups in rqlited/main.go -- creating a cancellable context and passing it to a function that spawns a goroutine -- the only thing is AFAICT it would require use of some form of global state. This is common enough in Go (even in rqlite I see globals are used for at least expvar.Map) but some folks strongly object to globals. (Presumably because they can indicate a code smell, though IMO they don't always.)

In fact, I actually started down these lines but back-pedaled to a single function when I thought I could get away with it. :) But it sounds like something more sophisticated is in order.

Two approaches come to mind:

  1. Make the file watcher itself a single global goroutine where rtls (and possibly elsewhere in future) registers a file watch (slice of filenames with a channel to be poked) via a Watch() function. The registered files would be global state (map plus mutex) in the file watcher package (whatever that is). rqlited/main.go then calls the Start() function in that package, which takes a Context and spawns a goroutine to poll the registered files, and which terminates when the context is cancelled. Watch() then takes care of updating the global map which is checked by the single goroutine.
  2. Or, NewFileWatcher() returns an instance of a FileWatcher struct, which keeps said map/mutex within the struct, and where each instance spawns its own goroutine similar to what's in my branch now. Except here the key difference is that the package also offers a SetDefaultContext() function which is called by rqlited/main.go to set a single context that's used to cancel all goroutines spawned by NewFileWatcher. Then the rtls package creates one global instance of FileWatcher for its use -- global mainly to minimize the duplication of watching the same file (namely the node cert/key, since IIRC CreateClientConfig() is called for each internode connection).

So in the first case the global state is a map (plus corresponding mutex) that tracks the files registered for watch, while in the second case the state is a shared context set by main() and a singleton instance of FileWatcher used by rtls.

@jtackaberry
Copy link
Contributor Author

jtackaberry commented Jan 14, 2024

global mainly to minimize the duplication of watching the same file (namely the node cert/key, since IIRC CreateClientConfig() is called for each internode connection).

You know, maybe this is a premature/unnecessary optimization. Even if we do spawn a separate file watcher goroutine per node connection watching the same files, so what? We aren't talking about clusters with thousands or even hundreds of nodes. Ok, we stat the same files multiple times per interval, and this will look a little wonky to anyone paying attention to strace, but it'll simplify things in rtls which uses the file watcher.


// load reads the certificate and key files from disk and caches the result for Get()
func (mc *monitoredCertificate) load() error {
cert, err := tls.LoadX509KeyPair(mc.certFile, mc.keyFile)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, does this safely fails if one file changed but the other still didn't?

For example, what happens if the cert file changes and there is some random slowness until the key file gets also written? Before loading them it could be wise to have a delay, unless LoadX509KeyPair fails if there is any kind of mismatch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless LoadX509KeyPair fails if there is any kind of mismatch

Exactly. If the key doesn't match the cert, LoadX509KeyPair() returns an error and the active cert/key in memory won't be replaced. When the other file gets updated (or whatever operational issue is resolved), then this will get re-executed and at that time.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh great then, I should find the equivalent for LoadX509KeyPair in Node.js, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants