Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dynamic reloading of TLS files for raft and http protocols #1554

Open
jtackaberry opened this issue Jan 2, 2024 · 7 comments
Open

Comments

@jtackaberry
Copy link
Contributor

jtackaberry commented Jan 2, 2024

Certificate rotations are a standard operational requirement. Today, certificate rotations require a rolling restart of the cluster.

It's increasingly common for certificates to be issued through automated means with relatively short validity periods to mitigate certificate compromises (because revocation in PKI is notoriously broken). For example, in Kubernetes, one can use cert-manager to manage generation and rotation of certificates. An operator may choose to use, for example, Let's Encrypt for the the client-facing HTTP port, which currently issues certificates with 90 day lifetimes (and has said they may consider ratcheting down the lifetime in future). Inter-node certificates could be generated using, say, GCP CAS or AWS ACM PCA, or even cert-manager's native CA issuer.

In all these cases, the certificate rotation is fully automated, and updated PEM files are automatically swapped into the running rqlite pods (typically by means of a mounted Kubernetes secret).

This FR is a small quality-of-life enhancement to offload certificate management to outside orchestration such as cert-manager, for both the HTTP API and internode communication.

As with #1553, this could be done using fsnotify or polling mtime every few seconds, and should follow symlinks.

@jtackaberry
Copy link
Contributor Author

I was thinking about this one and I think a simpler implementation would be to replace population of the tls.Config.Certificates field with GetCertificate and GetClientCertificate callbacks, which point to a function that reads the cert/key from disk, caching it for a period of time (30 seconds or so).

This means the cert loading happens in-band of the connection, and this should work equally well for both egress and ingress TLS connections.

I'll do up a PR.

@otoolep
Copy link
Member

otoolep commented Jan 12, 2024

What is the standard approach, isn't responding to SIGHUP the way these systems traditionally work? I'd rather implement support for that, caching files for N seconds leads to racy, hard-to-test code in practice.

@otoolep
Copy link
Member

otoolep commented Jan 12, 2024

All manner of things are harder test when systems do async things like expiring caches. For example, what if one of the refreshes from disk fails? How do we inform the user? Continually log every 30 seconds? It gets messy.

@jtackaberry
Copy link
Contributor Author

jtackaberry commented Jan 12, 2024

It tends to be a mix between dynamic reloading and SIGHUP. I see a couple options here:

  1. I can use fsnotify or basic os.Stat polling to detect changes instead of caching as in my original suggestion
  2. I can add a SIGHUP handler to reload certificates (and in the Helm chart use inotifyd to watch the cert files in the background and issue SIGHUP to rqlited -- this assumes use of alpine base images though)

In both cases though replacing tls.Config.Certificates with the callbacks would be needed.

Let me know what your preference is.

@otoolep
Copy link
Member

otoolep commented Jan 12, 2024

A goroutine that os.Stat polls the file every second or so would be acceptable. That way I can control testing more precisely -- touch the file.

Ideally the poll period is configurable, at the unit test level (it would not need to be user-configurable). That way we can set the poll period to, say, 50ms, to make testing faster.

If you do decide to add this functionality, I would suggest writing a generic io.Stat-based poller, for any file, which takes some sort of generic callback or supplies a channel. Then we unit-test that, and only then integrate it with the Certs stuff.

Roughly the same interface as the fsnotify module. I would prefer not to drag in a whole new module for something that seems easy enough to implement just using the standard library.

You could put it in rtls/file since the only use case we have right now is for certs.

@jtackaberry
Copy link
Contributor Author

If you do decide to add this functionality, I would suggest writing a generic io.Stat-based poller, for any file, which takes some sort of generic callback or supplies a channel. Then we unit-test that, and only then integrate it with the Certs stuff.

Ack. My instinct right now is a callback, since with a channel something ought to be on the receiving end of that continuously, but tls.Config.GetCertificate() would only be called when there are connections. In the simplest case, the caller could just pass a reference to an atomic.Bool to the poller and asking it to flip that to true when any of the provided files changes, and GetCertificate() consults that to decide if it should load from disk or use the previously loaded value.

Though of course the supplied callback could do that too. I'll start with the callback.

I would prefer not to drag in a whole new module for something that seems easy enough to implement just using the standard library.

Using the kernel for file notifications is much faster and more efficient, but, yeah, when you're polling a single digit number of files per second it's not going to make any difference. Context switching is a rounding error, and VFS cache will prevent any increase in physical disk I/O.

You could put it in rtls/file since the only use case we have right now is for certs.

Putting the code that implements the callbacks for tls.Config which uses the polling file monitor makes sense. (AFAICT the monitor would be setup from CreateClientConfig() and CreateServerConfig().)

The node cert/key gets passed to these functions multiple times, so either we live with polling the same file multiple times (from different goroutines), or maintain some sort of global state, probably in the form of a singleton of a struct holding the file monitor state.

There's nothing TLS-specific about such a file monitor though. And in fact it would make sense to use it for #1553 if you consider it worth implementing. It's more of a common utility function/struct. Would you still want to put it in rtls/file if it's used to watch the credentials file?

I can sketch out something in a PR and then you can decide if you want to tweak it, gut it, or toss it.

@otoolep
Copy link
Member

otoolep commented Jan 12, 2024

There's nothing TLS-specific about such a file monitor though. And in fact it would make sense to use it for #1553 if you consider it worth implementing. It's more of a common utility function/struct. Would you still want to put it in rtls/file if it's used to watch the credentials file?

True, if we have a use for it already, maybe it's a top-level module. Or maybe we pull in some other module, but I prefer to pull high-quality, well-tested code only if I can help it.

Yes, I would be interested in a PR, let's see what you have in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants