Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5756 Identity Management Overhaul] Rewrite sentry to authenticate clients before signing #6171

Merged
merged 77 commits into from
Aug 1, 2023

Conversation

JoshVanL
Copy link
Contributor

@JoshVanL JoshVanL commented Mar 31, 2023

Part of #5756

This is the first in a chain or PRs to complete this issue. The following PRs will implement the security changes for each of the other components.

This rewrites the sentry service to now authenticate clients before signing for their identity in a backwards compatible way.

It introduces a new pkg/security package. This package is designed to manage the components SPIFFE ID (fetched from sentry), and expose connection options for authentication and authorizing servers.

The sentry flag --token-audience has now been deprecated. The audience of the kubernetes token used by clients should always be the SPIFFE ID of sentry (not implemented in this PR). For backwards compatibility, sentry will still accept dapr.io/sentry which should be remove in v1.13.

Sentry will ensure clients request their app ID. Sentry will still accept clients sending a CSR request with the ID set to $namespace:$sa, however this exception should be removed in 1.13.

Sentry will still send the trust anchors in the sign response, but should be removed in 1.13 as trust distribution should happen out of band of signing.

Sentry will continue to include cluster.local as a DNS SAN on its serving cert, since pre v1.12 client authorize on this. This should be removed in v1.13

Sentry will also continue to include cluster.local as DNS SAN on the issuer certificate since this is used by the control plane components as serving certificates. Also to be removed in v1.13.

global.mtls.controlPlaneTrustDomain is a new top level helm chart to set the control plane trust domain. Currently only read by sentry.

The root CA is now created by sentry (no longer helm) and written to the dapr-trust-bundle secret. It also write the trust anchors to the config map dapr-trust-bundle. The control plane components will read the trust anchors from this ConfigMap as a volume mount.

Updates sentry validators:

  1. Renamed the "selfhosted" validator to "insecure", since it doesn't do any validation at all.
  2. Added a new mtls.tokenValidators option to the "daprsentry" Configuration resource to configure additional validators. This is an array that contains that contains:
    • mtls.tokenValidators[*].name: name of the validator.
    • mtls.tokenValidators[*].options: optional dictionary of key-values that are options that can be passed to validator.
  3. In Kubernetes mode, the built-in "kubernetes" validator is always enabled. In self-hosted mode, the built-in "insecure" validator is enabled only if there's no additional validator configured.
  4. Added the jwks validator that validates .
    • This accepts 1 option: source is the location of a JWKS that contains the public keys to validate the keys. This can be one of:
      • A path to a file on disk - this is watched for changes with fsnotify
      • A HTTP(S) URL - this is periodically refreshed if needed
      • The actual JWKS in the string value, optionally base64-encoded
    • This validator requires JWT tokens that contain the 2 claims that SPIFFE's JWT-SIV use too:
      • sub must be the SPIFFE ID of the app, for example spiffe://localhost/ns/default/myapp
      • aud must contain the SPIFFE ID of Sentry, for example spiffe://localhost/ns/dapr-system/dapr-sentry
      • The validator will validate time validity bounds if present, but it will not enforce their presence (so tokens can live as long as the issuer wants them to)
  5. daprd can load a token to send to Sentry that is signed with an arbitrary key. These are defined using:
    • A file pointed by the env var DAPR_SENTRY_TOKEN_FILE
    • (Note that the env var must start with DAPR_ for security reasons, since those are not accessible in the localenv secret store)
  6. Token requests that want to use a non-default validator must pass the "token_validator" option in the gRPC request.

@codecov
Copy link

codecov bot commented Mar 31, 2023

Codecov Report

Merging #6171 (c4e440c) into master (b118e9c) will decrease coverage by 0.51%.
Report is 1 commits behind head on master.
The diff coverage is 49.46%.

❗ Current head c4e440c differs from pull request most recent head 3057086. Consider uploading reports for the commit 3057086 to get more accurate results

@@            Coverage Diff             @@
##           master    #6171      +/-   ##
==========================================
- Coverage   65.02%   64.52%   -0.51%     
==========================================
  Files         216      226      +10     
  Lines       19705    20082     +377     
==========================================
+ Hits        12814    12957     +143     
- Misses       5853     6069     +216     
- Partials     1038     1056      +18     
Files Changed Coverage Δ
pkg/config/configuration.go 58.16% <0.00%> (-1.17%) ⬇️
pkg/security/security.go 0.00% <0.00%> (ø)
pkg/sentry/server/ca/fake/fake.go 7.69% <7.69%> (ø)
pkg/sentry/server/validator/fake/fake.go 7.69% <7.69%> (ø)
pkg/runtime/security/auth.go 20.00% <8.33%> (-0.28%) ⬇️
pkg/security/x509source.go 11.57% <11.57%> (ø)
pkg/security/fake/fake.go 14.28% <14.28%> (ø)
pkg/sentry/server/validator/insecure/insecure.go 25.00% <25.00%> (ø)
pkg/channel/http/http_channel.go 55.41% <50.00%> (ø)
pkg/grpc/server.go 46.10% <50.00%> (ø)
... and 23 more

... and 2 files with indirect coverage changes

Copy link
Contributor

@ItalyPaleAle ItalyPaleAle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my first round of review, although there are still a few files missing.

One question is on why we are still using both a root CA cert and and intermediate CA cert. Can't we remove one from the chain, and just use the root CA?

Also, since this is a major rewrite and reviewing this PR is not super easy (because the diff isn't much helpful), would you be available for a review during a call?

pkg/sentry/server/validator/kubernetes/kubernetes_test.go Outdated Show resolved Hide resolved
pkg/sentry/server/validator/kubernetes/kubernetes.go Outdated Show resolved Hide resolved
pkg/sentry/sentry.go Outdated Show resolved Hide resolved
pkg/sentry/sentry.go Outdated Show resolved Hide resolved
pkg/sentry/sentry.go Outdated Show resolved Hide resolved
pkg/sentry/server/ca/cert.go Show resolved Hide resolved
pkg/security/token.go Outdated Show resolved Hide resolved
pkg/security/pem/pem.go Outdated Show resolved Hide resolved
pkg/security/pem/pem.go Outdated Show resolved Hide resolved
cmd/sentry/main.go Outdated Show resolved Hide resolved
Copy link
Contributor

@mukundansundar mukundansundar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first set of comments.

cmd/sentry/main.go Outdated Show resolved Hide resolved
cmd/sentry/main.go Outdated Show resolved Hide resolved
pkg/sentry/server/validator/validator.go Outdated Show resolved Hide resolved
pkg/sentry/server/validator/internal/common_test.go Outdated Show resolved Hide resolved
panic(err)
}
if err := corev1.AddToScheme(scheme); err != nil {
panic(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are there panics during init? what would trigger this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An error here would signal that something is fundamentally broken in our APIs, or those imported. We should catch these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still run time panic, this is not going to be during dev alone right? Or is it expected that this will help catch issues during dev?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this will catch issuing during dev. If this errors, the only thing we can ever do it terminate the program with an error anyway.

pkg/security/spiffe/spiffe_test.go Outdated Show resolved Hide resolved
pkg/sentry/certs/certs.go Show resolved Hide resolved
@JoshVanL
Copy link
Contributor Author

JoshVanL commented Apr 6, 2023

Thanks for the review @ItalyPaleAle!

This is my first round of review, although there are still a few files missing.

One question is on why we are still using both a root CA cert and and intermediate CA cert. Can't we remove one from the chain, and just use the root CA?

Could do in future, but since this is what we are doing already today we should keep doing that until we are in a future version, past needing to rely on any old PKI. Remember that all components are currently consuming the issuing certificate and private key and using that for serving and clients. Changing up the chain structure will add to the backwards/forwards compat complexity. Ideally, we should not care what PKI setup Dapr is using so long as it follows correct PKI.

Also, since this is a major rewrite and reviewing this PR is not super easy (because the diff isn't much helpful), would you be available for a review during a call?

Yes, definitely makes sense to do this! 🙂

@mukundansundar
Copy link
Contributor

@JoshVanL what will be the effect of these sentry changes wrt the CLI renew certificate command?

@ItalyPaleAle
Copy link
Contributor

One question is on why we are still using both a root CA cert and and intermediate CA cert. Can't we remove one from the chain, and just use the root CA?

Could do in future, but since this is what we are doing already today we should keep doing that until we are in a future version, past needing to rely on any old PKI. Remember that all components are currently consuming the issuing certificate and private key and using that for serving and clients. Changing up the chain structure will add to the backwards/forwards compat complexity. Ideally, we should not care what PKI setup Dapr is using so long as it follows correct PKI.

The problem with the current approach is that because we throw away the root key, we can not renew the root CA certs. 10 years is a long time (and is probably too long), and yet that's how we got the Y2K and Y2K38 problems :)

Since we're making so drastic changes to Sentry, I think this would be the perfect time to remove this choice, which IMHO doesn't even make much sense (I get protecting the root CA key, but if the intermediate CA can sign certificates, this makes no difference at the end of the day).

@JoshVanL
Copy link
Contributor Author

JoshVanL commented Apr 6, 2023

@ItalyPaleAle Completely agree- lets do this in a followup PR.

I would like to see a better option set for sentry so the user can tune these things (with more sane defaults). This is also where we can introduce external CAs.

}

// SentryID returns the SPIFFE ID of the sentry server.
func SentryID(sentryTrustDomain, sentryNamespace string) (spiffeid.ID, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just like Spiffe Id for Sentry, do we need different Spiffe Ids for other control plane services as well, for Operator, Injector and Placement; as they may be also considered different workloads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that is the plan 🙂

You can see here in my "north star" diff which I am working towards constructs the expected SPIFFE ID in each control plane component client.

I am tackling each component in turn. Here are the drafts for the operator and placement PRs
#6193
#6179

pkg/sentry/server/ca/cert.go Show resolved Hide resolved
app: dapr-sentry
{{- range $key, $value := .Values.global.k8sLabels }}
{{ $key }}: {{ tpl $value $ }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EncryptionConfiguration is a Kubernetes API Server configuration option which as a Kubernetes application have no control over (i.e. not a Dapr concern). This option would be configured by the cluster administrators or the cloud provider for managed Kubernetes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. It helps.
Just for my understanding: So, when we say that Trust Anchor can be accessed from ConfigMap using volume mount, so can any non-Dapr process inside Pod be also able to access this volume and use this Trust Anchor to get itself Authenticated from Sentry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This trust anchor is used by clients (operator, placement, injector) to verify the server (sentry) connection. We use a ConfigMap here because:

  1. we don't want any component, apart from Sentry, to be mounting the intermediate certificate and private key,
  2. ConfigMap is a better resource type in this scenario because the TrustAnchor is not secret data,
  3. In future when we implement other CA signing sources (such as cert-manager), the intermediate Secret may never exist, so using dedicated ConfigMap for the TrustAnchor means that will always exist regardless of the signing/intermediate CA provider.

Daprd gets this TrustAnchor from the injector, set as a environment variable, which itself gets it from this ConfigMap. There may be an argument for implementing a controller to distribute the TrustAnchor to all (dapr enabled) namespaces as well so that daprds are able to update their trust store without being restarted (environment variables cannot be updated in place, whereas file mounts can).


We likely want to move to using ClusterTrustBundles once all supported Kubernetes versions serve this API.

kubernetes/enhancements#3257

@JoshVanL
Copy link
Contributor Author

As there are already major overhaul changes going on, should we also rename Issuer CA to Intermediate CA? Intermediate CA is quickly understandable, whereas all this while I too have been thinking that what is this Issuer CA.

Sure thing, more than happy to make this change!

@dapr-bot
Copy link
Collaborator

This pull request has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@ItalyPaleAle
Copy link
Contributor

@JoshVanL Good news :)

  • The E2E tests here confirm that we can run dapr 1.10 and 1.11 sidecars against "vNext" (with this PR)
    • Required a bit of a "hack" because current Dapr versions request a TLS certificate from Sentry by sending a CSR that has type "CERTIFICATE" in the PEM block, instead of "CERTIFICATE REQUEST". See: 4fcdf93
  • Although we cannot write an E2E test for that, I manually verified that we can deploy a Dapr 1.11 control plane and use "vNext" sidecars. The sidecars do get a TLS certificate correctly from v1.11 Sentry, and things like service invocation works.
    • Although this scenario is not officially supported, it's something we still make an effort to allow, especially because there can be a version mismatch during upgrades.

@JoshVanL
Copy link
Contributor Author

/test

@JoshVanL
Copy link
Contributor Author

/test-version-skew
/test-sdk-all

prevent panic

Signed-off-by: joshvanl <me@joshvanl.dev>
configuration crd

Signed-off-by: joshvanl <me@joshvanl.dev>
Signed-off-by: joshvanl <me@joshvanl.dev>
Signed-off-by: joshvanl <me@joshvanl.dev>
@JoshVanL
Copy link
Contributor Author

/ok-to-test
/test-version-skew
/test-sdk-all

@dapr-bot
Copy link
Collaborator

dapr-bot commented Jul 27, 2023

Dapr E2E test

🔗 Link to Action run

Commit ref: 2fae31e

✅ Build succeeded for linux/amd64

  • Image tag: dapre2e560ce6bdedl
  • Test image tag: dapre2e560ce6bdedl

✅ Infrastructure deployed

Cluster Resource group name Azure region
Linux Dapr-E2E-dapre2e560ce6bdedl westus3
Windows Dapr-E2E-dapre2e560ce6bdedw westus3
Linux/arm64 Dapr-E2E-dapre2e560ce6bdedla eastus

✅ Build succeeded for windows/amd64

  • Image tag: dapre2e560ce6bdedw
  • Test image tag: dapre2e560ce6bdedw

✅ Tests succeeded on linux/amd64

  • Image tag: dapre2e560ce6bdedl
  • Test image tag: dapre2e560ce6bdedl

✅ Tests succeeded on windows/amd64

  • Image tag: dapre2e560ce6bdedw
  • Test image tag: dapre2e560ce6bdedw

@JoshVanL
Copy link
Contributor Author

/test-sdk-all

@dapr-bot
Copy link
Collaborator

dapr-bot commented Jul 27, 2023

Dapr SDK Python test

🔗 Link to Action run

Commit ref: 2fae31e

✅ Python SDK tests passed

@dapr-bot
Copy link
Collaborator

dapr-bot commented Jul 27, 2023

Dapr SDK Java test

🔗 Link to Action run

Commit ref: 2fae31e

❌ Java SDK tests failed

Please check the logs for details on the error.

@dapr-bot
Copy link
Collaborator

dapr-bot commented Jul 27, 2023

Dapr SDK Go test

🔗 Link to Action run

Commit ref: 2fae31e

✅ Go SDK tests passed

@dapr-bot
Copy link
Collaborator

dapr-bot commented Jul 27, 2023

Dapr SDK JS test

🔗 Link to Action run

Commit ref: 2fae31e

❌ JS SDK tests failed

Please check the logs for details on the error.

@JoshVanL
Copy link
Contributor Author

/test-sdk-all

@dapr-bot
Copy link
Collaborator

dapr-bot commented Jul 28, 2023

Dapr SDK Go test

🔗 Link to Action run

Commit ref: c4e440c

✅ Go SDK tests passed

@dapr-bot
Copy link
Collaborator

dapr-bot commented Jul 28, 2023

Dapr SDK Python test

🔗 Link to Action run

Commit ref: c4e440c

✅ Python SDK tests passed

@dapr-bot
Copy link
Collaborator

dapr-bot commented Jul 28, 2023

Dapr SDK Java test

🔗 Link to Action run

Commit ref: c4e440c

❌ Java SDK tests failed

Please check the logs for details on the error.

@dapr-bot
Copy link
Collaborator

dapr-bot commented Jul 28, 2023

Dapr SDK JS test

🔗 Link to Action run

Commit ref: c4e440c

✅ JS SDK tests passed

@artursouza
Copy link
Member

This PR is breaking service invocation and tracing ITs in Java SDK.

@JoshVanL
Copy link
Contributor Author

@artursouza those Java failing tests look unrelated to this PR. Here is a similar failure which was executed from the SDK test schedule https://github.com/dapr/dapr/actions/runs/5686772228

@artursouza artursouza merged commit c585729 into dapr:master Aug 1, 2023
9 checks passed
@artursouza
Copy link
Member

@JoshVanL This will be a release blocker.

resp, err := c.SignCertificate(context.Background(),
&sentryv1pb.SignCertificateRequest{
CertificateSigningRequest: certPem,
CertificateSigningRequest: csrPem,
Id: getSentryIdentifier(id),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This intentionally not just the app ID? (Instead, being SENTRY_LOCAL_IDENTITY which can be > 64 characters)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @a-shoemaker, indeed it has been done to ensure backwards compatibility with a 1.11 sentry server. The intention is to use the app ID in 1.13.

// is wrong- dapr identities are based on daprd namespace + _app ID_.
// Remove this allowance in v1.12.
if req.Namespace+":"+pod.Spec.ServiceAccountName == req.Id {
req.Id = expID

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you validated the req.ID is < 64 characters before this even though it wasn't the app ID. Which subsequently seems to make 1.12 unusable if you have any namespaces + app ID's of much length at ll.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @a-shoemaker, indeed this looks like a bug in 1.12 which needs to be fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a-shoemaker I have opened up an issue here to track.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants