-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add self-repair for malformed instance certs #41467
Conversation
342d525
to
9302729
Compare
9302729
to
36792f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a test that simulates an older agent starting up with/without new roles as well, i.e. where the initial local version state is not set?
lib/service/connect.go
Outdated
// a teleport instance that is running multiple services authorized by different join tokens, which poses challenges currently | ||
// and may be a strictly invalid state in the future if we start enforcing access-control limitations on agents via their join tokens. | ||
if role == types.RoleInstance && process.getLocalAuth() == nil { | ||
identity, err := process.GetIdentity(role) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to make sure I understand this right:
- the first check in this func is for non "instance" roles, i.e.
RoleDiscovery
,RoleDatabase
etc.). It needs to wait for the "instance" connector to be broadcast. - this check is for the "instance" role registration itself, so we look to see if we already have an instance identity stored
- if we do have an instance identity already, assert that we either have all the instance roles in that identity OR that the identity is older than v16
- if we dont have an identity already, then this is first time connect and we will write the teleport version in "InitialLocalVersion"
- finally, the instance connector is made available to 1. and we assert again that the role is present in the instance identity
36792f5
to
d521ba0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orca Security Scan Summary
Status | Check | Issues by priority | |
---|---|---|---|
Passed | Infrastructure as Code | 0 0 0 0 | View in Orca |
Failed | Secrets | 1 0 0 0 | View in Orca |
Passed | Vulnerabilities | 0 0 0 0 | View in Orca |
🔑 The following Secrets have been detected in your pull request across all commits
NAME | FILE | LINE NUM | COMMIT | ||
---|---|---|---|---|---|
PEM File With Private Key | ...uth/webproxy_key.pem | 1 | f66cc33 | View in code |
"secrets" in the above scan result are self-signed certs used by a test. |
As someone who's been on the rough end of user experience with Teleport's join tokens for a while (and is particularly aware of the customer frustration/confusion which can exist around them), I wanted to suggest that it would be nicer if we allow the mix-and-match behaviour. To be clear; I always tell people that they should use a join token which is valid for the full set of services that an agent provides. However, people don't always take advice, and the fact that things have just worked in this circumstance historically means that users have pre-set expectations. They don't understand the difference between an App cert and an Instance cert, and nor should they. It sounds like there is an explicit requirement here for the agent's storage to be cleared so a new instance certificate can be issued? This will cause confusion and frustration for anyone who has mixed and matched tokens in the past and expects things to work. IMO, we should definitely make agents just "do the right thing" and automatically handle Instance cert regeneration/reissue as long as the agent has separate certs for each service it provides. Also - I don't know whether we still advertise Related (for a different, but no less frustrating reason): #2838 |
I have concerns about the current form of this PR and its impact on the product. First, agents installed via the Second, adding or removing services from an agent configuration is a common practice. Users often begin with limited knowledge of Teleport, implementing and adopting it incrementally. They might start with Kubernetes and later expand to application or database services to expose assets running in Kubernetes. This incremental adoption is standard and should not be hindered. The effort required to expand a service, especially when customers purchase new licenses for additional protocols or request trials, would be so great that it would discourage implementation. This could be a significant drawback for larger customers who expand their usage regularly by purchasing new protocols or features after a successful rollout. Instead, we should allow agents to exchange all their secrets for a new instance certificate valid for the appropriate services. By exchanging all secrets, I mean the agents should call an endpoint where, through a cryptographic challenge, they prove possession of the cert-key pair for a given role and a token for the new role they wish to adopt. After Auth validates the possession of the certificates for the current enabled services and the new token, it will issue a new instance certificate valid for all requested roles. This approach is more sustainable in the long term because:
This strategy provides a better framework for scalability and flexibility, accommodating the evolving needs of our users without significant overhead or disruption. When a service is disabled and moved to another service, Teleport should replace the stored cert-key pairs with a new instance certificate that has the necessary permissions for the remaining services. This ensures that only the required credentials are stored, enhancing security. Instead of retaining the cert-key pairs of all previously enabled services, we should transition to a more secure model where a new, less privileged instance certificate is issued. This certificate should be sufficient to meet the needs of the remaining active services without retaining unnecessary credentials. |
f66cc33
to
5af7a4e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orca Security Scan Summary
Status | Check | Issues by priority | |
---|---|---|---|
Passed | Infrastructure as Code | 0 0 0 0 | View in Orca |
Failed | Secrets | 1 0 0 0 | View in Orca |
Passed | Vulnerabilities | 0 0 0 0 | View in Orca |
🔑 The following Secrets have been detected in your pull request across all commits
NAME | FILE | LINE NUM | COMMIT | ||
---|---|---|---|---|---|
PEM File With Private Key | ...uth/webproxy_key.pem | 1 | 5af7a4e | View in code |
5af7a4e
to
59f4482
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orca Security Scan Summary
Status | Check | Issues by priority | |
---|---|---|---|
Passed | Infrastructure as Code | 0 0 0 0 | View in Orca |
Failed | Secrets | 1 0 0 0 | View in Orca |
Passed | Vulnerabilities | 0 0 0 0 | View in Orca |
🔑 The following Secrets have been detected in your pull request across all commits
NAME | FILE | LINE NUM | COMMIT | ||
---|---|---|---|---|---|
PEM File With Private Key | ...uth/webproxy_key.pem | 1 | 59f4482 | View in code |
59f4482
to
7024909
Compare
Based on feedback from @tigrato and @webvictim, and some supplementary discussions elsewhere, this PR has been modified to no longer reject new mix-and-match attempts. The self-repair logic should fix the existing issues caused by mix-and-match. Future features may still require that we formalize a more specific model of how mix-and-match should work (and possibly limit the cases where it is permitted), but for now we are going to preserve mix-and-match ability. |
7024909
to
563e7c3
Compare
71c75bf
to
afa3d8e
Compare
func getFreeListenAddr() (string, error) { | ||
l, err := net.Listen("tcp", "localhost:0") | ||
if err != nil { | ||
return "", trace.Wrap(err) | ||
} | ||
|
||
defer l.Close() | ||
return l.Addr().String(), nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any guarantees that returned listen address will still be free after the listener is closed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ish... binding on 0
and then immediately closing is a well documented pattern for finding free ports on *nix systems during tests, and from my research seems to be accounted for/expected by socket API implementers who try to ensure that the logic behind selecting a free port is either random or biased toward not rapidly reusing free ports. For example, on macOS binding 0
doesn't return a duplicate until it has cycled through returning all other available free ports first.
7e0f143
to
4f1fc65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to not block merging with the assumption that s/instance-assets/testdata
happens prior to merging
4f1fc65
to
8228c07
Compare
@fspmarshall See the table below for backport results.
|
Fixes an issue where mix-and-match of join tokens with different system role permissions over time would cause the instance certificate to be malformed. This could lead to various issues, including services not showing up in instance heartbeats, or service-level heartbeats failing to be emitted. The current fix works by having agents prove which system roles they hold via assertions in order to get their primary instance cert reissued. This is the same mechanism by which the instance certs were originally generated back in v10, and has been ported forward and modified to work within our existing cert reissue framework.
In the long run, a more well-defined and structured model for token mix-and-match will likely be needed in order to properly support future features like scoped RBAC and static label assignments. This PR initially sidestepped the issue by disallowing mix-and-match in v16 onwards, but after some discussion that has been deemed too drastic of a change. The most realistic alternative I know of at the time of writing would be to start always requiring that that new tokens also grant all old roles, and fully regenerating the agent's identity based on the new token. This has the upside of mostly preserving existing behavior and not requiring users to reset their data directories, but also presents some confusing edge-cases, such as what to do about identities related to services that aren't currently active on the agent, but were in the past.
Related: https://github.com/gravitational/teleport.e/pull/4240
Fixes: #38977
changelog: fixed an issue where mix-and-match of join tokens could interfere with some services appearing correctly in heartbeats.