Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add self-repair for malformed instance certs #41467

Merged
merged 1 commit into from
May 29, 2024

Conversation

fspmarshall
Copy link
Contributor

@fspmarshall fspmarshall commented May 13, 2024

Fixes an issue where mix-and-match of join tokens with different system role permissions over time would cause the instance certificate to be malformed. This could lead to various issues, including services not showing up in instance heartbeats, or service-level heartbeats failing to be emitted. The current fix works by having agents prove which system roles they hold via assertions in order to get their primary instance cert reissued. This is the same mechanism by which the instance certs were originally generated back in v10, and has been ported forward and modified to work within our existing cert reissue framework.

In the long run, a more well-defined and structured model for token mix-and-match will likely be needed in order to properly support future features like scoped RBAC and static label assignments. This PR initially sidestepped the issue by disallowing mix-and-match in v16 onwards, but after some discussion that has been deemed too drastic of a change. The most realistic alternative I know of at the time of writing would be to start always requiring that that new tokens also grant all old roles, and fully regenerating the agent's identity based on the new token. This has the upside of mostly preserving existing behavior and not requiring users to reset their data directories, but also presents some confusing edge-cases, such as what to do about identities related to services that aren't currently active on the agent, but were in the past.


Related: https://github.com/gravitational/teleport.e/pull/4240

Fixes: #38977

changelog: fixed an issue where mix-and-match of join tokens could interfere with some services appearing correctly in heartbeats.

@fspmarshall fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 342d525 to 9302729 Compare May 15, 2024 18:04
@fspmarshall fspmarshall marked this pull request as ready for review May 15, 2024 18:05
lib/service/connect.go Show resolved Hide resolved
lib/service/connect.go Show resolved Hide resolved
@fspmarshall fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 9302729 to 36792f5 Compare May 20, 2024 16:06
Copy link
Contributor

@GavinFrazar GavinFrazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a test that simulates an older agent starting up with/without new roles as well, i.e. where the initial local version state is not set?

lib/auth/auth.go Outdated Show resolved Hide resolved
lib/auth/auth_with_roles.go Outdated Show resolved Hide resolved
lib/auth/auth_with_roles.go Show resolved Hide resolved
lib/services/local/unstable.go Outdated Show resolved Hide resolved
lib/services/local/unstable.go Outdated Show resolved Hide resolved
lib/services/local/unstable_test.go Outdated Show resolved Hide resolved
lib/service/service.go Show resolved Hide resolved
lib/service/connect.go Outdated Show resolved Hide resolved
// a teleport instance that is running multiple services authorized by different join tokens, which poses challenges currently
// and may be a strictly invalid state in the future if we start enforcing access-control limitations on agents via their join tokens.
if role == types.RoleInstance && process.getLocalAuth() == nil {
identity, err := process.GetIdentity(role)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to make sure I understand this right:

  1. the first check in this func is for non "instance" roles, i.e. RoleDiscovery, RoleDatabase etc.). It needs to wait for the "instance" connector to be broadcast.
  2. this check is for the "instance" role registration itself, so we look to see if we already have an instance identity stored
  3. if we do have an instance identity already, assert that we either have all the instance roles in that identity OR that the identity is older than v16
  4. if we dont have an identity already, then this is first time connect and we will write the teleport version in "InitialLocalVersion"
  5. finally, the instance connector is made available to 1. and we assert again that the role is present in the instance identity

@fspmarshall fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 36792f5 to d521ba0 Compare May 22, 2024 23:13
Copy link

@orca-security-us orca-security-us bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orca Security Scan Summary

Status Check Issues by priority
Passed Passed Infrastructure as Code high 0   medium 0   low 0   info 0 View in Orca
Failed Failed Secrets high 1   medium 0   low 0   info 0 View in Orca
Passed Passed Vulnerabilities high 0   medium 0   low 0   info 0 View in Orca
🔑 The following Secrets have been detected in your pull request across all commits

⚠️ Please take action to mitigate the risk of the identified secrets by revoking them, and if already in use, updating all dependent systems

NAME FILE LINE NUM COMMIT
high PEM File With Private Key ...uth/webproxy_key.pem 1 f66cc33 View in code

@fspmarshall
Copy link
Contributor Author

"secrets" in the above scan result are self-signed certs used by a test.

@webvictim
Copy link
Contributor

webvictim commented May 28, 2024

As someone who's been on the rough end of user experience with Teleport's join tokens for a while (and is particularly aware of the customer frustration/confusion which can exist around them), I wanted to suggest that it would be nicer if we allow the mix-and-match behaviour.

To be clear; I always tell people that they should use a join token which is valid for the full set of services that an agent provides. However, people don't always take advice, and the fact that things have just worked in this circumstance historically means that users have pre-set expectations. They don't understand the difference between an App cert and an Instance cert, and nor should they.

It sounds like there is an explicit requirement here for the agent's storage to be cleared so a new instance certificate can be issued? This will cause confusion and frustration for anyone who has mixed and matched tokens in the past and expects things to work. IMO, we should definitely make agents just "do the right thing" and automatically handle Instance cert regeneration/reissue as long as the agent has separate certs for each service it provides.

Also - I don't know whether we still advertise teleport app start and other similar commands as a viable strategy for adding more services to a running agent, but this seems in direct opposition to the spirit of this PR.

Related (for a different, but no less frustrating reason): #2838

@tigrato
Copy link
Contributor

tigrato commented May 28, 2024

@fspmarshall

I have concerns about the current form of this PR and its impact on the product.

First, agents installed via the teleport-kube-agents Helm chart and running in Kubernetes store their credentials in Kubernetes secrets following the pattern {pod-name}-state. In contrast, agents running on nodes store their credentials in /var/lib/teleport. It's not as simple as saying cleanup all /var/lib/teleport folders for all agents/replicas you wish to update. Credentials are crucial in Teleport, yet it is not transparent how and when agents store them to disk or secrets. Enforcing a complete data wipe introduces other issues, starting with the server-id. The server-id is vital in SSH as it allows dialing to remote hosts using the server ID instead of relying on the host's name or hostname. Forcing users to change the server-id can disrupt automations if a specific host ID is hardcoded.

Second, adding or removing services from an agent configuration is a common practice. Users often begin with limited knowledge of Teleport, implementing and adopting it incrementally. They might start with Kubernetes and later expand to application or database services to expose assets running in Kubernetes. This incremental adoption is standard and should not be hindered. The effort required to expand a service, especially when customers purchase new licenses for additional protocols or request trials, would be so great that it would discourage implementation. This could be a significant drawback for larger customers who expand their usage regularly by purchasing new protocols or features after a successful rollout.

Instead, we should allow agents to exchange all their secrets for a new instance certificate valid for the appropriate services. By exchanging all secrets, I mean the agents should call an endpoint where, through a cryptographic challenge, they prove possession of the cert-key pair for a given role and a token for the new role they wish to adopt. After Auth validates the possession of the certificates for the current enabled services and the new token, it will issue a new instance certificate valid for all requested roles. This approach is more sustainable in the long term because:

  • It allows users to expand their current usage seamlessly.
  • Inventory will consistently display the correct instance profiles.
  • In the future, it will enable hot config reloads to activate new services with minimal disruption.

This strategy provides a better framework for scalability and flexibility, accommodating the evolving needs of our users without significant overhead or disruption.

When a service is disabled and moved to another service, Teleport should replace the stored cert-key pairs with a new instance certificate that has the necessary permissions for the remaining services. This ensures that only the required credentials are stored, enhancing security. Instead of retaining the cert-key pairs of all previously enabled services, we should transition to a more secure model where a new, less privileged instance certificate is issued. This certificate should be sufficient to meet the needs of the remaining active services without retaining unnecessary credentials.

@fspmarshall fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from f66cc33 to 5af7a4e Compare May 28, 2024 19:42
Copy link

@orca-security-us orca-security-us bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orca Security Scan Summary

Status Check Issues by priority
Passed Passed Infrastructure as Code high 0   medium 0   low 0   info 0 View in Orca
Failed Failed Secrets high 1   medium 0   low 0   info 0 View in Orca
Passed Passed Vulnerabilities high 0   medium 0   low 0   info 0 View in Orca
🔑 The following Secrets have been detected in your pull request across all commits

⚠️ Please take action to mitigate the risk of the identified secrets by revoking them, and if already in use, updating all dependent systems

NAME FILE LINE NUM COMMIT
high PEM File With Private Key ...uth/webproxy_key.pem 1 5af7a4e View in code

@fspmarshall fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 5af7a4e to 59f4482 Compare May 28, 2024 21:40
Copy link

@orca-security-us orca-security-us bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orca Security Scan Summary

Status Check Issues by priority
Passed Passed Infrastructure as Code high 0   medium 0   low 0   info 0 View in Orca
Failed Failed Secrets high 1   medium 0   low 0   info 0 View in Orca
Passed Passed Vulnerabilities high 0   medium 0   low 0   info 0 View in Orca
🔑 The following Secrets have been detected in your pull request across all commits

⚠️ Please take action to mitigate the risk of the identified secrets by revoking them, and if already in use, updating all dependent systems

NAME FILE LINE NUM COMMIT
high PEM File With Private Key ...uth/webproxy_key.pem 1 59f4482 View in code

@fspmarshall fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 59f4482 to 7024909 Compare May 28, 2024 22:08
@fspmarshall
Copy link
Contributor Author

Based on feedback from @tigrato and @webvictim, and some supplementary discussions elsewhere, this PR has been modified to no longer reject new mix-and-match attempts. The self-repair logic should fix the existing issues caused by mix-and-match. Future features may still require that we formalize a more specific model of how mix-and-match should work (and possibly limit the cases where it is permitted), but for now we are going to preserve mix-and-match ability.

@fspmarshall fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 7024909 to 563e7c3 Compare May 28, 2024 22:55
@fspmarshall fspmarshall changed the title add self-repair for malformed instance certs and explicitly disallow future mix-and-match of join tokens add self-repair for malformed instance certs May 28, 2024
@fspmarshall fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch 2 times, most recently from 71c75bf to afa3d8e Compare May 28, 2024 23:45
Comment on lines +82 to +88
func getFreeListenAddr() (string, error) {
l, err := net.Listen("tcp", "localhost:0")
if err != nil {
return "", trace.Wrap(err)
}

defer l.Close()
return l.Addr().String(), nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any guarantees that returned listen address will still be free after the listener is closed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ish... binding on 0 and then immediately closing is a well documented pattern for finding free ports on *nix systems during tests, and from my research seems to be accounted for/expected by socket API implementers who try to ensure that the logic behind selecting a free port is either random or biased toward not rapidly reusing free ports. For example, on macOS binding 0 doesn't return a duplicate until it has cycled through returning all other available free ports first.

integration/instance-assets/agent/host_uuid Outdated Show resolved Hide resolved
lib/auth/state/state.go Show resolved Hide resolved
integration/instance_test.go Outdated Show resolved Hide resolved
@fspmarshall fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 7e0f143 to 4f1fc65 Compare May 29, 2024 18:53
Copy link
Contributor

@rosstimothy rosstimothy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to not block merging with the assumption that s/instance-assets/testdata happens prior to merging

@fspmarshall fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 4f1fc65 to 8228c07 Compare May 29, 2024 22:47
@fspmarshall fspmarshall added this pull request to the merge queue May 29, 2024
Merged via the queue into master with commit f198694 May 29, 2024
41 checks passed
@fspmarshall fspmarshall deleted the fspmarshall/reject-new-partial-tokens branch May 29, 2024 23:24
@public-teleport-github-review-bot

@fspmarshall See the table below for backport results.

Branch Result
branch/v13 Failed
branch/v14 Failed
branch/v15 Failed
branch/v16 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants