add self-repair for malformed instance certs #41467

fspmarshall · 2024-05-13T15:01:58Z

Fixes an issue where mix-and-match of join tokens with different system role permissions over time would cause the instance certificate to be malformed. This could lead to various issues, including services not showing up in instance heartbeats, or service-level heartbeats failing to be emitted. The current fix works by having agents prove which system roles they hold via assertions in order to get their primary instance cert reissued. This is the same mechanism by which the instance certs were originally generated back in v10, and has been ported forward and modified to work within our existing cert reissue framework.

In the long run, a more well-defined and structured model for token mix-and-match will likely be needed in order to properly support future features like scoped RBAC and static label assignments. This PR initially sidestepped the issue by disallowing mix-and-match in v16 onwards, but after some discussion that has been deemed too drastic of a change. The most realistic alternative I know of at the time of writing would be to start always requiring that that new tokens also grant all old roles, and fully regenerating the agent's identity based on the new token. This has the upside of mostly preserving existing behavior and not requiring users to reset their data directories, but also presents some confusing edge-cases, such as what to do about identities related to services that aren't currently active on the agent, but were in the past.

Related: https://github.com/gravitational/teleport.e/pull/4240

Fixes: #38977

changelog: fixed an issue where mix-and-match of join tokens could interfere with some services appearing correctly in heartbeats.

lib/service/connect.go

GavinFrazar

can we add a test that simulates an older agent starting up with/without new roles as well, i.e. where the initial local version state is not set?

api/proto/teleport/legacy/client/proto/authservice.proto

lib/auth/auth.go

lib/auth/auth_with_roles.go

lib/services/local/unstable.go

lib/services/local/unstable_test.go

lib/service/service.go

lib/service/connect.go

GavinFrazar · 2024-05-20T21:24:11Z

lib/service/connect.go

+	// a teleport instance that is running multiple services authorized by different join tokens, which poses challenges currently
+	// and may be a strictly invalid state in the future if we start enforcing access-control limitations on agents via their join tokens.
+	if role == types.RoleInstance && process.getLocalAuth() == nil {
+		identity, err := process.GetIdentity(role)


I just want to make sure I understand this right:

the first check in this func is for non "instance" roles, i.e. RoleDiscovery, RoleDatabase etc.). It needs to wait for the "instance" connector to be broadcast.

this check is for the "instance" role registration itself, so we look to see if we already have an instance identity stored

if we do have an instance identity already, assert that we either have all the instance roles in that identity OR that the identity is older than v16

if we dont have an identity already, then this is first time connect and we will write the teleport version in "InitialLocalVersion"

finally, the instance connector is made available to 1. and we assert again that the role is present in the instance identity

orca-security-us

Orca Security Scan Summary

Status	Check	Issues by priority
Passed	Infrastructure as Code	0 0 0 0	View in Orca
Failed	Secrets	1 0 0 0	View in Orca
Passed	Vulnerabilities	0 0 0 0	View in Orca

🔑 The following Secrets have been detected in your pull request across all commits

⚠️ Please take action to mitigate the risk of the identified secrets by revoking them, and if already in use, updating all dependent systems

	NAME	FILE	LINE NUM	COMMIT
	PEM File With Private Key	...uth/webproxy_key.pem	1	`f66cc33`	View in code

fspmarshall · 2024-05-24T04:19:49Z

"secrets" in the above scan result are self-signed certs used by a test.

webvictim · 2024-05-28T12:18:23Z

As someone who's been on the rough end of user experience with Teleport's join tokens for a while (and is particularly aware of the customer frustration/confusion which can exist around them), I wanted to suggest that it would be nicer if we allow the mix-and-match behaviour.

To be clear; I always tell people that they should use a join token which is valid for the full set of services that an agent provides. However, people don't always take advice, and the fact that things have just worked in this circumstance historically means that users have pre-set expectations. They don't understand the difference between an App cert and an Instance cert, and nor should they.

It sounds like there is an explicit requirement here for the agent's storage to be cleared so a new instance certificate can be issued? This will cause confusion and frustration for anyone who has mixed and matched tokens in the past and expects things to work. IMO, we should definitely make agents just "do the right thing" and automatically handle Instance cert regeneration/reissue as long as the agent has separate certs for each service it provides.

Also - I don't know whether we still advertise teleport app start and other similar commands as a viable strategy for adding more services to a running agent, but this seems in direct opposition to the spirit of this PR.

Related (for a different, but no less frustrating reason): #2838

tigrato · 2024-05-28T13:30:58Z

@fspmarshall

I have concerns about the current form of this PR and its impact on the product.

First, agents installed via the teleport-kube-agents Helm chart and running in Kubernetes store their credentials in Kubernetes secrets following the pattern {pod-name}-state. In contrast, agents running on nodes store their credentials in /var/lib/teleport. It's not as simple as saying cleanup all /var/lib/teleport folders for all agents/replicas you wish to update. Credentials are crucial in Teleport, yet it is not transparent how and when agents store them to disk or secrets. Enforcing a complete data wipe introduces other issues, starting with the server-id. The server-id is vital in SSH as it allows dialing to remote hosts using the server ID instead of relying on the host's name or hostname. Forcing users to change the server-id can disrupt automations if a specific host ID is hardcoded.

Second, adding or removing services from an agent configuration is a common practice. Users often begin with limited knowledge of Teleport, implementing and adopting it incrementally. They might start with Kubernetes and later expand to application or database services to expose assets running in Kubernetes. This incremental adoption is standard and should not be hindered. The effort required to expand a service, especially when customers purchase new licenses for additional protocols or request trials, would be so great that it would discourage implementation. This could be a significant drawback for larger customers who expand their usage regularly by purchasing new protocols or features after a successful rollout.

Instead, we should allow agents to exchange all their secrets for a new instance certificate valid for the appropriate services. By exchanging all secrets, I mean the agents should call an endpoint where, through a cryptographic challenge, they prove possession of the cert-key pair for a given role and a token for the new role they wish to adopt. After Auth validates the possession of the certificates for the current enabled services and the new token, it will issue a new instance certificate valid for all requested roles. This approach is more sustainable in the long term because:

It allows users to expand their current usage seamlessly.
Inventory will consistently display the correct instance profiles.
In the future, it will enable hot config reloads to activate new services with minimal disruption.

This strategy provides a better framework for scalability and flexibility, accommodating the evolving needs of our users without significant overhead or disruption.

When a service is disabled and moved to another service, Teleport should replace the stored cert-key pairs with a new instance certificate that has the necessary permissions for the remaining services. This ensures that only the required credentials are stored, enhancing security. Instead of retaining the cert-key pairs of all previously enabled services, we should transition to a more secure model where a new, less privileged instance certificate is issued. This certificate should be sufficient to meet the needs of the remaining active services without retaining unnecessary credentials.

orca-security-us

Orca Security Scan Summary

Status	Check	Issues by priority
Passed	Infrastructure as Code	0 0 0 0	View in Orca
Failed	Secrets	1 0 0 0	View in Orca
Passed	Vulnerabilities	0 0 0 0	View in Orca

🔑 The following Secrets have been detected in your pull request across all commits

⚠️ Please take action to mitigate the risk of the identified secrets by revoking them, and if already in use, updating all dependent systems

	NAME	FILE	LINE NUM	COMMIT
	PEM File With Private Key	...uth/webproxy_key.pem	1	`5af7a4e`	View in code

orca-security-us

Orca Security Scan Summary

Status	Check	Issues by priority
Passed	Infrastructure as Code	0 0 0 0	View in Orca
Failed	Secrets	1 0 0 0	View in Orca
Passed	Vulnerabilities	0 0 0 0	View in Orca

🔑 The following Secrets have been detected in your pull request across all commits

⚠️ Please take action to mitigate the risk of the identified secrets by revoking them, and if already in use, updating all dependent systems

	NAME	FILE	LINE NUM	COMMIT
	PEM File With Private Key	...uth/webproxy_key.pem	1	`59f4482`	View in code

fspmarshall · 2024-05-28T22:16:10Z

Based on feedback from @tigrato and @webvictim, and some supplementary discussions elsewhere, this PR has been modified to no longer reject new mix-and-match attempts. The self-repair logic should fix the existing issues caused by mix-and-match. Future features may still require that we formalize a more specific model of how mix-and-match should work (and possibly limit the cases where it is permitted), but for now we are going to preserve mix-and-match ability.

rosstimothy · 2024-05-29T13:05:40Z

integration/instance_test.go

+func getFreeListenAddr() (string, error) {
+	l, err := net.Listen("tcp", "localhost:0")
+	if err != nil {
+		return "", trace.Wrap(err)
+	}
+
+	defer l.Close()
+	return l.Addr().String(), nil
+}


Are there any guarantees that returned listen address will still be free after the listener is closed?

ish... binding on 0 and then immediately closing is a well documented pattern for finding free ports on *nix systems during tests, and from my research seems to be accounted for/expected by socket API implementers who try to ensure that the logic behind selecting a free port is either random or biased toward not rapidly reusing free ports. For example, on macOS binding 0 doesn't return a duplicate until it has cycled through returning all other available free ports first.

integration/instance-assets/agent/host_uuid

lib/auth/state/state.go

integration/instance_test.go

rosstimothy

Approving to not block merging with the assumption that s/instance-assets/testdata happens prior to merging

public-teleport-github-review-bot · 2024-05-29T23:26:35Z

@fspmarshall See the table below for backport results.

Branch	Result
branch/v13	Failed
branch/v14	Failed
branch/v15	Failed
branch/v16	Create PR

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 342d525 to 9302729 Compare May 15, 2024 18:04

fspmarshall marked this pull request as ready for review May 15, 2024 18:05

github-actions bot requested review from GavinFrazar and timothyb89 May 15, 2024 18:05

github-actions bot added the size/md label May 15, 2024

timothyb89 reviewed May 18, 2024

View reviewed changes

lib/service/connect.go Show resolved Hide resolved

lib/service/connect.go Show resolved Hide resolved

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 9302729 to 36792f5 Compare May 20, 2024 16:06

fspmarshall requested a review from timothyb89 May 20, 2024 16:14

GavinFrazar reviewed May 20, 2024

View reviewed changes

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 36792f5 to d521ba0 Compare May 22, 2024 23:13

orca-security-us bot reviewed May 24, 2024

View reviewed changes

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from f66cc33 to 5af7a4e Compare May 28, 2024 19:42

orca-security-us bot reviewed May 28, 2024

View reviewed changes

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 5af7a4e to 59f4482 Compare May 28, 2024 21:40

orca-security-us bot reviewed May 28, 2024

View reviewed changes

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 59f4482 to 7024909 Compare May 28, 2024 22:08

fspmarshall requested a review from GavinFrazar May 28, 2024 22:16

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 7024909 to 563e7c3 Compare May 28, 2024 22:55

fspmarshall added backport/branch/v14 backport/branch/v15 labels May 28, 2024

fspmarshall changed the title ~~add self-repair for malformed instance certs and explicitly disallow future mix-and-match of join tokens~~ add self-repair for malformed instance certs May 28, 2024

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch 2 times, most recently from 71c75bf to afa3d8e Compare May 28, 2024 23:45

timothyb89 approved these changes May 29, 2024

View reviewed changes

rosstimothy added the backport/branch/v16 label May 29, 2024

rosstimothy linked an issue May 29, 2024 that may be closed by this pull request

Upgrading from 14.3.18 to 14.3.20 broke configured agent applications #42040

Closed

rosstimothy reviewed May 29, 2024

View reviewed changes

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 7e0f143 to 4f1fc65 Compare May 29, 2024 18:53

rosstimothy approved these changes May 29, 2024

View reviewed changes

self-repair instance certs

8228c07

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 4f1fc65 to 8228c07 Compare May 29, 2024 22:47

fspmarshall added the backport/branch/v13 label May 29, 2024

fspmarshall enabled auto-merge May 29, 2024 22:58

fspmarshall added this pull request to the merge queue May 29, 2024

Merged via the queue into master with commit f198694 May 29, 2024
41 checks passed

fspmarshall deleted the fspmarshall/reject-new-partial-tokens branch May 29, 2024 23:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add self-repair for malformed instance certs #41467

add self-repair for malformed instance certs #41467

fspmarshall commented May 13, 2024 •

edited

GavinFrazar left a comment

GavinFrazar May 20, 2024 •

edited

orca-security-us bot left a comment

fspmarshall commented May 24, 2024

webvictim commented May 28, 2024 •

edited

tigrato commented May 28, 2024

orca-security-us bot left a comment

orca-security-us bot left a comment

fspmarshall commented May 28, 2024

rosstimothy May 29, 2024

fspmarshall May 29, 2024

rosstimothy left a comment

public-teleport-github-review-bot bot commented May 29, 2024

add self-repair for malformed instance certs #41467

add self-repair for malformed instance certs #41467

Conversation

fspmarshall commented May 13, 2024 • edited

GavinFrazar left a comment

Choose a reason for hiding this comment

GavinFrazar May 20, 2024 • edited

Choose a reason for hiding this comment

orca-security-us bot left a comment

Choose a reason for hiding this comment

Orca Security Scan Summary

fspmarshall commented May 24, 2024

webvictim commented May 28, 2024 • edited

tigrato commented May 28, 2024

orca-security-us bot left a comment

Choose a reason for hiding this comment

Orca Security Scan Summary

orca-security-us bot left a comment

Choose a reason for hiding this comment

Orca Security Scan Summary

fspmarshall commented May 28, 2024

rosstimothy May 29, 2024

Choose a reason for hiding this comment

fspmarshall May 29, 2024

Choose a reason for hiding this comment

rosstimothy left a comment

Choose a reason for hiding this comment

public-teleport-github-review-bot bot commented May 29, 2024

fspmarshall commented May 13, 2024 •

edited

GavinFrazar May 20, 2024 •

edited

webvictim commented May 28, 2024 •

edited