spire-agent: re-attest without restarting #4991

sorindumitru · 2024-03-18T11:25:28Z

Pull Request check list

Commit conforms to CONTRIBUTING.md?
Proper tests/regressions included?
Documentation updated?

Affected functionality
spire-agent behaviour in case of eviction or expired agent SVID

Description of change
When an agent is evicted it can re-attest to reconnect to spire-server but it currently needs to restart to do that. To avoid unavailability periods, which can lead to noticeable latency in workloads, reattest in process.

amartinezfayo

Thank you @sorindumitru for this contribution!

amartinezfayo · 2024-04-17T13:31:18Z

pkg/agent/svid/rotator.go

+		return fmt.Errorf("unexpected value type: %T", r.state.Value())
+	}
+
+	if state.Reattestable && !r.c.DisableReattestToRenew {


I would probably separate this in two different conditions and return separate errors so it's easier to understand and debug what happened.

amartinezfayo · 2024-04-17T13:31:42Z

pkg/agent/svid/rotator.go

+	if state.Reattestable && !r.c.DisableReattestToRenew {
+		err = r.reattest(ctx)
+	} else {
+		return fmt.Errorf("attestation method is not re-attestable or re-attestation disabled")


No formatting is needed here.

MarcosDY · 2024-05-03T15:40:49Z

pkg/agent/manager/manager.go

+			m.deleteSVID()
+			return err
+		}
+		goto restart


instead of a goto, can you add a for before util.RunTasks?

for { err := util.RunTasks(ctx, ... }

MarcosDY · 2024-05-03T18:30:08Z

pkg/agent/manager/manager.go

@@ -204,9 +205,14 @@ func (m *manager) Run(ctx context.Context) error {
 		m.c.Log.Info("Cache manager stopped")
 		return nil
 	case nodeutil.ShouldAgentReattest(err):


NIT: there is no unit test to verify this (the original ShouldAgentReattest),
so it is hard for me to ask you to add a unit test here...
in any case it is not a blocker but if you can add a unit test it will be great, if not we can add that later

I'll see if I can do something about this. The test doesn't use an attestor yet, is they some mock attestor available that can be used?

MarcosDY · 2024-05-03T18:32:35Z

pkg/agent/svid/rotator.go

@@ -137,6 +138,29 @@ func (r *rotator) SetRotationFinishedHook(f func()) {
 	r.rotationFinishedHook = f
 }

+func (r *rotator) Reattest(ctx context.Context) (err error) {


can you add unit tests for this new function?

MarcosDY · 2024-05-03T18:35:42Z

pkg/agent/svid/rotator.go

+		if !r.c.DisableReattestToRenew {
+			err = r.reattest(ctx)
+		} else {
+			return errors.New("re-attestation is disabled")


not sure about this error, since a user must DIsableReattesttion to get into this case...
Maybe it worth to have a single if?

if state.Reattestable && !r.c.DisableReattestToRenew {

and if that is the case this code is pretty much the same that is in rotateSVIDIfNeeded and we can expose that instead?

That's what i had before and @amartinezfayo asked me to change. Either way is fine with me.

rotateSVIDIfNeeded can't be used directly because that only rotates if the SVID is expired.

MarcosDY · 2024-05-03T18:40:37Z

test/integration/suites/evict-agent/08-delete-agent

@@ -1,21 +0,0 @@
-#!/bin/bash
-
-log-debug "deleting agent..."


This IT is exercise:

evict agent

start again

evict again

So our current test case is to evict and verify that agent was re-atested, but agent never restarted, and keep alive with without timeout,
may we update this IT to verify that?

I've added some checks that the agents was able to re-attest. Note that I've had to make a bunch of changes in this test because it was unreliable on my machine. E.g. a server was brought up and then the test wouldn't wait for it to be available before trying to use it.

MarcosDY · 2024-05-03T18:42:50Z

test/integration/suites/node-re-attestation/03-evict-agents

-check-attested-agents
+
+# spire-agent-a will re-attest but spire-agent-b won't because join_token implements trust on first use model.
+AGENT_A_SPIFFE_ID_PATH="/spire/agent/x509pop/$(fingerprint conf/agent/agent.crt.pem)"


can you move this PATH at the start and use it to create AGENT_A_SPIFFE_ID?

MarcosDY · 2024-05-03T18:46:13Z

test/integration/suites/spire-server-cli/06-agent

@@ -78,16 +78,6 @@ docker-compose exec -T spire-server \
 docker-compose exec -T spire-server \
 	/opt/spire/bin/spire-server agent evict -spiffeID "$agentID1" | grep "Agent evicted successfully" || fail-now "failed to evict agent 1"

-# Verify agent list after evict


NIT: if I follow this... this is not longer required because attested agent was using a reattestable attestor,
and now we evict but agent recovers itself,
however, may we add a test case with a no reattestable node? so we can verify that agent was really reattested?
or catch some logs to verify we had a successful reattestation?

When an agent is evicted it can re-attest to reconnect to spire-server but it currently needs to restart to do that. To avoid unavailability periods, which can lead to latency in applications, reattest in process Signed-off-by: Sorin Dumitru <sdumitru@bloomberg.net>

Signed-off-by: Sorin Dumitru <sdumitru@bloomberg.net>

they didn't seem to need to be removed Signed-off-by: Sorin Dumitru <sdumitru@bloomberg.net>

sorindumitru requested review from evan2645, amartinezfayo, azdagron, MarcosDY and rturner3 as code owners March 18, 2024 11:25

evan2645 assigned MarcosDY and amartinezfayo Mar 19, 2024

amartinezfayo reviewed Apr 17, 2024

View reviewed changes

sorindumitru force-pushed the reattest branch from 502fc39 to 3824bc4 Compare April 28, 2024 08:42

azdagron added this to the 1.10.0 milestone May 2, 2024

MarcosDY reviewed May 3, 2024

View reviewed changes

sorindumitru force-pushed the reattest branch from 3824bc4 to 3a50920 Compare May 10, 2024 15:16

sorindumitru added 5 commits May 10, 2024 16:18

Add a 'Reattest()' test

110d299

Signed-off-by: Sorin Dumitru <sdumitru@bloomberg.net>

Use for loop instead of goto

4a09e57

Signed-off-by: Sorin Dumitru <sdumitru@bloomberg.net>

Verify agent reattested in evict-agent test

6f495f6

Signed-off-by: Sorin Dumitru <sdumitru@bloomberg.net>

bring up checks in 06-agent

3a50920

they didn't seem to need to be removed Signed-off-by: Sorin Dumitru <sdumitru@bloomberg.net>

MarcosDY approved these changes May 21, 2024

View reviewed changes

amartinezfayo approved these changes May 21, 2024

View reviewed changes

Merge branch 'main' into reattest

9eebedb

MarcosDY merged commit e33fb84 into spiffe:main May 21, 2024
33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spire-agent: re-attest without restarting #4991

spire-agent: re-attest without restarting #4991

sorindumitru commented Mar 18, 2024

amartinezfayo left a comment

amartinezfayo Apr 17, 2024

amartinezfayo Apr 17, 2024

MarcosDY May 3, 2024

MarcosDY May 3, 2024

sorindumitru May 10, 2024

MarcosDY May 3, 2024

MarcosDY May 3, 2024

sorindumitru May 10, 2024

MarcosDY May 3, 2024

sorindumitru May 10, 2024

MarcosDY May 3, 2024

MarcosDY May 3, 2024

spire-agent: re-attest without restarting #4991

spire-agent: re-attest without restarting #4991

Conversation

sorindumitru commented Mar 18, 2024

amartinezfayo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment