Implement custom `containerd` behavior for Windows Agents, ensure supporting processes exit #5419

HarrisonWAffel · 2024-02-09T20:54:13Z

Proposed Changes

Update the pebinary and staticpod executors to conform to the updated k3s executor interface which enables greater control over how containerd and docker are managed.

The pebinary executor partially reimplements the logic for starting containerd with one minor change: If the containerd process is killed for any reason, os.Exit will not be invoked and instead containerd will be restarted after 5 seconds. This is in line with how we start and restart other core components within the windows agent, such as the kubelet.

This change ensures that the rke2 agent process and the Windows service do not prematurely exit alongside containerd, giving enough for the signals.Context cancellation to fully propagate and stop all processes spawned from rke2/k3s.

In order to better coordinate how the RKE2 windows service shuts down, additional logic has been added to monitor processes spawned by RKE2 for up to ten seconds. This ensures that the agent process does not exit until all other processes have been cleaned up, or the time limit is reached.

The docker implementation for Windows agents has not been changed and the pebinary executor simply invokes the k3s implementation for docker.

The linux staticpod executor invokes the existing implementation within k3s for containerd and docker, resulting in no behavior changes for linux agents.

Types of Changes

Updates to the pebinary executor, staticpod executor, and service_windows

Verification

This can be verified by joining an RKE2 windows node to an existing cluster, and manually stopping the rke2 service using Powershell, ensuring that all rke2 processes stop at the same time.

To verify this manually I have done the following

Provision a Linux server node and start rke2 with rke2 server --cni calico
Compile a windows rke2 agent binary with the changes in this PR
Configure the windows node to pull runtime images from a custom registry populated with the required images by adding the system-default-registry field within /etc/rancher/rke2/config.yaml
Add the rke2 windows agent service and start it using Start-Service
Wait for the service to start, and for the windows agent to join the cluster successfully
Open Task Manager, run stop-service rke2, and ensure that all processes are stopped alongside rke2.exe. Optionally, watch the RKE2 service logs with Get-EventLog -logname application -source rke2
Start the rke2 service again and ensure that no issues are encountered during startup

Testing

The existing end to end testing which covers mixed OS clusters (tests/e2e/mixedos/mixedos_test.go) already tests that the windows agent starts and behaves as expected. I'm happy to help expand the e2e tests to cover this specific scenario if we feel it's necessary.

Linked Issues

#2204

User-Facing Change

NONE

Further Comments

Since this change uses the updated executor interface from k3s, I've bumped the k3s version to the latest commit on master (cfc3a124eed6) at the time of raising this PR.

brandond · 2024-02-09T21:46:07Z

pkg/windows/service_windows.go

+	return true, nil
+}
+
+// MonitorProcessExit ensures that the kubelet, kube-proxy, calico-node, and containerd processes have stopped running.


This is an interesting way to do this. I'm not a big fan of inline powershell and shelling out to do things that we should be able to do from native golang.

Can we not use a WaitGroup when starting these processes through the PEBinaryExecutor, and just wait on that when shutting down?

A bit of face palm moment on my side - a WaitGroup is a much better idea. After looking into it a bit, I found apimachinery/pkg/util/wait which offers wait groups that utilize contexts. Spent some time Friday and this morning testing the changes out and everything works as desired without any inlined PowerShell.

Let me know what you think. Also, I wanted to get your opinion on if it would be a good idea to add a timeout for the WaitGroup.Wait call. I can't think of a situation where one of these commands would continue to run after the context cancelation, but I want to avoid a situation where we might wait forever for the WaitGroup to complete.

No I think this is good. If we're blocking on child process exit and want to be sure they get cleaned up, I think blocking shutdown indefinitely is appropriate.

dereknola · 2024-02-12T17:47:14Z

I would love to see the E2E test expanded, since you already have a manual verification method.

HarrisonWAffel · 2024-02-12T17:50:48Z

@dereknola Is this something that could be done as a follow up PR? My understanding is that code freeze for Feb patches is EOD, and so far I've had trouble setting up the e2e test suite on an ubuntu VM

dereknola · 2024-02-12T17:52:49Z

It can happen in another PR. Two things for that:

Open an issue to track adding the E2E test
Expand the current Verification section of this PR with more info/commands on how you verified the changes.

If you give me enough info on it I can easily take over the testing changes.

manuelbuil · 2024-02-12T17:46:07Z

pkg/podexecutor/staticpod.go

@@ -614,6 +616,16 @@ func (s *StaticPodConfig) ETCD(ctx context.Context, args executor.ETCDConfig, ex
 	return staticpod.Run(s.ManifestsDir, spa)
 }

+// Containerd starts the k3s implementation of containerd


staticpod is the executor implementation for rke2-linux, hence I think the comment is wrong

The intent here is to call out that the rke2 static pod executor invokes the k3s implementation for containerd and docker. Current versions of rke2 already rely on the implementation within k3s, however now that the executor interface has been expanded I wanted to make it clear that the static pod executor invoked the k3s implementation

manuelbuil · 2024-02-12T17:46:28Z

pkg/podexecutor/staticpod.go

+	return containerd.Run(ctx, config)
+}
+
+// Docker starts the k3s implementation of cridockerd


I think this is correct, the comment is just calling out that all this does is wrap the k3s implementation.

manuelbuil · 2024-02-12T17:54:42Z

pkg/pebinaryexecutor/pebinary.go

@@ -160,9 +165,9 @@ func (p *PEBinaryConfig) Kubelet(ctx context.Context, args []string) error {
 		cleanArgs = append(cleanArgs, arg)
 	}

-	logrus.Infof("Running RKE2 kubelet %v", cleanArgs)
-	go func() {
+	win.ProcessWaitGroup.StartWithContext(ctx, func(ctx context.Context) {


If now we use a context here, do you think it makes sense that we still have a separate one for the cni cniCtx?

The initial implementation of pebinary had access to this context as well. iiuc, using a separate context ensures that the CNI plugin is restarted each time that the kubelet is restarted, which may happen before the global context is canceled (for example, if someone accidentally runs Stop-Process for the kubelet). Since this PR does not focus on how the kubelet starts and stops the CNI plugins I opted to leave this unchanged.

manuelbuil · 2024-02-12T18:01:03Z

pkg/pebinaryexecutor/pebinary.go

+	stdOut := io.Writer(os.Stdout)
+	stdErr := io.Writer(os.Stderr)


Did you manage to get containerd logs in Windows when using this? I had a lot of trouble and was unable to do it, so by default I decided to dump logs in files (kube-proxy, etc)

Yes, I was able to find the logs in C:\var\lib\rancher\rke2\agent\containerd\containerd.log

manuelbuil · 2024-02-12T18:11:35Z

pkg/pebinaryexecutor/pebinary.go

+				pair[0] = strings.TrimPrefix(pair[0], "CONTAINERD_")
+				cenv = append(cenv, strings.Join(pair, "="))
+			default:
+				env = append(env, strings.Join(pair, "="))


shouldn't it only check on CONTAINERD_ envs?

This is a mirror implementation of what k3s does, I'm not sure what the implications are if start ignoring specific environment variables

yeah this is all existing logic copy-pasted from k3s. it strips CONTAINERD_ from vars with that prefix, and passes the rest through as-is. The intent is not to drop the unprefixed ones.

manuelbuil · 2024-02-12T18:13:02Z

pkg/pebinaryexecutor/pebinary.go

+		}
+
+		for {
+			logrus.Infof("Running containerd %s", config.ArgString(args[1:]))


It might worth logging the env variales too, that's what we do in Calico to help debugging. If we are picking all of them, probably it makes more sense to log it only in Debug=true scenarios

I believe I have addressed this

I would avoid logging env vars, using env vars to pass through secrets or credentials since they don't get logged or show in ps output is a pattern that we want to continue to enable.

Ah thats a good point, I'll update this

…ticpod executor Signed-off-by: Harrison Affel <harrisonaffel@gmail.com>

Signed-off-by: Harrison Affel <harrisonaffel@gmail.com>

pkg/cli/cmds/agent.go

HarrisonWAffel requested a review from a team as a code owner February 9, 2024 20:54

brandond reviewed Feb 9, 2024

View reviewed changes

HarrisonWAffel force-pushed the windows-agent-containerd-behavior branch from bab1b5a to 7264dea Compare February 12, 2024 17:41

HarrisonWAffel requested a review from brandond February 12, 2024 17:49

HarrisonWAffel mentioned this pull request Feb 12, 2024

[E2E] Add Test Case For Process Cleanup when stopping the RKE2 Windows Service #5435

Closed

manuelbuil reviewed Feb 12, 2024

View reviewed changes

HarrisonWAffel force-pushed the windows-agent-containerd-behavior branch from 7264dea to 0ea7135 Compare February 12, 2024 18:52

HarrisonWAffel added 2 commits February 12, 2024 14:54

update executor interface implementation for pebinaryexecutor and sta…

eccf75a

…ticpod executor Signed-off-by: Harrison Affel <harrisonaffel@gmail.com>

bump k3s to cfc3a124eed6

0cdd703

Signed-off-by: Harrison Affel <harrisonaffel@gmail.com>

HarrisonWAffel force-pushed the windows-agent-containerd-behavior branch from 0ea7135 to 0cdd703 Compare February 12, 2024 19:55

dereknola approved these changes Feb 12, 2024

View reviewed changes

brandond approved these changes Feb 12, 2024

View reviewed changes

briandowns reviewed Feb 12, 2024

View reviewed changes

pkg/cli/cmds/agent.go Show resolved Hide resolved

HarrisonWAffel merged commit 3f6c90b into rancher:master Feb 12, 2024
2 checks passed

This was referenced Feb 12, 2024

[1.26] Backport agent containerd behavior 1.26 #5455

Merged

[1.27] Backport agent containerd behavior 1.27 #5456

Merged

[1.28] Backport agent containerd behavior 1.28 #5457

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement custom `containerd` behavior for Windows Agents, ensure supporting processes exit #5419

Implement custom `containerd` behavior for Windows Agents, ensure supporting processes exit #5419

HarrisonWAffel commented Feb 9, 2024 •

edited

brandond Feb 9, 2024

HarrisonWAffel Feb 12, 2024

brandond Feb 12, 2024

dereknola commented Feb 12, 2024

HarrisonWAffel commented Feb 12, 2024

dereknola commented Feb 12, 2024 •

edited

manuelbuil Feb 12, 2024

HarrisonWAffel Feb 12, 2024

manuelbuil Feb 12, 2024

brandond Feb 12, 2024

manuelbuil Feb 12, 2024

HarrisonWAffel Feb 12, 2024

manuelbuil Feb 12, 2024

HarrisonWAffel Feb 12, 2024

manuelbuil Feb 12, 2024

HarrisonWAffel Feb 12, 2024

brandond Feb 12, 2024

manuelbuil Feb 12, 2024

HarrisonWAffel Feb 12, 2024

brandond Feb 12, 2024 •

edited

HarrisonWAffel Feb 12, 2024

		stdOut := io.Writer(os.Stdout)
		stdErr := io.Writer(os.Stderr)

Implement custom containerd behavior for Windows Agents, ensure supporting processes exit #5419

Implement custom containerd behavior for Windows Agents, ensure supporting processes exit #5419

Conversation

HarrisonWAffel commented Feb 9, 2024 • edited

Proposed Changes

Types of Changes

Verification

Testing

Linked Issues

User-Facing Change

Further Comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dereknola commented Feb 12, 2024

HarrisonWAffel commented Feb 12, 2024

dereknola commented Feb 12, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brandond Feb 12, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Implement custom `containerd` behavior for Windows Agents, ensure supporting processes exit #5419

Implement custom `containerd` behavior for Windows Agents, ensure supporting processes exit #5419

HarrisonWAffel commented Feb 9, 2024 •

edited

dereknola commented Feb 12, 2024 •

edited

brandond Feb 12, 2024 •

edited