CRD upgrade race condition #10032

g-gaston · 2024-01-22T18:21:56Z

What steps did you take and what happened?

Upgrade capi components with clusterctl from v1.5.2 to v1.6.0

clusterctl upgrade apply completes successfully. However, on the next change I make to the cluster, reconciliation for Machines gets stuck due to this error when reading the provider machine object:

"Reconciler error" err="failed to retrieve DockerMachine external object "my-ns"/"m-docker-etcd-1705429991949-cpfqz": failed to get restmapping: failed to get API group resources: unable to retrieve the complete list of server APIs: infrastructure.cluster.x-k8s.io/v1alpha4: the server could not find the requested resource" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="my-ns/m-docker-etcd-fr7bh" namespace="my-ns" name="m-docker-etcd-fr7bh"

This error continues on loop and is not resolved until the container is restarted. Once the container is restarted, the reconciliation continues without any other issue.

This error also appears (sometimes) in the MachineSet and MachineDeployment controllers.

Note: I haven't been able to replicate this using the capi official artifacts and only using EKS-A's fork. I believe this is just because of the nature of the race condition and not because of the difference in code (I haven't been able to find even one patch in the codepath of this issue).

What did you expect to happen?

All controllers should be able to continue reconciliation after upgrading with clusterctl.

Cluster API version

v1.5.2 to v1.6.0

Kubernetes version

1.27

Anything else you would like to add?

TLDR: capi v1.6.0 marks v1alpha4 apis as not served. The problem comes from controller-runtime caching the results of the call to list APIGroups. If this call returns v1alpha4 as one of the available versions for infrastructure.cluster.x-k8s.io, the client will try to get the APIResource definition for v1alpha4. If the call to get this APIResource is made after the api is marked as not served, the client will receive a not found error. Since v1alpha4 has been cached already as available, the client will try to keep making this call and receive the same error forever.

This is the stack for the previously mentioned error:

sigs.k8s.io/controller-runtime/pkg/client/apiutil.(*mapper).fetchGroupVersionResources
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/client/apiutil/restmapper.go:294
sigs.k8s.io/controller-runtime/pkg/client/apiutil.(*mapper).addKnownGroupAndReload
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/client/apiutil/restmapper.go:191
sigs.k8s.io/controller-runtime/pkg/client/apiutil.(*mapper).RESTMapping
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/client/apiutil/restmapper.go:122
sigs.k8s.io/controller-runtime/pkg/client/apiutil.IsGVKNamespaced
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/client/apiutil/apimachinery.go:96
sigs.k8s.io/controller-runtime/pkg/client/apiutil.IsObjectNamespaced
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/client/apiutil/apimachinery.go:90
sigs.k8s.io/controller-runtime/pkg/cache.(*multiNamespaceCache).Get
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/cache/multi_namespace_cache.go:202
sigs.k8s.io/controller-runtime/pkg/cache.(*delegatingByGVKCache).Get
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/cache/delegating_by_gvk_cache.go:44
sigs.k8s.io/controller-runtime/pkg/client.(*client).Get
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/client/client.go:348
sigs.k8s.io/cluster-api/controllers/external.Get
  sigs.k8s.io/cluster-api/controllers/external/util.go:43
sigs.k8s.io/cluster-api/internal/controllers/machine.(*Reconciler).reconcileExternal
  sigs.k8s.io/cluster-api/internal/controllers/machine/machine_controller_phases.go:106
sigs.k8s.io/cluster-api/internal/controllers/machine.(*Reconciler).reconcileInfrastructure
  sigs.k8s.io/cluster-api/internal/controllers/machine/machine_controller_phases.go:256
sigs.k8s.io/cluster-api/internal/controllers/machine.(*Reconciler).reconcile
  sigs.k8s.io/cluster-api/internal/controllers/machine/machine_controller.go:297
sigs.k8s.io/cluster-api/internal/controllers/machine.(*Reconciler).Reconcile
  sigs.k8s.io/cluster-api/internal/controllers/machine/machine_controller.go:222
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
  sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227
runtime.goexit
  runtime/asm_amd64.s:1598

The first time a resource from a particular group is requested, the restmapper for the client detects that group hasn't been seen before and it gets the definition for all the available versions of that group. The rest mapper caches both the group and the versions for that group.

If there is an error trying to get one of the available versions, the rest mapper aborts and returns an error. The restmapper doesn't distinguish between different errors, so a 404 will result in this call failing until either the group is re-created or the cached is invalidated (the only way to do that is to restart the program).

The issue here is that if the first call to get the available versions returns a version that has been (or will be soon) deleted, the client becomes incapable of making requests for any resource of that group, independently of the version.

My hypothesis here is that the first call to get the APIGroup is made either just before the CRDs are updated or immediately after but before the kube api server cache is refreshed. This call then returns v1alpha4. And immediately after, the restmapper's call to get the APIResource for infrastructure.cluster.x-k8s.io/v1alpha4 returns a not found error since this version has already been marked as not served in the CRD.

IMO the long term solution is to not fail if a cached group version is not found. Actually, this has already been implemented in controller-runtime. It has been released as part of v0.17.0 but since it's marked as a breaking change, it doesn't seem like it will be backported to v0.16. We already bumped controller-runtime to v0.17.0 but it won't be backported to our v1.6 branch, so we can't leverage this fix.

I propose to update external.Get to be able to identify this issue and restart the controller when it happens. This restart should happen at most once (immediately after a CRD upgrade) and it suspect it won't be frequent, since it seems that until now I'm the only one who faced this race condition.

In addition, we could change the upgrade logic in clusterctl to:

Update all the CRDs (for core and all providers)
Wait for the api server to return only the served versions for all groups and only after
Scale back up the controller deployments.

This wouldn't be enough to guarantee the issue doesn't happen, so I don't think this can be an alternative to restarting the controller when this issue happens. Given this change is more involved and this is only required as a short term solution for the v1.6 releases, I would vote to only implement the first change for now.

Label(s) to be applied

/kind bug
/area clusterctl

The text was updated successfully, but these errors were encountered:

g-gaston · 2024-01-22T18:22:05Z

/assign

sbueringer · 2024-01-23T13:37:56Z

@g-gaston

IMO the long term solution is to not fail if a cached group version is not found. Actually, this has already been implemented in kubernetes-sigs/controller-runtime#2571.

I think this is not correct. We should get stuck earlier in the func here:

func (m *mapper) findAPIGroupByName(groupName string) (*metav1.APIGroup, error) {
	// Looking in the cache first.
	{
		m.mu.RLock()
		group, ok := m.apiGroups[groupName]
		m.mu.RUnlock()
		if ok {
			return group, nil
		}

We'll take a closer look to see if we can fix the lazy restmapper in CR

sbueringer · 2024-01-23T14:18:25Z

Sorry you're right. It hides/ignores the error. But I still think the lazy restmapper should be improved :)

g-gaston · 2024-01-23T14:35:13Z

Sorry you're right. It hides/ignores the error. But I still think the lazy restmapper should be improved :)

I'm happy to look into the lazy rest mapper and make changes

What would you suggest? A possible improvement I can see is to remove the v1alpha4 version from the known group when we get a NotFound calling APIResource. So we avoid extra calls to APIResource. On the flip side, this can create the opposite race condition, if APIGroups returns a new available group but APIResource returns a NotFound because of a not updated cache or something.

A more robust option would be to implement some kind of cache invalidation. Either trigger from errors or periodically (or both), refresh the caches groups and versions.

sbueringer · 2024-01-23T15:27:21Z

@g-gaston @fabriziopandini and I met in a Zoom call and discussed options.

This is a rough summary of the issues in the lazy restmapper in CR:

The high-level problem is that once we have a certain apiGroup in m.apiGroups we never ever refresh it again. This can lead to problems if either 1. an apiVersion is removed or 2. an apiVersion is added.

1. apiVersion is removed

The specific problem we're hitting with this is that fetchGroupVersionResources will fail in CR v0.16.3. In CR >= v0.17.0 we are ignoring the error. But we are still calling ServerResourcesForGroupVersion for the missing apiVersion, which is not ideal.

We would basically suggest to remove the specific apiGroup from m.apiGroups as soon as we get the not found error. Basically getting a not found on a specific version of an apiGroup signals pretty clearly that the apiGroup we cached is outdated (i.e. it contains a version that doesn't exist anymore). On the next call of findAPIGroupByName m.apiGroup will be refreshed and everything recovers by itself.

@g-gaston will open a PR in CR with a proposed fix. To me at this point this seems like a non-breaking change and we'll ask if we can backport it into v0.16 & v0.17 (so we can get this issue fixed in CAPI).

2. apiVersion is added

Basically we won't get any errors like in case 1. But the RestMappings returned by the mapper would be simply missing the apiVersion. A potential fix could be to add a TTL to m.apiGroups. E.g. so we call ServerGroups ~ every 5 minutes or so. This shouldn't be a performance issue I think.
We don't have a problem with this case in CAPI (we don't have code paths where this is a problem). I would open an issue for this one

fabriziopandini · 2024-01-24T10:17:33Z

Let's keep this issue open for tracking, but the work will happen in CR
/triage accepted

sbueringer · 2024-01-24T11:34:06Z

Yeah makes sense to keep it open here. At the very minimum we have to bump CR and want to verify that the issue is fixed in CAPI.

Worst case we might want to consider implementing a workaround if we can't get a fix into CR v0.16.x

fabriziopandini · 2024-04-11T17:58:19Z

/priority important-soon
@sbueringer @g-gaston can we close this now?

sbueringer · 2024-04-12T05:24:25Z

I think yes, @g-gaston ?

g-gaston · 2024-04-12T16:33:45Z

Oh yes! This is done done
/close

k8s-ci-robot · 2024-04-12T16:33:49Z

@g-gaston: Closing this issue.

In response to this:

Oh yes! This is done done
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/clusterctl Issues or PRs related to clusterctl needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 22, 2024

k8s-ci-robot assigned g-gaston Jan 22, 2024

g-gaston mentioned this issue Jan 22, 2024

Handle race condition when upgrading capi CRDs aws/eks-anywhere-build-tooling#2841

Merged

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 24, 2024

g-gaston mentioned this issue Jan 24, 2024

🐛 Clean restmapper cache if a version is notFound kubernetes-sigs/controller-runtime#2663

Merged

sbueringer mentioned this issue Feb 9, 2024

LazyRESTMapper KindsFor does not return all versions when single version is previously queried kubernetes-sigs/controller-runtime#2349

Open

This was referenced Feb 9, 2024

🌱 Bump controller-runtime to v0.17.1 #10131

Merged

[release-1.6] 🌱 Bump controller-runtime v0.16.4 #10132

Merged

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 11, 2024

k8s-ci-robot closed this as completed Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRD upgrade race condition #10032

CRD upgrade race condition #10032

g-gaston commented Jan 22, 2024

g-gaston commented Jan 22, 2024

sbueringer commented Jan 23, 2024

sbueringer commented Jan 23, 2024

g-gaston commented Jan 23, 2024

sbueringer commented Jan 23, 2024

fabriziopandini commented Jan 24, 2024

sbueringer commented Jan 24, 2024

fabriziopandini commented Apr 11, 2024

sbueringer commented Apr 12, 2024

g-gaston commented Apr 12, 2024

k8s-ci-robot commented Apr 12, 2024

CRD upgrade race condition #10032

CRD upgrade race condition #10032

Comments

g-gaston commented Jan 22, 2024

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

g-gaston commented Jan 22, 2024

sbueringer commented Jan 23, 2024

sbueringer commented Jan 23, 2024

g-gaston commented Jan 23, 2024

sbueringer commented Jan 23, 2024

1. apiVersion is removed

2. apiVersion is added

fabriziopandini commented Jan 24, 2024

sbueringer commented Jan 24, 2024

fabriziopandini commented Apr 11, 2024

sbueringer commented Apr 12, 2024

g-gaston commented Apr 12, 2024

k8s-ci-robot commented Apr 12, 2024