Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Fix KCP Controller reconcile always return error when workload cluster is unreachable #9342

Conversation

Levi080513
Copy link
Contributor

@Levi080513 Levi080513 commented Aug 30, 2023

What this PR does / why we need it:
In this PR #3082, we treat failing to connect to the workload cluster as a transient error and do not add that error to the reconcile reterr.

if err := r.updateStatus(ctx, controlPlane); err != nil {
var connFailure *internal.RemoteClusterConnectionError
if errors.As(err, &connFailure) {
log.Info("Could not connect to workload cluster to fetch status", "err", err.Error())
} else {
log.Error(err, "Failed to update KubeadmControlPlane Status")
reterr = kerrors.NewAggregate([]error{reterr, err})
}
}

But this improvement was missed when we migrated to getting the workload ClusterClient from ClusterCacheTracker, and it will causes reconcile always return an error when the workload cluster is unreachable, and the expected reconcile retry interval RequeueAfter may does not take effect.

So we should re-add this improvement.

Test
Before re-add this improvement, when the first CP Machine fails to initialize, KCP reconcile always return error and the RequeueAfter does not take effect.

I0831 12:58:38.532983       1 controller.go:286] "Reconcile KubeadmControlPlane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/hw-sks-test-1.24.13-05-controlplane" namespace="default" name="hw-sks-test-1.24.13-05-controlplane" reconcileID=583dcf35-ea2d-4b09-99a9-dc2072740d84 Cluster="default/hw-sks-test-1.24.13-05"
I0831 12:58:38.576764       1 controller.go:414] "Scaling up control plane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/hw-sks-test-1.24.13-05-controlplane" namespace="default" name="hw-sks-test-1.24.13-05-controlplane" reconcileID=583dcf35-ea2d-4b09-99a9-dc2072740d84 Cluster="default/hw-sks-test-1.24.13-05" Desired=3 Existing=1
I0831 12:58:38.580040       1 scale.go:213] "Waiting for control plane to pass preflight checks" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/hw-sks-test-1.24.13-05-controlplane" namespace="default" name="hw-sks-test-1.24.13-05-controlplane" reconcileID=583dcf35-ea2d-4b09-99a9-dc2072740d84 Cluster="default/hw-sks-test-1.24.13-05" failures="[machine hw-sks-test-1.24.13-05-controlplane-b4p25 does not have APIServerPodHealthy condition, machine hw-sks-test-1.24.13-05-controlplane-b4p25 does not have ControllerManagerPodHealthy condition, machine hw-sks-test-1.24.13-05-controlplane-b4p25 does not have SchedulerPodHealthy condition, machine hw-sks-test-1.24.13-05-controlplane-b4p25 does not have EtcdPodHealthy condition, machine hw-sks-test-1.24.13-05-controlplane-b4p25 does not have EtcdMemberHealthy condition]"
E0831 12:58:48.590095       1 controller.go:199] "Failed to update KubeadmControlPlane Status" err="failed to create remote cluster client: failed to create cluster accessor: error creating dynamic rest mapper for remote cluster \"default/hw-sks-test-1.24.13-05\": Get \"https://10.255.26.35:6443/api?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/hw-sks-test-1.24.13-05-controlplane" namespace="default" name="hw-sks-test-1.24.13-05-controlplane" reconcileID=583dcf35-ea2d-4b09-99a9-dc2072740d84 Cluster="default/hw-sks-test-1.24.13-05"
E0831 12:58:48.593623       1 controller.go:329] "Reconciler error" err="failed to create remote cluster client: failed to create cluster accessor: error creating dynamic rest mapper for remote cluster \"default/hw-sks-test-1.24.13-05\": Get \"https://10.255.26.35:6443/api?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/hw-sks-test-1.24.13-05-controlplane" namespace="default" name="hw-sks-test-1.24.13-05-controlplane" reconcileID=583dcf35-ea2d-4b09-99a9-dc2072740d84

After re-add this improvement, when the first CP Machine fails to initialize, KCP will retry reconcile every 15s.

I0831 12:45:03.963264       1 controller.go:340] "Reconcile KubeadmControlPlane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/hw-sks-test-1.24.13-05-controlplane" namespace="default" name="hw-sks-test-1.24.13-05-controlplane" reconcileID="e6052bad-883f-44b8-a227-85e7e6f06622" Cluster="default/hw-sks-test-1.24.13-05"
I0831 12:45:03.986402       1 controller.go:432] "Scaling up control plane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/hw-sks-test-1.24.13-05-controlplane" namespace="default" name="hw-sks-test-1.24.13-05-controlplane" reconcileID="e6052bad-883f-44b8-a227-85e7e6f06622" Cluster="default/hw-sks-test-1.24.13-05" Desired=3 Existing=1
I0831 12:45:03.987209       1 recorder.go:104] "events: Waiting for control plane to pass preflight checks to continue reconciliation: Machine hw-sks-test-1.24.13-05-controlplane-b4p25 does not have a corresponding Node yet (Machine.status.nodeRef not set)" type="Warning" object={"kind":"KubeadmControlPlane","namespace":"default","name":"hw-sks-test-1.24.13-05-controlplane","uid":"b34f6982-6643-4433-a0e2-5db6e00df841","apiVersion":"controlplane.cluster.x-k8s.io/v1beta1","resourceVersion":"6432542"} reason="ControlPlaneUnhealthy"
I0831 12:45:03.988335       1 scale.go:204] "Waiting for control plane to pass preflight checks" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/hw-sks-test-1.24.13-05-controlplane" namespace="default" name="hw-sks-test-1.24.13-05-controlplane" reconcileID="e6052bad-883f-44b8-a227-85e7e6f06622" Cluster="default/hw-sks-test-1.24.13-05" failures="Machine hw-sks-test-1.24.13-05-controlplane-b4p25 does not have a corresponding Node yet (Machine.status.nodeRef not set)"
I0831 12:45:03.989069       1 cluster_cache_tracker.go:272] "Creating new cluster accessor" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/hw-sks-test-1.24.13-05-controlplane" namespace="default" name="hw-sks-test-1.24.13-05-controlplane" reconcileID="e6052bad-883f-44b8-a227-85e7e6f06622" Cluster="default/hw-sks-test-1.24.13-05" cluster="default/hw-sks-test-1.24.13-05"
I0831 12:45:13.997396       1 controller.go:206] "Could not connect to workload cluster to fetch status" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/hw-sks-test-1.24.13-05-controlplane" namespace="default" name="hw-sks-test-1.24.13-05-controlplane" reconcileID="e6052bad-883f-44b8-a227-85e7e6f06622" Cluster="default/hw-sks-test-1.24.13-05" err="failed to create remote cluster client: default/hw-sks-test-1.24.13-05: failed to create cluster accessor: error creating client for remote cluster \"default/hw-sks-test-1.24.13-05\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://10.255.26.35:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
I0831 12:45:14.005046       1 controller.go:331] "Reconcile done, requeueing after 15s" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/hw-sks-test-1.24.13-05-controlplane" namespace="default" name="hw-sks-test-1.24.13-05-controlplane" reconcileID="e6052bad-883f-44b8-a227-85e7e6f06622"

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

/area provider/control-plane-kubeadm

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 30, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @Levi080513. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Aug 30, 2023
Copy link
Contributor

@killianmuldoon killianmuldoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 30, 2023
@jessehu
Copy link
Contributor

jessehu commented Sep 2, 2023

Thanks @killianmuldoon. Is this PR ready for merge?

Copy link
Contributor

@killianmuldoon killianmuldoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/area provider/control-plane-kubeadm

Thanks for this!
/lgtm

@k8s-ci-robot k8s-ci-robot added area/provider/control-plane-kubeadm Issues or PRs related to KCP and removed do-not-merge/needs-area PR is missing an area label labels Sep 4, 2023
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 4, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: e17d1d3a60491cfa8f9ea6e34f17d4677dd1569a

@fabriziopandini
Copy link
Member

/lgtm
cc @sbueringer

@Levi080513
Copy link
Contributor Author

Levi080513 commented Sep 17, 2023

@sbueringer It would be great if you had time to review this PR.

Copy link
Contributor

@killianmuldoon killianmuldoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: killianmuldoon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 18, 2023
@k8s-ci-robot k8s-ci-robot merged commit 2e0acc6 into kubernetes-sigs:main Sep 18, 2023
42 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.6 milestone Sep 18, 2023
@sbueringer
Copy link
Member

/lgtm

PR is fine, I've opened a PR to optimize this code a bit further: #9448 (review)

@sbueringer
Copy link
Member

Should we backport this PR and #9448?

@killianmuldoon
Copy link
Contributor

/cherry-pick release-1.5

@k8s-infra-cherrypick-robot

@killianmuldoon: new pull request created: #9449

In response to this:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@killianmuldoon
Copy link
Contributor

/cherry-pick release-1.4

@k8s-infra-cherrypick-robot

@killianmuldoon: new pull request created: #9450

In response to this:

/cherry-pick release-1.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/control-plane-kubeadm Issues or PRs related to KCP cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants