-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SkewTest failing gce-1.10-master-upgrade-cluster-new job in sig-release-master-upgrade #63506
Comments
@AishSundar: GitHub didn't allow me to assign the following users: for, triage. Note that only kubernetes members and repo collaborators can be assigned. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I don't think this is sig-testing. Looks like /unassign |
/sig cluster-lifecycle @timothysc can you see if this is a cluster lifecycle issue please? |
Chiming in for cluster-lifecycle, this is weird. Somehow the nodes are being put back into a cordoned state after the upgrade. The upgrade checks that after a node is upgraded, it becomes both ready and schedulable. https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/upgrade.sh#L347-L360 Here's an example from the linked failure: I0505 21:53:28.217] == Recreating instance bootstrap-e2e-minion-group-srfd. ==
I0505 21:53:29.848] == Waiting for new node to be added to k8s. ==
I0505 21:55:56.502] ..............................................................................................................................== Waiting for bootstrap-e2e-minion-group-srfd to become ready. ==
I0505 21:55:56.502] Node bootstrap-e2e-minion-group-srfd recreated.
I0505 21:55:56.789] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:55:58.067] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:55:59.371] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:56:00.670] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:56:01.964] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:56:03.272] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:56:04.582] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:56:05.886] Node bootstrap-e2e-minion-group-srfd Ready=True But just a few lines below that, we can see the the cluster validation after the upgrade shows that 2 nodes are still unschedulable. I0505 21:56:13.446] == Finished upgrading nodes to v1.11.0-alpha.2.110+8fcaa19dd06f19. ==
I0505 21:56:13.447] == Validating cluster post-upgrade ==
I0505 21:56:13.489] Validating gce cluster, MULTIZONE=
I0505 21:56:13.490] Project: k8s-jkns-gce-slow-1-4
I0505 21:56:13.490] Network Project: k8s-jkns-gce-slow-1-4
I0505 21:56:13.490] Zone: us-central1-f
I0505 21:56:14.846] Found 4 node(s).
I0505 21:56:15.005] NAME STATUS ROLES AGE VERSION
I0505 21:56:15.006] bootstrap-e2e-master Ready,SchedulingDisabled <none> 27m v1.11.0-alpha.2.110+8fcaa19dd06f19
I0505 21:56:15.006] bootstrap-e2e-minion-group-l1rn Ready <none> 8m v1.11.0-alpha.2.110+8fcaa19dd06f19
I0505 21:56:15.006] bootstrap-e2e-minion-group-ls61 Ready,SchedulingDisabled <none> 27m v1.11.0-alpha.2.110+8fcaa19dd06f19
I0505 21:56:15.006] bootstrap-e2e-minion-group-srfd Ready,SchedulingDisabled <none> 26m v1.11.0-alpha.2.110+8fcaa19dd06f19
I0505 21:56:15.168] Validate output:
I0505 21:56:15.314] NAME STATUS MESSAGE ERROR
I0505 21:56:15.314] etcd-1 Healthy {"health": "true"}
I0505 21:56:15.314] etcd-0 Healthy {"health": "true"}
I0505 21:56:15.314] scheduler Healthy ok
I0505 21:56:15.314] controller-manager Healthy ok
I0505 21:56:15.316] Cluster validation succeeded
I0505 21:56:15.317] == Post-Upgrade Node OS and Kubelet Versions ==
I0505 21:56:15.487] name: "bootstrap-e2e-master", osImage: "Container-Optimized OS from Google", kubeletVersion: "v1.11.0-alpha.2.110+8fcaa19dd06f19"
I0505 21:56:15.487] name: "bootstrap-e2e-minion-group-l1rn", osImage: "Container-Optimized OS from Google", kubeletVersion: "v1.11.0-alpha.2.110+8fcaa19dd06f19"
I0505 21:56:15.488] name: "bootstrap-e2e-minion-group-ls61", osImage: "Container-Optimized OS from Google", kubeletVersion: "v1.11.0-alpha.2.110+8fcaa19dd06f19"
I0505 21:56:15.488] name: "bootstrap-e2e-minion-group-srfd", osImage: "Container-Optimized OS from Google", kubeletVersion: "v1.11.0-alpha.2.110+8fcaa19dd06f19"
I0505 21:56:15.493] May 5 21:56:15.492: INFO: Waiting up to 5m0s for all nodes to be ready after the upgrade
I0505 21:56:15.499] May 5 21:56:15.499: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:56:35.503] May 5 21:56:35.503: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:56:55.507] May 5 21:56:55.507: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:57:15.511] May 5 21:57:15.511: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:57:35.516] May 5 21:57:35.516: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:57:55.520] May 5 21:57:55.520: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:58:15.524] May 5 21:58:15.524: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:58:35.529] May 5 21:58:35.529: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:58:55.533] May 5 21:58:55.533: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:59:15.538] May 5 21:59:15.538: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:59:35.542] May 5 21:59:35.542: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:59:55.547] May 5 21:59:55.547: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 22:00:15.551] May 5 22:00:15.551: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 22:00:35.556] May 5 22:00:35.555: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 22:00:55.560] May 5 22:00:55.560: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 22:01:15.563] May 5 22:01:15.561: INFO: Unexpected error occurred: timeout waiting 5m0s for number of ready nodes to be 3 Does someone from the node team know why this might be happening? Is it a race condition? It worked for one node, but not two in this case. In other runs, I see it working for two nodes and only leaving one unschedulable. cc @kubernetes/sig-node-bugs |
/cc @dchen1107 @derekwaynecarr can either of you please help investigate the node side of this issue to unblock @krousey. This is currently on of the few tests failing our sig-release upgrade blocking suite. If this is not fixed we will have to delay beta for 1.11 which is slated for 5/15. |
/cc @jiayingz @Random-Liu I am also running into this issue when trying to add a new upgrade test. First, I think the reason that the test log said the node was ready is because the code is looking at a wrong field for SchedulingDisabled. The unschedulable field is a field in node.spec not a node status condition, so I think we should instead use: I am also puzzled on why recreating instance just rebooted the node. According to 'gcloud instance-groups managed recreate-instances' doc, that command should delete the instance first and create a new one. For the passed cluster upgrade test, that seems the case, but for the failed ones, I saw node was just rebooted and so still got the old unschedulable state from the API server. While trying to figure out why recreating instance just rebooted node sometimes, would it make sense we explicitly use 'gcloud instance-groups managed delete-instances' and 'gcloud instance-groups managed create-instances' ? |
@dchen1107 Thanks, I'll work on a PR and send it out today. |
From my manual run, I ssh-ed into the unschedulable node after the test and saw the kubelet log contains multiple restarts and logged the event that the node was previously registered and then later changed to NodeNotSchedulable based on the spec got from the API server: 1754: May 08 18:03:18 e2e-test-jiayingz-minion-group-ssbv kubelet[1073]: I0508 18:03:18.587210 1073 kubelet_node_status.go:126] Node e2e-test-jiayingz-minion-group-ssbv was previously registered |
@jiayingz That fits with the current theory. The instance is being recreated, but the kubelet is no longer re-registering itself since the externalID is the same. |
Automatic merge from submit-queue (batch tested with PRs 62665, 62194, 63616, 63672, 63450). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Uncordon the node after upgrade **What this PR does / why we need it**: Previous logic was relying on the node to recreate the node API object and, as a side-effect, uncordon itself. A change went in that no longer ensures the node recreates itself, so the bug in this logic was exposed. **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes #63506 **Special notes for your reviewer**: **Release note**: ```release-note NONE ``` cc @dchen1107 @AishSundar
/reopen |
@krousey: you can't re-open an issue/PR unless you authored it or you are assigned to it. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign |
/reopen |
Note that this bug is flaky. It doesn't happen 100% of the time. Looks like we got 100% pass in the latest run: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new/1431 Here's an upgrade test that's hitting this flake: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-parallel/3384 |
@dchen1107 @jiayingz |
@jingxu97 Can you move that to another issue? It is a legitimate bug that the upgrade logic should not rely on kubelet recreating the node object. I agree that no longer doing that may lead to other issues, but that should be a separate discussion. |
Automatic merge from submit-queue (batch tested with PRs 63367, 63718, 63446, 63723, 63720). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Move uncordon to after the node is ready **What this PR does / why we need it**: It is possible that the node API object doesn't exist in a brief window between recreation and registering. By moving the uncordon until after the node is ready, we can be sure the API object exists. **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes #63506 #63499 **Release note**: ```release-note NONE ```
@krousey I see the test failing in the last few runs. Is it a different issue plaguing the test now? https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-1.10-master-upgrade-cluster-new |
/reopen |
[MILESTONENOTIFIER] Milestone Issue Needs Approval @AishSundar @krousey @timothysc @kubernetes/sig-cluster-lifecycle-misc @kubernetes/sig-node-misc Action required: This issue must have the Issue Labels
|
Here are the Commit changes since the last failure. |
Spoke to @krousey offline and SkewTest is a overall for the second half of the suite. So if the other tests in the job pass, this one should turn the corner. so closing this bug for now. |
Automatic merge from submit-queue (batch tested with PRs 64344, 64709, 64717, 63631, 58647). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add gpu cluster upgrade test. **What this PR does / why we need it**: **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes # **Special notes for your reviewer**: Currently running GPUMasterUpgrade test should pass with gpu nodes but running GPUClusterUpgrade test will run into #63506 **Release note**: ```release-note ```
Failing Job
gce-1.10-master-upgrade-cluster-new
Gubernator logs
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new/1422
Triage
https://storage.googleapis.com/k8s-gubernator/triage/index.html?job=ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new&test=SkewTest
/kind bug
/priority failing-test
/priority important-soon
/sig testing
/milestone v1.11
@kubernetes/sig-testing-bugs
cc @jberkus @tpepper
/assign @BenTheElder for triage
The text was updated successfully, but these errors were encountered: