SkewTest failing gce-1.10-master-upgrade-cluster-new job in sig-release-master-upgrade #63506

AishSundar · 2018-05-07T20:23:25Z

Failing Job

Gubernator logs

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new/1422

Triage

https://storage.googleapis.com/k8s-gubernator/triage/index.html?job=ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new&test=SkewTest

/kind bug
/priority failing-test
/priority important-soon
/sig testing
/milestone v1.11
@kubernetes/sig-testing-bugs

cc @jberkus @tpepper

/assign @BenTheElder for triage

k8s-ci-robot · 2018-05-07T20:23:27Z

@AishSundar: GitHub didn't allow me to assign the following users: for, triage.

Note that only kubernetes members and repo collaborators can be assigned.

In response to this:

Failing Job

gce-1.10-master-upgrade-cluster-new

Gubernator logs

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new/1422

Triage

https://storage.googleapis.com/k8s-gubernator/triage/index.html?job=ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new&test=SkewTest

/kind bug
/priority failing-test
/priority important-soon
/sig testing
/milestone v1.11
@kubernetes/sig-testing-bugs

cc @jberkus @tpepper

/assign @BenTheElder for triage

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

BenTheElder · 2018-05-07T21:14:23Z

I don't think this is sig-testing. Looks like [sig-cluster-lifecycle] , [sig-autoscaling], [sig-storage]

/unassign
/remove-sig testing

AishSundar · 2018-05-07T21:45:23Z

/sig cluster-lifecycle
/assign @timothysc

@timothysc can you see if this is a cluster lifecycle issue please?

krousey · 2018-05-08T16:36:51Z

Chiming in for cluster-lifecycle, this is weird. Somehow the nodes are being put back into a cordoned state after the upgrade. The upgrade checks that after a node is upgraded, it becomes both ready and schedulable. https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/upgrade.sh#L347-L360

Here's an example from the linked failure:

I0505 21:53:28.217] == Recreating instance bootstrap-e2e-minion-group-srfd. ==
I0505 21:53:29.848] == Waiting for new node to be added to k8s.  ==
I0505 21:55:56.502] ..............................................................................................................................== Waiting for bootstrap-e2e-minion-group-srfd to become ready. ==
I0505 21:55:56.502] Node bootstrap-e2e-minion-group-srfd recreated.
I0505 21:55:56.789] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:55:58.067] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:55:59.371] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:56:00.670] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:56:01.964] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:56:03.272] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:56:04.582] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:56:05.886] Node bootstrap-e2e-minion-group-srfd Ready=True

But just a few lines below that, we can see the the cluster validation after the upgrade shows that 2 nodes are still unschedulable.

I0505 21:56:13.446] == Finished upgrading nodes to v1.11.0-alpha.2.110+8fcaa19dd06f19. ==
I0505 21:56:13.447] == Validating cluster post-upgrade ==
I0505 21:56:13.489] Validating gce cluster, MULTIZONE=
I0505 21:56:13.490] Project: k8s-jkns-gce-slow-1-4
I0505 21:56:13.490] Network Project: k8s-jkns-gce-slow-1-4
I0505 21:56:13.490] Zone: us-central1-f
I0505 21:56:14.846] Found 4 node(s).
I0505 21:56:15.005] NAME                              STATUS                     ROLES     AGE       VERSION
I0505 21:56:15.006] bootstrap-e2e-master              Ready,SchedulingDisabled   <none>    27m       v1.11.0-alpha.2.110+8fcaa19dd06f19
I0505 21:56:15.006] bootstrap-e2e-minion-group-l1rn   Ready                      <none>    8m        v1.11.0-alpha.2.110+8fcaa19dd06f19
I0505 21:56:15.006] bootstrap-e2e-minion-group-ls61   Ready,SchedulingDisabled   <none>    27m       v1.11.0-alpha.2.110+8fcaa19dd06f19
I0505 21:56:15.006] bootstrap-e2e-minion-group-srfd   Ready,SchedulingDisabled   <none>    26m       v1.11.0-alpha.2.110+8fcaa19dd06f19
I0505 21:56:15.168] Validate output:
I0505 21:56:15.314] NAME                 STATUS    MESSAGE              ERROR
I0505 21:56:15.314] etcd-1               Healthy   {"health": "true"}   
I0505 21:56:15.314] etcd-0               Healthy   {"health": "true"}   
I0505 21:56:15.314] scheduler            Healthy   ok                   
I0505 21:56:15.314] controller-manager   Healthy   ok                   
I0505 21:56:15.316] Cluster validation succeeded
I0505 21:56:15.317] == Post-Upgrade Node OS and Kubelet Versions ==
I0505 21:56:15.487] name: "bootstrap-e2e-master", osImage: "Container-Optimized OS from Google", kubeletVersion: "v1.11.0-alpha.2.110+8fcaa19dd06f19"
I0505 21:56:15.487] name: "bootstrap-e2e-minion-group-l1rn", osImage: "Container-Optimized OS from Google", kubeletVersion: "v1.11.0-alpha.2.110+8fcaa19dd06f19"
I0505 21:56:15.488] name: "bootstrap-e2e-minion-group-ls61", osImage: "Container-Optimized OS from Google", kubeletVersion: "v1.11.0-alpha.2.110+8fcaa19dd06f19"
I0505 21:56:15.488] name: "bootstrap-e2e-minion-group-srfd", osImage: "Container-Optimized OS from Google", kubeletVersion: "v1.11.0-alpha.2.110+8fcaa19dd06f19"
I0505 21:56:15.493] May  5 21:56:15.492: INFO: Waiting up to 5m0s for all nodes to be ready after the upgrade
I0505 21:56:15.499] May  5 21:56:15.499: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:56:35.503] May  5 21:56:35.503: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:56:55.507] May  5 21:56:55.507: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:57:15.511] May  5 21:57:15.511: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:57:35.516] May  5 21:57:35.516: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:57:55.520] May  5 21:57:55.520: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:58:15.524] May  5 21:58:15.524: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:58:35.529] May  5 21:58:35.529: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:58:55.533] May  5 21:58:55.533: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:59:15.538] May  5 21:59:15.538: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:59:35.542] May  5 21:59:35.542: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:59:55.547] May  5 21:59:55.547: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 22:00:15.551] May  5 22:00:15.551: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 22:00:35.556] May  5 22:00:35.555: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 22:00:55.560] May  5 22:00:55.560: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 22:01:15.563] May  5 22:01:15.561: INFO: Unexpected error occurred: timeout waiting 5m0s for number of ready nodes to be 3

Does someone from the node team know why this might be happening? Is it a race condition? It worked for one node, but not two in this case. In other runs, I see it working for two nodes and only leaving one unschedulable.

cc @kubernetes/sig-node-bugs

AishSundar · 2018-05-08T17:40:22Z

/cc @dchen1107 @derekwaynecarr can either of you please help investigate the node side of this issue to unblock @krousey. This is currently on of the few tests failing our sig-release upgrade blocking suite. If this is not fixed we will have to delay beta for 1.11 which is slated for 5/15.

AishSundar · 2018-05-08T17:41:13Z

/cc @dchen1107 @derekwaynecarr

AishSundar · 2018-05-08T17:45:12Z

cc @dchen1107 @derekwaynecarr

jiayingz · 2018-05-09T00:50:38Z

/cc @jiayingz @Random-Liu

I am also running into this issue when trying to add a new upgrade test. First, I think the reason that the test log said the node was ready is because the code is looking at a wrong field for SchedulingDisabled. The unschedulable field is a field in node.spec not a node status condition, so I think we should instead use:
local cordoned=$("${KUBE_ROOT}/cluster/kubectl.sh" get node "${instance}" --output='jsonpath=="{.spec.unschedulable}"'

I am also puzzled on why recreating instance just rebooted the node. According to 'gcloud instance-groups managed recreate-instances' doc, that command should delete the instance first and create a new one. For the passed cluster upgrade test, that seems the case, but for the failed ones, I saw node was just rebooted and so still got the old unschedulable state from the API server.

While trying to figure out why recreating instance just rebooted node sometimes, would it make sense we explicitly use 'gcloud instance-groups managed delete-instances' and 'gcloud instance-groups managed create-instances' ?

krousey · 2018-05-09T17:48:58Z

@dchen1107 Thanks, I'll work on a PR and send it out today.

jiayingz · 2018-05-09T17:49:22Z

From my manual run, I ssh-ed into the unschedulable node after the test and saw the kubelet log contains multiple restarts and logged the event that the node was previously registered and then later changed to NodeNotSchedulable based on the spec got from the API server:

1754: May 08 18:03:18 e2e-test-jiayingz-minion-group-ssbv kubelet[1073]: I0508 18:03:18.587210 1073 kubelet_node_status.go:126] Node e2e-test-jiayingz-minion-group-ssbv was previously registered
1755: May 08 18:03:18 e2e-test-jiayingz-minion-group-ssbv kubelet[1073]: I0508 18:03:18.587238 1073 kubelet_node_status.go:85] Successfully registered node e2e-test-jiayingz-minion-group-ssbv
...
1789: May 08 18:03:18 e2e-test-jiayingz-minion-group-ssbv kubelet[1073]: I0508 18:03:18.602719 1073 server.go:442] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"e2e-test-jiayingz-minion-group-ssbv", UID:"e2e-test-jiayingz-minion-group-ssbv", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotSchedulable' Node e2e-test-jiayingz-minion-group-ssbv status is now: NodeNotSchedulable

krousey · 2018-05-09T17:50:52Z

@jiayingz That fits with the current theory. The instance is being recreated, but the kubelet is no longer re-registering itself since the externalID is the same.

@dchen1107

Automatic merge from submit-queue (batch tested with PRs 62665, 62194, 63616, 63672, 63450). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Uncordon the node after upgrade **What this PR does / why we need it**: Previous logic was relying on the node to recreate the node API object and, as a side-effect, uncordon itself. A change went in that no longer ensures the node recreates itself, so the bug in this logic was exposed. **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes #63506 **Special notes for your reviewer**: **Release note**: ```release-note NONE ``` cc @dchen1107 @AishSundar

jiayingz · 2018-05-10T20:21:26Z

/cc @msau42 @jingxu97

AishSundar · 2018-05-11T04:21:15Z

@krousey was #63616 sufficient to fix this issue or are we also waiting on #63631 to b merged? I am still waiting for the latest run to verify if the issue is fixed.

krousey · 2018-05-11T16:22:29Z

/reopen
There's a bug with my fix. It fails if the uncordon command happens in the window where the node object doesn't exist. I will have a fix out shortly.

k8s-ci-robot · 2018-05-11T16:22:29Z

@krousey: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen
There's a bug with my fix. It fails if the uncordon command happens in the window where the node object doesn't exist. I will have a fix out shortly.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

krousey · 2018-05-11T16:23:10Z

/assign

krousey · 2018-05-11T16:23:16Z

/reopen

krousey · 2018-05-11T16:43:00Z

Note that this bug is flaky. It doesn't happen 100% of the time. Looks like we got 100% pass in the latest run: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new/1431

Here's an upgrade test that's hitting this flake: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-parallel/3384

jingxu97 · 2018-05-11T16:58:34Z

@dchen1107 @jiayingz
cc @mikedanese
Please see my comments at for the change https://github.com/kubernetes/kubernetes/pull/61877/files#diff-6b0808bd1afb15f9f77986f4459601c2L147
#61877 (comment)
Before the change, when instance is recreated, kubelet will delete the node object and a new one will be created. Now, kubelet will not delete the old one if the node name is the same, I think. This might give some trouble for volumes because the old node will contain the outdated volumesAreAttached information. But we do sync up with the cloud provider and update this information eventually. I am not sure whether this old object might cause other problems or not.

krousey · 2018-05-11T17:19:50Z

@jingxu97 Can you move that to another issue? It is a legitimate bug that the upgrade logic should not rely on kubelet recreating the node object. I agree that no longer doing that may lead to other issues, but that should be a separate discussion.

Automatic merge from submit-queue (batch tested with PRs 63367, 63718, 63446, 63723, 63720). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Move uncordon to after the node is ready **What this PR does / why we need it**: It is possible that the node API object doesn't exist in a brief window between recreation and registering. By moving the uncordon until after the node is ready, we can be sure the API object exists. **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes #63506 #63499 **Release note**: ```release-note NONE ```

AishSundar · 2018-05-14T03:02:23Z

@krousey I see the test failing in the last few runs. Is it a different issue plaguing the test now?

https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-1.10-master-upgrade-cluster-new

AishSundar · 2018-05-14T16:00:44Z

/reopen

k8s-github-robot · 2018-05-14T16:01:06Z

[MILESTONENOTIFIER] Milestone Issue Needs Approval

@AishSundar @krousey @timothysc @kubernetes/sig-cluster-lifecycle-misc @kubernetes/sig-node-misc

Action required: This issue must have the status/approved-for-milestone label applied by a SIG maintainer. If the label is not applied within 1 day, the issue will be moved out of the v1.11 milestone.

Issue Labels

sig/cluster-lifecycle sig/node: Issue will be escalated to these SIGs if needed.
priority/important-soon: Escalate to the issue owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
kind/bug: Fixes a bug discovered during the current release.

Help

AishSundar · 2018-05-14T16:09:23Z

Here are the Commit changes since the last failure.

AishSundar · 2018-05-14T16:12:09Z

Spoke to @krousey offline and SkewTest is a overall for the second half of the suite. So if the other tests in the job pass, this one should turn the corner. so closing this bug for now.

Automatic merge from submit-queue (batch tested with PRs 64344, 64709, 64717, 63631, 58647). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add gpu cluster upgrade test. **What this PR does / why we need it**: **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes # **Special notes for your reviewer**: Currently running GPUMasterUpgrade test should pass with gpu nodes but running GPUClusterUpgrade test will run into #63506 **Release note**: ```release-note ```

k8s-ci-robot assigned BenTheElder May 7, 2018

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 7, 2018

k8s-ci-robot added this to the v1.11 milestone May 7, 2018

k8s-github-robot added the milestone/needs-approval label May 7, 2018

k8s-ci-robot unassigned BenTheElder May 7, 2018

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. and removed sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 7, 2018

k8s-github-robot added milestone/incomplete-labels and removed milestone/needs-approval labels May 7, 2018

k8s-ci-robot assigned timothysc May 7, 2018

k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 7, 2018

k8s-github-robot added milestone/needs-approval and removed milestone/incomplete-labels labels May 7, 2018

BenTheElder mentioned this issue May 7, 2018

Skew Test failing gke-gci-1.10-gci-master-upgrade-cluster-new job in sig-release-master-upgrade #63508

Closed

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label May 8, 2018

krousey mentioned this issue May 8, 2018

Failing Tests : Timeout and UpgradeTest in gce-1.10-master-upgrade-cluster, gce-1.10-master-upgrade-cluster-new and gce-1.10-master-upgrade-cluster-parallel #62713

Closed

krousey mentioned this issue May 9, 2018

Uncordon the node after upgrade #63616

Merged

jiayingz mentioned this issue May 9, 2018

Add gpu cluster upgrade test. #63631

Merged

k8s-github-robot closed this as completed in #63616 May 10, 2018

k8s-ci-robot assigned krousey May 11, 2018

k8s-ci-robot reopened this May 11, 2018

krousey mentioned this issue May 11, 2018

Move uncordon to after the node is ready #63718

Merged

k8s-github-robot closed this as completed in #63718 May 11, 2018

k8s-ci-robot reopened this May 14, 2018

AishSundar closed this as completed May 14, 2018

jiayingz mentioned this issue Jun 1, 2018

From 1.11, an extended resource reported by a device plugin can be left on a node after node upgrade even though its device plugin never registers back #64632

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SkewTest failing gce-1.10-master-upgrade-cluster-new job in sig-release-master-upgrade #63506

SkewTest failing gce-1.10-master-upgrade-cluster-new job in sig-release-master-upgrade #63506

AishSundar commented May 7, 2018

k8s-ci-robot commented May 7, 2018

Failing Job

Gubernator logs

Triage

BenTheElder commented May 7, 2018

AishSundar commented May 7, 2018

krousey commented May 8, 2018

AishSundar commented May 8, 2018

AishSundar commented May 8, 2018

AishSundar commented May 8, 2018

jiayingz commented May 9, 2018

krousey commented May 9, 2018

jiayingz commented May 9, 2018

krousey commented May 9, 2018

jiayingz commented May 10, 2018

AishSundar commented May 11, 2018

krousey commented May 11, 2018

k8s-ci-robot commented May 11, 2018

krousey commented May 11, 2018

krousey commented May 11, 2018

krousey commented May 11, 2018

jingxu97 commented May 11, 2018

krousey commented May 11, 2018

AishSundar commented May 14, 2018

AishSundar commented May 14, 2018

k8s-github-robot commented May 14, 2018

AishSundar commented May 14, 2018

AishSundar commented May 14, 2018

SkewTest failing gce-1.10-master-upgrade-cluster-new job in sig-release-master-upgrade #63506

SkewTest failing gce-1.10-master-upgrade-cluster-new job in sig-release-master-upgrade #63506

Comments

AishSundar commented May 7, 2018

Failing Job

Gubernator logs

Triage

k8s-ci-robot commented May 7, 2018

Failing Job

Gubernator logs

Triage

BenTheElder commented May 7, 2018

AishSundar commented May 7, 2018

krousey commented May 8, 2018

AishSundar commented May 8, 2018

AishSundar commented May 8, 2018

AishSundar commented May 8, 2018

jiayingz commented May 9, 2018

krousey commented May 9, 2018

jiayingz commented May 9, 2018

krousey commented May 9, 2018

jiayingz commented May 10, 2018

AishSundar commented May 11, 2018

krousey commented May 11, 2018

k8s-ci-robot commented May 11, 2018

krousey commented May 11, 2018

krousey commented May 11, 2018

krousey commented May 11, 2018

jingxu97 commented May 11, 2018

krousey commented May 11, 2018

AishSundar commented May 14, 2018

AishSundar commented May 14, 2018

k8s-github-robot commented May 14, 2018

AishSundar commented May 14, 2018

AishSundar commented May 14, 2018