Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SkewTest failing gce-1.10-master-upgrade-cluster-new job in sig-release-master-upgrade #63506

Closed
AishSundar opened this issue May 7, 2018 · 27 comments · Fixed by #63616 or #63718
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. milestone/needs-approval priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@AishSundar
Copy link
Contributor

Failing Job

gce-1.10-master-upgrade-cluster-new

Gubernator logs

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new/1422

Triage

https://storage.googleapis.com/k8s-gubernator/triage/index.html?job=ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new&test=SkewTest

/kind bug
/priority failing-test
/priority important-soon
/sig testing
/milestone v1.11
@kubernetes/sig-testing-bugs

cc @jberkus @tpepper

/assign @BenTheElder for triage

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 7, 2018
@k8s-ci-robot k8s-ci-robot added this to the v1.11 milestone May 7, 2018
@k8s-ci-robot
Copy link
Contributor

@AishSundar: GitHub didn't allow me to assign the following users: for, triage.

Note that only kubernetes members and repo collaborators can be assigned.

In response to this:

Failing Job

gce-1.10-master-upgrade-cluster-new

Gubernator logs

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new/1422

Triage

https://storage.googleapis.com/k8s-gubernator/triage/index.html?job=ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new&test=SkewTest

/kind bug
/priority failing-test
/priority important-soon
/sig testing
/milestone v1.11
@kubernetes/sig-testing-bugs

cc @jberkus @tpepper

/assign @BenTheElder for triage

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 7, 2018
@BenTheElder
Copy link
Member

I don't think this is sig-testing. Looks like [sig-cluster-lifecycle] , [sig-autoscaling], [sig-storage]

/unassign
/remove-sig testing

@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. and removed sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 7, 2018
@AishSundar
Copy link
Contributor Author

/sig cluster-lifecycle
/assign @timothysc

@timothysc can you see if this is a cluster lifecycle issue please?

@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 7, 2018
@krousey
Copy link
Contributor

krousey commented May 8, 2018

Chiming in for cluster-lifecycle, this is weird. Somehow the nodes are being put back into a cordoned state after the upgrade. The upgrade checks that after a node is upgraded, it becomes both ready and schedulable. https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/upgrade.sh#L347-L360

Here's an example from the linked failure:

I0505 21:53:28.217] == Recreating instance bootstrap-e2e-minion-group-srfd. ==
I0505 21:53:29.848] == Waiting for new node to be added to k8s.  ==
I0505 21:55:56.502] ..............................................................................................................................== Waiting for bootstrap-e2e-minion-group-srfd to become ready. ==
I0505 21:55:56.502] Node bootstrap-e2e-minion-group-srfd recreated.
I0505 21:55:56.789] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:55:58.067] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:55:59.371] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:56:00.670] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:56:01.964] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:56:03.272] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:56:04.582] Node bootstrap-e2e-minion-group-srfd is still not ready: Ready=False
I0505 21:56:05.886] Node bootstrap-e2e-minion-group-srfd Ready=True

But just a few lines below that, we can see the the cluster validation after the upgrade shows that 2 nodes are still unschedulable.

I0505 21:56:13.446] == Finished upgrading nodes to v1.11.0-alpha.2.110+8fcaa19dd06f19. ==
I0505 21:56:13.447] == Validating cluster post-upgrade ==
I0505 21:56:13.489] Validating gce cluster, MULTIZONE=
I0505 21:56:13.490] Project: k8s-jkns-gce-slow-1-4
I0505 21:56:13.490] Network Project: k8s-jkns-gce-slow-1-4
I0505 21:56:13.490] Zone: us-central1-f
I0505 21:56:14.846] Found 4 node(s).
I0505 21:56:15.005] NAME                              STATUS                     ROLES     AGE       VERSION
I0505 21:56:15.006] bootstrap-e2e-master              Ready,SchedulingDisabled   <none>    27m       v1.11.0-alpha.2.110+8fcaa19dd06f19
I0505 21:56:15.006] bootstrap-e2e-minion-group-l1rn   Ready                      <none>    8m        v1.11.0-alpha.2.110+8fcaa19dd06f19
I0505 21:56:15.006] bootstrap-e2e-minion-group-ls61   Ready,SchedulingDisabled   <none>    27m       v1.11.0-alpha.2.110+8fcaa19dd06f19
I0505 21:56:15.006] bootstrap-e2e-minion-group-srfd   Ready,SchedulingDisabled   <none>    26m       v1.11.0-alpha.2.110+8fcaa19dd06f19
I0505 21:56:15.168] Validate output:
I0505 21:56:15.314] NAME                 STATUS    MESSAGE              ERROR
I0505 21:56:15.314] etcd-1               Healthy   {"health": "true"}   
I0505 21:56:15.314] etcd-0               Healthy   {"health": "true"}   
I0505 21:56:15.314] scheduler            Healthy   ok                   
I0505 21:56:15.314] controller-manager   Healthy   ok                   
I0505 21:56:15.316] Cluster validation succeeded
I0505 21:56:15.317] == Post-Upgrade Node OS and Kubelet Versions ==
I0505 21:56:15.487] name: "bootstrap-e2e-master", osImage: "Container-Optimized OS from Google", kubeletVersion: "v1.11.0-alpha.2.110+8fcaa19dd06f19"
I0505 21:56:15.487] name: "bootstrap-e2e-minion-group-l1rn", osImage: "Container-Optimized OS from Google", kubeletVersion: "v1.11.0-alpha.2.110+8fcaa19dd06f19"
I0505 21:56:15.488] name: "bootstrap-e2e-minion-group-ls61", osImage: "Container-Optimized OS from Google", kubeletVersion: "v1.11.0-alpha.2.110+8fcaa19dd06f19"
I0505 21:56:15.488] name: "bootstrap-e2e-minion-group-srfd", osImage: "Container-Optimized OS from Google", kubeletVersion: "v1.11.0-alpha.2.110+8fcaa19dd06f19"
I0505 21:56:15.493] May  5 21:56:15.492: INFO: Waiting up to 5m0s for all nodes to be ready after the upgrade
I0505 21:56:15.499] May  5 21:56:15.499: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:56:35.503] May  5 21:56:35.503: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:56:55.507] May  5 21:56:55.507: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:57:15.511] May  5 21:57:15.511: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:57:35.516] May  5 21:57:35.516: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:57:55.520] May  5 21:57:55.520: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:58:15.524] May  5 21:58:15.524: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:58:35.529] May  5 21:58:35.529: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:58:55.533] May  5 21:58:55.533: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:59:15.538] May  5 21:59:15.538: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:59:35.542] May  5 21:59:35.542: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 21:59:55.547] May  5 21:59:55.547: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 22:00:15.551] May  5 22:00:15.551: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 22:00:35.556] May  5 22:00:35.555: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 22:00:55.560] May  5 22:00:55.560: INFO: Waiting for ready nodes 3, current ready 1, not ready nodes 0
I0505 22:01:15.563] May  5 22:01:15.561: INFO: Unexpected error occurred: timeout waiting 5m0s for number of ready nodes to be 3

Does someone from the node team know why this might be happening? Is it a race condition? It worked for one node, but not two in this case. In other runs, I see it working for two nodes and only leaving one unschedulable.

cc @kubernetes/sig-node-bugs

@AishSundar
Copy link
Contributor Author

/cc @dchen1107 @derekwaynecarr can either of you please help investigate the node side of this issue to unblock @krousey. This is currently on of the few tests failing our sig-release upgrade blocking suite. If this is not fixed we will have to delay beta for 1.11 which is slated for 5/15.

@AishSundar
Copy link
Contributor Author

/cc @dchen1107 @derekwaynecarr

@AishSundar
Copy link
Contributor Author

cc @dchen1107 @derekwaynecarr

@jiayingz
Copy link
Contributor

jiayingz commented May 9, 2018

/cc @jiayingz @Random-Liu

I am also running into this issue when trying to add a new upgrade test. First, I think the reason that the test log said the node was ready is because the code is looking at a wrong field for SchedulingDisabled. The unschedulable field is a field in node.spec not a node status condition, so I think we should instead use:
local cordoned=$("${KUBE_ROOT}/cluster/kubectl.sh" get node "${instance}" --output='jsonpath=="{.spec.unschedulable}"'

I am also puzzled on why recreating instance just rebooted the node. According to 'gcloud instance-groups managed recreate-instances' doc, that command should delete the instance first and create a new one. For the passed cluster upgrade test, that seems the case, but for the failed ones, I saw node was just rebooted and so still got the old unschedulable state from the API server.

While trying to figure out why recreating instance just rebooted node sometimes, would it make sense we explicitly use 'gcloud instance-groups managed delete-instances' and 'gcloud instance-groups managed create-instances' ?

@krousey
Copy link
Contributor

krousey commented May 9, 2018

@dchen1107 Thanks, I'll work on a PR and send it out today.

@jiayingz
Copy link
Contributor

jiayingz commented May 9, 2018

From my manual run, I ssh-ed into the unschedulable node after the test and saw the kubelet log contains multiple restarts and logged the event that the node was previously registered and then later changed to NodeNotSchedulable based on the spec got from the API server:

1754: May 08 18:03:18 e2e-test-jiayingz-minion-group-ssbv kubelet[1073]: I0508 18:03:18.587210 1073 kubelet_node_status.go:126] Node e2e-test-jiayingz-minion-group-ssbv was previously registered
1755: May 08 18:03:18 e2e-test-jiayingz-minion-group-ssbv kubelet[1073]: I0508 18:03:18.587238 1073 kubelet_node_status.go:85] Successfully registered node e2e-test-jiayingz-minion-group-ssbv
...
1789: May 08 18:03:18 e2e-test-jiayingz-minion-group-ssbv kubelet[1073]: I0508 18:03:18.602719 1073 server.go:442] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"e2e-test-jiayingz-minion-group-ssbv", UID:"e2e-test-jiayingz-minion-group-ssbv", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotSchedulable' Node e2e-test-jiayingz-minion-group-ssbv status is now: NodeNotSchedulable

@krousey
Copy link
Contributor

krousey commented May 9, 2018

@jiayingz That fits with the current theory. The instance is being recreated, but the kubelet is no longer re-registering itself since the externalID is the same.

k8s-github-robot pushed a commit that referenced this issue May 10, 2018
Automatic merge from submit-queue (batch tested with PRs 62665, 62194, 63616, 63672, 63450). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Uncordon the node after upgrade

**What this PR does / why we need it**:
Previous logic was relying on the node to recreate the node API object
and, as a side-effect, uncordon itself. A change went in that no
longer ensures the node recreates itself, so the bug in this logic was exposed.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #63506

**Special notes for your reviewer**:

**Release note**:

```release-note
NONE
```

cc @dchen1107 @AishSundar
@jiayingz
Copy link
Contributor

/cc @msau42 @jingxu97

@AishSundar
Copy link
Contributor Author

@krousey was #63616 sufficient to fix this issue or are we also waiting on #63631 to b merged? I am still waiting for the latest run to verify if the issue is fixed.

@krousey
Copy link
Contributor

krousey commented May 11, 2018

/reopen
There's a bug with my fix. It fails if the uncordon command happens in the window where the node object doesn't exist. I will have a fix out shortly.

@k8s-ci-robot
Copy link
Contributor

@krousey: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen
There's a bug with my fix. It fails if the uncordon command happens in the window where the node object doesn't exist. I will have a fix out shortly.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@krousey
Copy link
Contributor

krousey commented May 11, 2018

/assign

@krousey
Copy link
Contributor

krousey commented May 11, 2018

/reopen

@k8s-ci-robot k8s-ci-robot reopened this May 11, 2018
@krousey
Copy link
Contributor

krousey commented May 11, 2018

Note that this bug is flaky. It doesn't happen 100% of the time. Looks like we got 100% pass in the latest run: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new/1431

Here's an upgrade test that's hitting this flake: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-parallel/3384

@jingxu97
Copy link
Contributor

@dchen1107 @jiayingz
cc @mikedanese
Please see my comments at for the change https://github.com/kubernetes/kubernetes/pull/61877/files#diff-6b0808bd1afb15f9f77986f4459601c2L147
#61877 (comment)
Before the change, when instance is recreated, kubelet will delete the node object and a new one will be created. Now, kubelet will not delete the old one if the node name is the same, I think. This might give some trouble for volumes because the old node will contain the outdated volumesAreAttached information. But we do sync up with the cloud provider and update this information eventually. I am not sure whether this old object might cause other problems or not.

@krousey
Copy link
Contributor

krousey commented May 11, 2018

@jingxu97 Can you move that to another issue? It is a legitimate bug that the upgrade logic should not rely on kubelet recreating the node object. I agree that no longer doing that may lead to other issues, but that should be a separate discussion.

k8s-github-robot pushed a commit that referenced this issue May 11, 2018
Automatic merge from submit-queue (batch tested with PRs 63367, 63718, 63446, 63723, 63720). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Move uncordon to after the node is ready

**What this PR does / why we need it**:
It is possible that the node API object doesn't exist in a brief
window between recreation and registering. By moving the uncordon
until after the node is ready, we can be sure the API object exists.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #63506 #63499

**Release note**:

```release-note
NONE
```
@AishSundar
Copy link
Contributor Author

@krousey I see the test failing in the last few runs. Is it a different issue plaguing the test now?

https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-1.10-master-upgrade-cluster-new

@AishSundar
Copy link
Contributor Author

/reopen

@k8s-ci-robot k8s-ci-robot reopened this May 14, 2018
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Issue Needs Approval

@AishSundar @krousey @timothysc @kubernetes/sig-cluster-lifecycle-misc @kubernetes/sig-node-misc

Action required: This issue must have the status/approved-for-milestone label applied by a SIG maintainer. If the label is not applied within 1 day, the issue will be moved out of the v1.11 milestone.

Issue Labels
  • sig/cluster-lifecycle sig/node: Issue will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the issue owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@AishSundar
Copy link
Contributor Author

Here are the Commit changes since the last failure.

@AishSundar
Copy link
Contributor Author

Spoke to @krousey offline and SkewTest is a overall for the second half of the suite. So if the other tests in the job pass, this one should turn the corner. so closing this bug for now.

k8s-github-robot pushed a commit that referenced this issue Jun 5, 2018
Automatic merge from submit-queue (batch tested with PRs 64344, 64709, 64717, 63631, 58647). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add gpu cluster upgrade test.

**What this PR does / why we need it**:

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #

**Special notes for your reviewer**:
Currently running GPUMasterUpgrade test should pass with gpu nodes but running GPUClusterUpgrade test will run into #63506

**Release note**:

```release-note

```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. milestone/needs-approval priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
9 participants