Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes thinks pod is running even after the node was deleted explicitly #27882

Closed
sols1 opened this issue Jun 22, 2016 · 20 comments
Closed
Assignees
Labels
kind/support Categorizes issue or PR as a support question. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@sols1
Copy link

sols1 commented Jun 22, 2016

kubectl delete node 192.168.78.14

kubectl get pods -o wide --all-namespaces
NAMESPACE     NAME                                READY     STATUS    RESTARTS   AGE       NODE
default       collectd-9pafi                      1/1       Running   0          23h       192.168.78.14
default       collectd-3dslw                      1/1       Running   0          23h       192.168.78.15
default       collectd-ja6p7                      1/1       Running   0          23h       192.168.78.16
default       graphite-zruml                      1/1       Running   0          1d        192.168.78.15
default       ha-service-loadbalancer-a7ssn       1/1       Running   0          1d        192.168.78.15
default       ha-service-loadbalancer-k3hq2       1/1       Running   0          1d        192.168.78.16
kube-system   kube-dns-v11-4qoi8                  4/4       Running   0          1d        192.168.78.16
kube-system   kube-registry-v0-69k0f              1/1       Running   0          1d        192.168.78.16
kube-system   kubernetes-dashboard-v1.0.0-cwg7k   1/1       Running   0          1d        192.168.78.15
@sols1
Copy link
Author

sols1 commented Jun 22, 2016

Actually, the node was already gone before delete node.

@girishkalele
Copy link

@sols1 I want to say that we need to wait before we declare the node/pods dead because of network partitioning. Forwarding to the experts for expected behaviour explanation.

@dbsmith @kubernetes/goog-cluster @davidopp

How long will the pod collectd-9pafi on the deleted node 192.168.78.14 be shown in running state till we consider the node is really dead and not just partitioned.

@davidopp davidopp added kind/support Categorizes issue or PR as a support question. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Jun 27, 2016
@davidopp
Copy link
Member

Actually, the node was already gone before delete node.

Can you clarify what you mean by "gone"?

To answer your question -- NodeController used to sync nodes against the cloud provider (treating the cloud provider as the source of truth), so "delete node" didn't really do anything unless you also deleted the VM in your cloud provider. (And in fact "delete node" wasn't necessary, because once it was deleted from the cloud provider, the NodeController would detect it and do the "delete node" itself). But for some reason I can't find that syncs from the cloud provider, only the code that deletes the node when NodeController sees it is missing from the cloud provider: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/node/nodecontroller.go#L647

In any event, I suspect that deleting the node from your cloud provider and then waiting a little bit should result in the pods being evicted and the node being deleted for real.

If you're running on bare metal, kill the kubelet and then run "kubectl delete node"

@gmarek Do you know what happened to the code that I'm referring to? Am I hallucinating?

@sols1
Copy link
Author

sols1 commented Jun 27, 2016

Is kubernetes supposed to work on bare metal cluster (which is what I am doing)?

In bare metal cluster there is no cloud provider.

By "node was already gone" I mean that the node was switched off to another cluster (and it was a while).

Waiting does not help - the node and pods that were running on the node do not disappear no matter how long you wait.

What do you mean by saying "kill the kubelet"? Where? The node was gone already.

Why is this P3? If kubernetes does not support bare metal clusters, then you should state it in the docs very clearly.

@davidopp
Copy link
Member

Kubernetes supports bare metal clusters, hence my comment

If you're running on bare metal, kill the kubelet and then run "kubectl delete node"

If the node is no longer in touch with the master (killing the kubelet process such that it doesn't restart is one way to do that; shutting off the machine is another) and you do "kubectl delete node" then the node should be deleted from the master's state and the pods evicted. If that's not happening, it's a bug. As a starting point, can you run "kubectl describe pod" and "kubectl get pod -o yaml" on the pods that the master says are still running on the node that's gone (I guess collectd-9pafi is one), and post the output here?

Thanks

@thockin
Copy link
Member

thockin commented Jun 27, 2016

You shouldn't need to delete the node, though. When kubelet stops checking
in, all the pods should be deleted... It sounds to me like THAT is the
bug, here...

On Sun, Jun 26, 2016 at 10:35 PM, David Oppenheimer <
notifications@github.com> wrote:

Kubernetes supports bare metal clusters, hence my comment

If you're running on bare metal, kill the kubelet and then run "kubectl
delete node"

If the node is no longer in touch with the master (killing the kubelet
process such that it doesn't restart is one way to do that; shutting off
the machine is another) and you do "kubectl delete node" then the node
should be deleted from the master's state and the pods evicted. If that's
not happening, it's a bug. As a starting point, can you run "kubectl
describe pod" and "kubectl get pod -o yaml" on the pods that the master
says are still running on the node that's gone (I guess collectd-9pafi is
one), and post the output here?

Thanks


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#27882 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVAIlFZuE7c4H7RTNcqGw9hFLk8kJks5qP2EFgaJpZM4I8BS5
.

@sols1
Copy link
Author

sols1 commented Jun 27, 2016

Why do I need to do delete node if the node is already down? Why doesn't kubernetes itself detects such a situation and handles it correctly (just like with pods)? This looks like a bug.

Yes, if I do delete node manually, then pods that were running on the node do not disappear - it doesn't matter how long you wait. This is another bug.

I cannot do kubectl describe pod right now since it was a while but I observed this behavior multiple times. But I can do it next time I see this.

Do you actually test kubernetes on bare metal clusters?

@davidopp
Copy link
Member

Yes, sorry, I mis-read the issue. As @thockin alluded to, the pods should be evicted once the node stops heartbeating for 5 minutes (actually it's 5m40s). You don't need to delete the node. The behavior is the same on bare metal and cloud.

The next time you see this, please send the information about the pod that I described in my previous comment, as it will help us debug. Also do the same (describe and get -o yaml) for the node that the pod claims to be running on would be very helpful too.

@xiangpengzhao
Copy link
Contributor

I once met the similar situation in my env. I thought it was the time setting of the env then, so I ignored the issue and reset my VM. I will try to reproduce the issue.

@sols1
Copy link
Author

sols1 commented Jun 29, 2016

3 nodes up:

date; ~/kubectl --server=192.168.78.16:8080 get nodes
Tue Jun 28 16:21:20 PDT 2016
NAME            STATUS    AGE
192.168.78.14   Ready     42m
192.168.78.15   Ready     11d
192.168.78.16   Ready     11d

date; ~/kubectl --server=192.168.78.16:8080 get svc,ds -o wide --all-namespaces | cut -c1-175
Tue Jun 28 16:22:22 PDT 2016
NAMESPACE     NAME                       CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE       SELECTOR
default       graphite                   10.100.72.82     nodes         9093/TCP,2003/TCP,8125/UDP   11d       name=graphite
default       graphite-ext               10.100.83.96     10.10.10.20   9093/TCP                     8d        name=graphite
default       kubernetes                 10.100.0.1       <none>        443/TCP                      11d       <none>
development   dns-backend                10.100.117.111   <none>        8000/TCP                     5d        name=dns-backend
kube-system   kube-dns                   10.100.2.254     <none>        53/UDP,53/TCP                11d       k8s-app=kube-dns
kube-system   kube-registry              10.100.78.167    <none>        5000/TCP                     11d       k8s-app=kube-registry
kube-system   kubernetes-dashboard       10.100.97.204    <none>        80/TCP                       1d        k8s-app=kubernetes-dashboard
kube-system   kubernetes-dashboard-ext   10.100.190.239   10.10.10.20   9090/TCP                     1d        k8s-app=kubernetes-dashboard
production    dns-backend                10.100.70.161    <none>        8000/TCP                     4d        name=dns-backend
NAMESPACE     NAME                       DESIRED          CURRENT       NODE-SELECTOR                AGE       CONTAINER(S)            IMAGE(S)                                
default       collectd                   3                3             <none>                       4h        collectd                collectd-docker            
default       node-problem-detector      3                3             <none>                       3d        node-problem-detector   gcr.io/google_containers/node-problem-de

date; ~/kubectl --server=192.168.78.16:8080 get pods -o wide --all-namespaces
Tue Jun 28 16:22:52 PDT 2016
NAMESPACE     NAME                                READY     STATUS    RESTARTS   AGE       NODE
default       collectd-0z9ih                      1/1       Running   0          43m       192.168.78.14
default       collectd-1za4f                      1/1       Running   0          4h        192.168.78.15
default       collectd-ty56y                      1/1       Running   0          4h        192.168.78.16
default       graphite-5rr7m                      1/1       Running   0          6d        192.168.78.16
default       ha-service-loadbalancer-83c8n       1/1       Running   0          23h       192.168.78.15
default       ha-service-loadbalancer-fyihg       1/1       Running   0          23h       192.168.78.16
default       node-problem-detector-33r6c         1/1       Running   0          22h       192.168.78.14
default       node-problem-detector-bcrpv         1/1       Running   0          3d        192.168.78.15
default       node-problem-detector-i46vz         1/1       Running   0          3d        192.168.78.16
development   dns-backend-rtx9c                   1/1       Running   0          5d        192.168.78.15
kube-system   kube-dns-v11-mbkar                  4/4       Running   0          1d        192.168.78.15
kube-system   kube-registry-v0-5fly7              1/1       Running   0          6d        192.168.78.16
kube-system   kubernetes-dashboard-v1.0.0-y5jel   1/1       Running   0          1d        192.168.78.16
production    dns-backend-wkt5j                   1/1       Running   0          5d        192.168.78.16

14 shuts down:

date; ~/kubectl --server=192.168.78.16:8080 get nodes
Tue Jun 28 16:33:11 PDT 2016
NAME            STATUS     AGE
192.168.78.14   NotReady   54m
192.168.78.15   Ready      11d
192.168.78.16   Ready      11d

date; ~/kubectl --server=192.168.78.16:8080 get svc,ds -o wide --all-namespaces | cut -c1-175
Tue Jun 28 16:33:31 PDT 2016
NAMESPACE     NAME                       CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE       SELECTOR
default       graphite                   10.100.72.82     nodes         9093/TCP,2003/TCP,8125/UDP   11d       name=graphite
default       graphite-ext               10.100.83.96     10.10.10.20   9093/TCP                     8d        name=graphite
default       kubernetes                 10.100.0.1       <none>        443/TCP                      11d       <none>
development   dns-backend                10.100.117.111   <none>        8000/TCP                     5d        name=dns-backend
kube-system   kube-dns                   10.100.2.254     <none>        53/UDP,53/TCP                11d       k8s-app=kube-dns
kube-system   kube-registry              10.100.78.167    <none>        5000/TCP                     11d       k8s-app=kube-registry
kube-system   kubernetes-dashboard       10.100.97.204    <none>        80/TCP                       1d        k8s-app=kubernetes-dashboard
kube-system   kubernetes-dashboard-ext   10.100.190.239   10.10.10.20   9090/TCP                     1d        k8s-app=kubernetes-dashboard
production    dns-backend                10.100.70.161    <none>        8000/TCP                     5d        name=dns-backend
NAMESPACE     NAME                       DESIRED          CURRENT       NODE-SELECTOR                AGE       CONTAINER(S)            IMAGE(S)                                
default       collectd                   3                3             <none>                       4h        collectd                collectd-docker            
default       node-problem-detector      3                3             <none>                       3d        node-problem-detector   gcr.io/google_containers/node-problem-de

date; ~/kubectl --server=192.168.78.16:8080 get pods -o wide --all-namespaces
Tue Jun 28 16:33:44 PDT 2016
NAMESPACE     NAME                                READY     STATUS    RESTARTS   AGE       NODE
default       collectd-0z9ih                      1/1       Running   0          54m       192.168.78.14
default       collectd-1za4f                      1/1       Running   0          4h        192.168.78.15
default       collectd-ty56y                      1/1       Running   0          4h        192.168.78.16
default       graphite-5rr7m                      1/1       Running   0          6d        192.168.78.16
default       ha-service-loadbalancer-83c8n       1/1       Running   0          23h       192.168.78.15
default       ha-service-loadbalancer-fyihg       1/1       Running   0          23h       192.168.78.16
default       node-problem-detector-33r6c         1/1       Running   0          22h       192.168.78.14
default       node-problem-detector-bcrpv         1/1       Running   0          3d        192.168.78.15
default       node-problem-detector-i46vz         1/1       Running   0          3d        192.168.78.16
development   dns-backend-rtx9c                   1/1       Running   0          5d        192.168.78.15
kube-system   kube-dns-v11-mbkar                  4/4       Running   0          1d        192.168.78.15
kube-system   kube-registry-v0-5fly7              1/1       Running   0          6d        192.168.78.16
kube-system   kubernetes-dashboard-v1.0.0-y5jel   1/1       Running   0          1d        192.168.78.16
production    dns-backend-wkt5j                   1/1       Running   0          5d        192.168.78.16

1 hour later the pods are still "running" on 14:

date; ~/kubectl --server=192.168.78.16:8080 get nodes
Tue Jun 28 17:39:31 PDT 2016
NAME            STATUS     AGE
192.168.78.14   NotReady   2h
192.168.78.15   Ready      11d
192.168.78.16   Ready      11d

date; ~/kubectl --server=192.168.78.16:8080 get pods -o wide --all-namespaces
Tue Jun 28 17:39:37 PDT 2016
NAMESPACE     NAME                                READY     STATUS    RESTARTS   AGE       NODE
default       collectd-0z9ih                      1/1       Running   0          2h        192.168.78.14
default       collectd-1za4f                      1/1       Running   0          5h        192.168.78.15
default       collectd-ty56y                      1/1       Running   0          5h        192.168.78.16
default       graphite-5rr7m                      1/1       Running   0          6d        192.168.78.16
default       ha-service-loadbalancer-83c8n       1/1       Running   0          1d        192.168.78.15
default       ha-service-loadbalancer-fyihg       1/1       Running   0          1d        192.168.78.16
default       node-problem-detector-33r6c         1/1       Running   0          23h       192.168.78.14
default       node-problem-detector-bcrpv         1/1       Running   0          4d        192.168.78.15
default       node-problem-detector-i46vz         1/1       Running   0          4d        192.168.78.16
development   dns-backend-rtx9c                   1/1       Running   0          5d        192.168.78.15
kube-system   kube-dns-v11-mbkar                  4/4       Running   0          1d        192.168.78.15
kube-system   kube-registry-v0-5fly7              1/1       Running   0          6d        192.168.78.16
kube-system   kubernetes-dashboard-v1.0.0-y5jel   1/1       Running   0          1d        192.168.78.16
production    dns-backend-wkt5j                   1/1       Running   0          5d        192.168.78.16

Even after kubectl delete node the pods are still "running":

date; ~/kubectl --server=192.168.78.16:8080 delete node 192.168.78.14
Tue Jun 28 17:40:44 PDT 2016
node "192.168.78.14" deleted

date; ~/kubectl --server=192.168.78.16:8080 get nodes
Tue Jun 28 17:40:58 PDT 2016
NAME            STATUS    AGE
192.168.78.15   Ready     11d
192.168.78.16   Ready     11d

date; ~/kubectl --server=192.168.78.16:8080 get pods -o wide --all-namespaces
Tue Jun 28 17:41:07 PDT 2016
NAMESPACE     NAME                                READY     STATUS    RESTARTS   AGE       NODE
default       collectd-0z9ih                      1/1       Running   0          2h        192.168.78.14
default       collectd-1za4f                      1/1       Running   0          5h        192.168.78.15
default       collectd-ty56y                      1/1       Running   0          5h        192.168.78.16
default       graphite-5rr7m                      1/1       Running   0          6d        192.168.78.16
default       ha-service-loadbalancer-83c8n       1/1       Running   0          1d        192.168.78.15
default       ha-service-loadbalancer-fyihg       1/1       Running   0          1d        192.168.78.16
default       node-problem-detector-33r6c         1/1       Running   0          23h       192.168.78.14
default       node-problem-detector-bcrpv         1/1       Running   0          4d        192.168.78.15
default       node-problem-detector-i46vz         1/1       Running   0          4d        192.168.78.16
development   dns-backend-rtx9c                   1/1       Running   0          5d        192.168.78.15
kube-system   kube-dns-v11-mbkar                  4/4       Running   0          1d        192.168.78.15
kube-system   kube-registry-v0-5fly7              1/1       Running   0          6d        192.168.78.16
kube-system   kubernetes-dashboard-v1.0.0-y5jel   1/1       Running   0          1d        192.168.78.16
production    dns-backend-wkt5j                   1/1       Running   0          5d        192.168.78.16

@gmarek
Copy link
Contributor

gmarek commented Sep 1, 2016

Sorry for dropping it so late, but it completely fell off my radar. What both @davidopp and @thockin wrote is right. Kuberentes should delete pods from NotReady Nodes after 5 minutes, with the exception of Daemons - isn't collectd a pod which is a part of daemon controller? If so, then sadly this is expected and was discussed a couple of times (@mikedanese who added this behavior). We probably should have the logic that deletes daemons from nonexisting Nodes somewhere, and the lack of it is a bug. Because it was decided that NC shouldn't touch Daemons I guess it's the responsibility of the DaemonController.

As for the 'automatically deleting Nodes in on-prem setups' - that we consciously not do. It's impossible to tell if non-responsive Node is really gone, or just have some temporary problems - everything we can say is that we can't hear from it. Theoretically we could have some heuristic which determines that if the cluster size is X and the Node is unresponsive for more than Y it's probably not coming back, but it'll cause more problems than it would solve.

Again - sorry for late answer. @davidopp please triage Daemon deletion from non-existing Nodes problem.

@gmarek
Copy link
Contributor

gmarek commented Sep 8, 2016

@sols1 - did my explanation is enough for you?

@sols1
Copy link
Author

sols1 commented Sep 27, 2016

@gmarek If what you are saying is true, then k8s documentation should say clearly and directly that dead nodes must be removed manually in bare metal k8s clusters.

For DaemonSets this is a bug, isn't it?

@gmarek
Copy link
Contributor

gmarek commented Sep 28, 2016

cc @devin-donnelly for the former

It turned out that there were two different places in which we were removing orphaned Pods. One was omitting DaemonSets, other one probably wasn't. There's a PR (#32495) that cleans up this mess - it might help with your case as well.

@mtbbiker
Copy link

mtbbiker commented Feb 3, 2017

I believe this is still Issue. Just tested this, when I shutdown the VM (I am running CoreOS on VMWare)

My Pods on that Node (in NotReady state) still show Running and the after 10min it is not yet re-created at a different Node (Pods are create from a ReplicationController)

Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.6", GitCommit:"e569a27d02001e343cb68086bc06d47804f62af6", GitTreeState:"clean", BuildDate:"2016-11-12T05:22:15Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2+coreos.0", GitCommit:"3ed7d0f453a5517245d32a9c57c39b946e578821", GitTreeState:"clean", BuildDate:"2017-01-13T00:23:19Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}

Could this be caused by Network Partitioning, If so, how can I test? I don't VMWare too well

@davidopp davidopp added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Feb 5, 2017
@gmarek
Copy link
Contributor

gmarek commented Feb 6, 2017

@mtbbiker - I replied on the second thread (#8335). Please add xrefs instead of sending the same message on multiple threads.

@mtbbiker
Copy link

mtbbiker commented Feb 8, 2017

@gmarek Thank for the reply and heads-up

@gmarek gmarek closed this as completed May 10, 2017
@jomeier
Copy link

jomeier commented Jul 7, 2018

Maybe this is a possible workaround: #65936

@rasberrypie
Copy link

Is there is a way to reduce the 5 mins time of heartbeat?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

9 participants