Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

containerd not responding leads to logspam and tight loop. #97404

Closed
brendandburns opened this issue Dec 19, 2020 · 3 comments
Closed

containerd not responding leads to logspam and tight loop. #97404

brendandburns opened this issue Dec 19, 2020 · 3 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@brendandburns
Copy link
Contributor

What happened:
Containerd started mis-behaving, kubelet went into a tight loop and was consuming 100% of CPU.

...
Dec 19 12:34:08 kube3 kubelet[546]: W1219 12:34:08.888663     546 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...
Dec 19 12:34:08 kube3 kubelet[546]: W1219 12:34:08.930522     546 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...
Dec 19 12:34:08 kube3 kubelet[546]: W1219 12:34:08.972428     546 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...
Dec 19 12:34:09 kube3 kubelet[546]: W1219 12:34:09.014214     546 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...
Dec 19 12:34:09 kube3 kubelet[546]: W1219 12:34:09.056127     546 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...
Dec 19 12:34:09 kube3 kubelet[546]: W1219 12:34:09.103635     546 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...
Dec 19 12:34:09 kube3 kubelet[546]: W1219 12:34:09.145329     546 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...
...

What you expected to happen:
Kubelet appears to be tight-looping (every ~40ms) trying to talk to containerd. There should be exponential back-off, and also 40ms is probably too fast even for the first retry.

How to reproduce it (as minimally and precisely as possible):
Make containerd hang (not sure how it got wedged on my machine), run kubelet.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.19.4
  • Cloud provider or hardware configuration: Bare Metal (Intel x64)
  • OS (e.g: cat /etc/os-release): Debian Buster
  • Kernel (e.g. uname -a): Linux kube3 4.19.0-12-amd64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Debian 4.19.152-1 (2020-10-18) x86_64 GNU/Linux
  • Install tools: kubeadm
  • Network plugin and version (if this is a network-related bug): N/A
  • Others: containerd 1.43.1-1
@brendandburns brendandburns added the kind/bug Categorizes issue or PR as related to a bug. label Dec 19, 2020
@k8s-ci-robot
Copy link
Contributor

@brendandburns: There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:

  • /sig <group-name>
  • /wg <group-name>
  • /committee <group-name>

Please see the group list for a listing of the SIGs, working groups, and committees available.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 19, 2020
@k8s-ci-robot
Copy link
Contributor

@brendandburns: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 19, 2020
@brendandburns
Copy link
Contributor Author

Oops, realized this is a duplicate of #95727, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

2 participants