Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The cpu of kubelet is always 100% after containerd.service restart #95727

Closed
lisongmin opened this issue Oct 20, 2020 · 25 comments · Fixed by #97174
Closed

The cpu of kubelet is always 100% after containerd.service restart #95727

lisongmin opened this issue Oct 20, 2020 · 25 comments · Fixed by #97174
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@lisongmin
Copy link

What happened:

When I restart containerd.service, kubelet can not connect to it again and cause 100% cpu usage.

    PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
2556974 root       20   0 2085M 90964 52752 S 134.  2.3  4:43.21 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-drive

there are many logs like this:

Oct 20 13:47:39 bypass-route kubelet[312802]: W1020 13:47:39.523936  312802 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "tr
ansport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...
Oct 20 13:47:39 bypass-route kubelet[312802]: W1020 13:47:39.527771  312802 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...
Oct 20 13:47:39 bypass-route kubelet[312802]: W1020 13:47:39.531614  312802 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...
Oct 20 13:47:39 bypass-route kubelet[312802]: W1020 13:47:39.535471  312802 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...
Oct 20 13:47:39 bypass-route kubelet[312802]: W1020 13:47:39.539306  312802 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...

I can show container info via crictl at the meantime

root@bypass-route:~# crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                      ATTEMPT             POD ID
3f365bb70e010       0983ebec5ab78       26 hours ago        Running             calico-node               2                   0b9859bf8eb8e
61a3d24c4b172       a94d6c76e8fe0       26 hours ago        Running             kube-scheduler            5                   3fac68115766b

and restart kubelet.service can resolve the problem.

What you expected to happen:

kubelet can reconnect to containerd after containerd restart.

How to reproduce it (as minimally and precisely as possible):

  1. systemctl restart containerd
  2. check cpu usage and syslog.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
root@bypass-route:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/arm64"}
The connection to the server localhost:8080 was refused - did you specify the right host or port?
  • Cloud provider or hardware configuration:
    armbian(debian on arm64)

  • OS (e.g: cat /etc/os-release):

root@bypass-route:~# cat /etc/os-release
PRETTY_NAME="Armbian 20.08.9 Buster"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel (e.g. uname -a):
root@bypass-route:~# uname -a
Linux bypass-route 5.8.13-rockchip64 #20.08.8 SMP PREEMPT Mon Oct 5 15:59:02 CEST 2020 aarch64 GNU/Linux
  • Install tools:
    kubeadm
  • Network plugin and version (if this is a network-related bug):
  • Others:
@lisongmin lisongmin added the kind/bug Categorizes issue or PR as related to a bug. label Oct 20, 2020
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 20, 2020
@k8s-ci-robot
Copy link
Contributor

@lisongmin: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@neolit123
Copy link
Member

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 20, 2020
@hanlins
Copy link
Member

hanlins commented Oct 26, 2020

/assign

@hanlins
Copy link
Member

hanlins commented Oct 27, 2020

Reproduce:

  1. I can reproduce the issue by restarting containerd when containerd is the container runtime
  2. In another setup where docker was used used as container runtime, didn't experience the issue by restarting docker, but restarted the underlying containerd, same issue shown up again and had to restart kubelet to resolve the issue.

Will continue with the investigation, might need to reconnect to containerd after the socket is broken.

@neolit123
Copy link
Member

To me it seems there are a couple of problems here. 1. kubelet does not reconnect to the containerd socket. 2. The reconnect is agressive and consumes a lot if cpu.

I don't know the cause for 1, but for 2 this might be an indication that the retry interval should be relaxed.

@hanlins
Copy link
Member

hanlins commented Oct 29, 2020

Spent a few hours on the investigation, it's likely to be some internal ClientConn states being messed up in grpc.

@hanlins
Copy link
Member

hanlins commented Oct 31, 2020

TL;DR: It should be related to gRPC.

Hacked the vendored gRPC mod and changed its internal dial to something like this:

func dial(ctx context.Context, fn func(context.Context, string) (net.Conn, error), addr string) (net.Conn, error) {
	if fn != nil {
		// return fn(ctx, addr)
		grpclog.Warningf("#! grpc: OVERWRITE dial addr: '%s', given fn: '%#v'", addr, reflect.ValueOf(fn))
		return (&net.Dialer{}).DialContext(ctx, "unix", addr)
	}
	return (&net.Dialer{}).DialContext(ctx, "tcp", addr)
}

Then CPU only spike for a few seconds after containerd restart then comes back to normal instead of failing the dial all the time.

kubelet passes a customized dialer to the gRPC client. I logged the dialer address during kubelet startup:

Oct 30 23:44:19 test-4-md-0-74fb968bd8-zhjpq kubelet[10365]: W1030 23:44:19.012111   10365 util_unix.go:103] Using "/var/run/containerd/containerd.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/containerd/containerd.sock".
Oct 30 23:44:19 test-4-md-0-74fb968bd8-zhjpq kubelet[10365]: W1030 23:44:19.012253   10365 kubelet.go:319] !!! NewRemoteRuntimeService Dialer: '(func(context.Context, string) (net.Conn, error))(0x33f2d40)'

As you can see, the address of the dialer is at 0x33f2d40. Before containerd restarts, I can see from log that every time, the gRPC dial function is using the customized dialer that kubelet passed:

Oct 30 23:44:19 test-4-md-0-74fb968bd8-zhjpq kubelet[10365]: W1030 23:44:19.019038   10365 http2_client.go:179] #! grpc: OVERWRITE dial addr: '/var/run/containerd/containerd.sock', given fn: '(func(context.Context, string) (net.Conn, error))(0x33f2d40)'

However, once containerd got restarted, the address of the dialer is changed:

Oct 30 23:44:25 test-4-md-0-74fb968bd8-zhjpq kubelet[10365]: W1030 23:44:25.498923   10365 http2_client.go:179] #! grpc: OVERWRITE dial addr: 'unix:///run/containerd/containerd.sock', given fn: '(func(context.Context, string) (net.Conn, error))(0x1a06c10)'

And that function at 0x1a06c10 keeps failing. As I forced dial to use the original customized dialer, the issue is gone.
Without too much knowledge on gRPC, my blind guess is as ClientConn is passed by reference to gRPC client, some component could modify it during the process lifetime and the change got persisted.

Anyway, I'll check with gRPC team on this issue and see if there's already a fix on their side. If so then I can port the fix here.

@dims
Copy link
Member

dims commented Nov 10, 2020

/cc

@jr200
Copy link

jr200 commented Nov 13, 2020

Hi, are there any updates on this. I am experiencing what appears to be the same issue.

If I restart containerd while the kubelet is running, my kubelet logs get flooded with:

Nov 13 12:14:32 hostname kubelet[30658]: W1113 12:14:32.928349 30658 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...

Restarting the kubelet appears to be the only way to re-establish the connection with containerd.

Version information is:

containerd github.com/containerd/containerd v1.4.1 c623d1b36f09f8ef6536a057bd658b3aa8632828
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}

@hanlins
Copy link
Member

hanlins commented Nov 13, 2020

Hi @jr200, thanks for the info, I'm still working on the root cause. As I mentioned above, it's something related to gRPC. I collected the ClientConn object info, this is the object before containerd restart:
clientconn1
And this is the object after containerd restart:
clientconn2
There's one thing made me concern is, the connection state is still 2 (i.e. Ready, see here) after the failure. I'm not sure if this is related to the cause but anyway it still should be something related to gRPC. Btw I've followed their suggestion to use the latest gRPC client without customized dialer and still see this issue, will still work on the root cause.

@skaegi
Copy link

skaegi commented Nov 24, 2020

We're seeing this now when we update our nodes to 1.19 where our Kata Containers install does a systemctl restart containerd. Our current workaround is to also do systemctl restart kubelet, but would be good to get this fixed.

@lisongmin
Copy link
Author

Is this MR relative to the problem?

grpc/grpc-go#4021

@hanlins
Copy link
Member

hanlins commented Nov 28, 2020

Is this MR relative to the problem?

grpc/grpc-go#4021

No I don't think so, I've cherry-picked that change and tried in my environment weeks back, and the issue is still there. I only get around the issue by modifying the gRPC code as I mentioned in the comment above, I'll continue to check which component modified the Dial function in ClientConn after thanksgiving.

@scrwr
Copy link

scrwr commented Dec 2, 2020

For the sake of priority. This hit us hard these days too, as security updates for container.d triggered a restart and a ton of nodes flooded the central log management. ;)

Thanks for working hard on a fix.

@moohdyy
Copy link

moohdyy commented Dec 2, 2020

Same here. Two systems were affected by automatic container.d upgrades & restarts.
But now I wonder if - generally - "unattended-upgrades" (Ubuntu 18.04) is a wise thing to do? Or should the OS for kubernetes deployments always be updated manually & controlled? I can not find best practices for this in the documentation.

@pacoxu
Copy link
Member

pacoxu commented Dec 2, 2020

After the recent CVE-2020-15257,many users will upgrade and restart containerd. It would be a big problem.

@hanlins
Copy link
Member

hanlins commented Dec 2, 2020

I see, will prioritize on the root cause and fix. At the mean time, you need to restart kubelet after you restart containerd.

@hanlins
Copy link
Member

hanlins commented Dec 3, 2020

Had another round of debugging, and made one step towards the root cause. I added following snipple to the original hack as I mentioned in previous comment:

func GetFuncInfo(i interface{}) string {
	f := runtime.FuncForPC(reflect.ValueOf(i).Pointer())
	file, line := f.FileLine(f.Entry())
	name := f.Name()
	return fmt.Sprintf("File: %s#%d, Name: %s", file, line, name)
}
...
func dial(ctx context.Context, fn func(context.Context, string) (net.Conn, error), addr string) (net.Conn, error) {
	if fn != nil {
		grpclog.Warningf("#DEBUG!!! DIALER info: %s", GetFuncInfo(fn))
		return fn(ctx, addr)
	}
	grpclog.Warningf("#DEBUG!!! DIALER fn is NIL!!!")
	return (&net.Dialer{}).DialContext(ctx, "tcp", addr)
}

When I reproduced the error, I got following information for the dialer function:

Dec 03 01:16:32 tkg-mgmt-vsphere-20201202142246-control-plane-25mbz kubelet[1481]: W1203 01:16:30.535400    1481 http2_client.go:198] #DEBUG!!! DIALER info: File: /Users/hanlins/goproj/src/github.com/kubernetes/kubernetes/vendor/github.com/containerd/containerd/pkg/dialer/dialer.go#33, Name: github.com/containerd/containerd/pkg/dialer.ContextDialer

That leads to cadvisor, which is the only component that uses the dialer.ContextDialer in the repo:

➜  kubernetes (master) ✗ ag -s "\.ContextDialer"
vendor/github.com/google/cadvisor/container/containerd/client.go
75:			grpc.WithContextDialer(dialer.ContextDialer),

So I'm convinced that this should be related to cadvisor and will update the thread once I've made further progress.

@hanlins
Copy link
Member

hanlins commented Dec 3, 2020

EOD updates:

  1. The issue is indeed caused by cadvisor, reproduced the issue in this repo.
  2. The confusing timeout in log should be this one.
  3. The root cause might be a bug for using context that already exceeded deadline in cadvisor. To validate that, I need to explore all methods of this client:
type client struct {
	containerService containersapi.ContainersClient
	taskService      tasksapi.TasksClient
	versionService   versionapi.VersionClient
}

Will work on the fix on cadvisor for the rest of the week.

@bobbypage
Copy link
Member

bobbypage commented Dec 10, 2020

Fix was made in google/cadvisor#2749

Copying from google/cadvisor#2749 (comment) to ensure we can capture it on the issue here:

I've cut new cAdvisor releases (v0.37.3 and v0.38.6) with this fix cherrypicked.

We should get the cAdvisor fix back into kubernetes. We'll need 3 PRs to k/k with following updates:

@andrewsykim
Copy link
Member

Thanks @hanlins for fixing this one, definitely a tricky one to track down.

@hanlins
Copy link
Member

hanlins commented Dec 11, 2020

Thanks for all the patience and team efforts on getting this issue fixed, also credits to @bobbypage on the guides and discussions, won't be able to get it done without your helps!

@bobbypage
Copy link
Member

bobbypage commented Dec 21, 2020

The fix was applied on master as well as cherrypicked into v1.20.1 and v1.19.6 releases.

@SergeyKanzhelev
Copy link
Member

The fix was applied on master as well as cherrypicked into v1.20.1 and v1.19.6 releases.

I remember you told me, but I forgot. Does issue exist in 1.18 and 1.17?

@bobbypage
Copy link
Member

I don't believe 1.18 or 1.17 should be affected. I think this issue was introduced with google/cadvisor#2513 which commit was included in cAdvisor v0.37 release. cAdvisor v0.37 was vendored into k8s 1.19.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.