The cpu of kubelet is always 100% after containerd.service restart #95727

lisongmin · 2020-10-20T14:05:40Z

What happened:

When I restart containerd.service, kubelet can not connect to it again and cause 100% cpu usage.

    PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
2556974 root       20   0 2085M 90964 52752 S 134.  2.3  4:43.21 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-drive

there are many logs like this:

Oct 20 13:47:39 bypass-route kubelet[312802]: W1020 13:47:39.523936  312802 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "tr
ansport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...
Oct 20 13:47:39 bypass-route kubelet[312802]: W1020 13:47:39.527771  312802 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...
Oct 20 13:47:39 bypass-route kubelet[312802]: W1020 13:47:39.531614  312802 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...
Oct 20 13:47:39 bypass-route kubelet[312802]: W1020 13:47:39.535471  312802 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...
Oct 20 13:47:39 bypass-route kubelet[312802]: W1020 13:47:39.539306  312802 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...

I can show container info via crictl at the meantime

root@bypass-route:~# crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                      ATTEMPT             POD ID
3f365bb70e010       0983ebec5ab78       26 hours ago        Running             calico-node               2                   0b9859bf8eb8e
61a3d24c4b172       a94d6c76e8fe0       26 hours ago        Running             kube-scheduler            5                   3fac68115766b

and restart kubelet.service can resolve the problem.

What you expected to happen:

kubelet can reconnect to containerd after containerd restart.

How to reproduce it (as minimally and precisely as possible):

systemctl restart containerd
check cpu usage and syslog.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):

root@bypass-route:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/arm64"}
The connection to the server localhost:8080 was refused - did you specify the right host or port?

Cloud provider or hardware configuration:
armbian(debian on arm64)
OS (e.g: cat /etc/os-release):

root@bypass-route:~# cat /etc/os-release
PRETTY_NAME="Armbian 20.08.9 Buster"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Kernel (e.g. uname -a):

root@bypass-route:~# uname -a
Linux bypass-route 5.8.13-rockchip64 #20.08.8 SMP PREEMPT Mon Oct 5 15:59:02 CEST 2020 aarch64 GNU/Linux

Install tools:
kubeadm
Network plugin and version (if this is a network-related bug):
Others:

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2020-10-20T14:05:56Z

@lisongmin: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

neolit123 · 2020-10-20T14:12:30Z

/sig node

hanlins · 2020-10-26T08:29:16Z

/assign

hanlins · 2020-10-27T05:25:06Z

Reproduce:

I can reproduce the issue by restarting containerd when containerd is the container runtime
In another setup where docker was used used as container runtime, didn't experience the issue by restarting docker, but restarted the underlying containerd, same issue shown up again and had to restart kubelet to resolve the issue.

Will continue with the investigation, might need to reconnect to containerd after the socket is broken.

neolit123 · 2020-10-27T12:38:11Z

To me it seems there are a couple of problems here. 1. kubelet does not reconnect to the containerd socket. 2. The reconnect is agressive and consumes a lot if cpu.

I don't know the cause for 1, but for 2 this might be an indication that the retry interval should be relaxed.

hanlins · 2020-10-29T17:47:34Z

Spent a few hours on the investigation, it's likely to be some internal ClientConn states being messed up in grpc.

hanlins · 2020-10-31T00:13:53Z

TL;DR: It should be related to gRPC.

Hacked the vendored gRPC mod and changed its internal dial to something like this:

func dial(ctx context.Context, fn func(context.Context, string) (net.Conn, error), addr string) (net.Conn, error) {
	if fn != nil {
		// return fn(ctx, addr)
		grpclog.Warningf("#! grpc: OVERWRITE dial addr: '%s', given fn: '%#v'", addr, reflect.ValueOf(fn))
		return (&net.Dialer{}).DialContext(ctx, "unix", addr)
	}
	return (&net.Dialer{}).DialContext(ctx, "tcp", addr)
}

Then CPU only spike for a few seconds after containerd restart then comes back to normal instead of failing the dial all the time.

kubelet passes a customized dialer to the gRPC client. I logged the dialer address during kubelet startup:

Oct 30 23:44:19 test-4-md-0-74fb968bd8-zhjpq kubelet[10365]: W1030 23:44:19.012111   10365 util_unix.go:103] Using "/var/run/containerd/containerd.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/containerd/containerd.sock".
Oct 30 23:44:19 test-4-md-0-74fb968bd8-zhjpq kubelet[10365]: W1030 23:44:19.012253   10365 kubelet.go:319] !!! NewRemoteRuntimeService Dialer: '(func(context.Context, string) (net.Conn, error))(0x33f2d40)'

As you can see, the address of the dialer is at 0x33f2d40. Before containerd restarts, I can see from log that every time, the gRPC dial function is using the customized dialer that kubelet passed:

Oct 30 23:44:19 test-4-md-0-74fb968bd8-zhjpq kubelet[10365]: W1030 23:44:19.019038   10365 http2_client.go:179] #! grpc: OVERWRITE dial addr: '/var/run/containerd/containerd.sock', given fn: '(func(context.Context, string) (net.Conn, error))(0x33f2d40)'

However, once containerd got restarted, the address of the dialer is changed:

Oct 30 23:44:25 test-4-md-0-74fb968bd8-zhjpq kubelet[10365]: W1030 23:44:25.498923   10365 http2_client.go:179] #! grpc: OVERWRITE dial addr: 'unix:///run/containerd/containerd.sock', given fn: '(func(context.Context, string) (net.Conn, error))(0x1a06c10)'

And that function at 0x1a06c10 keeps failing. As I forced dial to use the original customized dialer, the issue is gone.
Without too much knowledge on gRPC, my blind guess is as ClientConn is passed by reference to gRPC client, some component could modify it during the process lifetime and the change got persisted.

Anyway, I'll check with gRPC team on this issue and see if there's already a fix on their side. If so then I can port the fix here.

dims · 2020-11-10T18:20:22Z

/cc

jr200 · 2020-11-13T12:10:04Z

Hi, are there any updates on this. I am experiencing what appears to be the same issue.

If I restart containerd while the kubelet is running, my kubelet logs get flooded with:

Nov 13 12:14:32 hostname kubelet[30658]: W1113 12:14:32.928349 30658 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout". Reconnecting...

Restarting the kubelet appears to be the only way to re-establish the connection with containerd.

Version information is:

containerd github.com/containerd/containerd v1.4.1 c623d1b36f09f8ef6536a057bd658b3aa8632828
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}

hanlins · 2020-11-13T18:18:38Z

Hi @jr200, thanks for the info, I'm still working on the root cause. As I mentioned above, it's something related to gRPC. I collected the ClientConn object info, this is the object before containerd restart:

And this is the object after containerd restart:

There's one thing made me concern is, the connection state is still 2 (i.e. Ready, see here) after the failure. I'm not sure if this is related to the cause but anyway it still should be something related to gRPC. Btw I've followed their suggestion to use the latest gRPC client without customized dialer and still see this issue, will still work on the root cause.

skaegi · 2020-11-24T01:17:32Z

We're seeing this now when we update our nodes to 1.19 where our Kata Containers install does a systemctl restart containerd. Our current workaround is to also do systemctl restart kubelet, but would be good to get this fixed.

lisongmin · 2020-11-28T03:17:32Z

Is this MR relative to the problem?

grpc/grpc-go#4021

hanlins · 2020-11-28T03:28:03Z

Is this MR relative to the problem?

grpc/grpc-go#4021

No I don't think so, I've cherry-picked that change and tried in my environment weeks back, and the issue is still there. I only get around the issue by modifying the gRPC code as I mentioned in the comment above, I'll continue to check which component modified the Dial function in ClientConn after thanksgiving.

scrwr · 2020-12-02T08:24:44Z

For the sake of priority. This hit us hard these days too, as security updates for container.d triggered a restart and a ton of nodes flooded the central log management. ;)

Thanks for working hard on a fix.

moohdyy · 2020-12-02T11:17:04Z

Same here. Two systems were affected by automatic container.d upgrades & restarts.
But now I wonder if - generally - "unattended-upgrades" (Ubuntu 18.04) is a wise thing to do? Or should the OS for kubernetes deployments always be updated manually & controlled? I can not find best practices for this in the documentation.

pacoxu · 2020-12-02T11:32:14Z

After the recent CVE-2020-15257，many users will upgrade and restart containerd. It would be a big problem.

hanlins · 2020-12-02T19:52:46Z

I see, will prioritize on the root cause and fix. At the mean time, you need to restart kubelet after you restart containerd.

hanlins · 2020-12-03T01:48:27Z

Had another round of debugging, and made one step towards the root cause. I added following snipple to the original hack as I mentioned in previous comment:

func GetFuncInfo(i interface{}) string {
	f := runtime.FuncForPC(reflect.ValueOf(i).Pointer())
	file, line := f.FileLine(f.Entry())
	name := f.Name()
	return fmt.Sprintf("File: %s#%d, Name: %s", file, line, name)
}
...
func dial(ctx context.Context, fn func(context.Context, string) (net.Conn, error), addr string) (net.Conn, error) {
	if fn != nil {
		grpclog.Warningf("#DEBUG!!! DIALER info: %s", GetFuncInfo(fn))
		return fn(ctx, addr)
	}
	grpclog.Warningf("#DEBUG!!! DIALER fn is NIL!!!")
	return (&net.Dialer{}).DialContext(ctx, "tcp", addr)
}

When I reproduced the error, I got following information for the dialer function:

Dec 03 01:16:32 tkg-mgmt-vsphere-20201202142246-control-plane-25mbz kubelet[1481]: W1203 01:16:30.535400    1481 http2_client.go:198] #DEBUG!!! DIALER info: File: /Users/hanlins/goproj/src/github.com/kubernetes/kubernetes/vendor/github.com/containerd/containerd/pkg/dialer/dialer.go#33, Name: github.com/containerd/containerd/pkg/dialer.ContextDialer

That leads to cadvisor, which is the only component that uses the dialer.ContextDialer in the repo:

➜  kubernetes (master) ✗ ag -s "\.ContextDialer"
vendor/github.com/google/cadvisor/container/containerd/client.go
75:			grpc.WithContextDialer(dialer.ContextDialer),

So I'm convinced that this should be related to cadvisor and will update the thread once I've made further progress.

hanlins · 2020-12-03T08:02:30Z

EOD updates:

The issue is indeed caused by cadvisor, reproduced the issue in this repo.
The confusing timeout in log should be this one.
The root cause might be a bug for using context that already exceeded deadline in cadvisor. To validate that, I need to explore all methods of this client:

type client struct {
	containerService containersapi.ContainersClient
	taskService      tasksapi.TasksClient
	versionService   versionapi.VersionClient
}

Will work on the fix on cadvisor for the rest of the week.

bobbypage · 2020-12-10T00:10:31Z

Fix was made in google/cadvisor#2749

Copying from google/cadvisor#2749 (comment) to ensure we can capture it on the issue here:

I've cut new cAdvisor releases (v0.37.3 and v0.38.6) with this fix cherrypicked.

We should get the cAdvisor fix back into kubernetes. We'll need 3 PRs to k/k with following updates:

k/k master to update cAdvisor to v0.38.6 -- vendor: update cAdvisor to v0.38.6 #97174
k/k release-1.20 to update cAdvisor to v0.38.6 -- vendor: update cAdvisor to v0.38.6 #97175
k/k release-1.19 to update cAdvisor to v0.37.3 -- vendor: update cAdvisor to v0.37.3 #97176

andrewsykim · 2020-12-11T03:30:25Z

Thanks @hanlins for fixing this one, definitely a tricky one to track down.

hanlins · 2020-12-11T05:11:07Z

Thanks for all the patience and team efforts on getting this issue fixed, also credits to @bobbypage on the guides and discussions, won't be able to get it done without your helps!

bobbypage · 2020-12-21T06:26:03Z

The fix was applied on master as well as cherrypicked into v1.20.1 and v1.19.6 releases.

SergeyKanzhelev · 2020-12-22T22:46:58Z

The fix was applied on master as well as cherrypicked into v1.20.1 and v1.19.6 releases.

I remember you told me, but I forgot. Does issue exist in 1.18 and 1.17?

bobbypage · 2020-12-23T20:36:01Z

I don't believe 1.18 or 1.17 should be affected. I think this issue was introduced with google/cadvisor#2513 which commit was included in cAdvisor v0.37 release. cAdvisor v0.37 was vendored into k8s 1.19.

lisongmin added the kind/bug Categorizes issue or PR as related to a bug. label Oct 20, 2020

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 20, 2020

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 20, 2020

k8s-ci-robot assigned hanlins Oct 26, 2020

hanlins mentioned this issue Oct 31, 2020

Customized Dialer passed to ClientConn got overwritten for unix domain socket grpc/grpc-go#4007

Closed

This was referenced Dec 3, 2020

cpu saturated after containerd.service restart google/cadvisor#2748

Closed

Fix timeout flooding issue after containerd restart google/cadvisor#2749

Merged

vendor: update cAdvisor to v0.38.6 #97174

Merged

vendor: update cAdvisor to v0.38.6 #97175

Merged

hanlins mentioned this issue Dec 9, 2020

vendor: update cAdvisor to v0.37.3 #97176

Merged

k8s-ci-robot closed this as completed in #97174 Dec 11, 2020

brendandburns mentioned this issue Dec 19, 2020

containerd not responding leads to logspam and tight loop. #97404

Closed

hanlins mentioned this issue Feb 16, 2021

REQUEST: New membership for hanlins kubernetes/org#2505

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The cpu of kubelet is always 100% after containerd.service restart #95727

The cpu of kubelet is always 100% after containerd.service restart #95727

lisongmin commented Oct 20, 2020

k8s-ci-robot commented Oct 20, 2020

neolit123 commented Oct 20, 2020

hanlins commented Oct 26, 2020

hanlins commented Oct 27, 2020

neolit123 commented Oct 27, 2020

hanlins commented Oct 29, 2020

hanlins commented Oct 31, 2020

dims commented Nov 10, 2020

jr200 commented Nov 13, 2020 •

edited

hanlins commented Nov 13, 2020

skaegi commented Nov 24, 2020

lisongmin commented Nov 28, 2020

hanlins commented Nov 28, 2020

scrwr commented Dec 2, 2020

moohdyy commented Dec 2, 2020

pacoxu commented Dec 2, 2020

hanlins commented Dec 2, 2020

hanlins commented Dec 3, 2020

hanlins commented Dec 3, 2020

bobbypage commented Dec 10, 2020 •

edited

andrewsykim commented Dec 11, 2020

hanlins commented Dec 11, 2020

bobbypage commented Dec 21, 2020 •

edited

SergeyKanzhelev commented Dec 22, 2020

bobbypage commented Dec 23, 2020

The cpu of kubelet is always 100% after containerd.service restart #95727

The cpu of kubelet is always 100% after containerd.service restart #95727

Comments

lisongmin commented Oct 20, 2020

k8s-ci-robot commented Oct 20, 2020

neolit123 commented Oct 20, 2020

hanlins commented Oct 26, 2020

hanlins commented Oct 27, 2020

neolit123 commented Oct 27, 2020

hanlins commented Oct 29, 2020

hanlins commented Oct 31, 2020

dims commented Nov 10, 2020

jr200 commented Nov 13, 2020 • edited

hanlins commented Nov 13, 2020

skaegi commented Nov 24, 2020

lisongmin commented Nov 28, 2020

hanlins commented Nov 28, 2020

scrwr commented Dec 2, 2020

moohdyy commented Dec 2, 2020

pacoxu commented Dec 2, 2020

hanlins commented Dec 2, 2020

hanlins commented Dec 3, 2020

hanlins commented Dec 3, 2020

bobbypage commented Dec 10, 2020 • edited

andrewsykim commented Dec 11, 2020

hanlins commented Dec 11, 2020

bobbypage commented Dec 21, 2020 • edited

SergeyKanzhelev commented Dec 22, 2020

bobbypage commented Dec 23, 2020

jr200 commented Nov 13, 2020 •

edited

bobbypage commented Dec 10, 2020 •

edited

bobbypage commented Dec 21, 2020 •

edited