New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The cpu of kubelet is always 100% after containerd.service restart #95727
Comments
@lisongmin: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig node |
/assign |
Reproduce:
Will continue with the investigation, might need to reconnect to |
To me it seems there are a couple of problems here. 1. kubelet does not reconnect to the containerd socket. 2. The reconnect is agressive and consumes a lot if cpu. I don't know the cause for 1, but for 2 this might be an indication that the retry interval should be relaxed. |
Spent a few hours on the investigation, it's likely to be some internal |
TL;DR: It should be related to gRPC. Hacked the vendored gRPC mod and changed its internal dial to something like this: func dial(ctx context.Context, fn func(context.Context, string) (net.Conn, error), addr string) (net.Conn, error) {
if fn != nil {
// return fn(ctx, addr)
grpclog.Warningf("#! grpc: OVERWRITE dial addr: '%s', given fn: '%#v'", addr, reflect.ValueOf(fn))
return (&net.Dialer{}).DialContext(ctx, "unix", addr)
}
return (&net.Dialer{}).DialContext(ctx, "tcp", addr)
} Then CPU only spike for a few seconds after kubelet passes a customized dialer to the gRPC client. I logged the
As you can see, the address of the
However, once
And that function at Anyway, I'll check with gRPC team on this issue and see if there's already a fix on their side. If so then I can port the fix here. |
/cc |
Hi, are there any updates on this. I am experiencing what appears to be the same issue. If I restart containerd while the kubelet is running, my kubelet logs get flooded with:
Restarting the kubelet appears to be the only way to re-establish the connection with containerd. Version information is:
|
Hi @jr200, thanks for the info, I'm still working on the root cause. As I mentioned above, it's something related to gRPC. I collected the |
We're seeing this now when we update our nodes to 1.19 where our Kata Containers install does a |
Is this MR relative to the problem? |
No I don't think so, I've cherry-picked that change and tried in my environment weeks back, and the issue is still there. I only get around the issue by modifying the gRPC code as I mentioned in the comment above, I'll continue to check which component modified the |
For the sake of priority. This hit us hard these days too, as security updates for container.d triggered a restart and a ton of nodes flooded the central log management. ;) Thanks for working hard on a fix. |
Same here. Two systems were affected by automatic container.d upgrades & restarts. |
After the recent CVE-2020-15257,many users will upgrade and restart containerd. It would be a big problem. |
I see, will prioritize on the root cause and fix. At the mean time, you need to restart |
Had another round of debugging, and made one step towards the root cause. I added following snipple to the original hack as I mentioned in previous comment: func GetFuncInfo(i interface{}) string {
f := runtime.FuncForPC(reflect.ValueOf(i).Pointer())
file, line := f.FileLine(f.Entry())
name := f.Name()
return fmt.Sprintf("File: %s#%d, Name: %s", file, line, name)
}
...
func dial(ctx context.Context, fn func(context.Context, string) (net.Conn, error), addr string) (net.Conn, error) {
if fn != nil {
grpclog.Warningf("#DEBUG!!! DIALER info: %s", GetFuncInfo(fn))
return fn(ctx, addr)
}
grpclog.Warningf("#DEBUG!!! DIALER fn is NIL!!!")
return (&net.Dialer{}).DialContext(ctx, "tcp", addr)
} When I reproduced the error, I got following information for the dialer function:
That leads to cadvisor, which is the only component that uses the
So I'm convinced that this should be related to |
EOD updates:
type client struct {
containerService containersapi.ContainersClient
taskService tasksapi.TasksClient
versionService versionapi.VersionClient
} Will work on the fix on cadvisor for the rest of the week. |
Fix was made in google/cadvisor#2749 Copying from google/cadvisor#2749 (comment) to ensure we can capture it on the issue here: I've cut new cAdvisor releases (v0.37.3 and v0.38.6) with this fix cherrypicked. We should get the cAdvisor fix back into kubernetes. We'll need 3 PRs to k/k with following updates:
|
Thanks @hanlins for fixing this one, definitely a tricky one to track down. |
Thanks for all the patience and team efforts on getting this issue fixed, also credits to @bobbypage on the guides and discussions, won't be able to get it done without your helps! |
I don't believe 1.18 or 1.17 should be affected. I think this issue was introduced with google/cadvisor#2513 which commit was included in cAdvisor v0.37 release. cAdvisor v0.37 was vendored into k8s 1.19. |
What happened:
When I restart containerd.service, kubelet can not connect to it again and cause 100% cpu usage.
there are many logs like this:
I can show container info via crictl at the meantime
and restart kubelet.service can resolve the problem.
What you expected to happen:
kubelet can reconnect to containerd after containerd restart.
How to reproduce it (as minimally and precisely as possible):
systemctl restart containerd
Anything else we need to know?:
Environment:
kubectl version
):Cloud provider or hardware configuration:
armbian(debian on arm64)
OS (e.g:
cat /etc/os-release
):uname -a
):kubeadm
The text was updated successfully, but these errors were encountered: