Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containers appear to be crashing with fork/exec resource unavailable #239

Closed
dcherniv opened this issue Apr 1, 2019 · 14 comments
Closed

Comments

@dcherniv
Copy link

dcherniv commented Apr 1, 2019

What happened:
On one particular node that runs 2 rabbitmq containers connected to two different rabbitmq clusters, the rabbit containers appear to be crashing with the above error.
What you expected to happen:
The containers to run properly
How to reproduce it (as minimally and precisely as possible):

helm install -n rabbit-1 stable/rabbitmq-ha
helm install -n rabbit-2 stable/rabbitmq-ha

Connect some clients to both cluster that create 5-10 queues, ping/pong type commands should do.
Wait for a day or two.

kubectl get pods -o yaml -n demo demo-rabbitmq-0
[...]
        message: 'failed to start shim: fork/exec /usr/bin/docker-containerd-shim:
          resource temporarily unavailable: unknown'
        reason: ContainerCannotRun

Anything else we need to know?:

 journalctl -u docker | tail -n 200
[...]
Mar 31 19:15:32 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T19:15:32Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/f5537aadb449b48a2e3f3e75cc8e01d9b2b11139fe172bb8d30d0fec1cd58ed1/shim.sock" debug=false pid=17287
Mar 31 19:15:46 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T19:15:46Z" level=info msg="shim reaped" id=cc95a7934f7203251394599bb8800700b595e7457998b34e38b393f6c78d6e9b
Mar 31 19:15:46 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T19:15:46.837885110Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 31 19:15:46 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T19:15:46.875961539Z" level=warning msg="Cannot kill container cc95a7934f7203251394599bb8800700b595e7457998b34e38b393f6c78d6e9b: unknown error after kill: fork/exec /usr/bin/docker-runc: resource temporarily unavailable: : unknown"
Mar 31 19:15:47 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T19:15:47Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/b48df07b3f3c9ade5a4ac8bb853b21edf3ae1b9f6a327479f8319c283cffaf1a/shim.sock" debug=false pid=18696
Mar 31 19:16:26 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T19:16:26Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/9ae4aff4045752cb655114caa38c086916dfc04f8af450a08e5086ce626d7245/shim.sock" debug=false pid=21630
Mar 31 21:27:21 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T21:27:21.408313532Z" level=error msg="stream copy error: reading from a closed fifo"
Mar 31 21:27:21 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T21:27:21.411989927Z" level=error msg="stream copy error: reading from a closed fifo"
Mar 31 21:27:21 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T21:27:21.425543549Z" level=error msg="Error running exec e9c4cc3e1d5cc42356399896c12a8887f8fc7b940c5b70e29d226a3dfbc8fc7b in container: OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused \"process_linux.go:90: adding pid 30275 to cgroups caused \\\"failed to write 30275 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/poddd27e9d8-51b1-11e9-871b-0e31a129cd1a/9ae4aff4045752cb655114caa38c086916dfc04f8af450a08e5086ce626d7245/cgroup.procs: invalid argument\\\"\": unknown"
Mar 31 23:03:05 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T23:03:05.622015961Z" level=error msg="stream copy error: reading from a closed fifo"
Mar 31 23:03:05 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T23:03:05.622443440Z" level=error msg="stream copy error: reading from a closed fifo"
Mar 31 23:03:05 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T23:03:05.636902513Z" level=error msg="Error running exec 28ee970a77964be05eaf971008ef6ae10198f7f205cc5e752256968c773c32cd in container: OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused \"process_linux.go:90: adding pid 18681 to cgroups caused \\\"failed to write 18681 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/podad780255-51b1-11e9-871b-0e31a129cd1a/b48df07b3f3c9ade5a4ac8bb853b21edf3ae1b9f6a327479f8319c283cffaf1a/cgroup.procs: invalid argument\\\"\": unknown"
Mar 31 23:33:44 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T23:33:44Z" level=info msg="shim reaped" id=642c774c1d4a6c656dd5da88e436aebc59bc65cd3fb2ae6a4b07589b04f655c1
Mar 31 23:33:44 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T23:33:44.672119161Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 31 23:33:46 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T23:33:46Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/9ffdd43b4b3d92982107f6a38369bef5c24ce0c659e666d6c571ea0237b8b48f/shim.sock" debug=false pid=5037
[root@ip-10-128-95-215 ~]# cat /lib/systemd/system/docker.service | grep -v "#"
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target firewalld.service
Wants=network-online.target

[Service]
Type=notify
EnvironmentFile=-/etc/sysconfig/docker
EnvironmentFile=-/etc/sysconfig/docker-storage
EnvironmentFile=-/run/docker/runtimes.env
ExecStartPre=/bin/mkdir -p /run/docker
ExecStartPre=/usr/libexec/docker/docker-setup-runtimes.sh
ExecStart=/usr/bin/dockerd $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_ADD_RUNTIMES
ExecReload=/bin/kill -s HUP $MAINPID
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TimeoutStartSec=0
Delegate=yes
KillMode=process
Restart=on-failure
StartLimitBurst=3
StartLimitInterval=60s

[Install]
WantedBy=multi-user.target
[root@ip-10-128-95-215 ~]# cat /lib/systemd/system/docker.service | grep Tasks
# Uncomment TasksMax if your systemd version supports it.
#TasksMax=infinity
[root@ip-10-128-95-215 ~]# 
[root@ip-10-128-95-215 ~]# systemctl status docker | grep Tasks
    Tasks: 734
[root@ip-10-128-95-215 ~]# 

Environment:

  • AWS Region: us-east-1
  • Instance Type(s): c5.xlarge
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): 1.11 "eks.2"
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.11 eks.2
  • AMI Version: amazon-eks-node-1.11-v20190327 (ami-05fe3f841ac4df3bb)
  • Kernel (e.g. uname -a):
[root@ip-10-128-95-215 ~]# uname -a
Linux ip-10-128-95-215.ec2.internal 4.14.104-95.84.amzn2.x86_64 #1 SMP Sat Mar 2 00:40:20 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@ip-10-128-95-215 ~]# 
  • Release information (run cat /tmp/release on a node):
[root@ip-10-128-95-215 ~]# cat /tmp/release
cat: /tmp/release: No such file or directory
[root@ip-10-128-95-215 ~]# 
@bnutt
Copy link

bnutt commented Apr 8, 2019

Experiencing a similar issue with Redis, same version of eks AMI and eks platform. It ends up crashing other services in my cluster that depend on it after approx 2 days.

@james-ingold
Copy link

Getting a similar issue in EKS 1.12 trying to exec into any pods:
kubectl exec -it {podName} -- /bin/sh
failed to create runc console socket: stat /tmp: no such file or directory: unknown
command terminated with exit code 126

us-east-1
ami-0abcb9f9190e867ab

@james-ingold
Copy link

Not sure if this will help your issue but I restarted the kubelet on each node and I was able to run an exec which fixed my problem.
sudo systemctl restart kubelet

@bnutt
Copy link

bnutt commented Apr 24, 2019

Seeing this on any cluster I have which uses 1.12,

  State:          Waiting
      Reason:       RunContainerError
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      failed to start shim: fork/exec /usr/bin/docker-containerd-shim: resource temporarily unavailable: unknown
      Exit Code:    128
      Started:      Wed, 24 Apr 2019 09:08:36 -0700
      Finished:     Wed, 24 Apr 2019 09:08:36 -0700
    Ready:          False

Environment:

AWS Region: us-east-1
Instance Type(s): m5.2xlarge
EKS Platform version (use aws eks describe-cluster --name --query cluster.platformVersion): 1.12 "eks.1"
Kubernetes version (use aws eks describe-cluster --name --query cluster.version): 1.12 eks.1
AMI Version: amazon-eks-node-1.12-v20190329 (ami-0abcb9f9190e867ab)
Kernel (e.g. uname -a): Can't get this because nodes ssh process is dead due to issue

@shshe
Copy link

shshe commented Jun 28, 2019

I'm getting this too after ~2 days of a node running. New containers fail to start and processes seem to have issues starting as well.

 Error: failed to start container "...": Error response from daemon: failed to start shim: fork/exec /usr/bin/docker-containerd-shim: resource temporarily unavailable: unknown
  Warning  BackOff    7s (x2 over 13s)   kubelet, ip-172-31-100-123.ec2.internal  Back-off

Running AMI: amazon-eks-node-1.12-v20190614 (ami-0200e65a38edfb7e1)

@whereisaaron
Copy link

One possible cause is running out of process IDs. Check you don't have 40.000 defunct processes or similar on nodes with problems.

@tecnobrat
Copy link

We experienced this on Thursday of last week, it definitely feels like its running out of process IDs, likely due to some fork-bomb style bug.

Unfortunately when it gets into this state we can't even SSH onto the node to find out what is going on.

@tecnobrat
Copy link

The node in question for us happened to be running coredns and tiller-deploy (which don't run on every node in the cluster, so I wanted to point this out as a possible breadcrumb)

Did anyone else notice if there was specific things running on that node?

@shshe
Copy link

shshe commented Jul 8, 2019

@whereisaaron Thanks for the tip! Our nodes were indeed being filled with defunct processes that were started by a Kubernetes health check command to check the status of our Celery workers. We've removed the check for now.

@tecnobrat
Copy link

@shshe do you have more information on that? What sort of checks were they?

@pawelprazak
Copy link

pawelprazak commented Nov 5, 2019

I can confirm this issue is still happening from time to time on v1.13.11-eks-5876d6, docker://18.6.1

the affected container, this time was istio-sidecar-injector

@shshe
Copy link

shshe commented Nov 5, 2019

@shshe do you have more information on that? What sort of checks were they?

Sorry for the really late response. Our defunct processes were being spawned when we ran celery inspect ping as part of our liveness probe. Here's the related issue in the celery project:

celery/celery#4079 (comment)

@mogren
Copy link

mogren commented Aug 16, 2020

@dcherniv @shshe Is this still an issue, or can we close the issue?

@shshe
Copy link

shshe commented Aug 16, 2020

@dcherniv @shshe Is this still an issue, or can we close the issue?

Yes, I'm not experiencing this since removing the ping command. I believe the zombie processes were spawned by celery and unrelated to the AMI.

@mogren mogren closed this as completed Aug 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants