Containers appear to be crashing with fork/exec resource unavailable #239

dcherniv · 2019-04-01T01:04:03Z

What happened:
On one particular node that runs 2 rabbitmq containers connected to two different rabbitmq clusters, the rabbit containers appear to be crashing with the above error.
What you expected to happen:
The containers to run properly
How to reproduce it (as minimally and precisely as possible):

helm install -n rabbit-1 stable/rabbitmq-ha
helm install -n rabbit-2 stable/rabbitmq-ha

Connect some clients to both cluster that create 5-10 queues, ping/pong type commands should do.
Wait for a day or two.

kubectl get pods -o yaml -n demo demo-rabbitmq-0
[...]
        message: 'failed to start shim: fork/exec /usr/bin/docker-containerd-shim:
          resource temporarily unavailable: unknown'
        reason: ContainerCannotRun

Anything else we need to know?:

 journalctl -u docker | tail -n 200
[...]
Mar 31 19:15:32 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T19:15:32Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/f5537aadb449b48a2e3f3e75cc8e01d9b2b11139fe172bb8d30d0fec1cd58ed1/shim.sock" debug=false pid=17287
Mar 31 19:15:46 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T19:15:46Z" level=info msg="shim reaped" id=cc95a7934f7203251394599bb8800700b595e7457998b34e38b393f6c78d6e9b
Mar 31 19:15:46 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T19:15:46.837885110Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 31 19:15:46 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T19:15:46.875961539Z" level=warning msg="Cannot kill container cc95a7934f7203251394599bb8800700b595e7457998b34e38b393f6c78d6e9b: unknown error after kill: fork/exec /usr/bin/docker-runc: resource temporarily unavailable: : unknown"
Mar 31 19:15:47 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T19:15:47Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/b48df07b3f3c9ade5a4ac8bb853b21edf3ae1b9f6a327479f8319c283cffaf1a/shim.sock" debug=false pid=18696
Mar 31 19:16:26 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T19:16:26Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/9ae4aff4045752cb655114caa38c086916dfc04f8af450a08e5086ce626d7245/shim.sock" debug=false pid=21630
Mar 31 21:27:21 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T21:27:21.408313532Z" level=error msg="stream copy error: reading from a closed fifo"
Mar 31 21:27:21 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T21:27:21.411989927Z" level=error msg="stream copy error: reading from a closed fifo"
Mar 31 21:27:21 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T21:27:21.425543549Z" level=error msg="Error running exec e9c4cc3e1d5cc42356399896c12a8887f8fc7b940c5b70e29d226a3dfbc8fc7b in container: OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused \"process_linux.go:90: adding pid 30275 to cgroups caused \\\"failed to write 30275 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/poddd27e9d8-51b1-11e9-871b-0e31a129cd1a/9ae4aff4045752cb655114caa38c086916dfc04f8af450a08e5086ce626d7245/cgroup.procs: invalid argument\\\"\": unknown"
Mar 31 23:03:05 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T23:03:05.622015961Z" level=error msg="stream copy error: reading from a closed fifo"
Mar 31 23:03:05 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T23:03:05.622443440Z" level=error msg="stream copy error: reading from a closed fifo"
Mar 31 23:03:05 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T23:03:05.636902513Z" level=error msg="Error running exec 28ee970a77964be05eaf971008ef6ae10198f7f205cc5e752256968c773c32cd in container: OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused \"process_linux.go:90: adding pid 18681 to cgroups caused \\\"failed to write 18681 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/podad780255-51b1-11e9-871b-0e31a129cd1a/b48df07b3f3c9ade5a4ac8bb853b21edf3ae1b9f6a327479f8319c283cffaf1a/cgroup.procs: invalid argument\\\"\": unknown"
Mar 31 23:33:44 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T23:33:44Z" level=info msg="shim reaped" id=642c774c1d4a6c656dd5da88e436aebc59bc65cd3fb2ae6a4b07589b04f655c1
Mar 31 23:33:44 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T23:33:44.672119161Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 31 23:33:46 ip-10-128-95-215.ec2.internal dockerd[3601]: time="2019-03-31T23:33:46Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/9ffdd43b4b3d92982107f6a38369bef5c24ce0c659e666d6c571ea0237b8b48f/shim.sock" debug=false pid=5037

[root@ip-10-128-95-215 ~]# cat /lib/systemd/system/docker.service | grep -v "#"
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target firewalld.service
Wants=network-online.target

[Service]
Type=notify
EnvironmentFile=-/etc/sysconfig/docker
EnvironmentFile=-/etc/sysconfig/docker-storage
EnvironmentFile=-/run/docker/runtimes.env
ExecStartPre=/bin/mkdir -p /run/docker
ExecStartPre=/usr/libexec/docker/docker-setup-runtimes.sh
ExecStart=/usr/bin/dockerd $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_ADD_RUNTIMES
ExecReload=/bin/kill -s HUP $MAINPID
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TimeoutStartSec=0
Delegate=yes
KillMode=process
Restart=on-failure
StartLimitBurst=3
StartLimitInterval=60s

[Install]
WantedBy=multi-user.target

[root@ip-10-128-95-215 ~]# cat /lib/systemd/system/docker.service | grep Tasks
# Uncomment TasksMax if your systemd version supports it.
#TasksMax=infinity
[root@ip-10-128-95-215 ~]#

[root@ip-10-128-95-215 ~]# systemctl status docker | grep Tasks
    Tasks: 734
[root@ip-10-128-95-215 ~]#

Environment:

AWS Region: us-east-1
Instance Type(s): c5.xlarge
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): 1.11 "eks.2"
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.11 eks.2
AMI Version: amazon-eks-node-1.11-v20190327 (ami-05fe3f841ac4df3bb)
Kernel (e.g. uname -a):

[root@ip-10-128-95-215 ~]# uname -a
Linux ip-10-128-95-215.ec2.internal 4.14.104-95.84.amzn2.x86_64 #1 SMP Sat Mar 2 00:40:20 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@ip-10-128-95-215 ~]#

Release information (run cat /tmp/release on a node):

[root@ip-10-128-95-215 ~]# cat /tmp/release
cat: /tmp/release: No such file or directory
[root@ip-10-128-95-215 ~]#

The text was updated successfully, but these errors were encountered:

bnutt · 2019-04-08T20:30:48Z

Experiencing a similar issue with Redis, same version of eks AMI and eks platform. It ends up crashing other services in my cluster that depend on it after approx 2 days.

james-ingold · 2019-04-12T22:17:23Z

Getting a similar issue in EKS 1.12 trying to exec into any pods:
kubectl exec -it {podName} -- /bin/sh
failed to create runc console socket: stat /tmp: no such file or directory: unknown
command terminated with exit code 126

us-east-1
ami-0abcb9f9190e867ab

james-ingold · 2019-04-18T12:52:45Z

Not sure if this will help your issue but I restarted the kubelet on each node and I was able to run an exec which fixed my problem.
sudo systemctl restart kubelet

bnutt · 2019-04-24T17:54:50Z

Seeing this on any cluster I have which uses 1.12,

  State:          Waiting
      Reason:       RunContainerError
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      failed to start shim: fork/exec /usr/bin/docker-containerd-shim: resource temporarily unavailable: unknown
      Exit Code:    128
      Started:      Wed, 24 Apr 2019 09:08:36 -0700
      Finished:     Wed, 24 Apr 2019 09:08:36 -0700
    Ready:          False

Environment:

AWS Region: us-east-1
Instance Type(s): m5.2xlarge
EKS Platform version (use aws eks describe-cluster --name --query cluster.platformVersion): 1.12 "eks.1"
Kubernetes version (use aws eks describe-cluster --name --query cluster.version): 1.12 eks.1
AMI Version: amazon-eks-node-1.12-v20190329 (ami-0abcb9f9190e867ab)
Kernel (e.g. uname -a): Can't get this because nodes ssh process is dead due to issue

shshe · 2019-06-28T15:36:43Z

I'm getting this too after ~2 days of a node running. New containers fail to start and processes seem to have issues starting as well.

 Error: failed to start container "...": Error response from daemon: failed to start shim: fork/exec /usr/bin/docker-containerd-shim: resource temporarily unavailable: unknown
  Warning  BackOff    7s (x2 over 13s)   kubelet, ip-172-31-100-123.ec2.internal  Back-off

Running AMI: amazon-eks-node-1.12-v20190614 (ami-0200e65a38edfb7e1)

whereisaaron · 2019-06-29T14:19:44Z

One possible cause is running out of process IDs. Check you don't have 40.000 defunct processes or similar on nodes with problems.

tecnobrat · 2019-07-08T14:48:17Z

We experienced this on Thursday of last week, it definitely feels like its running out of process IDs, likely due to some fork-bomb style bug.

Unfortunately when it gets into this state we can't even SSH onto the node to find out what is going on.

tecnobrat · 2019-07-08T15:02:52Z

The node in question for us happened to be running coredns and tiller-deploy (which don't run on every node in the cluster, so I wanted to point this out as a possible breadcrumb)

Did anyone else notice if there was specific things running on that node?

shshe · 2019-07-08T15:04:35Z

@whereisaaron Thanks for the tip! Our nodes were indeed being filled with defunct processes that were started by a Kubernetes health check command to check the status of our Celery workers. We've removed the check for now.

tecnobrat · 2019-07-08T15:30:23Z

@shshe do you have more information on that? What sort of checks were they?

pawelprazak · 2019-11-05T10:05:26Z

I can confirm this issue is still happening from time to time on v1.13.11-eks-5876d6, docker://18.6.1

the affected container, this time was istio-sidecar-injector

shshe · 2019-11-05T15:45:57Z

@shshe do you have more information on that? What sort of checks were they?

Sorry for the really late response. Our defunct processes were being spawned when we ran celery inspect ping as part of our liveness probe. Here's the related issue in the celery project:

celery/celery#4079 (comment)

mogren · 2020-08-16T04:21:24Z

@dcherniv @shshe Is this still an issue, or can we close the issue?

shshe · 2020-08-16T04:48:40Z

@dcherniv @shshe Is this still an issue, or can we close the issue?

Yes, I'm not experiencing this since removing the ping command. I believe the zombie processes were spawned by celery and unrelated to the AMI.

mogren closed this as completed Aug 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Containers appear to be crashing with fork/exec resource unavailable #239

Containers appear to be crashing with fork/exec resource unavailable #239

dcherniv commented Apr 1, 2019 •

edited

bnutt commented Apr 8, 2019

james-ingold commented Apr 12, 2019

james-ingold commented Apr 18, 2019

bnutt commented Apr 24, 2019

shshe commented Jun 28, 2019

whereisaaron commented Jun 29, 2019

tecnobrat commented Jul 8, 2019

tecnobrat commented Jul 8, 2019

shshe commented Jul 8, 2019

tecnobrat commented Jul 8, 2019

pawelprazak commented Nov 5, 2019 •

edited

shshe commented Nov 5, 2019

mogren commented Aug 16, 2020

shshe commented Aug 16, 2020 •

edited

Containers appear to be crashing with fork/exec resource unavailable #239

Containers appear to be crashing with fork/exec resource unavailable #239

Comments

dcherniv commented Apr 1, 2019 • edited

bnutt commented Apr 8, 2019

james-ingold commented Apr 12, 2019

james-ingold commented Apr 18, 2019

bnutt commented Apr 24, 2019

shshe commented Jun 28, 2019

whereisaaron commented Jun 29, 2019

tecnobrat commented Jul 8, 2019

tecnobrat commented Jul 8, 2019

shshe commented Jul 8, 2019

tecnobrat commented Jul 8, 2019

pawelprazak commented Nov 5, 2019 • edited

shshe commented Nov 5, 2019

mogren commented Aug 16, 2020

shshe commented Aug 16, 2020 • edited

dcherniv commented Apr 1, 2019 •

edited

pawelprazak commented Nov 5, 2019 •

edited

shshe commented Aug 16, 2020 •

edited