-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Containers appear to be crashing with fork/exec resource unavailable #239
Comments
Experiencing a similar issue with Redis, same version of eks AMI and eks platform. It ends up crashing other services in my cluster that depend on it after approx 2 days. |
Getting a similar issue in EKS 1.12 trying to exec into any pods: us-east-1 |
Not sure if this will help your issue but I restarted the kubelet on each node and I was able to run an exec which fixed my problem. |
Seeing this on any cluster I have which uses 1.12,
Environment: AWS Region: us-east-1 |
I'm getting this too after ~2 days of a node running. New containers fail to start and processes seem to have issues starting as well.
Running AMI: amazon-eks-node-1.12-v20190614 (ami-0200e65a38edfb7e1) |
One possible cause is running out of process IDs. Check you don't have 40.000 defunct processes or similar on nodes with problems. |
We experienced this on Thursday of last week, it definitely feels like its running out of process IDs, likely due to some fork-bomb style bug. Unfortunately when it gets into this state we can't even SSH onto the node to find out what is going on. |
The node in question for us happened to be running Did anyone else notice if there was specific things running on that node? |
@whereisaaron Thanks for the tip! Our nodes were indeed being filled with defunct processes that were started by a Kubernetes health check command to check the status of our Celery workers. We've removed the check for now. |
@shshe do you have more information on that? What sort of checks were they? |
I can confirm this issue is still happening from time to time on the affected container, this time was |
Sorry for the really late response. Our defunct processes were being spawned when we ran |
What happened:
On one particular node that runs 2 rabbitmq containers connected to two different rabbitmq clusters, the rabbit containers appear to be crashing with the above error.
What you expected to happen:
The containers to run properly
How to reproduce it (as minimally and precisely as possible):
Connect some clients to both cluster that create 5-10 queues, ping/pong type commands should do.
Wait for a day or two.
Anything else we need to know?:
Environment:
aws eks describe-cluster --name <name> --query cluster.platformVersion
):1.11 "eks.2"
aws eks describe-cluster --name <name> --query cluster.version
):1.11 eks.2
amazon-eks-node-1.11-v20190327 (ami-05fe3f841ac4df3bb)
uname -a
):cat /tmp/release
on a node):The text was updated successfully, but these errors were encountered: