New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling down "gce-mig" drains node with running batch job #571
Comments
Hi @vincenthuynh 👋 Yeah, I think it would make sense to skip nodes that have active batch jobs, since they are often time-consuming to restart. Something like The problem is that we currently only allow a single |
Hi @lgfa29 , Perhaps I'm misunderstanding the docs, but I'm interpreting it as I'm not sure if this is related, but at one point, we end up with a single node that is set to drain and ineligible, preventing any new jobs from being scheduled. |
Oh sorry, I totally misunderstood your original message 😅 So yeah,
|
Hi @lgfa29, I've sent the complete autoscaler allocation logs over to But here's a snippet of the logs. Notice the seg violation at the end:
|
Thank you for the extra info. I got your logs in that email so we're all good. I think the problem you're experiencing there was fixed in #539. More specifically, the second point there where we tweaked the node selection algorithm to not halt in case of a node marked as ineligible. For the |
Hi @lgfa29 The autoscaler left us in the same situation again, where we have a single ineligible client: Here are autoscaler logs during the scale-out/in operations:
Logs in the autoscaler regarding the ineligible node
I hope this is sufficient to investigate how it gets into this state. Thanks, |
Thanks for the new info! Unfortunately I haven't been able to figure out what could be wrong yet 🤔 That |
When scaling down, the autoscaler picks a node with a long running batch job to drain despite using the
empty_ignore_system
strategy:What we end up seeing is any long running alloc that exceeds the drain deadline will be killed:
Sample batch job:
Policy:
Autoscaler:
0.3.4
Nomad:
1.1.6
The text was updated successfully, but these errors were encountered: