Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spot interrupt taint/label/annotation on node #6103

Open
stijndehaes opened this issue Apr 26, 2024 · 6 comments
Open

Spot interrupt taint/label/annotation on node #6103

stijndehaes opened this issue Apr 26, 2024 · 6 comments
Assignees
Labels
feature New feature or request

Comments

@stijndehaes
Copy link
Contributor

Description

What problem are you trying to solve?

When a node is being shut down because of a spot interrupt I want to be able to figure that out in my pod. That way we can provide the correct information on why a pod was shut down.
Currently we use aws node termination handler, which adds different taints depending on why the node is being shut down. I would love to switch to Karpenter handling spot interrupt however this feature is blocking.

How important is this feature to you?

This feature is very important, providing this visibility to users is key for the platform we are building.

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@stijndehaes stijndehaes added feature New feature or request needs-triage Issues that need to be triaged labels Apr 26, 2024
@stijndehaes
Copy link
Contributor Author

I am willing to work on this myself as I have experience with writing golang and kubernetes operator. It could be extra support needs to be added to upstream karpenter, but I am not sure what the best architecture would be

@engedaam engedaam removed the needs-triage Issues that need to be triaged label May 3, 2024
@engedaam
Copy link
Contributor

engedaam commented May 3, 2024

Would it be enough for Karpenter to fire metrics on the nodes that were interrupted?

@stijndehaes
Copy link
Contributor Author

Would it be enough for Karpenter to fire metrics on the nodes that were interrupted?

Sadly for our use case it doesn't. What we currently do is when a pod is being shut down we look at the node if there is a spot interrupt going on. If we only fire metrics there is no easy way to query this interactively. Currently in the log of the pod we output if there is a spot interrupt. With metrics we would need another way to visualise it.

@jonathan-innis
Copy link
Contributor

What we currently do is when a pod is being shut down we look at the node if there is a spot interrupt going on

What about Kubernetes events? We also fire an event here alongside the metric. I'm skeptical of wanting to change our tainting logic to support an observability use-case. What if we added a condition to the NodeClaim? Would this be enough to satisfy the observability use-case?

@jonathan-innis jonathan-innis self-assigned this May 13, 2024
@stijndehaes
Copy link
Contributor Author

What about Kubernetes events? We also fire an event here alongside the metric. I'm skeptical of wanting to change our tainting logic to support an observability use-case. What if we added a condition to the NodeClaim? Would this be enough to satisfy the observability use-case?

Didn't notice there are kubernetes events about disruption, I could use that!
A condition in the node claim would be better, but I will see where I can get with the events to start with.

Closed the PR for now, I can always open a new for the node claim condition. I will look at that later this week and make a proposal here :)

@stijndehaes
Copy link
Contributor Author

@jonathan-innis what do you think?

The new condition could look like this:

conditions:
- lastTransitionTime: "2024-05-10T00:05:07Z"
   status: "True"
   type: Interrupted
   Reason: "SpotInterrupt"

In the reason field we add why the node is interrupted: SpotInterrupt, ScheduledChange, ....
The type could just be Interrupted.

Would this new type need to be added to the upstream karpenter project? Or can we add it in the provider-aws implementation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants