Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpener unable to create spot instances in a local zone #6183

Open
STollenaar opened this issue May 12, 2024 · 9 comments
Open

Karpener unable to create spot instances in a local zone #6183

STollenaar opened this issue May 12, 2024 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@STollenaar
Copy link

STollenaar commented May 12, 2024

Description

When trying to use a spot instance in a local zone Karpenter fails to launch one even if there is capacity for it. As an example I am using the t3.medium in the eu-north-1-cph-1a local zone.

Observed Behavior:

{"level":"ERROR","time":"2024-05-11T19:42:18.235Z","logger":"controller.provisioner","message":"Could not schedule pod, incompatible with nodepool \"eks-cluster-prefect\", daemonset overhead={\"cpu\":\"210m\",\"memory\":\"240Mi\",\"pods\":\"6\"}, no instance type satisfied resources {\"cpu\":\"210m\",\"memory\":\"240Mi\",\"pods\":\"7\"} and requirements karpenter.sh/capacity-type In [spot], karpenter.sh/nodepool In [eks-cluster-prefect], node.kubernetes.io/instance-type In [t3.medium], topology.kubernetes.io/zone In [eu-north-1-cph-1a], worker/type In [prefect] (no instance type met the scheduling requirements or had a required offering); incompatible with nodepool \"eks-cluster-default\", daemonset overhead={\"cpu\":\"210m\",\"memory\":\"240Mi\",\"pods\":\"7\"}, incompatible requirements, key worker/type, worker/type In [prefect] not in worker/type In [worker]","commit":"a70b39e","pod":"prefect/inquisitive-carp-mwbgm-vtxj7"}

Karpenter is able to discover the spot instances:

karpenter_cloudprovider_instance_type_offering_available{capacity_type="spot",instance_type="t3.medium",zone="eu-north-1-cph-1a"} 0 karpenter_cloudprovider_instance_type_offering_available{capacity_type="spot",instance_type="t3.medium",zone="eu-north-1a"} 0 karpenter_cloudprovider_instance_type_offering_available{capacity_type="spot",instance_type="t3.medium",zone="eu-north-1b"} 0 karpenter_cloudprovider_instance_type_offering_available{capacity_type="spot",instance_type="t3.medium",zone="eu-north-1c"} 0

AWS Is offering this instance type but is not publishing the price history

Expected Behavior:

Have a spot instance scheduled, or get the warning that no capacity is available.

Reproduction Steps (Please include YAML):
NodeClass is basically the default one.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  annotations:
    karpenter.sh/nodepool-hash: "2250080940955023888"
  creationTimestamp: "2023-12-26T18:51:41Z"
  generation: 5
  name: eks-cluster-prefect
  resourceVersion: "55330654"
  uid: 4afa6704-20b1-4398-8834-02002b4dbdc5
spec:
  disruption:
    consolidateAfter: 30s
    consolidationPolicy: WhenEmpty
    expireAfter: 720h
  template:
    metadata:
      labels:
        worker/type: prefect
    spec:
      kubelet:
        evictionHard:
          memory.available: 200Mi
      nodeClassRef:
        name: eks-cluster
      requirements:
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - t3.medium
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
      - key: topology.kubernetes.io/zone
        operator: In
        values:
        - eu-north-1-cph-1a
      taints:
      - effect: NoSchedule
        key: prefect/dedicated
status: {}

Versions:

  • Chart Version: 0.36.0; Image version: 0.36.1
  • Kubernetes Version (kubectl version):

Client Version: v1.28.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.3-eks-adc7111

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@STollenaar STollenaar added bug Something isn't working needs-triage Issues that need to be triaged labels May 12, 2024
@jonathan-innis
Copy link
Contributor

AWS Is offering this instance type but is not publishing the price history

Do you see different behavior for on-demand instance types? Is this issue just for spot or for all instance types?

@jonathan-innis jonathan-innis self-assigned this May 13, 2024
@STollenaar
Copy link
Author

STollenaar commented May 13, 2024

yeah if I change this exact config to use on-demand everything gets scheduled like normal. But as soon as it's switched to spot it fails.
Also, even if I include all the offered instance types for this local zone it would fail when using spot, but succeed when using on-demand.

@jonathan-innis jonathan-innis removed the needs-triage Issues that need to be triaged label May 13, 2024
@jmdeal
Copy link
Contributor

jmdeal commented May 15, 2024

Are you able to launch this instance in non-local zones? Based on the metric it looks like we believe the instance type is unavailable across all zones, not just the local zone. Do you have a full set of Karpenter logs?

@STollenaar
Copy link
Author

STollenaar commented May 15, 2024

yes, if I include a non-local zone AZ both on-demand and spots work for the regular AZ. local-zone continues to fail for spot instances. (please don't mind the closed spot instance. It was in the running state, I was just too late grabbing a screenshot from the spot requests page)
karpenter_bad.log
karpenter_success.log
Screenshot from 2024-05-15 13-09-42

@jmdeal
Copy link
Contributor

jmdeal commented May 15, 2024

It does look like we assume if we can discover spot pricing within a region, we can discover pricing across all zones. If a zone is missing from the pricing information, offerings for that zone are marked as unavailable. It's probably reasonable to continue to fallback to the on-demand pricing if pricing for a specific zone is unavailable, like we would for the region if it were unavailable. I'm still a bit confused about your metric values though, Karpenter believed there were no offerings available in any zone for t3.medium, not just the local zone. Did you see different values for the karpenter_cloudprovider_instance_type_offering_available metric when you were successfully able to launch instance in the normal AZs?

@STollenaar
Copy link
Author

yes I noticed that when I allowed it to be scheduled in the other AZ it changed to a 1. I attached the entire metrics list, hopefully that also helps.
metrics.txt

@jmdeal
Copy link
Contributor

jmdeal commented May 18, 2024

After taking another look at your logs, I realized the reason the metrics showed the instance as unavailable was because you're only selecting subnets from that zone. I had assumed you were only using the nodepool to filter down zone. That clears up my point of confusion, and the metrics you provided look right to me.

I guess the question at this point is what are the implications of falling back to on-demand pricing in a single zone. The only thing I can think of is that we may not consolidate when we should have been able to, but the same can be said for full region fallback. I'm going to try and think if there's anything else, but that seems like a reasonable path forward at the moment.

@STollenaar
Copy link
Author

what are the implications of falling back to on-demand pricing in a single zone

Bit confused with this one, do you mean as a fallback showing that on-demand pricing? The implication I could see is that if you define a cheap t3 instance and an expensive g4dn type then that might cause issues. I noticed that the g4dn has ~77% cost savings and would be cheaper than that t3 one.

For my use case this would not be a problem, but for others it could be.

Could another option be to create an average from all other valid discovered zones?

@billrayburn billrayburn assigned jmdeal and unassigned njtran May 22, 2024
@STollenaar
Copy link
Author

@jmdeal Have you been able to get any progress with this yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants