Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFE] Optimize IP allocation at scale #313

Open
adrianchiris opened this issue Mar 29, 2023 · 4 comments
Open

[RFE] Optimize IP allocation at scale #313

adrianchiris opened this issue Mar 29, 2023 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@adrianchiris
Copy link

Is your feature request related to a problem? Please describe.
When scheduling multiple pods with multiple secondary networks it may take a while time until whereabouts allocates IPs for all interfaces if too many pods spin up at the same time possibly hitting kubelet default 4 min limit to run pod sandbox[1].

This has been encountered in K8s clusters (128 worker nodes) running AI/ML Jobs spinning up 128 Pods at the same time
each pod has a total of 17 networks, 16 of those are sriov + whereabouts as IPAM (essentially the same secondary network specified 16 times).

[1] https://github.com/kubernetes/kubernetes/blob/c3e7eca7fd38454200819b60e58144d5727f1bbc/pkg/kubelet/cri/remote/remote_runtime.go#L163

Describe the solution you'd like
So, i performed some experiments:

I have decided to try this in a more compact and simple environment (master + 2 workers all baremetal) creating a simple deployment with 128 replicas, specifying 16 additional networks (macvlan + whereabouts)
and indeed i have hit kubelet timeout before deployment was ready and some pods entered restart.

splitting to 16 different networks with separate IP ranges sped up the process as we now have less data to retrieve from k8s API on each call (and less work iterating over this data). now i got the same deployment running in about 3:35min

My next step was then to split the global lease used by whereabouts to a per pool lease. implementing some POC level code, i then tried it out and it now took 1:23min for the deployment to be ready (all pods running)

as a reference i deployed on my setup the same deployment (128 pods) with just primary network, no whereabouts involved and it took 1:10min for the deployment to be ready (all pods running)

So essentially my solution for optimizing IP allocation at scale consists of two things

  1. Recommendation for user to split ranges (separate IPPools)
  2. use leader election with lease per pool

I will upload a POC code shortly for this approach

Describe alternatives you've considered
An Alternative we have discussed internally was to avoid using leader election at CNI level and drive IP allocation from a central place

A controller which will watch for pods and assign ip addressed for their networks via CRD (create either CR instance per pod or per pod and network)

CNI plugin would just GET the object and return IPs within it or retry if not set.

This is going to be a relatively large change in both the approach of how whereabouts assigns IP as well as code base.

Additional context
Add any other context or screenshots about the feature request here.

@adrianchiris
Copy link
Author

@maiqueb @dougbtv thoughts on this one ?

@maiqueb
Copy link
Collaborator

maiqueb commented Apr 18, 2023

On paper, it makes sense @adrianchiris .

Let me think this through a little bit more, I'll get back to you.

Maybe add an entry to this week's community meeting so we can start an initial discussion into this proposal ?

@samba
Copy link

samba commented Mar 20, 2024

Hey friends, was there a conclusion on this topic (almost a year ago)?

My team is encountering related problems at scale, and if there are solutions to this, we'd love to explore solving it.

@xagent003
Copy link
Contributor

+1 I am interested in this as well, @maiqueb @dougbtv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants