[RFE] Optimize IP allocation at scale #313

adrianchiris · 2023-03-29T11:52:07Z

Is your feature request related to a problem? Please describe.
When scheduling multiple pods with multiple secondary networks it may take a while time until whereabouts allocates IPs for all interfaces if too many pods spin up at the same time possibly hitting kubelet default 4 min limit to run pod sandbox[1].

This has been encountered in K8s clusters (128 worker nodes) running AI/ML Jobs spinning up 128 Pods at the same time
each pod has a total of 17 networks, 16 of those are sriov + whereabouts as IPAM (essentially the same secondary network specified 16 times).

[1] https://github.com/kubernetes/kubernetes/blob/c3e7eca7fd38454200819b60e58144d5727f1bbc/pkg/kubelet/cri/remote/remote_runtime.go#L163

Describe the solution you'd like
So, i performed some experiments:

I have decided to try this in a more compact and simple environment (master + 2 workers all baremetal) creating a simple deployment with 128 replicas, specifying 16 additional networks (macvlan + whereabouts)
and indeed i have hit kubelet timeout before deployment was ready and some pods entered restart.

splitting to 16 different networks with separate IP ranges sped up the process as we now have less data to retrieve from k8s API on each call (and less work iterating over this data). now i got the same deployment running in about 3:35min

My next step was then to split the global lease used by whereabouts to a per pool lease. implementing some POC level code, i then tried it out and it now took 1:23min for the deployment to be ready (all pods running)

as a reference i deployed on my setup the same deployment (128 pods) with just primary network, no whereabouts involved and it took 1:10min for the deployment to be ready (all pods running)

So essentially my solution for optimizing IP allocation at scale consists of two things

Recommendation for user to split ranges (separate IPPools)
use leader election with lease per pool

I will upload a POC code shortly for this approach

Describe alternatives you've considered
An Alternative we have discussed internally was to avoid using leader election at CNI level and drive IP allocation from a central place

A controller which will watch for pods and assign ip addressed for their networks via CRD (create either CR instance per pod or per pod and network)

CNI plugin would just GET the object and return IPs within it or retry if not set.

This is going to be a relatively large change in both the approach of how whereabouts assigns IP as well as code base.

Additional context
Add any other context or screenshots about the feature request here.

adrianchiris · 2023-04-18T11:19:21Z

@maiqueb @dougbtv thoughts on this one ?

maiqueb · 2023-04-18T11:28:03Z

On paper, it makes sense @adrianchiris .

Let me think this through a little bit more, I'll get back to you.

Maybe add an entry to this week's community meeting so we can start an initial discussion into this proposal ?

samba · 2024-03-20T03:19:48Z

Hey friends, was there a conclusion on this topic (almost a year ago)?

My team is encountering related problems at scale, and if there are solutions to this, we'd love to explore solving it.

xagent003 · 2024-03-20T18:48:48Z

+1 I am interested in this as well, @maiqueb @dougbtv

adrianchiris added the enhancement New feature or request label Mar 29, 2023

adrianchiris assigned maiqueb Mar 29, 2023

adrianchiris mentioned this issue Mar 29, 2023

[POC] lease lock per IP pool #314

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFE] Optimize IP allocation at scale #313

[RFE] Optimize IP allocation at scale #313

adrianchiris commented Mar 29, 2023

adrianchiris commented Apr 18, 2023

maiqueb commented Apr 18, 2023

samba commented Mar 20, 2024

xagent003 commented Mar 20, 2024

[RFE] Optimize IP allocation at scale #313

[RFE] Optimize IP allocation at scale #313

Comments

adrianchiris commented Mar 29, 2023

adrianchiris commented Apr 18, 2023

maiqueb commented Apr 18, 2023

samba commented Mar 20, 2024

xagent003 commented Mar 20, 2024