Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kourier at large scale #941

Open
daraghlowe opened this issue Oct 19, 2022 · 3 comments
Open

Kourier at large scale #941

daraghlowe opened this issue Oct 19, 2022 · 3 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@daraghlowe
Copy link

daraghlowe commented Oct 19, 2022

What's the issue?
We have started testing Kourier at large scale to see if deployment times are better than Istio(time for a KSVC to become ready to serve traffic). Deploy times are good with Kourier and it consistently takes less than 10 seconds for a newly added KSVC to become ready all the way up to 2000 KSVC.

However, if you delete a KSVC and then you try to add a new KSVC, times are much slower and even with only 500 KSVC on the cluster it takes several minutes before the new KSVC is ready.

Looking at the logs in the net-kourier-controller, you can see that it starts reconciling all of the Ingress on the cluster when you delete a KSVC and presumably this needs to finish before the new ingress can be created for our new KSVC.

Why is this a problem?
This leads to inconsistent deploy times for our workloads which creates an inconsistent user experience as sometimes its really quick and other times it could takes minutes to become ready.

Results
Here are the times it took for a single KSVC to become ready right after I deleted a different single KVSC alongside the number of KSVC that were on the cluster.

image

Why are we doing this?
We are running a cluster with Knative and Istio with 1500 KSVC and have started to run a problem with the time it's taking before new KSVC we add become ready (the ingress).

We opened an issue for this here: knative/serving#13247

@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 18, 2023
@dprotaso
Copy link
Contributor

/lifecycle frozen

@knative-prow knative-prow bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 22, 2023
@dprotaso dprotaso reopened this Feb 22, 2023
@norbjd
Copy link
Contributor

norbjd commented Jul 22, 2023

Hello 👋,

We have noticed this kind of inconsistent deploy times on our Knative clusters too. As of now, this is the main reason we're running multiple clusters, with every cluster having only between 400 and 500 ksvc. Above, we start seeing some slowness, mostly ksvc taking a while to be ready.

I've started investigating based on @daraghlowe's example, on a simple kind cluster, measuring the time for the kingress resources to become ready. Here's my experiment, with the latest kourier version (main as of 2023-07-22: 85c062d):

  1. create a single Pod and a Service
  2. create 2000 Ingresses sequentially, all pointing on the Service created in the first step
  3. delete the first Ingress
  4. create 10 new Ingresses sequentially

For the first ingresses (up to 1200), 95% of the time, it takes less than 1s for every ingress to become ready. But, when we have more ingresses, this time increases, up to 2 seconds. The more ingresses objects we have, the more time it takes for an ingress to be ready, but it's always less or equal 2 seconds, so it's not that bad. See this plot showing the percentage of ingresses creation taking between 1 and 2 seconds according to the number of total ingresses:

image

These results might be normal, I don't know the intricacies.

But, once I had 2000 ingresses, and after deleting an existing ingress (3rd step), the next ingress creation (4th step) took between 7 and 8 seconds. The next ones were consistent with the results I showed before, between 1 and 2 seconds.

I'm still not sure why we got that big "time-to-ready" duration increase (from 1-2s to 7-8s) just after deleting an ingress, but from an outside perspective, adding an ingress should always take the same time to be ready.

I could not reproduce @daraghlowe numbers, because I only focused on ingresses here; there are obviously other things configured when we create a ksvc (revision, configuration, etc.).

I'll continue to investigate, but I thought it was worth posting this first experiment as it could bring more discussions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

3 participants