-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable concurrency per replica setting #12
Comments
The setting is here: Line 78 in 4350d67
In this case, |
I have already changed the default from 1 to 100, but what's left is having this configurable through the annotation and reconciling on this as needed |
Do I understand this correct, that you suggest a new annotation on the model deployment so that instead of a global value in lingo main this can be customized on the model level? This sounds very reasonable to me. The deployment manager receives updates on |
@alpe Yes that's correct. There should be a default global value. In addition, each deployment should be able to override the default global value by setting an annotation. |
Currently it seems Lingo is quickly creating more replicas as requests are incoming while the pod isn't ready to serve yet. It should be configurable how many requests a single pod can handle concurrently.
This could be done by using the following annotation in the deployment:
In this case Lingo should only scale up when a single pod is handling more than 100 HTTP requests in parallel. I think a good default value is 100 which is also what knative uses: https://knative.dev/docs/serving/autoscaling/concurrency/#soft-versus-hard-concurrency-limits
The text was updated successfully, but these errors were encountered: