Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add kv-store connection check to readiness probe #135

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Legion2
Copy link
Contributor

@Legion2 Legion2 commented Mar 19, 2024

Motivation

When the modelmesh is not able to connect to the kv store to update its instance recording or sync with the other instances it can not reliably serve inference requests. For a short time a disconnect can be tolerated and the cached values can be used to serve requests. However after some time the data may be stale and the routing of requests may result in errors. For example with instance A and B, if A has connection issues while B leaves the mesh, A still have the outdated instance record and will still route inference requests to B, which fail. To prevent this, A should be marked unready if it can not connect to the kv store, to inform upstream proxies to not route traffic to A.

Modifications

Add existing verifyKvStoreConnection check to isReady check.

Result

isReady will return false when the model mesh instance lost connection to the kv store. Allowing systems such as kubernetes to react to this condition.

Signed-off-by: Leon Kiefer <leon.k97@gmx.de>
Copy link

oss-prow-bot bot commented Mar 19, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Legion2
Once this PR has been reviewed and has the lgtm label, please assign rafvasq for approval by writing /assign @rafvasq in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@rafvasq rafvasq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me and the changes look fine, but I wonder if @tjohnson31415 or @njhill could chime in to confirm the idea before merging especially since it doesn't seem like there are any tests related to readiness probe.

FYI @ckadner

@rafvasq rafvasq requested a review from njhill April 1, 2024 21:19
@rafvasq rafvasq changed the title Add kv-store connection check to readiness probe feat: Add kv-store connection check to readiness probe Apr 1, 2024
@tjohnson31415
Copy link
Contributor

tjohnson31415 commented Apr 9, 2024

Hmm, I'm not sure about this behavior being the default.

As stated, ModelMesh is intended to handle downtime in etcd by continuing to serve requests for existing models (adding models and scaling would fail). If all pods reported as Unready when etcd goes down, then no requests would succeed. Even if the downtime extends over a longer period, I think it would be better to serve the requests it can until the system recovers.

Where this change would help is in the situation described in the issue where one or two of the pods are disconnected from the KV store, but this situation is not easily distinguishable from a complete etcd downtime at the pod level. Merging this PR would prevent partial service in the case that etcd does go down. 🤔

That said, depending on the failure mode, perhaps reporting as NOT READY would be useful in some circumstances. If you do want this behavior for your use-case, I think we can add an experimental feature-flag via environment variable to control this behavior (off by default).

@Legion2
Copy link
Contributor Author

Legion2 commented Apr 13, 2024

Thanks for the feedback, I will add a feature flag for it.

@rafvasq rafvasq marked this pull request as draft April 18, 2024 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants