Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seldon core operator is restarting due to failed renewal of lease #4147

Closed
sujaykulkarn opened this issue Jun 13, 2022 · 7 comments · Fixed by #4211
Closed

Seldon core operator is restarting due to failed renewal of lease #4147

sujaykulkarn opened this issue Jun 13, 2022 · 7 comments · Fixed by #4211
Assignees
Labels

Comments

@sujaykulkarn
Copy link

Describe the bug

Seldon core operator pod is getting restarted due to failed to retrieve resource lock getting below error logs.

{"version": "1.0", "level": "INFO", "host": "ccs-seldon-75dfcb5bf9-bwkfv.ccs", "system": "ml-inf-seldon", "type": "log", "log": {"message": "E0610 16:36:08.554283 7 leaderelection.go:325] error retrieving resource lock ccs/a33bd623.machinelearning.seldon.io: Get \"https://10.254.0.1:443/apis/coordination.k8s.io/v1/namespaces/ccs/leases/a33bd623.machinelearning.seldon.io\": context deadline exceeded"}, "time": "2022-06-10T16:36:09.511Z"} {"version": "1.0", "level": "INFO", "host": "ccs-seldon-75dfcb5bf9-bwkfv.ccs", "system": "ml-inf-seldon", "type": "log", "log": {"message": "I0610 16:36:08.554523 7 leaderelection.go:278] failed to renew lease ccs/a33bd623.machinelearning.seldon.io: timed out waiting for the condition"}, "time": "2022-06-10T16:36:09.512Z"} {"version": "1.0", "level": "INFO", "host": "ccs-seldon-75dfcb5bf9-bwkfv.ccs", "system": "ml-inf-seldon", "type": "log", "log": {"message": "setup : problem running manager"}, "time": "2022-06-10T16:36:09.512Z"}
Wanted to get more insights on this issue and is this issue is related ( kubernetes/client-go#966 )

To reproduce

  1. Install seldon chart with 2 replicas, keep it active for 2-3 days one or more restarts we will see.

Expected behaviour

Seldon pod must not restart it should retry for lease renewal.

Environment

  • Cloud Provider: Bare Metal

  • Kubernetes Cluster Version: [root:ccs-01-control-01 /root]$ kubectl version Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.5", GitCommit:"e338cf2c6d297aa603b50ad3a301f761b4173aa6", GitTreeState:"clean", BuildDate:"2020-12-09T11:18:51Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.5", GitCommit:"e338cf2c6d297aa603b50ad3a301f761b4173aa6", GitTreeState:"clean", BuildDate:"2020-12-09T11:10:32Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}

  • Deployed Seldon System Images: v1.11.0

Model Details

  • Images of your model: NA
  • Logs of your model: NA
@ukclivecox
Copy link
Contributor

is this related to kubernetes-sigs/kubebuilder#2604
I imagine its a kubebuilder issue or controller-runtime?

@ukclivecox ukclivecox added this to Needs triage in Backlog via automation Jun 24, 2022
@ukclivecox ukclivecox moved this from Needs triage to In Discussion in Backlog Jun 24, 2022
@ukclivecox ukclivecox self-assigned this Jun 27, 2022
@ukclivecox ukclivecox added this to To do in Seldon Core 1.15 via automation Jul 6, 2022
@ukclivecox ukclivecox removed this from In Discussion in Backlog Jul 6, 2022
@ukclivecox
Copy link
Contributor

Is there anything particular about your cluster that would mean resource locks fail?

@ukclivecox
Copy link
Contributor

similar issue kedacore/keda#2836

@ukclivecox
Copy link
Contributor

One option might be to allow longer deadlines to allow users to handle noisy/network issues in their clusters?

@ukclivecox ukclivecox moved this from To do to Review in progress in Seldon Core 1.15 Jul 8, 2022
Seldon Core 1.15 automation moved this from Review in progress to Done Jul 14, 2022
@sujaykulkarn
Copy link
Author

sujaykulkarn commented Jul 15, 2022

Hi @cliveseldon @axsaucedo, Many Thanks for the change.
These changes were most needed as sometime clusters have a heavy load and with these parameters, it will be easy to control the leader election process for Seldon. One small query is there any documentation done for the above fix?? Thank you.

@ukclivecox
Copy link
Contributor

There is not explicit docs at present. Setting these values require understanding the k8s leadership election process from the controller-runtime docs. Look forward to hearing how you get on. Also adding docs from your experience as a PR would be welcome. Feel free to open an issue.

@sujaykulkarn
Copy link
Author

Sure, Thank you. May I know when is the planned release for Seldon 1.15?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

2 participants