Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set leader election client and renew timeout #65094

Merged
merged 2 commits into from Jul 3, 2018

Conversation

hzxuzhonghu
Copy link
Member

@hzxuzhonghu hzxuzhonghu commented Jun 14, 2018

What this PR does / why we need it:

set leader-election client timeout

set timeout for tryAcquireOrRenew

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #65090 #65257

Special notes for your reviewer:

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 14, 2018
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jun 14, 2018
@hzxuzhonghu
Copy link
Member Author

/assign @mikedanese @timothysc

/cc @kubernetes/api-reviewers

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 20, 2018
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 20, 2018
@hzxuzhonghu
Copy link
Member Author

@hzxuzhonghu hzxuzhonghu changed the title set leader election client timeout set leader election client and renew timeout Jun 20, 2018
@@ -201,7 +201,19 @@ func (le *LeaderElector) renew() {
stop := make(chan struct{})
wait.Until(func() {
err := wait.Poll(le.config.RetryPeriod, le.config.RenewDeadline, func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poll already retries and timesout. What does this do differently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed wait.Poll check timeout before running ConditionFunc. If ConditionFunc takes too long, it can not quit quickly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, gross. Ok, let me think about this.

leaderElectionClient := clientset.NewForConfigOrDie(restclient.AddUserAgent(kubeconfig, "leader-election"))
// shallow copy, do not modify the kubeconfig.Timeout.
config := *kubeconfig
config.Timeout = s.GenericComponent.LeaderElection.RenewDeadline.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be sufficient to set this to RetryPeriod? I think then we can get away without the other timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this timeout is to set for http client, used to prevent blocking forever.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what it is used for. If we retry every RetryPeriod anyway, why would we want a client to block for longer than RetryPeriod?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, the timeout less than RetryPeriod can work well. But if the network is unstable, and it takes longer to finish the renew. During this period, this client loses leadership.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my own thought. Would not mind different voice.

return le.tryAcquireOrRenew(), nil
var succeed bool
done := make(chan struct{})
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These goroutines are going to leak. This concerns me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After setting timeout, the goroutine can definitely exit with at most less than twice RenewDeadline time.

@timothysc timothysc removed their assignment Jun 25, 2018
@mikedanese mikedanese assigned awly and mikedanese and unassigned mikedanese Jun 26, 2018
@@ -289,7 +290,9 @@ func createClients(config componentconfig.ClientConnectionConfiguration, masterO
return nil, nil, nil, err
}

leaderElectionClient, err := clientset.NewForConfig(restclient.AddUserAgent(kubeConfig, "leader-election"))
restConfig := *restclient.AddUserAgent(kubeConfig, "leader-election")
restConfig.Timeout = timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You set Timeout on return of restclient.AddUserAgent here and on kubeconfig in cmd/kube-controller-manager/app/options/options.go
Make them consistent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

done := make(chan struct{})
go func() {
defer close(done)
succeed = le.tryAcquireOrRenew()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making done = make(chan bool, 1) and sending result onto it and removing succeed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also achieve the same effect, but I think it is not direct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue this is more direct.
And avoids the potential of racing on the succeed var. Access is synchronized on done channel right now but my first reaction was "it's read and written to concurrently".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you persuaded me


select {
case <-time.After(le.config.LeaseDuration - le.config.RetryPeriod):
return false, fmt.Errorf("timeout trying acquire or renew")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add timeout duration to the message

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

}()

select {
case <-time.After(le.config.LeaseDuration - le.config.RetryPeriod):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not re-use timeoutCtx.Done()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense

@k8s-ci-robot k8s-ci-robot removed the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 27, 2018
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 27, 2018
@hzxuzhonghu
Copy link
Member Author

/retest


select {
case <-timeoutCtx.Done():
return false, fmt.Errorf("tryAcquireOrRenew timed out after %s", le.config.RenewDeadline)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, now that this is based on the context, add timeoutCtx.Err() to the message.
The parent ctx passed to renew could've been cancelled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

done := make(chan struct{})
go func() {
defer close(done)
succeed = le.tryAcquireOrRenew()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue this is more direct.
And avoids the potential of racing on the succeed var. Access is synchronized on done channel right now but my first reaction was "it's read and written to concurrently".

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 29, 2018
@awly
Copy link
Contributor

awly commented Jun 29, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 29, 2018
@mikedanese
Copy link
Member

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 29, 2018
@mikedanese
Copy link
Member

/hold
needs a squash

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 29, 2018
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 30, 2018
@hzxuzhonghu
Copy link
Member Author

@awly @mikedanese squashed the last two commits

@awly
Copy link
Contributor

awly commented Jul 2, 2018

/lgtm
@mikedanese for re-approval

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 2, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: awly, hzxuzhonghu, mikedanese

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mikedanese
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 2, 2018
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 65094, 65533, 63522, 65694, 65702). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 47020f3 into kubernetes:master Jul 3, 2018
@hzxuzhonghu hzxuzhonghu deleted the le-client-timeout branch July 3, 2018 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

leader election client should set timeout
7 participants