Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancellable leader election #57932

Merged

Conversation

ash2k
Copy link
Member

@ash2k ash2k commented Jan 6, 2018

What this PR does / why we need it:
Adds ability to cancel leader election. Useful in integration tests where the whole app is started and stopped in each test.

Special notes for your reviewer:
I used the context package - it is impossible/hard to achieve the same behaviour with just channels without spawning additional goroutines but it is trivial with context. See acquire() and renew() methods.

Release note:

NONE

/kind enhancement
/sig api-machinery

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/enhancement sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 6, 2018
@caesarxuchao
Copy link
Member

/assign @cheftako

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 17, 2018
@ash2k ash2k force-pushed the cancellable-leader-election branch from a67cfa9 to b013fef Compare January 17, 2018 09:41
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 17, 2018
@ash2k
Copy link
Member Author

ash2k commented Jan 17, 2018

/retest

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 18, 2018
@ash2k ash2k force-pushed the cancellable-leader-election branch from b013fef to ddf6fdb Compare January 19, 2018 11:48
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 19, 2018
@ash2k
Copy link
Member Author

ash2k commented Jan 19, 2018

/retest

@ash2k ash2k force-pushed the cancellable-leader-election branch from ddf6fdb to fdd468d Compare February 1, 2018 10:15
@ash2k
Copy link
Member Author

ash2k commented Feb 2, 2018

@kubernetes/sig-api-machinery-pr-reviews PTAL

@timothysc timothysc self-assigned this Feb 5, 2018
@timothysc
Copy link
Member

timothysc commented Feb 5, 2018

@ash2k where is your integration test that makes your case? It's pretty trivial to plumb signal-handlers through the stop channel and others have written code that depends on this external to k8s.

You're basically forcing everyone who uses this to change, so let's be certain to dot all the i's and cross all the t's.

@ash2k
Copy link
Member Author

ash2k commented Feb 6, 2018

@timothysc There seem to be some misunderstanding here. I'm saying once you start the leader election, there is no way to ask it to stop. I.e. there is no way to make the method LeaderElector.Run() return. It only ever returns if it fails to renew the lease. I want to start and stop integration tests that start the whole application (in a Go test) with leader election enabled (because it is code that needs to be tested). But once the application does what it needs to do in the test, the test should be able to tell the application to cleanly shutdown. No way to do it right now.
You seem to be talking about plumbing a channel though to stop the application itself - that is not the problem I'm trying to solve here.

Examples of my integration tests can be found here https://github.com/atlassian/smith/tree/master/it
You probably want to take a look at the SetupApp() function that starts the application for each test and then does clean tear down.

@timothysc
Copy link
Member

I'm ok with it, but you're changing common code that will affect multiple parties that use the client. I'm not certain how apimachinery folks communicate these changes, and what policies are in place nowadays.

/cc @sttts

@sttts
Copy link
Contributor

sttts commented Feb 12, 2018

Would be much easier to review if the conversion from stopCh to context would be separate.

@ash2k
Copy link
Member Author

ash2k commented Feb 12, 2018

It would be harder to implemented the same logic with a stop channel - channels panic if closed more than once unlike the cancel() function that is used to cancel a context. So I would need to have some sort of locking to ensure the channel has been closed only once.
For example the renew() function would need to start an additional goroutine to receive from the stopCh and then close a temporary channel which would signal wait.Until() to stop. And that temporary channel would be used to cancel the wait from inside of the wait.Until(). So need a lock around that - could use sync.Once to do this. Temporary channel is needed because it cannot and must not close the stop channel that was passed to it.
All of that is not needed with a context and it would probably be harder to review.

@sttts
Copy link
Contributor

sttts commented Feb 12, 2018

All of that is not needed with a context and it would probably be harder to review.

I don't argue against a context :) just that we have two changes mixed into one commit here.

@sttts
Copy link
Contributor

sttts commented Feb 12, 2018

I don't argue against a context :)

Would like us to move to contexts much more throughout the code base.

@ash2k
Copy link
Member Author

ash2k commented Feb 12, 2018

I understand and do not argue that it's two changes in 1 commit. My point is that it would not be easier to review. But if you insist I'll do it in another branch as an experiment. I'll post a link.
Yes, moving from stop channels to context would be nice.

@sttts
Copy link
Contributor

sttts commented Feb 12, 2018

But if you insist I'll do it in another branch as an experiment

Two commits in this PR would be fine.

In general, change sgtm.

@ash2k
Copy link
Member Author

ash2k commented Jun 8, 2018

/retest

@@ -144,7 +145,7 @@ func Run(c *config.CompletedConfig) error {
}
}

run := func(stop <-chan struct{}) {
run := func(runCtx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same nit as above. s/runCtx/ctx/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It clashes with ctx, err := CreateControllerContext(c, rootClientBuilder, clientBuilder, runCtx.Done()) below.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in general, variables of type context.Context prefer to be named ctx. Naming a "ControllerContext" struct variable ctx is probably worth changing so that readers don't get confused.

In fact, looking at the definition of ControllerContext, I don't think it is named correctly. It looks more like a config struct. (-:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

@mikedanese
Copy link
Member

One more comment.

/lgtm
/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jun 8, 2018
Copy link

@misterikkit misterikkit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would have been nicer to add new functionality to leaderelect without requiring clients to change in the same PR. Each controller could be updated in a follow-up PR so that you don't get caught up in all the approvals needed.

Since the work is already done here, I won't push for this to be split up. See my comment about how I think it could have been done, though.

@@ -144,7 +145,7 @@ func Run(c *config.CompletedConfig) error {
}
}

run := func(stop <-chan struct{}) {
run := func(runCtx context.Context) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in general, variables of type context.Context prefer to be named ctx. Naming a "ControllerContext" struct variable ctx is probably worth changing so that readers don't get confused.

In fact, looking at the definition of ControllerContext, I don't think it is named correctly. It looks more like a config struct. (-:

le.maybeReportTransition()
if !succeeded {
glog.V(4).Infof("failed to acquire lease %v", desc)
return
}
le.config.Lock.RecordEvent("became leader")
glog.Infof("successfully acquired lease %v", desc)
close(stop)
}, le.config.RetryPeriod, JitterFactor, true, stop)
once.Do(func() { close(tmpStop) })

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of tmpStop in this function? cancellation propagation?

JitterUntil may call this anonymous func multiple times... aha

Okay, tmpStop signals to JitterUntil that it has succeeded. I suggest renaming it to done. I also think that the entire wait library tends to result in unreadable code like this, when we could have just written a for loop.

glog.Fatalf("error running controllers: %v", err)
}
}

runCtx, cancel := context.WithCancel(context.Background())
defer cancel()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer cancel() is a pretty common pattern with context.WithCancel. I think the expectation is that after creating the context, you are going to call some blocking Run func. If that func starts goroutines, it should pass the context (or child contexts) to them. So if Run ever exits, we will signal to any stray goroutines to clean up and exit.

@@ -146,28 +146,28 @@ type LeaderElector struct {
}

// Run starts the leader election loop
func (le *LeaderElector) Run(stop <-chan struct{}) {
func (le *LeaderElector) Run(ctx context.Context) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also create an overload like this:
func (le *LeaderElector) Run() { le.Run(context.TODO()) }

Then updating clients to use the new thing would be much more graceful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go doesn't support overloading. We discussed adding WithContext style functions to maintain compatibility but the package is documented as having an alpha API. #57932 (review)

If people feel strongly we can do this, but I wouldn't fight for it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops - I am still learning quite a bit about golang.

I am not concerned with backward compatibility, but I thought the author's life would have been easier by temporarily preserving backward compatibility.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 9, 2018
@mikedanese
Copy link
Member

/hold cancel
/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jun 11, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ash2k, mikedanese

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@smarterclayton smarterclayton added this to the v1.12 milestone Jun 13, 2018
@smarterclayton
Copy link
Contributor

This is too late for 1.11 given the potential for regression. I'm marking at as 1.12 to be explicit with that.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

1 similar comment
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request Labels Incomplete

@ash2k @cheftako @deads2k @fgrzadkowski @mikedanese @pmorie

Action required: This pull request requires label changes. If the required changes are not made within 3 days, the pull request will be moved out of the v1.12 milestone.

priority: Must specify exactly one of priority/critical-urgent, priority/important-longterm or priority/important-soon.

Help

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 65256, 64236, 64919, 64879, 57932). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 571b9be into kubernetes:master Jun 21, 2018
@ash2k ash2k deleted the cancellable-leader-election branch June 21, 2018 00:56
akhilerm pushed a commit to akhilerm/apimachinery that referenced this pull request Sep 20, 2022
Automatic merge from submit-queue (batch tested with PRs 60012, 63692, 63977, 63960, 64008). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Use Dial with context

**What this PR does / why we need it**:
`net/http/Transport.Dial` field is deprecated:
```go
// DialContext specifies the dial function for creating unencrypted TCP connections.
// If DialContext is nil (and the deprecated Dial below is also nil),
// then the transport dials using package net.
DialContext func(ctx context.Context, network, addr string) (net.Conn, error)

// Dial specifies the dial function for creating unencrypted TCP connections.
//
// Deprecated: Use DialContext instead, which allows the transport
// to cancel dials as soon as they are no longer needed.
// If both are set, DialContext takes priority.
Dial func(network, addr string) (net.Conn, error)
```
This PR switches all `Dial` usages to `DialContext`. Fixes #63455.

**Special notes for your reviewer**:
Also related: kubernetes/kubernetes#59287 kubernetes/kubernetes#58532 kubernetes/kubernetes#815 kubernetes/community#1166 kubernetes/kubernetes#58677 kubernetes/kubernetes#57932

**Release note**:
```release-note
HTTP transport now uses `context.Context` to cancel dial operations. k8s.io/client-go/transport/Config struct has been updated to accept a function with a `context.Context` parameter. This is a breaking change if you use this field in your code.
```
/sig api-machinery
/kind enhancement
/cc @sttts

Kubernetes-commit: ddf551c24b7d88454f8332ce6855e53281440958
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. milestone/incomplete-labels release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet