Leadership election #341

avestuk · 2022-09-15T11:49:38Z

Hello

This PR is to address running Reloader with multiple replicas as requested in #112 and it's something that we'd love to have at Nutmeg!

I'm opening the PR at this point hoping to get some feedback on my approach and to find out whether you are amenable to it.

Outstanding items include, but are likely not limited to:

~~Verifying that RBAC perms are correct if reloader is running in a single namespace~~
~~Adding the liveness probe to the helm chart~~
~~Understand the behaviour of controllers when leadership is assumed~~
- ~~When items are added to the queue on startup they'll be processed via Add if reloadOnCreate is true, therefore reloadOnCreate should be set by default when in HA mode?~~

I'm not sure how to test the shutdown on failing to renew the lease so that potentially also an outstanding item.

stakater-user · 2022-09-15T12:07:22Z

@avestuk Image is available for testing. docker pull stakater/reloader:SNAPSHOT-PR-341-76a68ef1

faizanahmad055 · 2022-09-15T13:50:33Z

Hi @avestuk, Thank you so much for the contribution. I can take a look at the PR and will try to test it as well. Can you please add some test cases as well?

avestuk · 2022-09-15T14:16:17Z

@faizanahmad055 I'll absolutely add some tests. I think what'd be really helpful at this point would be confirmation that you're happy with the overall approach and I'll flesh this PR out with tests.

I'm currently just digging into the behaviour of the controllers on startup. I can see that there's a flag to allow reloadOnCreate and I just want to be sure I've completely understood whether it's fine for the controllers to come to life having perhaps missed an update event.

Consider if we have reloader pod A as leader, reloader pod B is running.

Reloader A has previously updated the config map for pod test. Reloader A dies, and the config map is updated. Reloader B takes ownership and should reconcile that the config map has been updated and perform the update.

EDIT:

I've been through the logic in https://github.com/stakater/Reloader/blob/master/internal/pkg/handler/upgrade.go#L134 and it's clear that updates will happen if you have also set reloadOnCreate=true. The delay having items reloaded in the scenario above will therefore be equal to the LeaseDuration in the worst case. That seems acceptable even using the default of a 15s LeaseDuration.

We've settled this issue here: #341 (comment)

stakater-user · 2022-09-16T11:09:35Z

@avestuk Yikes! You better fix it before anyone else finds out! Build has Failed!

stakater-user · 2022-09-16T11:36:22Z

@avestuk Image is available for testing. docker pull stakater/reloader:SNAPSHOT-PR-341-63ff290a

avestuk · 2022-09-16T12:46:45Z

@faizanahmad055 I've added test coverage for the liveness probe and leadership election. I believe that the one test should be sufficient as I've not otherwise modified the controller behaviour. If there are other cases you'd like me to add please let me know.

stakater-user · 2022-09-16T13:04:47Z

@avestuk Image is available for testing. docker pull stakater/reloader:SNAPSHOT-PR-341-87d03dea

faizanahmad055 · 2022-09-20T07:09:07Z

@avestuk Thank you for the update. I will review this soon. It says there are conflicts that need to be resolved. Can you please pull the latest changes in the meantime :)

avestuk · 2022-09-20T07:19:19Z

@faizanahmad055 Should be up to date now.

stakater-user · 2022-09-20T07:39:42Z

@avestuk Image is available for testing. docker pull stakater/reloader:SNAPSHOT-PR-341-11e76fa8

faizanahmad055 · 2022-09-23T06:47:33Z

@avestuk regarding reload on create event,

I think this should still be disabled by default. The default functionality is that it reloads based on the modification but we can enable the flag for reloading on create and it will reload the pod upon resource creation. The issue is that this feature has a limitation, the reloader doesn't know if the secret is created before or after the deployment itself and it can cause a false reload when your application is deployed for the first time since it will look at the new secret and just reload the deployment.

What is your opinion?

avestuk · 2022-09-23T07:03:17Z

@faizanahmad055 I agree with you, I think documenting exactly how reloadOnCreate works and outlining the trade offs would work nicely.

I'll make the changes and draft some documentation.

avestuk · 2022-09-23T08:05:50Z

@faizanahmad055 I've expanded the README to explain what I see as the trade offs for reloadOnCreate. I think that ultimately the desired behavior will depend on how your workloads are configured. In the ideal scenario having a rolling upgrade occur won't cause disruption because terminationGracePeriods, podDisruptionBudgets etc are set correctly so workloads gracefully restart.

However it seems to me that there's a higher potential for disruption by defaulting reloadOnCreate=true than false if workloads are not ideally configured.

stakater-user · 2022-09-23T08:22:47Z

@avestuk Image is available for testing. docker pull stakater/reloader:SNAPSHOT-PR-341-6c02b8ff

stakater-user · 2022-09-23T08:43:31Z

@avestuk Image is available for testing. docker pull stakater/reloader:SNAPSHOT-PR-341-2bf892cb

avestuk · 2022-09-23T09:28:56Z

I'm happy to split the pod anti affinity out into a separate PR btw, it just occurred to me that it'd be useful to have for HA.

faizanahmad055 · 2022-09-27T09:10:18Z

@avestuk Apologies for the delay. I have been sick past couple of days and couldn't take a look. I will try to test and merge this today and at max tomorrow.

faizanahmad055 · 2022-09-27T19:59:43Z

I think we should also put a condition here that if HA is false, then the default value should be 1 for replica otherwise whatever user has assigned it to avoid any issues.

faizanahmad055 · 2022-09-27T20:12:16Z

@avestuk, when I try to run with HA both the Pods, keep trying to become the leader and keep crashing.
Pod-1

[27/09/22 9:34:14]  ~ kc logs -f stakater-reloader-784d88cf6f-mtrz4
time="2022-09-27T19:35:04Z" level=info msg="Environment: Kubernetes"
time="2022-09-27T19:35:04Z" level=info msg="Starting Reloader"
time="2022-09-27T19:35:04Z" level=warning msg="KUBERNETES_NAMESPACE is unset, will detect changes in all namespaces."
time="2022-09-27T19:35:04Z" level=info msg="created controller for: configMaps"
time="2022-09-27T19:35:04Z" level=info msg="created controller for: secrets"
I0927 19:35:04.195413       1 leaderelection.go:248] attempting to acquire leader lease default/stakaer-reloader-lock...
I0927 19:35:04.199897       1 leaderelection.go:258] successfully acquired lease default/stakaer-reloader-lock
time="2022-09-27T19:35:04Z" level=info msg="still the leader!"
time="2022-09-27T19:35:04Z" level=info msg="became leader, starting controllers"

Pod-2

kc logs -f stakater-reloader-784d88cf6f-flrgb
time="2022-09-27T19:35:04Z" level=info msg="Environment: Kubernetes"
time="2022-09-27T19:35:04Z" level=info msg="Starting Reloader"
time="2022-09-27T19:35:04Z" level=warning msg="KUBERNETES_NAMESPACE is unset, will detect changes in all namespaces."
time="2022-09-27T19:35:04Z" level=info msg="created controller for: configMaps"
time="2022-09-27T19:35:04Z" level=info msg="created controller for: secrets"
I0927 19:35:04.192962       1 leaderelection.go:248] attempting to acquire leader lease default/stakaer-reloader-lock...
time="2022-09-27T19:35:04Z" level=info msg="new leader is stakater-reloader-784d88cf6f-mtrz4"

And when I try to run with HA false then it still keeps crashing. I am trying to test it with minikube version: v1.27.0. Can you please test it locally and see if it works for you?

faizanahmad055

@avestuk I have requested some changes and also some feedback, can you please check? Also, can you please pull the latest changes from the upstream master and resolve the conflicts if any?

deployments/kubernetes/chart/reloader/values.yaml

deployments/kubernetes/chart/reloader/templates/_helpers.tpl

Should move leadership bits to own pkg?

Pull liveness into leadership to ease testing, logically the liveness probe is directly affected by leadership so it makes sense here. Moved some of the components of the controller tests into the testutil package for reuse in my own tests.

Liveness probe endpoint will always be blocking on the main thread

avestuk · 2022-10-04T15:48:09Z

@faizanahmad055 Firstly thanks very much! And I've addressed your feedback. I'm sorry for the delay but I was on holiday.

I'm sorry about the issue with the restarts. The call to run leadership election is blocking and I must've just missed it when I added the liveness probe.

Everything appears to be working properly now.

stakater-user · 2022-10-04T15:49:45Z

@avestuk Image is available for testing. docker pull stakater/reloader:SNAPSHOT-PR-341-bc90f074

stakater-user · 2022-10-04T16:02:28Z

@avestuk Image is available for testing. docker pull stakater/reloader:SNAPSHOT-PR-341-488eaa9b

faizanahmad055 · 2022-10-04T18:34:27Z

@avestuk Thank you so much for the update and I hope you had a really good vacation. Community contributions are always welcome and this feature was much needed so thank you for adding more comments. The PR is a bit big and I will try to test it and merge it as soon as possible. In the meantime, can you please resolve the conflicts, apologies for the inconvenience as there was another PR pending for some time which got merged recently.

avestuk · 2022-10-06T10:18:31Z

@faizanahmad055 Conflicts have been resolved

stakater-user · 2022-10-06T10:37:45Z

@avestuk Image is available for testing. docker pull stakater/reloader:SNAPSHOT-PR-341-1c719088

faizanahmad055 · 2022-10-09T11:01:45Z

@avestuk Apologies, the PR is taking too long to review as I only get the time over weekends to review and test it properly. Everything seems good so far but while testing it out, I found the logs show the older and terminated pod as the new leader.

As you can see two pods here, one is terminated and there is a new one created

but here in the logs of the new pod, it shows that the older and terminated pod is selected as the leader

Should it be id or current_id here?

avestuk · 2022-10-10T08:24:22Z

@faizanahmad055 A pod attempts to acquire the leader lock. It cannot acquire the leader lock until the lease has expired so even if the old leader pod has been terminated it'll still hold the lock until it expires. The lease duration is 15s so it's perfectly possible for the new pod to see the old pod as the leader.

avestuk force-pushed the leadership-election branch from 87d03de to 11e76fa Compare September 20, 2022 07:18

faizanahmad055 requested changes Sep 27, 2022

View reviewed changes

deployments/kubernetes/chart/reloader/values.yaml Outdated Show resolved Hide resolved

deployments/kubernetes/chart/reloader/values.yaml Outdated Show resolved Hide resolved

deployments/kubernetes/chart/reloader/templates/_helpers.tpl Outdated Show resolved Hide resolved

Alex Vest added 9 commits October 4, 2022 16:41

Add leadership election

7f9f32c

Move consts to const pkg

401d422

Should move leadership bits to own pkg?

Update helm chart for HA in global mode

16079bd

Shutdown on leader election loss

919f75b

Move leadership to its own package

b7e83b7

Add liveness probe

d34c99b

Add tests for leadership election

11ae057

Pull liveness into leadership to ease testing, logically the liveness probe is directly affected by leadership so it makes sense here. Moved some of the components of the controller tests into the testutil package for reuse in my own tests.

Update helm chart with new liveness probe

6299b1d

Err check response writer

72a1c59

Alex Vest added 7 commits October 4, 2022 16:41

Fix roles

d043bcf

Expand documentation about reloadOnCreate

a7c3ae3

Add PodAntiAffinity if HA is enabled

28456ff

Set enableHA and reloadOnCreate to false

eedc8e8

Fix pod antiaffinity

deec4df

Set replicas = 1 by default, override if HA is enabled

676c370

Run leadership election as non blocking

488eaa9

Liveness probe endpoint will always be blocking on the main thread

avestuk force-pushed the leadership-election branch from bc90f07 to 488eaa9 Compare October 4, 2022 15:41

Merge branch 'master' into leadership-election

1c71908

avestuk mentioned this pull request Oct 6, 2022

Add pod disruption budget #345

Merged

faizanahmad055 approved these changes Oct 10, 2022

View reviewed changes

faizanahmad055 merged commit 50791ad into stakater:master Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leadership election #341

Leadership election #341

avestuk commented Sep 15, 2022 •

edited

Loading

stakater-user commented Sep 15, 2022

faizanahmad055 commented Sep 15, 2022

avestuk commented Sep 15, 2022 •

edited

Loading

stakater-user commented Sep 16, 2022

stakater-user commented Sep 16, 2022

avestuk commented Sep 16, 2022

stakater-user commented Sep 16, 2022

faizanahmad055 commented Sep 20, 2022

avestuk commented Sep 20, 2022

stakater-user commented Sep 20, 2022

faizanahmad055 commented Sep 23, 2022

avestuk commented Sep 23, 2022

avestuk commented Sep 23, 2022

stakater-user commented Sep 23, 2022

stakater-user commented Sep 23, 2022

avestuk commented Sep 23, 2022

faizanahmad055 commented Sep 27, 2022

faizanahmad055 commented Sep 27, 2022

faizanahmad055 commented Sep 27, 2022

faizanahmad055 left a comment •

edited

Loading

avestuk commented Oct 4, 2022

stakater-user commented Oct 4, 2022

stakater-user commented Oct 4, 2022

faizanahmad055 commented Oct 4, 2022

avestuk commented Oct 6, 2022

stakater-user commented Oct 6, 2022

faizanahmad055 commented Oct 9, 2022 •

edited

Loading

avestuk commented Oct 10, 2022

Leadership election #341

Leadership election #341

Conversation

avestuk commented Sep 15, 2022 • edited Loading

stakater-user commented Sep 15, 2022

faizanahmad055 commented Sep 15, 2022

avestuk commented Sep 15, 2022 • edited Loading

stakater-user commented Sep 16, 2022

stakater-user commented Sep 16, 2022

avestuk commented Sep 16, 2022

stakater-user commented Sep 16, 2022

faizanahmad055 commented Sep 20, 2022

avestuk commented Sep 20, 2022

stakater-user commented Sep 20, 2022

faizanahmad055 commented Sep 23, 2022

avestuk commented Sep 23, 2022

avestuk commented Sep 23, 2022

stakater-user commented Sep 23, 2022

stakater-user commented Sep 23, 2022

avestuk commented Sep 23, 2022

faizanahmad055 commented Sep 27, 2022

faizanahmad055 commented Sep 27, 2022

faizanahmad055 commented Sep 27, 2022

faizanahmad055 left a comment • edited Loading

Choose a reason for hiding this comment

avestuk commented Oct 4, 2022

stakater-user commented Oct 4, 2022

stakater-user commented Oct 4, 2022

faizanahmad055 commented Oct 4, 2022

avestuk commented Oct 6, 2022

stakater-user commented Oct 6, 2022

faizanahmad055 commented Oct 9, 2022 • edited Loading

avestuk commented Oct 10, 2022

avestuk commented Sep 15, 2022 •

edited

Loading

avestuk commented Sep 15, 2022 •

edited

Loading

faizanahmad055 left a comment •

edited

Loading

faizanahmad055 commented Oct 9, 2022 •

edited

Loading