-
Notifications
You must be signed in to change notification settings - Fork 548
Leadership election #341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leadership election #341
Conversation
@avestuk Image is available for testing. |
Hi @avestuk, Thank you so much for the contribution. I can take a look at the PR and will try to test it as well. Can you please add some test cases as well? |
@faizanahmad055 I'll absolutely add some tests. I think what'd be really helpful at this point would be confirmation that you're happy with the overall approach and I'll flesh this PR out with tests. I'm currently just digging into the behaviour of the controllers on startup. I can see that there's a flag to allow Consider if we have reloader pod A as leader, reloader pod B is running. Reloader A has previously updated the config map for pod test. Reloader A dies, and the config map is updated. Reloader B takes ownership and should reconcile that the config map has been updated and perform the update. EDIT: I've been through the logic in https://github.com/stakater/Reloader/blob/master/internal/pkg/handler/upgrade.go#L134 and it's clear that updates will happen if you have also set We've settled this issue here: #341 (comment) |
@avestuk Image is available for testing. |
@faizanahmad055 I've added test coverage for the liveness probe and leadership election. I believe that the one test should be sufficient as I've not otherwise modified the controller behaviour. If there are other cases you'd like me to add please let me know. |
@avestuk Image is available for testing. |
@avestuk Thank you for the update. I will review this soon. It says there are conflicts that need to be resolved. Can you please pull the latest changes in the meantime :) |
87d03de
to
11e76fa
Compare
@faizanahmad055 Should be up to date now. |
@avestuk Image is available for testing. |
@avestuk regarding reload on create event, I think this should still be disabled by default. The default functionality is that it reloads based on the modification but we can enable the flag for reloading on create and it will reload the pod upon resource creation. The issue is that this feature has a limitation, the reloader doesn't know if the secret is created before or after the deployment itself and it can cause a false reload when your application is deployed for the first time since it will look at the new secret and just reload the deployment. What is your opinion? |
@faizanahmad055 I agree with you, I think documenting exactly how I'll make the changes and draft some documentation. |
@faizanahmad055 I've expanded the README to explain what I see as the trade offs for reloadOnCreate. I think that ultimately the desired behavior will depend on how your workloads are configured. In the ideal scenario having a rolling upgrade occur won't cause disruption because terminationGracePeriods, podDisruptionBudgets etc are set correctly so workloads gracefully restart. However it seems to me that there's a higher potential for disruption by defaulting |
@avestuk Image is available for testing. |
@avestuk Image is available for testing. |
I'm happy to split the pod anti affinity out into a separate PR btw, it just occurred to me that it'd be useful to have for HA. |
@avestuk Apologies for the delay. I have been sick past couple of days and couldn't take a look. I will try to test and merge this today and at max tomorrow. |
I think we should also put a condition here that if HA is false, then the default value should be 1 for replica otherwise whatever user has assigned it to avoid any issues. |
@avestuk, when I try to run with HA both the Pods, keep trying to become the leader and keep crashing. [27/09/22 9:34:14] ~ kc logs -f stakater-reloader-784d88cf6f-mtrz4
time="2022-09-27T19:35:04Z" level=info msg="Environment: Kubernetes"
time="2022-09-27T19:35:04Z" level=info msg="Starting Reloader"
time="2022-09-27T19:35:04Z" level=warning msg="KUBERNETES_NAMESPACE is unset, will detect changes in all namespaces."
time="2022-09-27T19:35:04Z" level=info msg="created controller for: configMaps"
time="2022-09-27T19:35:04Z" level=info msg="created controller for: secrets"
I0927 19:35:04.195413 1 leaderelection.go:248] attempting to acquire leader lease default/stakaer-reloader-lock...
I0927 19:35:04.199897 1 leaderelection.go:258] successfully acquired lease default/stakaer-reloader-lock
time="2022-09-27T19:35:04Z" level=info msg="still the leader!"
time="2022-09-27T19:35:04Z" level=info msg="became leader, starting controllers" Pod-2 kc logs -f stakater-reloader-784d88cf6f-flrgb
time="2022-09-27T19:35:04Z" level=info msg="Environment: Kubernetes"
time="2022-09-27T19:35:04Z" level=info msg="Starting Reloader"
time="2022-09-27T19:35:04Z" level=warning msg="KUBERNETES_NAMESPACE is unset, will detect changes in all namespaces."
time="2022-09-27T19:35:04Z" level=info msg="created controller for: configMaps"
time="2022-09-27T19:35:04Z" level=info msg="created controller for: secrets"
I0927 19:35:04.192962 1 leaderelection.go:248] attempting to acquire leader lease default/stakaer-reloader-lock...
time="2022-09-27T19:35:04Z" level=info msg="new leader is stakater-reloader-784d88cf6f-mtrz4" And when I try to run with HA false then it still keeps crashing. I am trying to test it with minikube version: v1.27.0. Can you please test it locally and see if it works for you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avestuk I have requested some changes and also some feedback, can you please check? Also, can you please pull the latest changes from the upstream master and resolve the conflicts if any?
Should move leadership bits to own pkg?
Pull liveness into leadership to ease testing, logically the liveness probe is directly affected by leadership so it makes sense here. Moved some of the components of the controller tests into the testutil package for reuse in my own tests.
Liveness probe endpoint will always be blocking on the main thread
bc90f07
to
488eaa9
Compare
@faizanahmad055 Firstly thanks very much! And I've addressed your feedback. I'm sorry for the delay but I was on holiday. I'm sorry about the issue with the restarts. The call to run leadership election is blocking and I must've just missed it when I added the liveness probe. Everything appears to be working properly now. |
@avestuk Image is available for testing. |
@avestuk Image is available for testing. |
@avestuk Thank you so much for the update and I hope you had a really good vacation. Community contributions are always welcome and this feature was much needed so thank you for adding more comments. The PR is a bit big and I will try to test it and merge it as soon as possible. In the meantime, can you please resolve the conflicts, apologies for the inconvenience as there was another PR pending for some time which got merged recently. |
@faizanahmad055 Conflicts have been resolved |
@avestuk Image is available for testing. |
@avestuk Apologies, the PR is taking too long to review as I only get the time over weekends to review and test it properly. Everything seems good so far but while testing it out, I found the logs show the older and terminated pod as the new leader. As you can see two pods here, one is terminated and there is a new one created but here in the logs of the new pod, it shows that the older and terminated pod is selected as the leader Should it be |
@faizanahmad055 A pod attempts to acquire the leader lock. It cannot acquire the leader lock until the lease has expired so even if the old leader pod has been terminated it'll still hold the lock until it expires. The lease duration is 15s so it's perfectly possible for the new pod to see the old pod as the leader. |
Hello
This PR is to address running Reloader with multiple replicas as requested in #112 and it's something that we'd love to have at Nutmeg!
I'm opening the PR at this point hoping to get some feedback on my approach and to find out whether you are amenable to it.
Outstanding items include, but are likely not limited to:
Verifying that RBAC perms are correct if reloader is running in a single namespaceAdding the liveness probe to the helm chartUnderstand the behaviour of controllers when leadership is assumedWhen items are added to the queue on startup they'll be processed via Add ifreloadOnCreate
is true, thereforereloadOnCreate
should be set by default when in HA mode?I'm not sure how to test the shutdown on failing to renew the lease so that potentially also an outstanding item.