[RFC] Gating Flux reconciliation #3158

stefanprodan · 2022-09-29T15:16:11Z

Add proposal for introducing a gating mechanism to Flux that would offer a way for cluster admins and other teams involved in the release process to manually approve the rollout of changes onto clusters.

In addition, Flux could offer a way to define maintenance time windows and other time-based gates, to allow a better control of applications and infrastructure changes to critical system.

Note that this proposal is WIP, we need to cover more use-cases from #870

Signed-off-by: Stefan Prodan <stefan.prodan@gmail.com>

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>

hassenius · 2022-10-04T15:39:11Z

I'm curious whether this proposal would easily be able to support

When the gate is closed, will the kustomization/helmrelease controllers still be able to perform drift avoidance (i.e. continue to reconcile currently desired state)
That the gate is open "for a specific revision/change", in other words that when the gate is open it can be known "what" the gate is open for?

pjbgf · 2022-10-04T15:40:12Z

rfcs/XXXX-gating/README.md

+
+```yaml
+apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
+kind: Kustomization


I propose gates are to be applied at sources only instead of Kustomization and HelmRelease objects, as otherwise may face the non-trivial challenges below:

Incompatible desired state when Kustomization / HelmRelease objects are gated whilst others are not, specially when they depend on the same Flux source that is not gated.

Example 1: Kustomization objects app2 and app1 depends on GitRepository team-source. A change is merged into the repository and reconciled for object team-source, however app2 has a close Gate and app1 does not. In this case the desired state at both Git repository and the Flux source in the cluster are valid, but only one Kustomizations is being applied leading to an invalid observed state in the cluster. Same applies if both Kustomizations are behind a Gate which has its state changed before the second object is reconcilied (e.g. example 2 below).

Enforcing the desired state which was allowed before the Gate closing (User Story 2) will be difficult to manage from any point beyond the source, which could lead to the cluster to be in an invalid state.

Example 1: user manually deletes Kubernetes object managed in the hope Flux will re-create it.

Example 2: two kustomizations (parent and child) having the same Gate, the latter has a dependsOn pointing to the former. If the Gate closes after parent but before child is applied, the cluster observed state may be invalid (considering the desired state at source).

pjbgf · 2022-10-04T15:41:12Z

rfcs/XXXX-gating/README.md

+- How can an operator determine if the feature is in use?
+- Are there any drawbacks when enabling this feature?
+-->
+


A few ideas on design details to be discussed:

Multiple Gates

Flux objects that support gating can specify multiple gates. By default,
all gates specified must be open for the reconciliation to go ahead. To
change the behavior spec.gates.require can be set to oneOf instead:

apiVersion: kustomize.toolkit.fluxcd.io/v1beta1 kind: GitRepository metadata: name: flux-system namespace: flux-system spec: gates: # all (default): all gates must be open for the reconciliation to go ahead. # oneOf: at least one of the gates must be open for the reconciliation to go ahead. require: oneOf # <all|oneOf> refs: - change-freeze # gate that enforces a change freeze time window - bypass-signoff # gate that allows other gates to be overriden.

When oneOf is used, a single open Gate is required for reconcilations
to proceed.

Recovering from wiped Storage

The source artifact storage lives in the running source controller instance. If its Pod
is recreated, all the artifacts will need to be reacquired. This process happens
automatically as part of each source reconciliation, however, when sources
are restricted by a gate they will require a new process.

In such situation the controller will need to fetch the version used before the Gate was
closed, and that was in the storage before it being wiped. Not all types will support this
and therefore here are the outcomes expected:

GitRepository: restores artifact based on the last known revision (commit ID).

S3 Buckets: fail reconciliation with "artifact not found artifact whilst under closed Gate".

OCI: TBC.

Helm Repository: TBC.

For OCI we have the upstream digest in status, so like with Git, we can use it to pull the data.

hassenius · 2022-10-06T21:11:57Z

rfcs/XXXX-gating/README.md

+```
+
+You could also schedule "No Deploy Fridays" with a CronJob that closes the `maintenance` gate at `0 0 * * FRI`.
+


nvmnd...moved...

hassenius · 2022-10-06T21:18:35Z

rfcs/XXXX-gating/README.md

+> As a member of the SRE team, I want existing deployments to still be
+> reconciled during a change freeze.
+
+Gates can be used to block Flux sources from being refreshed, resulting in Flux


To clarify that block Flux sources from being refreshed in this case means the gate prevent the source controller from making the new assets available, rather than to suspend the polling in the source controller?

It is desirable for the source to still be able to detect changes, such that in-cluster logic that is tied to the gating mechanisms can be notified of available changes.

If it stopped polling, it would avoid all the resource consumption (CPU, memory, Network and storage) for an operation that would not cause any side effect to the cluster, which would stream line the controllers to better handle reconciliations that were not gated.

@hassenius is there any specific scenario you have in mind?

hassenius · 2022-10-06T21:24:58Z

rfcs/XXXX-gating/README.md

+
+### Goals
+
+- Offer a dedicated API for defining time-based gates in a declarative manner.


I would suggest that the gates should not be exclusively time based, but also support specific revisions

hassenius · 2022-10-06T21:42:13Z

rfcs/XXXX-gating/README.md

+This would ensure that Gate changes would not impact the eventual consistency of
+mid-flight reconciliations that were already deployed in the cluster. Flux would also
+continue to re-create Flux managed objects that were manually deleted from the cluster.
+


I would propose an additional use case, bringing in the reference dimension in addition to the time dimension for gates, such that a gate can be defined as open to a specific reference, either in spec, or as an annotation

Story 3

As a multi-cluster operator, I want to control how cluster configuration changes
are rolled out across the cluster estate from a single configuration source

Gates can be used to control the timing and order of where configurations are reconciled

For example

apiVersion: gating.toolkit.fluxcd.io/v1alpha1 kind: Gate metadata: name: approved-revisions namespace: flux-system spec: openTo: 4f08fc93e31

or

apiVersion: gating.toolkit.fluxcd.io/v1alpha1 kind: Gate metadata: annotations: open.gate.fluxcd.io/revision: 4f08fc93e31 name: approved-revisions namespace: flux-system spec: default: closed

hiddeco · 2022-10-07T12:32:42Z

Have been thinking about the current proposal, and am wondering if the ecosystem would benefit from a simple static Gate object which is actually controlled by a Gatekeeper (which knows about the mechanics around e.g. a CronJob).

By doing this, it would be much easier for others to write their own "gatekeeper" which e.g. listens to button presses in a UI.

pjbgf · 2022-10-07T17:35:39Z

rfcs/XXXX-gating/README.md

+objects.
+
+## Design Details
+


In highly regulated environments Gating can be used to ensure specific processes were observed before a new change made its way to a given environment. Flux controllers that support such mechanism should extend their log messages to express what Gates were taken into account at the time of a reconciliation, and their states.

rupelu · 2022-11-07T19:42:19Z

Just throwing in another idea. It would be great if a Gate could evaluate a metric. This way we could decline reconciliation on high load or when some long running task is not yet completed. A metrics based Gate would make checks, whether a release can be rolled out, more automatic. Of course as already mentioned by @pjbgf it would be highly appreciated if we would know the reason why a reconciliation did not happen and which Gate was involved - maybe even what the evaluation of the said metric was.

Also it would be great to force-open one or all Gates. This could make especially sense when a rollout starts with Gates opened but errors come up during the rollout process and the maintenance window reaches its end. Sometimes the way forward is better than a rollback and Operators should be able to decide that the release should be fixed and rolled out never the less. Maybe this could be done with a label or annotation overriding the close-time of the Gate.

pjbgf · 2022-11-25T14:28:24Z

rfcs/XXXX-gating/README.md

+
+### Goals
+
+- Offer a dedicated API for defining time-based gates in a declarative manner.


Suggested change

- Offer a dedicated API for defining time-based gates in a declarative manner.

- Offer a dedicated API for defining gates in a declarative manner.

pjbgf · 2022-11-25T14:52:33Z

rfcs/XXXX-gating/README.md

+
+The `Gate` API would replace Flagger's current
+[manual gating mechanism](https://docs.flagger.app/usage/webhooks#manual-gating).
+


Additional audit trail for this feature:

Enriching existing logs so that actions that can be gated should log:

What gate(s) were checked (if any).

What gate(s) caused the final result of (not) going ahead with the action.

If multiple gates can be defined, changes to which gates are being observed and what logical operators (and/or) are being used, should be descriptively logged.

Logs for the new Gate controller:

Descriptive summary of the gate state at creation time.

Any change of Gate state should log a descriptive summary.

The additional logs above would be an "opt-in" feature, so that users that do not require this level of governance don't have to process the extra data.

pjbgf · 2022-11-25T15:42:49Z

Have been thinking about the current proposal, and am wondering if the ecosystem would benefit from a simple static Gate object which is actually controlled by a Gatekeeper (which knows about the mechanics around e.g. a CronJob).

By doing this, it would be much easier for others to write their own "gatekeeper" which e.g. listens to button presses in a UI.

I agree with @hiddeco, this would make the implementation quite extensible. Users could implement whatever logic they need to toggle the state on top of the static Gate.

In-tree "GateKeeper" implementations could be added based on most popular requests amongst the community.

pjbgf · 2022-11-25T15:54:36Z

Just throwing in another idea. It would be great if a Gate could evaluate a metric. This way we could decline reconciliation on high load or when some long running task is not yet completed. A metrics based Gate would make checks, whether a release can be rolled out, more automatic. Of course as already mentioned by @pjbgf it would be highly appreciated if we would know the reason why a reconciliation did not happen and which Gate was involved - maybe even what the evaluation of the said metric was.

@rupelu The Gate happens before a new state is applied into the cluster. So I think they are a better fit for defining whether or not sources should be refreshed, for the reasons I mentioned here. However, you could have similar results by using Flagger to do the progressive roll-out side of things for you.
If the release was broken, no (or few) users would be redirected to the new version. Rolling fixes back/forward, would still require the Gate to be opened though.

pjbgf · 2022-12-22T14:28:16Z

I have worked out an example on what this could look like if we were to follow @hiddeco's suggestion on using static gates. This is what I got to: https://hackmd.io/N-xK-y5fTCiFFnusF22Sfw. I talked with a few users to ascertain whether this would mitigate their key use cases, and so far the response has been positive.

The example is not meant to define the implementation details, but rather to flesh out the workflow so folks can raise concerns sooner rather than later.

haggishunk · 2023-09-12T23:35:46Z

Counterpoint:

I am fond of the shifting left that flux encourages with its watchful polling of sources like git repos and image registries. Maybe it's a blocker to widespread community adoption to not have a way to put a human in the loop, but in some cases that human that pushes the deploy button is not the one that merged the app code. Flux is differentiated from many other imperative pipeline CD tools by this very nature which sets the "human in the loop" to be those that commit artifacts to the watched sources of truth and the policies/controls around them.

Let's take the scenario of putting a gate up on a Friday afternoon, having the source evolve with new commits/pushes. Some developers finish their projects and update the source of truth, say with refs to new images. On Monday when the gate is lifted the accumulation of several code changes are released. If things perform unexpectedly, what is the last good state? The last commit before the gate went up, sure, but this is cross domain. I would much rather roll back a merge train that hit Monday morning with reverting commits. You will have better team engagement with the release if you remove the safety of a gate window.

Promoting the suspend feature to a more first class implementation seems to strike the right balance IMHO.

Disclaimer:

Please forgive my rant, but to me flux means "minutes of time difference between my gitops source of truth and my target environment" and I love it that way.

antonmatsiuk · 2023-09-13T07:32:59Z

Counterpoint:

I am fond of the shifting left that flux encourages with its watchful polling of sources like git repos and image registries. Maybe it's a blocker to widespread community adoption to not have a way to put a human in the loop, but in some cases that human that pushes the deploy button is not the one that merged the app code. Flux is differentiated from many other imperative pipeline CD tools by this very nature which sets the "human in the loop" to be those that commit artifacts to the watched sources of truth and the policies/controls around them.

Let's take the scenario of putting a gate up on a Friday afternoon, having the source evolve with new commits/pushes. Some developers finish their projects and update the source of truth, say with refs to new images. On Monday when the gate is lifted the accumulation of several code changes are released. If things perform unexpectedly, what is the last good state? The last commit before the gate went up, sure, but this is cross domain. I would much rather roll back a merge train that hit Monday morning with reverting commits. You will have better team engagement with the release if you remove the safety of a gate window.

Promoting the suspend feature to a more first class implementation seems to strike the right balance IMHO.

Disclaimer:

Please forgive my rant, but to me flux means "minutes of time difference between my gitops source of truth and my target environment" and I love it that way.

I think your argument is unrelated to what most of the people here ask for. The most valuable use case for the gating feature is to have "maintenance windows" on the environments that reconcile from a stable release branch, and not for having CD-style of deployment that you've described. Imagine you have dozens of environments that have to be updated across the globe within fixed timeframes. Do you want to wake up at night to press the merge button to update the environment? If not, then you have to play around with scheduling of "PR merge" events as CronJobs and creating different branches for different timezones in your version control platform. This will become messy and complex pretty fast. Gating feature is an elegant solution for this problem, and if you need to update the environment outside the maintenance window, an "override" flag should be there.

Add proposal for adding a gating mechanism to Flux

650bea4

Signed-off-by: Stefan Prodan <stefan.prodan@gmail.com>

stefanprodan added the area/rfc Feature request proposals in the RFC format label Sep 29, 2022

stefanprodan assigned pjbgf Sep 29, 2022

pjbgf mentioned this pull request Sep 30, 2022

New Feature: Gating Flux reconciliation #3164

Open

4 tasks

rfc: Add story 2 and alternatives

35785b8

Signed-off-by: Paulo Gomes <paulo.gomes@weave.works>

pjbgf reviewed Oct 4, 2022

View reviewed changes

hassenius reviewed Oct 6, 2022

View reviewed changes

pjbgf reviewed Oct 7, 2022

View reviewed changes

pjbgf reviewed Nov 25, 2022

View reviewed changes

darkowlzz mentioned this pull request Mar 22, 2024

Feature Request: Replace reconciliation interval with cron schedule in HelmRelease CRD fluxcd/helm-controller#920

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Gating Flux reconciliation #3158

[RFC] Gating Flux reconciliation #3158

stefanprodan commented Sep 29, 2022 •

edited

hassenius commented Oct 4, 2022

pjbgf Oct 4, 2022

pjbgf Oct 4, 2022

stefanprodan Oct 6, 2022

hassenius Oct 6, 2022 •

edited

hassenius Oct 6, 2022

pjbgf Nov 25, 2022

hassenius Oct 6, 2022

hassenius Oct 6, 2022

hiddeco commented Oct 7, 2022

pjbgf Oct 7, 2022

rupelu commented Nov 7, 2022 •

edited

pjbgf Nov 25, 2022

pjbgf Nov 25, 2022

pjbgf commented Nov 25, 2022

pjbgf commented Nov 25, 2022

pjbgf commented Dec 22, 2022 •

edited

haggishunk commented Sep 12, 2023

antonmatsiuk commented Sep 13, 2023

		```

		You could also schedule "No Deploy Fridays" with a CronJob that closes the `maintenance` gate at `0 0 * * FRI`.


		### Goals

		- Offer a dedicated API for defining time-based gates in a declarative manner.

	- Offer a dedicated API for defining time-based gates in a declarative manner.
	- Offer a dedicated API for defining gates in a declarative manner.


		The `Gate` API would replace Flagger's current
		[manual gating mechanism](https://docs.flagger.app/usage/webhooks#manual-gating).

[RFC] Gating Flux reconciliation #3158

Are you sure you want to change the base?

[RFC] Gating Flux reconciliation #3158

Conversation

stefanprodan commented Sep 29, 2022 • edited

hassenius commented Oct 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Multiple Gates

Recovering from wiped Storage

Choose a reason for hiding this comment

hassenius Oct 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Story 3

hiddeco commented Oct 7, 2022

Choose a reason for hiding this comment

rupelu commented Nov 7, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pjbgf commented Nov 25, 2022

pjbgf commented Nov 25, 2022

pjbgf commented Dec 22, 2022 • edited

haggishunk commented Sep 12, 2023

antonmatsiuk commented Sep 13, 2023

stefanprodan commented Sep 29, 2022 •

edited

hassenius Oct 6, 2022 •

edited

rupelu commented Nov 7, 2022 •

edited

pjbgf commented Dec 22, 2022 •

edited