KEP-4578: Server Feature Gate in etcd #4610

siyuanfoundation · 2024-05-02T00:38:55Z

One-line PR description: KEP to introduce feature gate in etcd

Issue link: Server Feature Gate in etcd #4578

Other comments:

siyuanfoundation · 2024-05-02T00:48:23Z

/sig etcd

cc @serathius @ahrtr @logicalhan @jmhbnz @fuweid @stackbaek @henrybear327 @wenjiaswe

jmhbnz

Thanks for getting this started team. We have already had some discussions on this at KubeCon etc so overall intention makes sense to me.

Two minor questions below from my first read.

keps/sig-etcd/4578-feature-gate/README.md

jmhbnz · 2024-05-04T08:38:15Z

keps/sig-etcd/4578-feature-gate/README.md

+}
+```
+
+The feature gates for a server can only be set with the `--feature-gates` flag during startup. We do not support dynamically changing the feature gates when the server is running.


Should we support setting feature flags via configuration file as well? Imo yes as many of our users prefer to configure etcd via configuration file.

sounds good. Thanks for the suggestion!

k8s-ci-robot · 2024-05-06T18:16:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: siyuanfoundation
Once this PR has been reviewed and has the lgtm label, please ask for approval from serathius. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-etcd/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

keps/sig-etcd/4578-feature-gate/README.md

keps/sig-etcd/4578-feature-gate/kep.yaml

sftim · 2024-05-07T10:56:51Z

keps/sig-etcd/4578-feature-gate/README.md

+<!--
+What other approaches did you consider, and why did you rule them out? These do
+not need to be as detailed as the proposal, but should include enough
+information to express the idea and why it was not acceptable.
+-->


We could introduce the internal feature gate support in a minor version of etcd (eg 3.6), so that an 3.6 etcd can see what feature gates are active. And then release etcd 3.7 that actually lets you turn on feature gates.

I think that would be more effort but a better downgrade story.

etcd has very long release cycles. If possible, it would be good to have the feature gate available in 3.6. That's why we are proposing to have minimal backport in 3.5, so that it is ok to start using them in 3.6.

keps/sig-etcd/4578-feature-gate/README.md

ahrtr · 2024-05-07T12:57:33Z

keps/sig-etcd/4578-feature-gate/README.md

+
+It would be easier for developers to add new features without worrying too much about breaking etcd, because the feature would be Alpha first and disabled by default, and new changes would be mostly gated behind the feature gate.
+
+It would be a smoother process to move the feature to the next stage without the need to introduce new flags.


Adding feature gate and introducing new flags should be orthogonal? Enabling a feature gate may enable users to configure certain flags for the feature.

Rephrase it a little bit. Can you check if it is more clear?

ahrtr · 2024-05-07T13:15:04Z

keps/sig-etcd/4578-feature-gate/README.md

+  // if true, query if the features are enabled for the single server.
+  // otherwise, query if the features are enabled for the cluster.
+  bool isServerFeature = 1;


For simplicity, suggest not to add the concept of "cluster level feature gate". If inconsistency across all nodes in the cluster may cause problem, then let's clearly clarify it in the document, and print warning/error in log if possible. If the inconsistency may cause critical issue (i.e. data inconsistency), then we can even fail the etcd's startup (but I do not see such case so far).

Also this is a generic issue, not limited to only feature gate. etcd has lots of flags, the inconsistencies across all nodes for each flag may cause issue or confusion.

Suggested change

// if true, query if the features are enabled for the single server.

// otherwise, query if the features are enabled for the cluster.

bool isServerFeature = 1;

inconsistencies is inevitable in upgrade/downgrade scenarios. For example a feature can be alpha (disabled by default) in one version, and beta (enabled by default) in the next. I think we have to have a clear way of handling the inconsistencies. Just warning/error messages is not enough.

Name clusterLevelFeature makes more sense then isServerFeature. But see bigger concern below

With respect to the inconsistence,

As mentioned above, inconsistence is a generic issue when upgrading/updating a cluster, not only limited to the feature gate. For example, when we change the value of a flag (it can be any flag, including the feature gate) during upgrading or updating a cluster, there is a small window of inconsistence. So if we really want to resolve such issue, it's out of the scope of feature gate, we should have a more generic solution.

But I don't see any major or critical issue for such inconsistence. So we need to have a list of such issues firstly, and clearly clarify each issue, then discuss the solution. It can be discussed separately.

feature gate is a potential solution to inconsistencies.

if a flag is available in both versions, it is probably ok to assume they are consistently set during upgrade/downgrade.

if a flag is available only in one version, ideally it should be gated behind a feature gate, and feature gate would gracefully handle the case when it is set in one version but not in the other.

Let's imagine a hypothetical case:

a new write type is introduced in V_n+1, and some K8s function depends on this new feature.

K8s asks an etcd server (which happens to be V_n+1) if the feature is enabled, and gets confirmation, and starts sending the new write requests.

the new write requests would be sent to etcd leader (which happens to be V_n), and it does not recognize this new write type.

However short this intermediate mixed version state is, the workload could potentially break the whole cluster. Right now, we have to introduce a new raft type in V_n before using it in V_n+1 to prevent such failures. But if we can handle cluster feature enablement, we could remove the step of introducing a new raft type in V_n.
I think there is great benefit in speeding up the dev process with this cluster level feature gate.

Another example is the ProgressNotify, which KEP-2340 depends on. So K8s performance can be greatly impacted by whether this feature is available or not. If there is no cluster level feature information, depending on which api server is the master, or which etcd server it talks to, ProgressNotify could flip flop, and making the performance inconsistent. There is no guarantee etcd upgrade/downgrade could be done in x seconds, I think we should make our best to make the intermediate state predictable.

cc @serathius

ahrtr · 2024-05-07T13:23:40Z

keps/sig-etcd/4578-feature-gate/README.md

+### Upgrade / Downgrade Strategy
+
+The feature gate feature would available in 3.6+. 
+In 3.5, the necessary proto changes would be backported, but feature gate would not be available. 


Why do we need to backport proto change to 3.5? We are just adding the feature of feature gate. We do not add any new feature which may change or break the proto.

For the impact on the upgrade/downgrade, we need to analyze each new feature case by case. But for this feature (feature-gate), it seems safe for now.

If we are planning to migrate some existing experimental feature, we can do it in 3.7.

The plan is for the cluster level features, each member needs to know the enabled features in other members, and save that in a new member Attribute filed.

Let's try not impact 3.5 in any way. We should keep the downgrade as simple & easy as possible.

FYI. I am thinking to simplify the existing downgrade workflow. But it can be discussed separately.

Added the Risks and Mitigations section to explain this change should not really impact 3.5. Does it make sense @ahrtr ?

keps/sig-etcd/4578-feature-gate/README.md

ahrtr · 2024-05-13T18:34:12Z

keps/sig-etcd/4578-feature-gate/README.md

+  // if true, query if the features are enabled for the single server.
+  // otherwise, query if the features are enabled for the cluster.
+  bool isServerFeature = 1;


Name clusterLevelFeature makes more sense then isServerFeature. But see bigger concern below

With respect to the inconsistence,

As mentioned above, inconsistence is a generic issue when upgrading/updating a cluster, not only limited to the feature gate. For example, when we change the value of a flag (it can be any flag, including the feature gate) during upgrading or updating a cluster, there is a small window of inconsistence. So if we really want to resolve such issue, it's out of the scope of feature gate, we should have a more generic solution.

But I don't see any major or critical issue for such inconsistence. So we need to have a list of such issues firstly, and clearly clarify each issue, then discuss the solution. It can be discussed separately.

ahrtr · 2024-05-13T18:35:44Z

keps/sig-etcd/4578-feature-gate/README.md

+### Upgrade / Downgrade Strategy
+
+The feature gate feature would available in 3.6+. 
+In 3.5, the necessary proto changes would be backported, but feature gate would not be available. 


Let's try not impact 3.5 in any way. We should keep the downgrade as simple & easy as possible.

FYI. I am thinking to simplify the existing downgrade workflow. But it can be discussed separately.

serathius · 2024-05-15T08:30:21Z

keps/sig-etcd/4578-feature-gate/README.md

+
+## Summary
+
+We are introducing a new `--feature-gates` flag in etcd, and the underlying framework to gate future feature enhancement to etcd. 


Adding a flag should not be a goal of it's self, but a means. Please describe what you are proposing in one paragraph.

Thanks for the suggestion. Updated the summary, can you take a look again?

shyamjvs · 2024-05-20T05:21:55Z

keps/sig-etcd/4578-feature-gate/README.md

+
+* Introduce apis/methods to query the enablement state of a feature.
+
+* Establish lifecycle stages of a feature, and clear criteria of lifecycle progression.


Since we also cover consensus-based feature enablement in this KEP, can we add that as an explicit goal here? Say Establish a mechanism to safely and consistently enable/disable cluster-level features

Thanks @shyamjvs. These are all very good points. After some thoughts, we decided to move cluster level feature gate into another KEP, we can continue the discussions there #4647

+1 to that - makes this KEP more tractable.

One question - is there any coupling we expect between member-level and cluster-level features (like the APIs/mechanisms to enable and expose them)? If so, it'll help to roughly address in this KEP how we're leaving space to extend to the latter usecase in future.

In this KEP, there is a line of we will not expose querying feature enablement of any cluster level features, because we cannot risk possible conflicting settings that might break cluster correctness..
The alternative is to expose the cluster level features as per server feature, and users can aggregate them across all members, but that could be prone to misusage like when to use aggregation, and not really based on a raft consensus.

yeah verifiability and correctness will be hard to prove from client-side imho (similar to how /members API gives consistent list of members rather than clients assuming it to be "running instances")

shyamjvs · 2024-05-20T05:26:51Z

keps/sig-etcd/4578-feature-gate/README.md

+
+#### Story 4
+
+During cluster upgrade/downgrade, feature changes across versions should have predictable behavior in this mix version scenario. 


Can we assert a stronger claim/invariant here, say - predictable behavior governed by the lowest member version? Using concrete stories early on in the KEP can improve the mental model for what's proposed later on.

Also if there's a good reason why we don't want to be that specific here, open to hearing it :)

shyamjvs · 2024-05-20T05:51:29Z

keps/sig-etcd/4578-feature-gate/README.md

+
+#### Story 3
+
+In a HA cluster, users should be able to enable/disable a feature without bring down the whole cluster, by restarting the servers with the new feature gate flag one by one. 


Given cluster-level etcd features decouple "when a feature gate gets turned on for member(s)" and "when the feature actually gets started to be used by the cluster" - I'd recommend completing this story by adding .. And for cluster-level features, be able to query whether the feature is currently in use by the cluster.

And for debuggability, we might need to go a step further and also return when the feature was enabled/disabled (say the consistent_index where cluster agreed to do that).

shyamjvs · 2024-05-20T06:04:21Z

keps/sig-etcd/4578-feature-gate/README.md

+We think the answer is NO:
+* the new feature would need to be enabled by default to always apply the bug fix for new releases.
+* it changes the API which is not desirable in patch version releases.


One exception to consider here is - a long-standing bug that can only be fixed by making a breaking API/semantic change. In some cases, the bug fix may be important enough to release in a patch version. To do that safely though, we'd likely need a feature gate.

A concrete example coming to my mind is kubernetes/kubernetes#123935.

I do realize this is more of an exception than the norm. Nonetheless, it feels prudent for us to keep that door open.

shyamjvs · 2024-05-20T06:21:26Z

keps/sig-etcd/4578-feature-gate/README.md

+
+## Proposal
+
+### User Stories (Optional)


One story that's unclear for me (not covered in any of the below so asking) - do we foresee feature gates that once enabled by the cluster, cannot (or should not) be disabled? For e.g because it would lead to breaking changes in the apply/storage layer. Depending on the feature, we may need an offline migration step or enforce the feature to be uni-directional or something else. But this is something we should deliberate on here imho.

FWIW this problem seems equivalent to version downgrade support - so cc @serathius for ideas/opinions.

For e.g because it would lead to breaking changes in the apply/storage layer

We should never allow etcd running into such situation, otherwise it may lead to data inconsistencies.

One story that's unclear for me (not covered in any of the below so asking) - do we foresee feature gates that once enabled by the cluster, cannot (or should not) be disabled?

cluster-level feature isn't in the scope of this KEP, it's a complicated topic. But personally I think we should only allow users to disable a cluster-level feature offline.

Signed-off-by: Siyuan Zhang <sizhang@google.com>

ahrtr · 2024-05-23T13:20:55Z

keps/sig-etcd/4578-server-feature-gate/README.md

+| --- | --- | --- |
+| Alpha | <ul><li>Disabled by default. </li><li>Might be buggy. Enabling the feature may expose bugs. </li><li>Support for feature may be dropped at any time without notice. </li><li>The API may change in incompatible ways in a later software release without notice. </li><li>Recommended for use only in short-lived testing clusters, due to increased risk of bugs and lack of long-term support.</li></ul> | Before moving a feature to Beta, it should have <ul><li> Full unit/integration/e2e/robustness test coverage.</li><li>Full performance benchmark/test if applicable.</li><li> No significant changes for at least 1 minor release.</li><li>Other feature specific criteria in the feature KEP. </li></ul> |
+| Beta | <ul><li>Enabled by default. </li><li>The feature is well tested. Enabling the feature is considered safe.</li><li>Support for the overall feature will not be dropped, though details may change.</li><li>Recommended for only non-business-critical uses because of potential for discovering new hard-to-spot bugs through wider adoption.</li></ul> | Before moving a feature to GA, it should have <ul><li> Widespread usage.</li><li>No bug reported for at least 1 minor release.</li><li>Other feature specific criteria in the feature KEP. </li></ul> |
+| GA | <ul><li>The feature is always enabled; you cannot disable it.</li><li>The corresponding feature gate is no longer needed.</li><li>Stable versions of features will appear in released software for many subsequent versions.</li></ul> | Before deprecating a GA feature, it should have <ul><li> Feature deprecation announcement.</li><li>No user impacting change for at least 1 minor release.</li><li>Other feature specific criteria in the feature KEP. </li></ul> |


GA | The feature is always enabled; you cannot disable it.

Question: does it mean that a GAed feature will never have a bool flag (i.e. --pre-vote) for users to enable or disable it?

When a feature is GAed, it only means that its quality meets some criteria. But it may still make sense for users to enable or disable it for different scenarios or purpose, and it's exactly the reason why there are still some bool flags in api-server and etcd.

feature gates are not replacement of boolean flags. They are different concepts. Feature gate is meant to facilitate the development of a feature. A feature could have many other flags, including boolean flags. Here is one example in K8s: https://github.com/kubernetes/kubernetes/blob/5a121aad53a819c3c9b26a6b5866b1ea8c686f5b/pkg/kubelet/apis/config/validation/validation.go#L142

In the case of pre-vote, a feature gate of PreVote means there is a new PreVote feature people can use, and in which you can toggle --pre-vote on and off. This is redundant in this simple boolean case. But if a feature is big and with many options, then it would be more clear.

keps/sig-etcd/4578-server-feature-gate/README.md

ahrtr · 2024-05-23T15:15:30Z

The initial motivation is only to replace the existing new feature guidance with feature gate, so as to not break anything when graduating a feature.

The cluster level feature is a complicated topic, it's OK to discuss it separately.

Co-authored-by: Benjamin Wang <benjamin.wang@broadcom.com>

tengqm · 2024-05-23T23:24:47Z

keps/sig-etcd/4578-server-feature-gate/README.md

+* graduating a feature is no longer a breaking change with new flags.
+
+* well-established feature lifecycle convention from Kubernetes.
+
+* a unified place to store all the feature enablement state in the code base for simplicity, and easy to refer to in guarding the implementation code.
+
+* easy to query if a feature is available for an ectd server.


graduating a feature is no longer a breaking change with new flags.

No. You still need to toggle the gate, which is equivalent to changing command line args.

well-established feature lifecycle convention from Kubernetes

It works for Kubernetes doesn't necessarily mean it will work for etcd.
Not everything in the world are there to be copied.

a unified place to store all the feature enablement state in the code base for simplicity, and easy to refer to in guarding the implementation code.

This has nothing to do with feature gates. Users typically don't care how code are organized and we should not force them to.
Exposing flags or configuration items are acceptable when there are user visible changes are introduced.

easy to query if a feature is available for an ectd server.

When people over engineering something, there always are "reasons".
Please think twice ... do we really need this?

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 2, 2024

k8s-ci-robot requested a review from ahrtr May 2, 2024 00:39

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label May 2, 2024

k8s-ci-robot requested a review from jmhbnz May 2, 2024 00:39

k8s-ci-robot added sig/etcd Categorizes an issue or PR as relevant to SIG Etcd. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 2, 2024

siyuanfoundation force-pushed the etcd branch 8 times, most recently from 5e3a2db to 9e23e71 Compare May 3, 2024 16:08

jmhbnz reviewed May 4, 2024

View reviewed changes

serathius self-assigned this May 6, 2024

siyuanfoundation force-pushed the etcd branch from 9e23e71 to b2c236f Compare May 6, 2024 18:16

sftim reviewed May 7, 2024

View reviewed changes

ahrtr reviewed May 7, 2024

View reviewed changes

keps/sig-etcd/4578-feature-gate/README.md Outdated Show resolved Hide resolved

ahrtr reviewed May 7, 2024

View reviewed changes

siyuanfoundation force-pushed the etcd branch 6 times, most recently from f229b14 to e621249 Compare May 9, 2024 21:33

ahrtr reviewed May 13, 2024

View reviewed changes

serathius reviewed May 15, 2024

View reviewed changes

siyuanfoundation force-pushed the etcd branch 5 times, most recently from 116081b to b0bc99d Compare May 17, 2024 15:52

siyuanfoundation mentioned this pull request May 17, 2024

Add server level feature gate etcd-io/etcd#18023

Open

8 tasks

siyuanfoundation force-pushed the etcd branch 3 times, most recently from c84592c to 95cef2e Compare May 18, 2024 04:03

shyamjvs reviewed May 20, 2024

View reviewed changes

etcd: server feature gate in etcd.

3f539dd

Signed-off-by: Siyuan Zhang <sizhang@google.com>

siyuanfoundation force-pushed the etcd branch from 95cef2e to 3f539dd Compare May 20, 2024 22:46

siyuanfoundation changed the title ~~KEP-4578: Feature Gate in etcd~~ KEP-4578: Server Feature Gate in etcd May 20, 2024

ahrtr reviewed May 23, 2024

View reviewed changes

keps/sig-etcd/4578-server-feature-gate/README.md Outdated Show resolved Hide resolved

Update keps/sig-etcd/4578-server-feature-gate/README.md

94bbd6a

Co-authored-by: Benjamin Wang <benjamin.wang@broadcom.com>

ahrtr mentioned this pull request May 23, 2024

Support setting LocalAddr in peer communication - with e2e tests etcd-io/etcd#17661

Merged

tengqm reviewed May 23, 2024

View reviewed changes


		It would be easier for developers to add new features without worrying too much about breaking etcd, because the feature would be Alpha first and disabled by default, and new changes would be mostly gated behind the feature gate.

		It would be a smoother process to move the feature to the next stage without the need to introduce new flags.

	// if true, query if the features are enabled for the single server.
	// otherwise, query if the features are enabled for the cluster.
	bool isServerFeature = 1;


		## Summary

		We are introducing a new `--feature-gates` flag in etcd, and the underlying framework to gate future feature enhancement to etcd.


		* Introduce apis/methods to query the enablement state of a feature.

		* Establish lifecycle stages of a feature, and clear criteria of lifecycle progression.


		#### Story 4

		During cluster upgrade/downgrade, feature changes across versions should have predictable behavior in this mix version scenario.


		#### Story 3

		In a HA cluster, users should be able to enable/disable a feature without bring down the whole cluster, by restarting the servers with the new feature gate flag one by one.

KEP-4578: Server Feature Gate in etcd #4610

Are you sure you want to change the base?

KEP-4578: Server Feature Gate in etcd #4610

Conversation

siyuanfoundation commented May 2, 2024

siyuanfoundation commented May 2, 2024

jmhbnz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented May 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shyamjvs May 21, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shyamjvs May 20, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siyuanfoundation May 23, 2024 • edited

Choose a reason for hiding this comment

ahrtr commented May 23, 2024

Choose a reason for hiding this comment

shyamjvs May 21, 2024 •

edited

shyamjvs May 20, 2024 •

edited

siyuanfoundation May 23, 2024 •

edited