Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Enable cluster label verification in Helm #2865

Closed
krajorama opened this issue Aug 31, 2022 · 8 comments · Fixed by #7961
Closed

Docs: Enable cluster label verification in Helm #2865

krajorama opened this issue Aug 31, 2022 · 8 comments · Fixed by #7961
Assignees
Labels
helm keep-alive squad/customer type/docs Improvements or additions to documentation
Milestone

Comments

@krajorama
Copy link
Contributor

krajorama commented Aug 31, 2022

Describe the bug

Two separate memberlist gossip rings may join into one when IP addresses are reused between the two due to for example a kubernetes migration restarting a lot of PODs quickly.

To Reproduce

N/A incident RCA exists

Expected behavior

Gossip rings do not interact

Environment

  • Infrastructure: Kubernetes
  • Deployment tool: Helm

Additional Context

#2349

Requires a three step migration, so major version bump and migration document needed.

@krajorama krajorama added the helm label Aug 31, 2022
@dimitarvdimitrov
Copy link
Contributor

This looks like a docs issue to me. The migration is possible right now, helm doesn't prevent it. We can document how to do the migration and then enable this by default and do a major release.

@krajorama
Copy link
Contributor Author

This looks like a docs issue to me. The migration is possible right now, helm doesn't prevent it. We can document how to do the migration and then enable this by default and do a major release.

Yes, I agree, I was thinking about how to write a migration path, but it's three steps, not super complicated. And the user can keep backwards compatibility with one setting in mimir.structuredConfig.memberlist.cluster_label_verification_disabled: true which is also the first step to do a migration.

@krajorama krajorama added the type/docs Improvements or additions to documentation label Sep 1, 2022
@mattmendick
Copy link
Contributor

Please link to this from the (yet to be completed) "Running Helm in production" document

@mattmendick mattmendick changed the title Helm: enable cluster label verification [Docs] Helm: enable cluster label verification Oct 11, 2022
@dimitarvdimitrov
Copy link
Contributor

@lamida said he's interested in picking this up, so I'm posting my notes to help pick this up

Breaking change

The reason for doing this migration is because making all changes in one go is a breaking change. The agreement with @krajorama is that we want to eventually have cluster label verification enabled by default in the chart.

We have a deprecation policy of 2 major releases. I think it might make sense to follow the same period for this breaking change. Otherwise, a period of 6 months to a year sounds viable for this. I'm curious to hear what others think.

Migration doc

Because it's a breaking change we need a migration doc. I think it would make sense to have multiple parts for this doc. Would be nice to get feedback from @osg-grafana and other involved if you think this is viable.

  1. intro
    • why we're doing this change and who it applies to. This is basically anyone who is running multiple Mimir, Loki or Tempo (or their enterprise versions) in the same kubernetes cluster. More generally in any cluster where the pods may reuse each other's IPs after churning. Mention the risk of not doing the migration - clusters might merge and start sharding data.
    • An estimation of the effort and time it will take.
    • The release when we will introduce the breaking change.
    • The alternative - what happens if you do all of this in one go without a migration - the risk is that old members of the cluster will not have access to the hash ring. This will be crucial for the new ingesters who will experience a huge resharding. I think this should be brief
  2. More detailed explanation for how this migration solves this problem for folks that are interested to understand what is actually happening
  3. The actual migration
  4. How to verify that the migration has been successful

I think most of the sections are self-explanatory. I have notes on two of them

2. More detailed explanation for how this migration solves this problem for folks that are interested to understand what is actually happening

This is an excerpt from the internal docs for this migration. Might be good to include it in some form or another

The memberlist label filter discards all incoming memberlist messages and streams which the memberlist client receives which have a label that doesn’t match its own configured label. By default the configured label is “” and the label filter is enabled (the option to disable it is disabled).

3. The actual migration

The actual migration has three parts: I've taken them from this blog post and the internal procedure for doing the migration

  1. Explicitly disable the cluster label verification in memberlist - mimir.config.memberlist.cluster_label_verification_disabled: true. Roll out all the whole cluster.
  2. Add the memberlist cluster label to all components. mimir.config.memberlist.cluster_label: "$helm-release-name". Roll out the whole cluster.
  3. Change the verification to true and roll out the cluster again.

Helping with the migration

To help users with the migration we can do step 1. and include it in a helm release. The change is small and non-breaking. I think we should do it. This way the mandatory migration with the major release will be shorter for people that have a relatively recent helm chart version.

@osg-grafana osg-grafana changed the title [Docs] Helm: enable cluster label verification Docs: Enable cluster label verification in Helm Nov 10, 2023
@osg-grafana osg-grafana added this to the FY24Q4 milestone Nov 10, 2023
@osg-grafana
Copy link
Contributor

SMEs were @dimitarvdimitrov and @krajorama; SMEs will be @lamida and @francoposa.

@lamida
Copy link
Contributor

lamida commented Apr 29, 2024

I just finished the first draft of the documentation in #7961 @krajorama @dimitarvdimitrov .

However, I haven't put anything yet on stating step 1 will be part of our future Mimir helm release.

To help users with the migration we can do step 1. and include it in a helm release.

Do we still want to release a helm chart version to disable cluster label verification to remove one migration step if eventually in the future we will enable cluster label verification again by default in the chart?

@dimitarvdimitrov
Copy link
Contributor

To help users with the migration we can do step 1. and include it in a helm release.

Do we still want to release a helm chart version to disable cluster label verification to remove one migration step if eventually in the future we will enable cluster label verification again by default in the chart?

I feel like we're past this point. If we have the migration doc already, we can switch on cluster label verification on in the next release.

@krajorama
Copy link
Contributor Author

About the cluster label itself, I propose cluster_label to include the namespace and the helm release name, because technically you can install the same release name in different namespaces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
helm keep-alive squad/customer type/docs Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants