Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for helm memberlist cluster label migration #7961

Merged
merged 34 commits into from May 24, 2024

Conversation

lamida
Copy link
Contributor

@lamida lamida commented Apr 24, 2024

What this PR does

Add documentation to migrate mimir helm installation to enable cluster label verification. The rendered doc is here.

Which issue(s) this PR fixes or relates to

Fixes #2865

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

Signed-off-by: Jon Kartago Lamida <me@lamida.net>
@lamida lamida changed the title First pass helm memberlist cluster label migration doc Add documentation for helm memberlist cluster label migration Apr 24, 2024
weight: 110
---

# Introduction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [doc-validator] reported by reviewdog 🐶
The first heading 'Introduction' doesn't match the title 'Migrating Mimir Helm Installation to Use Memberlist Cluster Label'.
Decide which of the first heading or title is most applicable and change the other to match, or update them both.


The migration should take around 30 minutes. The risk of not doing this migration if you have Mimir and Tempo or Loki in the same cluster is the possibility of those different backend to talk with each other.

# How The Migration Solve The Issue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [doc-validator] reported by reviewdog 🐶
The first heading 'How The Migration Solve The Issue' doesn't match the title 'Migrating Mimir Helm Installation to Use Memberlist Cluster Label'.
Decide which of the first heading or title is most applicable and change the other to match, or update them both.


Cluster label will prevent such situation by only allowing communication for components that has same cluster label. Once that is enabled, before memberlist try to communicate with other components, it will verify whether that new components are having same cluster label and only allowing memberlist to communicate if the parties are having the same cluster label.

# Migration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [doc-validator] reported by reviewdog 🐶
The first heading 'Migration' doesn't match the title 'Migrating Mimir Helm Installation to Use Memberlist Cluster Label'.
Decide which of the first heading or title is most applicable and change the other to match, or update them both.


Apply the configuration changes by running `helm upgrade mimir-distributed -f values.yaml`.

# 4. Verifying The Migration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [doc-validator] reported by reviewdog 🐶
The first heading '4. Verifying The Migration' doesn't match the title 'Migrating Mimir Helm Installation to Use Memberlist Cluster Label'.
Decide which of the first heading or title is most applicable and change the other to match, or update them both.

Signed-off-by: Jon Kartago Lamida <me@lamida.net>
Signed-off-by: Jon Kartago Lamida <me@lamida.net>
Signed-off-by: Jon Kartago Lamida <me@lamida.net>
@lamida lamida marked this pull request as ready for review April 29, 2024 08:39
@lamida lamida requested review from jdbaldry and a team as code owners April 29, 2024 08:39
Copy link
Member

@jdbaldry jdbaldry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems more like this is an important configuration step rather than a migration task. Perhaps the title should be "Configure a unique cluster label"?

This looks like a task topic, there's some useful information for how to structure a task topic in https://grafana.com/docs/writers-toolkit/structure/topic-types/task/.

I haven't reviewed the prose of the PR yet, I think we should consider the name of the file and get the structure more aligned with the guidance in https://grafana.com/docs/writers-toolkit/structure/topic-types/task/ and then I can provide a style review for the wording.

@@ -0,0 +1,96 @@
---
description: "Migrating Mimir Helm Installation to Use Memberlist Cluster Label"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the description should provide more information on what the reader can expect to get from this page.
The description is presented to the user in search engines and on social media when links to this page are used.

A good example is https://github.com/grafana/mimir/blob/main/docs/sources/mimir/set-up/migrate/migrate-from-cortex.md?plain=1#L4-L7

https://grafana.com/docs/writers-toolkit/write/front-matter/#description-required

Comment on lines 3 to 4
title: "Migrating Mimir Helm Installation to Use Memberlist Cluster Label"
menuTitle: "Migrating Mimir Helm Installation to Use Memberlist Cluster Label"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Titles and headings should use sentence case rather than title case.
We should also avoid gerunds ("ing" endings in verbs).
Titles should also use the full product name "Grafana Mimir" for search engine optimization.

I have corrected this as an example, but please see my general review comment because we may want to rename this page.

Suggested change
title: "Migrating Mimir Helm Installation to Use Memberlist Cluster Label"
menuTitle: "Migrating Mimir Helm Installation to Use Memberlist Cluster Label"
title: "Migrate Grafana Mimir Helm chart to use memberlist cluster label"

Also the menuTitle can be omitted if it is the same as the title.
In this case, I think we need a much shorter title that omits the product name.
Considering a page rename may make this menuTitle much shorter, perhaps Configure a memberlist cluster label?

Comment on lines 8 to 10
# Migrating Mimir Helm Installation to Use Memberlist Cluster Label

## Introduction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we don't want stacked headings (headings with no content between them).

The introductory content can be moved out of the H2 and become the conceptual material at the start of this page.

Comment on lines 93 to 96
{{% docs/reference %}}
[memberlist]: "/ -> /docs/mimir/<MIMIR_DOCS_VERSION>/references/architecture/memberlist-and-the-gossip-protocol"
[hash ring]: "/ -> /docs/mimir/<MIMIR_DOCS_VERSION>/references/architecture/hash-ring"
{{% /docs/reference %}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any need to use docs/reference here because the links are the same and this content isn't reused via Hugo mounts as far as I know. You only need to use that shortcode for content that is used in multiple website projects like being both in Grafana and Grafana Cloud documentation.

Instead, you can just use URLs with <MIMIR_DOCS_VERSION> for version substitution.

Comment on lines 12 to 24
Grafana Mimir uses [memberlist] to share works and deciding which component replica to send the workload such as when ingesting series. Memberlist encodes the replica information in data structure called [hash ring], this is a consistent hashing scheme in which different ingesters instance tokens are placed around the ring. The token position in the ring determines which ingester should handle a request. The information of this hash ring is delivered to different components by using gossip protocol.

By default, hash ring is global. If we also have Grafana Tempo and Grafana Loki in the same Kubernetes cluster, their components can unintentionally talk and sending data with each other, because Loki and Tempo also uses memberlist. This possibility to happen increases in cluster where the pods may reuse each other's IPs after churning.

In this document we will describe the step on how to migrate a Mimir installation to prevent two separate memberlist gossip hash ring to join into one.

There are three steps of the migration which we will describe in details in the migration section. But in brief summary the steps are:

1. Disable memberlist cluster label verification
1. Set cluster label to all Mimir components
1. Enable memberlist cluster label verification again

The migration should take around 30 minutes. The risk of not doing this migration if you have Mimir and Tempo or Loki in the same cluster is the possibility of those different backend to talk with each other.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second sentence on L24 is the most important bit of this introduction. A reader wants to know why they should perform this task and they may not really care about the conceptual information on L12.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed many repetition and make a more concise introduction in the beginning of the section.

Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall I think this article conflates memberlist (which is a cluster membership protocol) with our key-value implementation on top of memberlist. It's not a huge deal for the purpose of the article, but it would be nice to have it correct

lamida added 10 commits May 6, 2024 16:19
Signed-off-by: Jon Kartago Lamida <me@lamida.net>
Signed-off-by: Jon Kartago Lamida <me@lamida.net>
Signed-off-by: Jon Kartago Lamida <me@lamida.net>
Signed-off-by: Jon Kartago Lamida <me@lamida.net>
Signed-off-by: Jon Kartago Lamida <me@lamida.net>
Signed-off-by: Jon Kartago Lamida <me@lamida.net>
Signed-off-by: Jon Kartago Lamida <me@lamida.net>
Signed-off-by: Jon Kartago Lamida <me@lamida.net>
Signed-off-by: Jon Kartago Lamida <me@lamida.net>
Signed-off-by: Jon Kartago Lamida <me@lamida.net>
weight: 110
---

# Configure a unique Grafana Mimir's Memberlist cluster label in helm installation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [doc-validator] reported by reviewdog 🐶
The first heading 'Configure a unique Grafana Mimir's Memberlist cluster label in helm installation' doesn't match the title 'Configure a unique Grafana Mimir's Memberlist cluster label in the mimir-distributed Helm chart installation'.
Decide which of the first heading or title is most applicable and change the other to match, or update them both.

Signed-off-by: Jon Kartago Lamida <me@lamida.net>
@lamida
Copy link
Contributor Author

lamida commented May 6, 2024

It seems more like this is an important configuration step rather than a migration task. Perhaps the title should be "Configure a unique cluster label"?

This looks like a task topic, there's some useful information for how to structure a task topic in https://grafana.com/docs/writers-toolkit/structure/topic-types/task/.

I haven't reviewed the prose of the PR yet, I think we should consider the name of the file and get the structure more aligned with the guidance in https://grafana.com/docs/writers-toolkit/structure/topic-types/task/ and then I can provide a style review for the wording.

I was not fully aware on the writers-toolkit before this. It's really useful and I will refer to it next time working on other documentation.

I have updated the structure slightly and hopefully matched the task topic structure. But please advise again if this is still not enough.

lamida added 2 commits May 6, 2024 18:55
Signed-off-by: Jon Kartago Lamida <me@lamida.net>
Signed-off-by: Jon Kartago Lamida <me@lamida.net>
@lamida
Copy link
Contributor Author

lamida commented May 6, 2024

overall I think this article conflates memberlist (which is a cluster membership protocol) with our key-value implementation on top of memberlist. It's not a huge deal for the purpose of the article, but it would be nice to have it correct

I think it is important to make it correct. Thanks for highlighting. I rewrote and shortened most of the section, hopefully, now the information is more accurate.

Signed-off-by: Jon Kartago Lamida <me@lamida.net>
@lamida
Copy link
Contributor Author

lamida commented May 6, 2024

I haven't tested step by step of configuration update. I will do it tomorrow and adjust the step accordingly if find out something is not accurate after the test.

Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving mostly wording suggestions

Multiple [Memberlist](https://grafana.com/docs/mimir/<MIMIR_VERSION>/references/architecture/memberlist-and-the-gossip-protocol/) [gossip ring](https://grafana.com/docs/mimir/<MIMIR_VERSION>/references/architecture/hash-ring/) cluster is at risk of merging into one without enabling cluster label verification.
For example, if a Mimir, Tempo or Loki are running in the same Kubernetes cluster, they might communicate with each other without this configuration update.
Once cluster label verification is enabled, before Mimir components communicate with other components, it will verify whether those other components are having same cluster label.
The process to update the configuration should take around 30 minutes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be more than 30 minutes for a large cluster - 3 rollouts 1hr each can span 3hrs. Should we just say that it will take 3 rollouts of the whole cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this make more sense. Even 30 minutes that I put there was also quite abrupt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean to apply this change to the doc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange. I remember that I changed this as what was suggested. Maybe it got stashed in between of other commits. I confirm the commit this time in cf11b13.

lamida and others added 12 commits May 8, 2024 09:34
…grate-enable-cluster-label-verification/_index.md

Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
…grate-enable-cluster-label-verification/_index.md

Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
…grate-enable-cluster-label-verification/_index.md

Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
…grate-enable-cluster-label-verification/_index.md

Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
…grate-enable-cluster-label-verification/_index.md

Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
…grate-enable-cluster-label-verification/_index.md

Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
…grate-enable-cluster-label-verification/_index.md

Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
…grate-enable-cluster-label-verification/_index.md

Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
…grate-enable-cluster-label-verification/_index.md

Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
…grate-enable-cluster-label-verification/_index.md

Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
…grate-enable-cluster-label-verification/_index.md

Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Jon Kartago Lamida <me@lamida.net>
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice work and thanks for addressing my comments!

@jdbaldry do you want to take another look before we merge this? Or should we ask the tech writer starting soon to take a look?

lamida and others added 2 commits May 9, 2024 10:03
…grate-enable-cluster-label-verification/_index.md

Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
…grate-enable-cluster-label-verification/_index.md

Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
@lamida
Copy link
Contributor Author

lamida commented May 9, 2024

I did one round of testing running the steps from the documentation. I also ran the test when we updated the cluster label directly without following the three steps in the doc.

Updating the cluster label directly will create a temporary Memberlist partition where old pods will run in the old Memberlist cluster and can't join the new Memberlist cluster with the new cluster label. Eventually, the old Memberlist cluster will be gone and all components will join the new Memberlist cluster. Nevertheless, I guess the goal of following the steps in the doc is to avoid that inconsistency when temporarily different mimir components are running in the different Memberlist clusters.

lamida and others added 2 commits May 9, 2024 11:16
…grate-enable-cluster-label-verification/_index.md

Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Jon Kartago Lamida <me@lamida.net>
@jdbaldry
Copy link
Member

Thanks for the revision here. I think it looks a lot better with the changes you have made and we can definitely get this in.

After the new writer has ramped up, they can come back to this and do a closer pass for style conformity :)

@dimitarvdimitrov
Copy link
Contributor

sounds good; @lamida should we merge this in then?

@lamida
Copy link
Contributor Author

lamida commented May 23, 2024

@dimitarvdimitrov actually I am under impression that @jdbaldry wanted me to wait for the new writer to review this first before merging. 🤔

@jdbaldry
Copy link
Member

Sorry for the confusion. I meant that you can merge this as it is and the new writer can revise this later as they become more acquainted with the project

@dimitarvdimitrov dimitarvdimitrov merged commit 80b9794 into main May 24, 2024
29 checks passed
@dimitarvdimitrov dimitarvdimitrov deleted the lamida/helm-enable-cluster-label branch May 24, 2024 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docs: Enable cluster label verification in Helm
3 participants