Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GEP-19] Migrate cache Prometheus deployment and configuration #9128

Merged
merged 23 commits into from
Feb 9, 2024

Conversation

rfranzke
Copy link
Member

@rfranzke rfranzke commented Feb 6, 2024

How to categorize this PR?

/area dev-productivity monitoring
/kind enhancement

What this PR does / why we need it:
With this PR, prometheus-operator (introduced in #9067) takes over management of the cache Prometheus deployment and its configuration. In a nutshell:

  • Prometheus custom resource is created (which will result in a StatefulSet)
  • standard scrape config for services in the seed cluster is provided via ServiceMonitor custom resources
  • special scrape config (for kubelet/cadvisor) is still provided in "raw format" via the additional scrape config Secret
  • the PV of an existing cache Prometheus instance will be reused

Which issue(s) this PR fixes:
Part of #9065

Special notes for your reviewer:
/cc @oliver-goetz @ScheererJ
FYI @istvanballok @rickardsjp

Release note:

It is now possible to provide configuration for the cache Prometheus running in seed clusters' `garden` namespaces. Read all about it [here](https://github.com/gardener/gardener/tree/master/docs/extensions/logging-and-monitoring.md#cache-prometheus).

@gardener-prow gardener-prow bot added area/dev-productivity Developer productivity related (how to improve development) area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 6, 2024
@ScheererJ
Copy link
Contributor

/assign

Copy link
Contributor

@ScheererJ ScheererJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this greatly structured pull request. I have a few minor questions/comments.

Do you also want to consider adapting

to the new reality or do you prefer to remove the file all together once all prometheus instances are migrated to the prometheus-operator?

@rfranzke
Copy link
Member Author

rfranzke commented Feb 7, 2024

Do you also want to consider adapting

to the new reality or do you prefer to remove the file all together once all prometheus instances are migrated to the prometheus-operator?

I cannot do it yet because the seed-prometheus deployment also refers to these values. I'll do it as soon as this instance has been migrated as well :)

@ScheererJ
Copy link
Contributor

/lgtm

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2024
Copy link
Contributor

gardener-prow bot commented Feb 7, 2024

LGTM label has been added.

Git tree hash: c3238c36df4903574013b01c18e86440f94714f9

@rfranzke
Copy link
Member Author

rfranzke commented Feb 8, 2024

/assign @oliver-goetz

@oliver-goetz
Copy link
Member

/retest

Copy link
Member

@oliver-goetz oliver-goetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR, thanks 🚀
I have some small remarks only.

// A pod selector to select the node-exporter pods in the kube-system namespace does not work here
// because the node-exporter uses the host network. Network policies are currently not supported with
// pods in the host network.
To: []networkingv1.NetworkPolicyPeer{},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make sense to restrict the traffic to the nodes network IP range of the seed then?

Copy link
Member Author

@rfranzke rfranzke Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but I'd like to keep such improvements for a dedicated PR - for now, I only focus on the "translation"/refactoring to the custom resources. I've added a TODO statement to potentially reconsider this.

pkg/gardenlet/controller/seed/seed/reconciler_reconcile.go Outdated Show resolved Hide resolved
pkg/component/monitoring/prometheus/migration.go Outdated Show resolved Hide resolved
pkg/component/monitoring/prometheus/migration.go Outdated Show resolved Hide resolved
pkg/component/monitoring/prometheus/migration.go Outdated Show resolved Hide resolved
… component

It is shared amongst all Prometheis, so let's manage it centrally.
`prometheus-operator` does not manage the `ServiceAccount` for the created `StatefulSet`
`prometheus-operator` does not manage the `Service` for the created `StatefulSet`
This results in the `StatefulSet` after reconciliation of `prometheus-operator`.
We decided to no longer deploy an `Hvpa` resource since
- the HVPA controller is getting removed eventually
- it's considered no longer necessary to restrict the downscaling
These rules cannot be directly associated with a component package in `pkg/component`, hence they are considered "general rules" for the cache prometheus.
Some scrape configs might not be translatable into the CRDs (or its beyond the scope of this PR), see next commits, so we need to keep transmitting them in "raw" format. This is possible via the "additional scrape config" configuration option in the `Prometheus` CRD.
This way, we don't need to copy the data from one disk to another but can simply take over the existing PV (claimed by a new PVC).

I consciously decided to not invest into unit tests for the new migration.go file since
- they wouldn't yield much benefit to increase confidence in a working migration path (if we wanted to have this, we would need to invest into proper e2e tests which is even more costly and complex)
- this is temporary code that is about to get deleted soon again
- even if there's a bug that we wouldn't have caught during development/testing phase, loosing the Prometheus data is not too bad
@gardener-prow gardener-prow bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 9, 2024
@oliver-goetz
Copy link
Member

/lgtm
/approve

/retest

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Feb 9, 2024
Copy link
Contributor

gardener-prow bot commented Feb 9, 2024

LGTM label has been added.

Git tree hash: 9bab393b945ebb5bc69376341569f07cb1d32c26

Copy link
Contributor

gardener-prow bot commented Feb 9, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: oliver-goetz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/dev-productivity Developer productivity related (how to improve development) area/monitoring Monitoring (including availability monitoring and alerting) related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants