Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Bump Prometheus-operator and node_exporter to latest versions #1661

Closed
wants to merge 2 commits into from

Conversation

lilic
Copy link
Member

@lilic lilic commented Jan 18, 2021

What type of PR is this?
/kind feature

What this PR does / why we need it:
While writing kube-state-metrics perf tests, I noticed that the version of Prometheus operator and node_exporter are out of date, so I decided to contribute. Let me know if I overstepped the line, happy to close the PR!

I tried to keep the overriding changes as close to as they were, let me know if I missed something important?

I also added node_exporter DaemonSet manifest instead of the pod one, but happy to revert the changes if you had a reason for one pod!

And switched replicas of Prometheus to 2, but since I do not know the full purpose of this Prometheus instance, maybe having 1 replica is enough in your case? (e.g. the trafeoffs of consuming less memory in a scale and perf test vs availability of Prometheus monitoring might be indeed a normal thing)

Special notes for your reviewer:
(Note I am the maintainer of both Prometheus operator and kube-prometheus which the original manifests are based on)

Does this PR introduce a user-facing change?:


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 18, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lilic
To complete the pull request process, please assign jkaniuk after the PR has been reviewed.
You can assign the PR to them by writing /assign @jkaniuk in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

@lilic: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-perf-tests-clusterloader2 64c57c8 link /test pull-perf-tests-clusterloader2
pull-perf-tests-clusterloader2-kubemark 64c57c8 link /test pull-perf-tests-clusterloader2-kubemark

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@mm4tt
Copy link
Contributor

mm4tt commented Jan 19, 2021

The PR is marked as WIP, let me know when it's ready for review.

@lilic
Copy link
Member Author

lilic commented Jan 19, 2021

@mm4tt Thanks for reply! Just wanted to first get initial feedback if this is something you all want? If yes, can you maybe addressed the comments in the description. After that I can address them and unmark it as WIP. Hope that makes sense? :) Thank you!

- --kubelet-service=kube-system/kubelet
{{end}}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like some templating was removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added back in, indeed, the variable is declared at the top 👍

@mm4tt
Copy link
Contributor

mm4tt commented Jan 19, 2021

I took a cursory glance. A few notes:

  • node-exporter has been updated to the newest version in Update node-exporter to 1.0.1 #1665
  • node-exporter needs to be kept as a pod - it's installed as a static pod on master nodes
  • prometheus doesn't need two replicas in our setup, we've never faced any loss of prometheus availability in our scale tests. On the other hand, having two instances comes with a cost, more resources needed, more api-calls, we'd have to change some of our custom logic (e.g. we have a custom logic for dumping prometheus databases after a test), etc. TL;DR, the gains of having 2+ replicas don't justify the costs here.

It'd be great if you can update the prometheus operator and dependencies to the newest version. It's something that we'll have to do eventually anyway. But, could you keep the change as minimalistic as possible? I see that you changed/added a lot of things, e.g. a lot of new grafana dashboards, that aren't really needed. Also, please make sure to honor our custom overrides and settings (e.g. the thing Pawel caught).

Thanks!

verbs:
- get
- list
- watch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we can remove this?
If I am not mistaken when we are using service/pod monitors prometheus is doing list and watch for these two objects, right?

{{if $PROMETHEUS_SCRAPE_WINDOWS_NODES}}
additionalScrapeConfigs:
name: windows-scrape-configs
key: windows-scrape-configs.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove whitespaces

{{end}}
image: gcr.io/k8s-testimages/quay.io/coreos/prometheus-operator:v0.31.0
- --prometheus-config-reloader=gcr.io/k8s-testimages/quay.io/prometheus-operator/prometheus-config-reloader:v0.45.0
image: gcr.io/k8s-testimages/quay.io/prometheus-operator/prometheus-operator:v0.45.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR these images do not exist right now, we will need to push them manually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i see. Can you advise how we can get this image updated?

- --secure-listen-address=:8443
- --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
- --upstream=http://127.0.0.1:8080/
image: quay.io/brancz/kube-rbac-proxy:v0.8.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what this container is doing, but either way if we want to use it then we will need to mirror it to gcr (in the past we had issues, where our tests were failing, because quay.io was down.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: There are also talks in kubebuilder to make automatic cloning of kube-rbac-proxy. Plus we (kube-rbac-proxy maintainers) want to donate that to k8s in the future when we find time to actually do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems 0.8.0 is available as per kubernetes-sigs/kubebuilder#1955, updating this PR accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marseel: I am new to this ecosystem, but can gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0 be referenced her as well or can you advise how I can help in having a gcr.io/k8s-testimages/quay.io/brancz/kube-rbac-proxy:v0.8.0?

tldr: kube-rbac-proxy is protects endpoints by executing a token and subject access review against k8s api to validate if service account token in flight has permissions to query metrics.

@marseel
Copy link
Member

marseel commented Jan 19, 2021

I took quick look. Definitely it's something that we want ;)
I also agree with @mm4tt that it's better not to increase number of replicas for now. It would be safer and easier to review if we would split it into smaller PRs (as @mm4tt mentioned)

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 19, 2021
@k8s-ci-robot
Copy link
Contributor

@lilic: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lilic
Copy link
Member Author

lilic commented Jan 26, 2021

Thank you all!! Just heads up I did not forget about this :)

@s-urbaniak
Copy link
Contributor

side note: to spread up the work load, I am taking this over from @lilic :-)

@s-urbaniak
Copy link
Contributor

@mm4tt @marseel let me actually submit separate PRs for the suggested changes so it is easier to track 👍 side note: i am also prometheus-operator and kube-prometheus maintainer, hence it might look like big changes here but they were more or less taken 1:1 from kube-prometheus upstream which evolved gradually.

re: grafana dashboards: do you use explicit custom grafana dashboards? Otherwise we could indeed vendor them 1:1 to catch up with upstream changes from kube-prometheus.

@mm4tt
Copy link
Contributor

mm4tt commented Feb 2, 2021

re: grafana dashboards: do you use explicit custom grafana dashboards?

Yes, we have our own grafana dashboards defined in https://github.com/kubernetes/perf-tests/tree/97fedf5c484e6438c2c13854bb901241493377ba/clusterloader2/pkg/prometheus/manifests/dashboards

@lilic
Copy link
Member Author

lilic commented Mar 29, 2021

Thanks Serg 🎉

@lilic lilic closed this Mar 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants