Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow disabling of PersistentVolumeUsageNearFull/PersistentVolumeUsageCritical alerts on workloads that are expected to be fully utilized #9568

Closed
akalenyu opened this issue Jan 12, 2022 · 10 comments

Comments

@akalenyu
Copy link

akalenyu commented Jan 12, 2022

Is this a bug report or feature request?

  • Feature Request

What should the feature do:
Allow components to add a label to a PVC that prevents PersistentVolumeUsageNearFull/PersistentVolumeUsageCritical alerts from firing.
(Similar to openshift/cluster-monitoring-operator#1493, can/should use the same key/value pair?)

What is use case behind this feature:
The use case for this is that some workloads (kubevirt/CDI) request
a PV that will by default be the exact size of the file (disk image),
causing the alerts to fire, when the reality is, that the size
of the data will never grow and the alert is obsolete.

Environment:

Clusters running kubevirt, but more use cases where PVC is full by design may exist

@akalenyu akalenyu changed the title Allow disabling of PersistentVolumeUsageNearFull/PersistentVolumeUsageCritical on workloads that are expected to be fully utilized Allow disabling of PersistentVolumeUsageNearFull/PersistentVolumeUsageCritical alerts on workloads that are expected to be fully utilized Jan 12, 2022
@parth-gr
Copy link
Member

I didn't understand the point,
If we need to grow the cluster in future we need to add more disk PVCs which will alow data to balance out and the alert will gone,
Or do you say we won't need to add any more data?

@travisn
Copy link
Member

travisn commented Mar 10, 2022

Work is in progress with #9837 that will allow the prometheus rules to be customizable with a helm post processor. @akalenyu Please take a look to confirm this will be covered.

@akalenyu
Copy link
Author

I didn't understand the point, If we need to grow the cluster in future we need to add more disk PVCs which will alow data to balance out and the alert will gone, Or do you say we won't need to add any more data?

The idea is that we give an escape hatch for workloads that are expected to take up the entire PVC by design,
so a critical alert won't pop up for them.

We would then annotate our PVCs to exclude them from being able to trigger the alert in https://github.com/kubevirt/containerized-data-importer.
Is something similar to openshift/cluster-monitoring-operator#1493 not a reasonable way to go about this?

@BlaineEXE
Copy link
Member

BlaineEXE commented Mar 15, 2022

@akalenyu did you miss Travis's comment here? #9568 (comment)

I believe this may alleviate the issue. (You could edit the rules to have them ignore PVCs containing label of your choosing)

@akalenyu
Copy link
Author

akalenyu commented Mar 16, 2022

@akalenyu did you miss Travis's comment here? #9568 (comment)

I believe this may alleviate the issue. (You could edit the rules to have them ignore PVCs containing label of your choosing)

Sorry I should have been clearer about this; we can't really edit ceph rules from our project (containerized-data-importer), we're looking to handle this just by annotating the objects we are managing (PVCs). That is why the OpenShift monitoring approach worked for us.

@BlaineEXE
Copy link
Member

we can't really edit ceph rules from our project

Do I take this to mean that you are not the admin of your kubernetes cluster?


How is Rook being installed in your clusters? Rook will no longer be deploying Ceph prometheus rules with Travis's PR #9837. After this, users will have to deploy the rules themselves manually or via Helm.

@akalenyu
Copy link
Author

akalenyu commented Mar 17, 2022

we can't really edit ceph rules from our project

Do I take this to mean that you are not the admin of your kubernetes cluster?

How is Rook being installed in your clusters? Rook will no longer be deploying Ceph prometheus rules with Travis's PR #9837. After this, users will have to deploy the rules themselves manually or via Helm.

I am not an admin of a particular cluster, no.
I am working on a project which is basically a Kubernetes controller that offers abstraction over PVCs, so the only way for us to silence this for our PVCs (which are expected to take up all space) programmatically is by labeling them upfront.

We don't install rook ourselves as part of the project, I noticed this alert on one of the clusters I was debugging
(which had OCS and our project installed on it).

#9837 might solve the issue, thank you - but I have a feeling that at some point OCS will decide to deploy this alerting rule automatically bringing us back to having this alert firing even though we expect the workloads to be nearly 100% utilized.

@travisn
Copy link
Member

travisn commented Mar 17, 2022

Openshift has a proposal for alert customization that would benefit OCS. Until then if you don't have control of the PrometheusRule CRs created by OCS/Rook, not sure how you can suppress these.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@github-actions
Copy link

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants