Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ceph prometheus rules created by OCS operator instead of Rook #1615

Merged

Conversation

travisn
Copy link
Contributor

@travisn travisn commented Apr 4, 2022

Rook has stopped creating the prometheus rules with the cephcluster monitoring.enabled setting. Now the rules must be created separately from the cluster CR as described in the rook PR rook/rook#9837. The rules are fully owned downstream by the ocs operator now since upstream they are only installed by the helm chart. This also gives full flexibility downstream to update the rules only when QE determines we are ready for testing all the new rules.

At the same time, vendor for golang 1.17 to pick up prometheus. The prometheus api was not resolving unless requiring go 1.17 compatibility. Go 1.16 is confused by the moving around of the prometheus api between different packages from v0.43 to v0.46. Now the packages are updated for go 1.17 compatibility.

Resolves https://bugzilla.redhat.com/show_bug.cgi?id=2066514

@travisn travisn changed the title Ceph prometheus rules created by OCS operator instead of Rook Bug 2066514: Ceph prometheus rules created by OCS operator instead of Rook Apr 4, 2022
@openshift-ci openshift-ci bot added bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Apr 4, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 4, 2022

@travisn: This pull request references Bugzilla bug 2066514, which is valid.

No validations were run on this bug

No GitHub users were found matching the public email listed for the QA contact in Bugzilla (mbukatov@redhat.com), skipping review request.

In response to this:

Bug 2066514: Ceph prometheus rules created by OCS operator instead of Rook

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@travisn
Copy link
Contributor Author

travisn commented Apr 4, 2022

@umangachapagain More testing is still needed, but please do let me know if there is any initial feedback, thanks.

Makefile Show resolved Hide resolved
@agarwal-mudit agarwal-mudit changed the title Bug 2066514: Ceph prometheus rules created by OCS operator instead of Rook Ceph prometheus rules created by OCS operator instead of Rook Apr 5, 2022
@openshift-ci openshift-ci bot removed the bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. label Apr 5, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 5, 2022

@travisn: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

Ceph prometheus rules created by OCS operator instead of Rook

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot removed the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Apr 5, 2022
@agarwal-mudit
Copy link
Member

No need to specify the bug in the title for main branch PRs.

Bug 2066514 can be mentioned in the comments if required.

go.mod Outdated Show resolved Hide resolved
@travisn travisn force-pushed the rook-prometheus-rules branch 2 times, most recently from bcfef64 to 9ba2d2e Compare April 5, 2022 18:32
@travisn
Copy link
Contributor Author

travisn commented Apr 5, 2022

/retest

@travisn
Copy link
Contributor Author

travisn commented Apr 5, 2022

@umangachapagain Do you see an error in the CI? I'm not seeing the failure and it doesn't seem to be intermittent since rerunning gave the same results.

// state.
func createPrometheusRules(r *StorageClusterReconciler, sc *ocsv1.StorageCluster, cluster *rookCephv1.CephCluster) error {
if !cluster.Spec.Monitoring.Enabled {
r.Log.Info("prometheus rules skipped", "CephCluster")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
r.Log.Info("prometheus rules skipped", "CephCluster")
r.Log.Info("prometheus rules skipped", "CephCluster", klog.KRef(cluster.Namespace, cluster.Name))

}
prometheusRule, err := parsePrometheusRule(rules)
if err != nil {
r.Log.Error(err, "Unable to retrieve prometheus rules.", "CephCluster")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
r.Log.Error(err, "Unable to retrieve prometheus rules.", "CephCluster")
r.Log.Error(err, "Unable to retrieve prometheus rules.", "CephCluster", klog.KRef(cluster.Namespace, cluster.Name))

applyLabels(getCephClusterMonitoringLabels(*sc), &prometheusRule.ObjectMeta)

if err := createOrUpdatePrometheusRule(r, sc, prometheusRule); err != nil {
r.Log.Error(err, "Prometheus rules could not be created.", "CephCluster")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
r.Log.Error(err, "Prometheus rules could not be created.", "CephCluster")
r.Log.Error(err, "Prometheus rules could not be created.", "CephCluster", klog.KRef(cluster.Namespace, cluster.Name))

return err
}

r.Log.Info("prometheus rules deployed", "CephCluster")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
r.Log.Info("prometheus rules deployed", "CephCluster")
r.Log.Info("prometheus rules deployed", "CephCluster", klog.KRef(cluster.Namespace, cluster.Name))

return nil
}

// OverwriteApplyToObjectMeta adds labels to object meta, overwriting keys that are already defined.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// OverwriteApplyToObjectMeta adds labels to object meta, overwriting keys that are already defined.
// applyLabels adds labels to object meta, overwriting keys that are already defined.

}
}

// GetPrometheusRule returns provided prometheus rules or an error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// GetPrometheusRule returns provided prometheus rules or an error
// parsePrometheusRule returns provided prometheus rules or an error

@umangachapagain
Copy link
Contributor

@umangachapagain Do you see an error in the CI? I'm not seeing the failure and it doesn't seem to be intermittent since rerunning gave the same results.

All required tests have passed. Remaining CI error is due to an issue with test scripts, not related to this PR.

Rook has stopped creating the prometheus rules with the cephcluster
monitoring.enabled setting. Now the rules must be created separately
from the cluster CR as described in the rook PR
rook/rook#9837. The rules are fully owned
downstream by the ocs operator now since upstream they are only
installed by the helm chart. This also gives full flexibility downstream
to update the rules only when QE determines we are ready for testing
all the new rules.

Signed-off-by: Travis Nielsen <tnielsen@redhat.com>
The prometheus api was not resolving unless requiring go 1.17
compatibility. Go 1.16 is confused by the moving around of the
prometheus api between different packages from v0.43 to v0.46.
Now the packages are updated for go 1.17 compatibility.

Signed-off-by: Travis Nielsen <tnielsen@redhat.com>
The latest rook release is based on v1.9 instead of v1.8.

Signed-off-by: Travis Nielsen <tnielsen@redhat.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 6, 2022

@travisn: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/red-hat-storage-ocs-ci-e2e-aws 5668229 link false /test red-hat-storage-ocs-ci-e2e-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Member

@jarrpa jarrpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 12, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 12, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jarrpa, travisn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 12, 2022
@openshift-merge-robot openshift-merge-robot merged commit 1a5f34d into red-hat-storage:main Apr 12, 2022
@travisn travisn deleted the rook-prometheus-rules branch April 12, 2022 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants