Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOG-5426: Fix stale metrics in telemetry #2443

Merged

Conversation

xperimental
Copy link
Contributor

@xperimental xperimental commented Apr 22, 2024

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

  • One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.
  • What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
    • The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)
  • My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.
    • I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model
  • It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.
  • Once everything is cleared up, this needs to be squashed 馃檪

Links

@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 22, 2024

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

  • One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.
  • What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
  • My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.
  • Once everything is cleared up, this needs to be squashed 馃檪

Links

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 22, 2024
@openshift-ci openshift-ci bot requested a review from cahartma April 22, 2024 18:44
@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 22, 2024

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

  • One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.
  • What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
  • My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.
  • Once everything is cleared up, this needs to be squashed 馃檪

Links

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 23, 2024

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

  • One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.
  • What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
  • My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.
  • Once everything is cleared up, this needs to be squashed 馃檪

Links

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 23, 2024

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

  • One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.
  • What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
  • My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.
  • I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model
  • It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.
  • Once everything is cleared up, this needs to be squashed 馃檪

Links

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 23, 2024

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

  • One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.
  • What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
  • The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)
  • My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.
  • I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model
  • It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.
  • Once everything is cleared up, this needs to be squashed 馃檪

Links

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@xperimental
Copy link
Contributor Author

Took another stab at some of the questions today and updated the PR code and description to reflect that.

@xperimental
Copy link
Contributor Author

/retest-required

1 similar comment
@xperimental
Copy link
Contributor Author

/retest-required

@xperimental
Copy link
Contributor Author

Had a short chat about this topic with @cahartma . There are still some open points which I need to address, so this PR will change again on Monday.

@jcantrill
Copy link
Contributor

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 26, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 29, 2024

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

  • One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.
  • What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
  • The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)
  • My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.
  • I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model
  • It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.
  • Once everything is cleared up, this needs to be squashed 馃檪

Links

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 29, 2024

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

  • One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.
  • What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
  • The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)
  • My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.
  • I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model
  • It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.
  • Once everything is cleared up, this needs to be squashed 馃檪

Links

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 29, 2024

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

  • One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.
  • What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
  • The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)
  • My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.
  • I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model
  • It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.
  • Once everything is cleared up, this needs to be squashed 馃檪

Links

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 29, 2024

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

  • One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.
  • What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
  • The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)
  • My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.
  • I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model
  • It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.
  • Once everything is cleared up, this needs to be squashed 馃檪

Links

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@xperimental
Copy link
Contributor Author

Ready for review again, now already squashed as discussed 馃檪

@xperimental
Copy link
Contributor Author

It looks to me as if builds fail because of the recent change to Go 1.21 . There's already a PR to fix that issue, so I'll probably need a rebase after that one merges. 馃

@jcantrill
Copy link
Contributor

/approve

"github.com/openshift/cluster-logging-operator/internal/constants"
)

func UpdateInfofromCLF(forwarder logging.ClusterLogForwarder) {
func (t *telemetryCollector) updateDefaultInfo(forwarder *loggingv1.ClusterLogForwarder) {
if forwarder.Namespace != constants.OpenshiftNS || forwarder.Name != constants.SingletonName {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe its a step after this but seems like we would be interested in telemetry for all CLFs not just the legacy instance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code provides metrics for all the CLF instances.

This function is a separate codepath for the "default/legacy instance", because it might not even exist (in the case when a user has ClusterLogging without any customization in ClusterLogForwarder). So this function collects the final state of the "legacy instance" to expose it in metrics if the ClusterLogging instance exists while for all other CLF instances the information is directly gathered from the resource (code here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this again, I might have forgotten to describe this in more detail in a comment. I can add that to this function, if you like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be reasonable to clarify though I'm the only one asking so maybe limited value. This is not a blocking issue for me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an explanation comment to the function. 鉁旓笍

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2024
Copy link
Contributor

@alanconway alanconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 9, 2024
Copy link
Contributor

openshift-ci bot commented May 9, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alanconway, jcantrill, xperimental

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [alanconway,jcantrill,xperimental]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@xperimental
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 14, 2024
Copy link
Contributor

openshift-ci bot commented May 14, 2024

@xperimental: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 0eccb99 into openshift:release-5.9 May 14, 2024
7 checks passed
@xperimental xperimental deleted the refactor-telemetry-5.9 branch May 16, 2024 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. release/5.9
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants