LOG-5426: Fix stale metrics in telemetry #2443

xperimental · 2024-04-22T18:44:51Z

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.
What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.
- The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)
My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.
- I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model
It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.
Once everything is cleared up, this needs to be squashed 🙂

Links

JIRA: LOG-5426

openshift-ci-robot · 2024-04-22T18:44:54Z

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.

What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.

My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.

Once everything is cleared up, this needs to be squashed 🙂

Links

JIRA: LOG-5426

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-04-22T18:45:00Z

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.

What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.

My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.

Once everything is cleared up, this needs to be squashed 🙂

Links

JIRA: LOG-5426

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-04-23T18:10:48Z

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.

What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.

My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.

Once everything is cleared up, this needs to be squashed 🙂

Links

JIRA: LOG-5426

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-04-23T18:12:22Z

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.

What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.

My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.

I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model

It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.

Once everything is cleared up, this needs to be squashed 🙂

Links

JIRA: LOG-5426

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-04-23T18:13:15Z

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.

What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.

The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)

My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.

I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model

It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.

Once everything is cleared up, this needs to be squashed 🙂

Links

JIRA: LOG-5426

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

xperimental · 2024-04-23T18:13:57Z

Took another stab at some of the questions today and updated the PR code and description to reflect that.

xperimental · 2024-04-23T21:42:13Z

/retest-required

xperimental · 2024-04-24T13:47:13Z

/retest-required

xperimental · 2024-04-26T19:42:02Z

Had a short chat about this topic with @cahartma . There are still some open points which I need to address, so this PR will change again on Monday.

jcantrill · 2024-04-26T23:31:47Z

/hold

openshift-ci-robot · 2024-04-29T16:59:09Z

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.

What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.

The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)

My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.

I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model

It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.

Once everything is cleared up, this needs to be squashed 🙂

Links

JIRA: LOG-5426

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-04-29T16:59:27Z

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.

What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.

The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)

My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.

I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model

It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.

Once everything is cleared up, this needs to be squashed 🙂

Links

JIRA: LOG-5426

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-04-29T17:02:40Z

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.

What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.

The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)

My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.

I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model

It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.

Once everything is cleared up, this needs to be squashed 🙂

Links

JIRA: LOG-5426

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-04-29T18:05:02Z

@xperimental: This pull request references LOG-5426 which is a valid jira issue.

In response to this:

Description

This PR aims to fix the issue of stale telemetry metrics present in the operator. It does this by creating a custom collector instead of relying on the standard instrumentation primitives (like GaugeVec), which keep the previous state of the metrics.

No changes are done to the metrics themselves (except for the help text) to avoid incompatibilities with previous telemetry queries.

/cc @cahartma
/assign @alanconway

Known issues / ToDo

One change to the telemetry is that Loki's "default outputs" now also count into the "default output" label of the log_forwarder_output_info metric.

What this PR does not fix (yet) is that the metrics do not disappear when the ClusterLogging instance is removed. One way to solve this would be to only emit metrics when there's at least one ClusterLogging / ClusterLogForwarder instance available.

The implementation now checks whether the different resources are available (even if for CLF and LFME none of the metrics are based on data in the resource)

My favorite implementation would be to have the telemetry collect all the metrics directly from the resources (for an example, see LokiStack metrics), because the List call is essentially free (the list is cached locally). This would also solve the "stale metrics after delete" issue. The issue I have with implementing this approach in CLO is that I don't know how to identify the "healthy" status, for example. Input appreciated.

I now fixed this for the ClusterLogging metric, ClusterLogForwarder is still based on the old model

It looks to me as if the LogFileMetricExporter metric is not actually ingested into telemetry. If we do not use it anywhere, it might also be an option to remove it.

Once everything is cleared up, this needs to be squashed 🙂

Links

JIRA: LOG-5426

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

xperimental · 2024-04-29T18:05:25Z

Ready for review again, now already squashed as discussed 🙂

xperimental · 2024-04-29T18:09:50Z

It looks to me as if builds fail because of the recent change to Go 1.21 . There's already a PR to fix that issue, so I'll probably need a rebase after that one merges. 🤞

jcantrill · 2024-04-30T16:53:44Z

/approve

jcantrill · 2024-04-30T16:46:36Z

internal/metrics/telemetry/cluster_log_forwarder.go

 	"github.com/openshift/cluster-logging-operator/internal/constants"
 )

-func UpdateInfofromCLF(forwarder logging.ClusterLogForwarder) {
+func (t *telemetryCollector) updateDefaultInfo(forwarder *loggingv1.ClusterLogForwarder) {
+	if forwarder.Namespace != constants.OpenshiftNS || forwarder.Name != constants.SingletonName {


Maybe its a step after this but seems like we would be interested in telemetry for all CLFs not just the legacy instance

The code provides metrics for all the CLF instances.

This function is a separate codepath for the "default/legacy instance", because it might not even exist (in the case when a user has ClusterLogging without any customization in ClusterLogForwarder). So this function collects the final state of the "legacy instance" to expose it in metrics if the ClusterLogging instance exists while for all other CLF instances the information is directly gathered from the resource (code here).

Thinking about this again, I might have forgotten to describe this in more detail in a comment. I can add that to this function, if you like.

May be reasonable to clarify though I'm the only one asking so maybe limited value. This is not a blocking issue for me

Added an explanation comment to the function. ✔️

alanconway

/lgtm

openshift-ci · 2024-05-09T14:33:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alanconway, jcantrill, xperimental

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alanconway,jcantrill,xperimental]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

xperimental · 2024-05-14T15:10:40Z

/unhold

openshift-ci · 2024-05-14T17:17:45Z

@xperimental: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot assigned alanconway Apr 22, 2024

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 22, 2024

openshift-ci bot requested a review from cahartma April 22, 2024 18:44

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 26, 2024

jcantrill added the release/5.9 label Apr 26, 2024

xperimental force-pushed the refactor-telemetry-5.9 branch from cf0e427 to 22fd218 Compare April 29, 2024 18:04

xperimental mentioned this pull request Apr 30, 2024

LOG-5471: Fix stale metrics in telemetry #2458

Merged

jcantrill reviewed Apr 30, 2024

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2024

xperimental mentioned this pull request Apr 30, 2024

LOG-5472: Fix stale metrics in telemetry #2459

Merged

xperimental force-pushed the refactor-telemetry-5.9 branch from 22fd218 to c4669b3 Compare April 30, 2024 23:00

Fix stale metrics in telemetry

e93eef8

xperimental force-pushed the refactor-telemetry-5.9 branch from c4669b3 to e93eef8 Compare May 8, 2024 18:02

alanconway approved these changes May 9, 2024

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 9, 2024

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 14, 2024

openshift-merge-bot bot merged commit 0eccb99 into openshift:release-5.9 May 14, 2024
7 checks passed

This was referenced May 14, 2024

LOG-5529: Fix stale metrics in telemetry #2491

Closed

LOG-5529: Fix stale metrics in telemetry #2492

Merged

xperimental deleted the refactor-telemetry-5.9 branch May 16, 2024 10:52

LOG-5426: Fix stale metrics in telemetry #2443

LOG-5426: Fix stale metrics in telemetry #2443

Conversation

xperimental commented Apr 22, 2024 • edited

Description

Known issues / ToDo

Links

openshift-ci-robot commented Apr 22, 2024 • edited by openshift-ci bot

Description

Known issues / ToDo

Links

openshift-ci-robot commented Apr 22, 2024 • edited by openshift-ci bot

Description

Known issues / ToDo

Links

openshift-ci-robot commented Apr 23, 2024 • edited by openshift-ci bot

Description

Known issues / ToDo

Links

openshift-ci-robot commented Apr 23, 2024 • edited by openshift-ci bot

Description

Known issues / ToDo

Links

openshift-ci-robot commented Apr 23, 2024 • edited by openshift-ci bot

Description

Known issues / ToDo

Links

xperimental commented Apr 23, 2024

xperimental commented Apr 23, 2024

xperimental commented Apr 24, 2024

xperimental commented Apr 26, 2024

jcantrill commented Apr 26, 2024

openshift-ci-robot commented Apr 29, 2024 • edited by openshift-ci bot

Description

Known issues / ToDo

Links

openshift-ci-robot commented Apr 29, 2024 • edited by openshift-ci bot

Description

Known issues / ToDo

Links

openshift-ci-robot commented Apr 29, 2024 • edited by openshift-ci bot

Description

Known issues / ToDo

Links

openshift-ci-robot commented Apr 29, 2024 • edited by openshift-ci bot

Description

Known issues / ToDo

Links

xperimental commented Apr 29, 2024

xperimental commented Apr 29, 2024

jcantrill commented Apr 30, 2024

jcantrill Apr 30, 2024

Choose a reason for hiding this comment

xperimental May 7, 2024

Choose a reason for hiding this comment

xperimental May 7, 2024

Choose a reason for hiding this comment

jcantrill May 8, 2024

Choose a reason for hiding this comment

xperimental May 8, 2024

Choose a reason for hiding this comment

alanconway left a comment

Choose a reason for hiding this comment

openshift-ci bot commented May 9, 2024

xperimental commented May 14, 2024

openshift-ci bot commented May 14, 2024

xperimental commented Apr 22, 2024 •

edited

openshift-ci-robot commented Apr 22, 2024 •

edited by openshift-ci bot

openshift-ci-robot commented Apr 22, 2024 •

edited by openshift-ci bot

openshift-ci-robot commented Apr 23, 2024 •

edited by openshift-ci bot

openshift-ci-robot commented Apr 23, 2024 •

edited by openshift-ci bot

openshift-ci-robot commented Apr 23, 2024 •

edited by openshift-ci bot

openshift-ci-robot commented Apr 29, 2024 •

edited by openshift-ci bot

openshift-ci-robot commented Apr 29, 2024 •

edited by openshift-ci bot

openshift-ci-robot commented Apr 29, 2024 •

edited by openshift-ci bot

openshift-ci-robot commented Apr 29, 2024 •

edited by openshift-ci bot