Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: strimzi.resources metric is missing in new unidirectional topic operator #9802

Closed
cthtrifork opened this issue Mar 8, 2024 · 12 comments

Comments

@cthtrifork
Copy link

Bug Description

It is documented here:
https://github.com/strimzi/proposals/blob/main/051-unidirectional-topic-operator.md#metrics

But it does not seem to be carried over from the old operator.
After upgrading, we are not able to see any strimzi_resource_state metric for each topic, as we have had before.

Steps to reproduce

No response

Expected behavior

No response

Strimzi version

0.39.0

Kubernetes version

Kubernetes 1.27.7

Installation method

Yaml files

Infrastructure

Azure AKS

Configuration files and logs

No response

Additional context

No response

@scholzj
Copy link
Member

scholzj commented Mar 8, 2024

I seem to have it:

$ kubectl exec -ti my-cluster-entity-operator-6879c6b9d9-ccldp -c topic-operator -- curl -s localhost:8080/metrics | grep strimzi.resources
# HELP strimzi_resources Number of custom resources the operator sees
# TYPE strimzi_resources gauge
strimzi_resources{kind="KafkaTopic",namespace="myproject",selector="strimzi.io/cluster=my-cluster",} 1.0

@cthtrifork
Copy link
Author

I seem to have it:


$ kubectl exec -ti my-cluster-entity-operator-6879c6b9d9-ccldp -c topic-operator -- curl -s localhost:8080/metrics | grep strimzi.resources

# HELP strimzi_resources Number of custom resources the operator sees

# TYPE strimzi_resources gauge

strimzi_resources{kind="KafkaTopic",namespace="myproject",selector="strimzi.io/cluster=my-cluster",} 1.0

Do you have any topics provisioned ? We are seeing that there is no population of this metric per topic as before.

We are using the metric to check status != 1 and the reason label to monitor reconcile errors for topics.

I will try to provide more data next weeek.

@scholzj
Copy link
Member

scholzj commented Mar 8, 2024

Ahh, ok. No, I do not have the per-topic metrics there. Just the counter. Not sure we want to keep these detailed metrics as they are hard to manage. But I guess that can be discussed when the issue is triaged.

@cthtrifork
Copy link
Author

I see your concern. As minimum we need to understand if there is any reconciliation issues. That is really hard to monitor with this feature removed. If we do not use the label together with "reason" we would need to extract it from logs, which would be a pain.

Perhaps it could be a configurable option?

@ppatierno
Copy link
Member

Triaged on 21.3.2024: @fvaleri is going to take a look at this one.

@fvaleri
Copy link
Contributor

fvaleri commented Mar 21, 2024

Hi @cthtrifork, thanks for raising this.

It was decided to not provide this metric with UTO because it does not scale well (it's an additional metric for each managed topic). Additionally, we don't have anything similar for the other operators.

If we do not use the label together with "reason" we would need to extract it from logs, which would be a pain.

Why would you want to extract the reason from logs? My suggestion is to leverage the KT status. You can use strimzi.reconciliations.failed metrics to be alerted, and then run a kubectl command to detect failed KTs. Alternatively, you can run the kubectl command periodically, and send an alert when it finds something.

A command similar to this one:

$ kubectl get kt -o custom-columns=TOPIC:.metadata.name,REASON:.status.conditions[0].reason,MESSAGE:.status.conditions[0].message,READY:.status.conditions[0].status | grep False
t1      NotSupported   Replication factor change not supported, but required for partitions []   False

Perhaps it could be a configurable option?

Personally, I don't like the idea because metrics are supposed to track the system behavior and performance, not the state of every single managed resource. The UTO has optional metrics to track internal operations that you can use for performance tests or troubleshooting, but they are aggregated.

That said, let's see what others think.

@scholzj @ppatierno @tombentley

@sebastiangaiser
Copy link

sebastiangaiser commented Mar 21, 2024

Hey,
for me the scaling is valid point and should be considered as @fvaleri wrote it.
But I also see the point of knowing from the alert already, which resource is having a problem.
So to (maybe) make a compromise on that I would suggest to use kube-state-metrics as e.g. Flux does it for every custom resource: https://fluxcd.io/flux/monitoring/metrics/ . Using this could expose the metrics but without the overhead inside of the operator.
This would also be a Kubernetes native way without doing some hacks via kubectl....

@scholzj
Copy link
Member

scholzj commented Mar 21, 2024

I did not know that kube-state-metrics can be configured to monitor custom resources. But if it can do so, it sounds like it is just a question of someone putting it together and sharing/contributing the configuration?

1 similar comment
@scholzj
Copy link
Member

scholzj commented Mar 21, 2024

I did not know that kube-state-metrics can be configured to monitor custom resources. But if it can do so, it sounds like it is just a question of someone putting it together and sharing/contributing the configuration?

@ppatierno
Copy link
Member

So I agree with @fvaleri to not provide these metrics because of the scaling but I also think that a contribution to Strimzi (maybe in the examples folder?) to provide a configuration to use kube-state-metrics could be a really interesting thing. The doc seems to be pretty straightforward https://github.com/kubernetes/kube-state-metrics/blob/main/docs/customresourcestate-metrics.md

@fvaleri
Copy link
Contributor

fvaleri commented May 20, 2024

We can have a dedicated improvement issue or PR if you think the kube-state-metrics example is necessary, but I would close this bug report. Wdyt?

@cthtrifork
Copy link
Author

We can have a dedicated improvement issue or PR if you think the kube-state-metrics example is necessary, but I would close this bug report. Wdyt?

Yes I will close this bug report. Thanks for assisting. We will look at kube-state-metrics!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants