Metrics about how opentelemetry collector is used #2829

rubenvp8510 · 2024-04-09T06:04:43Z

Component(s)

collector

Is your feature request related to a problem? Please describe.

I want to expose metrics about how the opentelemetry collector is used.

For this concrete feature I wan to know:

How many collectors use certain receivers/exporters/processors.
How many collectors use certain mode (sidecar/daemonset etc..)

Describe the solution you'd like

Exposed those metrics at the operator prometheus endpoint.

Describe alternatives you've considered

No response

Additional context

No response

jaronoff97 · 2024-04-09T15:02:38Z

Is there a reason you cannot use the collector's own metrics to obtain information about the components it runs? Similarly, you should be able to utilize k8s state metrics or the k8s cluster receiver to answer the second question. I'm trying to better understand the value of having this from the operator as opposed to existing sources.

pavolloffay · 2024-04-10T14:59:03Z

In general, this feature request proposes to define a set of operator metrics that will enable cluster administrators better understand how the collector is used. For instance which collector components are used, how the collector is deployed, whether the prometheus/ta is enabled and so on.

jaronoff97 · 2024-04-10T16:00:17Z

Seems like a good SIG discussion topic :)

jaronoff97 · 2024-04-11T16:58:14Z

we discussed this at the SIG meeting, @rubenvp8510 is going to get together a list of metrics and why we want to track each of them. Ruben will also put this behind a feature gate initially.

rubenvp8510 · 2024-04-16T06:07:56Z

we discussed this at the SIG meeting, @rubenvp8510 is going to get together a list of metrics and why we want to track each of them. Ruben will also put this behind a feature gate initially.

This is the list of metrics I want to expose at this first version:

opentelemetry_collector_receivers
opentelemetry_collector_exporters
opentelemetry_collector_processors
opentelemetry_collector_extensions
opentelemetry_collector_mode

For the receivers/exporters/processor/extension metrics I'm using a label type, which will be equal to the component name for example

opentelemetry_collector_receivers{type="otlp"} 2

This is useful because we want to know for the operants handled by this operator what components those collectors are using.

For the opentelemetry_collector_mode I'm using the label mode, and the value will be the modes available.

jaronoff97 · 2024-04-16T15:18:47Z

I wonder if instead of a metric for each component type, we could simply emit a single gauge opentelemetry_operator_collector_info{receiver_type="<receiver_type>", receiver_name="<receiver_name>", collector_name="<collector_name>", deployed_namespace="<namespace>", mode="<mode>"} 1
opentelemetry_operator_collector_info{exporter_type="<exporter_type>", exporter_name="<exporter_name>", collector_name="<collector_name>", deployed_namespace="<namespace>", mode="<mode>"} 1
... etc.

Going off of the Prometheus recommendations it feels like we should have a single metric with each of these as labels to then be aggregated later. i.e. you could run a query like sum(opentelemetry_operator_collector_info{receiver_type=~".+"}) by (receiver_name) to get all of the unique receivers. You could then query sum(opentelemetry_operator_collector_info) by (collector_name) to get the amount of components each collector in your infrastructure runs.

Though maybe this would be better as two metrics: one for collector metadata and one for the components:

opentelemetry_operator_collector_info{mode="<mode>", collector_name="<collector_name>", deployed_namespace="<namespace>"} 1
opentelemetry_operator_collector_components{receiver_type="<receiver_type>", receiver_name="<receiver_name>", collector_name="<collector_name>"} 1
opentelemetry_operator_collector_components{exporter_type="<exporter_type>", exporter_name="<exporter_name>", collector_name="<collector_name>"} 1

Either way, I think these are some constraints for the metrics:

they should use an operator namespace (not a collector one)
the labels should include the collectors names and namespaces where they are deployed

rubenvp8510 · 2024-04-22T04:19:57Z

I wonder if instead of a metric for each component type, we could simply emit a single gauge opentelemetry_operator_collector_info{receiver_type="<receiver_type>", receiver_name="<receiver_name>", collector_name="<collector_name>", deployed_namespace="<namespace>", mode="<mode>"} 1 opentelemetry_operator_collector_info{exporter_type="<exporter_type>", exporter_name="<exporter_name>", collector_name="<collector_name>", deployed_namespace="<namespace>", mode="<mode>"} 1 ... etc.

Going off of the Prometheus recommendations it feels like we should have a single metric with each of these as labels to then be aggregated later. i.e. you could run a query like sum(opentelemetry_operator_collector_info{receiver_type=~".+"}) by (receiver_name) to get all of the unique receivers. You could then query sum(opentelemetry_operator_collector_info) by (collector_name) to get the amount of components each collector in your infrastructure runs.

Though maybe this would be better as two metrics: one for collector metadata and one for the components:
opentelemetry_operator_collector_info{mode="<mode>", collector_name="<collector_name>", deployed_namespace="<namespace>"} 1
opentelemetry_operator_collector_components{receiver_type="<receiver_type>", receiver_name="<receiver_name>", collector_name="<collector_name>"} 1
opentelemetry_operator_collector_components{exporter_type="<exporter_type>", exporter_name="<exporter_name>", collector_name="<collector_name>"} 1
Either way, I think these are some constraints for the metrics:
* they should use an operator namespace (not a collector one)

* the labels should include the collectors names and namespaces where they are deployed

I'll do the changes for this, the only thing I'm not sure we need to include is the collector_name. I would be worried about the cardinality of that label, esentially what I want with this metrics are a summary of the status of the cluster.

rubenvp8510 · 2024-04-23T05:07:08Z

I also not sure about using the same metric, as my understanding is that Prometheus considers each unique combination of labels and label value as a different time series, and if we use the same metric and label each combination of receivers/exporters/processors etc.. This will grow to much!

I think we should preserve the metrics in this way

opentelemetry_operator_collector_info{mode="<mode>"} 1
opentelemetry_operator_collector_receivers{type="<receiver_type>"} 2
opentelemetry_operator_collector_exporters{type="<exporter_type>"} 1
opentelemetry_operator_collector_processors{type="<processor_type>"} 1

This will give us a good insights of what is deployed on the cluster without the produce a high cardinality time series. This is IMHO a good tradeoff.

Not sure if the namespace/collector name is useful. at least not for my use case.

jaronoff97 · 2024-04-24T17:07:07Z

I think we should definitely include the namespace/collector name as that would be really useful for someone trying to determine when and where a new receiver came online. I think it's fine to start with these as separate metrics, we can always change this in the future.

rubenvp8510 · 2024-04-24T17:46:54Z

I think we should definitely include the namespace/collector name as that would be really useful for someone trying to determine when and where a new receiver came online. I think it's fine to start with these as separate metrics, we can always change this in the future.

I agree that it would be useful, but still have the label cardinality concern. I would prefer to have this first version with separated metrics, and then we can move forward and add new things if new use cases required it.

jaronoff97 · 2024-04-24T18:25:06Z

i already know that the SREs and Ops folks at my co would want this granularity :) Cardinality add would only be the amounts of collector pools you run (because a collector pool can only run in a single namespace) so it's really not that much more.

rubenvp8510 added enhancement New feature or request needs triage labels Apr 9, 2024

rubenvp8510 linked a pull request Apr 9, 2024 that will close this issue

Add crd metrics usage information #2825

Open

jaronoff97 added the discuss-at-sig This issue or PR should be discussed at the next SIG meeting label Apr 11, 2024

jaronoff97 removed needs triage discuss-at-sig This issue or PR should be discussed at the next SIG meeting labels Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics about how opentelemetry collector is used #2829

Metrics about how opentelemetry collector is used #2829

rubenvp8510 commented Apr 9, 2024

jaronoff97 commented Apr 9, 2024

pavolloffay commented Apr 10, 2024

jaronoff97 commented Apr 10, 2024

jaronoff97 commented Apr 11, 2024

rubenvp8510 commented Apr 16, 2024 •

edited

jaronoff97 commented Apr 16, 2024

rubenvp8510 commented Apr 22, 2024 •

edited

rubenvp8510 commented Apr 23, 2024

jaronoff97 commented Apr 24, 2024

rubenvp8510 commented Apr 24, 2024

jaronoff97 commented Apr 24, 2024

Metrics about how opentelemetry collector is used #2829

Metrics about how opentelemetry collector is used #2829

Comments

rubenvp8510 commented Apr 9, 2024

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

jaronoff97 commented Apr 9, 2024

pavolloffay commented Apr 10, 2024

jaronoff97 commented Apr 10, 2024

jaronoff97 commented Apr 11, 2024

rubenvp8510 commented Apr 16, 2024 • edited

jaronoff97 commented Apr 16, 2024

rubenvp8510 commented Apr 22, 2024 • edited

rubenvp8510 commented Apr 23, 2024

jaronoff97 commented Apr 24, 2024

rubenvp8510 commented Apr 24, 2024

jaronoff97 commented Apr 24, 2024

rubenvp8510 commented Apr 16, 2024 •

edited

rubenvp8510 commented Apr 22, 2024 •

edited