Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics about how opentelemetry collector is used #2829

Open
rubenvp8510 opened this issue Apr 9, 2024 · 11 comments · May be fixed by #2825
Open

Metrics about how opentelemetry collector is used #2829

rubenvp8510 opened this issue Apr 9, 2024 · 11 comments · May be fixed by #2825
Labels
enhancement New feature or request

Comments

@rubenvp8510
Copy link
Contributor

Component(s)

collector

Is your feature request related to a problem? Please describe.

I want to expose metrics about how the opentelemetry collector is used.

For this concrete feature I wan to know:

  • How many collectors use certain receivers/exporters/processors.
  • How many collectors use certain mode (sidecar/daemonset etc..)

Describe the solution you'd like

Exposed those metrics at the operator prometheus endpoint.

Describe alternatives you've considered

No response

Additional context

No response

@rubenvp8510 rubenvp8510 added enhancement New feature or request needs triage labels Apr 9, 2024
@rubenvp8510 rubenvp8510 linked a pull request Apr 9, 2024 that will close this issue
@jaronoff97
Copy link
Contributor

Is there a reason you cannot use the collector's own metrics to obtain information about the components it runs? Similarly, you should be able to utilize k8s state metrics or the k8s cluster receiver to answer the second question. I'm trying to better understand the value of having this from the operator as opposed to existing sources.

@pavolloffay
Copy link
Member

In general, this feature request proposes to define a set of operator metrics that will enable cluster administrators better understand how the collector is used. For instance which collector components are used, how the collector is deployed, whether the prometheus/ta is enabled and so on.

@jaronoff97
Copy link
Contributor

Seems like a good SIG discussion topic :)

@jaronoff97 jaronoff97 added the discuss-at-sig This issue or PR should be discussed at the next SIG meeting label Apr 11, 2024
@jaronoff97
Copy link
Contributor

we discussed this at the SIG meeting, @rubenvp8510 is going to get together a list of metrics and why we want to track each of them. Ruben will also put this behind a feature gate initially.

@jaronoff97 jaronoff97 removed needs triage discuss-at-sig This issue or PR should be discussed at the next SIG meeting labels Apr 11, 2024
@rubenvp8510
Copy link
Contributor Author

rubenvp8510 commented Apr 16, 2024

we discussed this at the SIG meeting, @rubenvp8510 is going to get together a list of metrics and why we want to track each of them. Ruben will also put this behind a feature gate initially.

This is the list of metrics I want to expose at this first version:

opentelemetry_collector_receivers
opentelemetry_collector_exporters
opentelemetry_collector_processors
opentelemetry_collector_extensions
opentelemetry_collector_mode

For the receivers/exporters/processor/extension metrics I'm using a label type, which will be equal to the component name for example

opentelemetry_collector_receivers{type="otlp"} 2

This is useful because we want to know for the operants handled by this operator what components those collectors are using.

For the opentelemetry_collector_mode I'm using the label mode, and the value will be the modes available.

@jaronoff97
Copy link
Contributor

I wonder if instead of a metric for each component type, we could simply emit a single gauge opentelemetry_operator_collector_info{receiver_type="<receiver_type>", receiver_name="<receiver_name>", collector_name="<collector_name>", deployed_namespace="<namespace>", mode="<mode>"} 1
opentelemetry_operator_collector_info{exporter_type="<exporter_type>", exporter_name="<exporter_name>", collector_name="<collector_name>", deployed_namespace="<namespace>", mode="<mode>"} 1
... etc.

Going off of the Prometheus recommendations it feels like we should have a single metric with each of these as labels to then be aggregated later. i.e. you could run a query like sum(opentelemetry_operator_collector_info{receiver_type=~".+"}) by (receiver_name) to get all of the unique receivers. You could then query sum(opentelemetry_operator_collector_info) by (collector_name) to get the amount of components each collector in your infrastructure runs.

Though maybe this would be better as two metrics: one for collector metadata and one for the components:

opentelemetry_operator_collector_info{mode="<mode>", collector_name="<collector_name>", deployed_namespace="<namespace>"} 1
opentelemetry_operator_collector_components{receiver_type="<receiver_type>", receiver_name="<receiver_name>", collector_name="<collector_name>"} 1
opentelemetry_operator_collector_components{exporter_type="<exporter_type>", exporter_name="<exporter_name>", collector_name="<collector_name>"} 1

Either way, I think these are some constraints for the metrics:

  • they should use an operator namespace (not a collector one)
  • the labels should include the collectors names and namespaces where they are deployed

@rubenvp8510
Copy link
Contributor Author

rubenvp8510 commented Apr 22, 2024

I wonder if instead of a metric for each component type, we could simply emit a single gauge opentelemetry_operator_collector_info{receiver_type="<receiver_type>", receiver_name="<receiver_name>", collector_name="<collector_name>", deployed_namespace="<namespace>", mode="<mode>"} 1 opentelemetry_operator_collector_info{exporter_type="<exporter_type>", exporter_name="<exporter_name>", collector_name="<collector_name>", deployed_namespace="<namespace>", mode="<mode>"} 1 ... etc.

Going off of the Prometheus recommendations it feels like we should have a single metric with each of these as labels to then be aggregated later. i.e. you could run a query like sum(opentelemetry_operator_collector_info{receiver_type=~".+"}) by (receiver_name) to get all of the unique receivers. You could then query sum(opentelemetry_operator_collector_info) by (collector_name) to get the amount of components each collector in your infrastructure runs.

Though maybe this would be better as two metrics: one for collector metadata and one for the components:

opentelemetry_operator_collector_info{mode="<mode>", collector_name="<collector_name>", deployed_namespace="<namespace>"} 1
opentelemetry_operator_collector_components{receiver_type="<receiver_type>", receiver_name="<receiver_name>", collector_name="<collector_name>"} 1
opentelemetry_operator_collector_components{exporter_type="<exporter_type>", exporter_name="<exporter_name>", collector_name="<collector_name>"} 1

Either way, I think these are some constraints for the metrics:

* they should use an operator namespace (not a collector one)

* the labels should include the collectors names and namespaces where they are deployed

I'll do the changes for this, the only thing I'm not sure we need to include is the collector_name. I would be worried about the cardinality of that label, esentially what I want with this metrics are a summary of the status of the cluster.

@rubenvp8510
Copy link
Contributor Author

I also not sure about using the same metric, as my understanding is that Prometheus considers each unique combination of labels and label value as a different time series, and if we use the same metric and label each combination of receivers/exporters/processors etc.. This will grow to much!

I think we should preserve the metrics in this way

opentelemetry_operator_collector_info{mode="<mode>"} 1
opentelemetry_operator_collector_receivers{type="<receiver_type>"} 2
opentelemetry_operator_collector_exporters{type="<exporter_type>"} 1
opentelemetry_operator_collector_processors{type="<processor_type>"} 1

This will give us a good insights of what is deployed on the cluster without the produce a high cardinality time series. This is IMHO a good tradeoff.

Not sure if the namespace/collector name is useful. at least not for my use case.

@jaronoff97
Copy link
Contributor

I think we should definitely include the namespace/collector name as that would be really useful for someone trying to determine when and where a new receiver came online. I think it's fine to start with these as separate metrics, we can always change this in the future.

@rubenvp8510
Copy link
Contributor Author

I think we should definitely include the namespace/collector name as that would be really useful for someone trying to determine when and where a new receiver came online. I think it's fine to start with these as separate metrics, we can always change this in the future.

I agree that it would be useful, but still have the label cardinality concern. I would prefer to have this first version with separated metrics, and then we can move forward and add new things if new use cases required it.

@jaronoff97
Copy link
Contributor

i already know that the SREs and Ops folks at my co would want this granularity :) Cardinality add would only be the amounts of collector pools you run (because a collector pool can only run in a single namespace) so it's really not that much more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants