Custom tagging infrastructure for Otel Metrics #36306

joybestourous · 2024-04-09T13:17:03Z

Is your feature request related to a problem? Please describe.

We're hoping to migrate to gRPC's Otel metrics when instrumentation is complete in the languages we support. In the meantime, we have custom internal metrics designed to match the otel specs (grpc/proposal#380). We've noticed several friction points with these metrics internally, and we're hoping we can work with you folks to build solutions into gRPC for two reasons: 1. if these are problems we're having internally, we imagine other gRPC customers are having similar issues, and 2. if we cant fix these upstream, it's likely we can't adopt the metrics you folks are building out for us.

Describe the solution you'd like (and alternatives you've considered)

There are 3 main solutions we're looking for here

The ability to add custom static tags at startup. This is useful for information that will not change for the life of a program. For example, grpc currently has a grpc.target tag, whose value is the fully qualified target. We have several customers that would like the ability to add a target_service tag with just the service name, as it eases the process of migrating existing dashboards and prevents them from having to use wildcards to find all services speaking to them. Another example use case we have for this is a service that has multiple teams contributing endpoints; they currently allow teams to break endpoints into groups by a tag so that they can be monitored together without manually updating new endpoints (i.e, grouping endpoints into distinct SLO categories)
The ability to identify the caller in the server metrics via custom metadata send by clients. This is incredibly useful for debugging services. We did briefly discuss this with the team last September, and we've since been unable to come up with another strategy for this.
The ability to configure non error codes. These metrics now treat any non-OK response as an error, but sometimes services make decisions as to what they consider an "error" (for example, many of our services consider "Canceled" to be a non-error code as it is usually a decision made by the caller, not a server error). This request is a little less urgent, since we can currently filter these out in our graphs and calculations like Success Rate. It's just a nice to have.

We recognize that there is a risk of cardinality with allowing customers to define custom tags, but we're hopeful we can work with you folks on a strategy that is still safe without sacrificing the usefulness of the metrics you're building.

The text was updated successfully, but these errors were encountered:

joybestourous · 2024-05-06T14:43:51Z

Hey @yashykt , any update on this?

joybestourous added kind/enhancement lang/core priority/P2 untriaged labels Apr 9, 2024

grpc-bot assigned drfloob Apr 9, 2024

yashykt assigned yashykt and unassigned drfloob Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom tagging infrastructure for Otel Metrics #36306

Custom tagging infrastructure for Otel Metrics #36306

joybestourous commented Apr 9, 2024

joybestourous commented May 6, 2024

Custom tagging infrastructure for Otel Metrics #36306

Custom tagging infrastructure for Otel Metrics #36306

Comments

joybestourous commented Apr 9, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you'd like (and alternatives you've considered)

joybestourous commented May 6, 2024