Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom tagging infrastructure for Otel Metrics #36306

Open
joybestourous opened this issue Apr 9, 2024 · 1 comment
Open

Custom tagging infrastructure for Otel Metrics #36306

joybestourous opened this issue Apr 9, 2024 · 1 comment

Comments

@joybestourous
Copy link

Is your feature request related to a problem? Please describe.

We're hoping to migrate to gRPC's Otel metrics when instrumentation is complete in the languages we support. In the meantime, we have custom internal metrics designed to match the otel specs (grpc/proposal#380). We've noticed several friction points with these metrics internally, and we're hoping we can work with you folks to build solutions into gRPC for two reasons: 1. if these are problems we're having internally, we imagine other gRPC customers are having similar issues, and 2. if we cant fix these upstream, it's likely we can't adopt the metrics you folks are building out for us.

Describe the solution you'd like (and alternatives you've considered)

There are 3 main solutions we're looking for here

  1. The ability to add custom static tags at startup. This is useful for information that will not change for the life of a program. For example, grpc currently has a grpc.target tag, whose value is the fully qualified target. We have several customers that would like the ability to add a target_service tag with just the service name, as it eases the process of migrating existing dashboards and prevents them from having to use wildcards to find all services speaking to them. Another example use case we have for this is a service that has multiple teams contributing endpoints; they currently allow teams to break endpoints into groups by a tag so that they can be monitored together without manually updating new endpoints (i.e, grouping endpoints into distinct SLO categories)
  2. The ability to identify the caller in the server metrics via custom metadata send by clients. This is incredibly useful for debugging services. We did briefly discuss this with the team last September, and we've since been unable to come up with another strategy for this.
  3. The ability to configure non error codes. These metrics now treat any non-OK response as an error, but sometimes services make decisions as to what they consider an "error" (for example, many of our services consider "Canceled" to be a non-error code as it is usually a decision made by the caller, not a server error). This request is a little less urgent, since we can currently filter these out in our graphs and calculations like Success Rate. It's just a nice to have.

We recognize that there is a risk of cardinality with allowing customers to define custom tags, but we're hopeful we can work with you folks on a strategy that is still safe without sacrificing the usefulness of the metrics you're building.

@joybestourous
Copy link
Author

Hey @yashykt , any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants