Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop histogram aggregation, default to explicit bucket histogram #2429

Merged
merged 6 commits into from
Apr 12, 2022

Conversation

jack-berg
Copy link
Member

Resolves #2382.

Now that metrics is marked as stable, we're stuck with the imprecise histogram aggregation which gives SDKs the latitude to change implementations from one version to the next, potentially breaking backend integrations.

I suggest we remove the histogram aggregation, and default histogram instruments to explicit bucket histogram.

Later once exponential histogram aggregation is specified, we can add back the histogram aggregation with its current "best available" semantics. However, we should have users opt-in to this uncertainty rather than making it the default.

@jack-berg jack-berg requested review from a team as code owners March 22, 2022 20:26
@jack-berg jack-berg changed the title Drop histogram aggregation, default histogram instrument to explicit … Drop histogram aggregation, default to explicit bucket histogram Mar 22, 2022
@reyang
Copy link
Member

reyang commented Mar 22, 2022

@jmacd @pirgeo I think this is intended for exponential bucket histogram (e.g. base-2 one)?

@reyang reyang added spec:metrics Related to the specification/metrics directory area:sdk Related to the SDK labels Mar 23, 2022
@pirgeo
Copy link
Member

pirgeo commented Mar 23, 2022

To make sure I understand correctly:

  1. You suggest removing the Histogram Aggregation for now, and using the explicit-bucket Histogram as defined in the spec as the default.
  2. Later we could add the Histogram Aggregation back, which would allow (and require) users to explicitly state "give me whatever histogram is specified as the default for this language/SDK version combination", and that would then provide either a fixed-bucket histogram or an exponential one.

Alternatively, users can always explicitly specify that they either want an explicit histogram with buckets that they themselves set up or an exponential histogram

The default behavior (explicit bucket histogram with buckets as specified in the spec) will never change, this way the SDK will never produce incompatible data when updating SDK versions. All histograms, if not explicitly set to something else, will always be fixed bucket ones, and look exactly like defined in the spec today.

Did I misunderstand something here?

If we merge this change, we can never specify the ExponentialHistogram as the default for Histogram aggregation anymore, right?

@jack-berg
Copy link
Member Author

Yes that is what I'm suggesting.

If we merge this change, we can never specify the ExponentialHistogram as the default for Histogram aggregation anymore, right?

Correct. The assertion is that SDKS shouldn't ever change the default behavior because doing so may be a breaking change for users. For example, backends might rely on explicit bucket histogram and not support exponential histogram, or some user experience might rely on the default bucket boundaries of explicit bucket histogram.

In the java SIG we've decided that this type of change would be unacceptable and have gotten rid of the "best available" histogram aggregation for the time being.

@reyang
Copy link
Member

reyang commented Mar 23, 2022

Here is what we should consider:

We want to have "Hint API" (e.g. #1753 which was abandoned to scope down the initial spec release) in the Metrics API specification. And it is very likely that the instrumentation authors would want to provide the hint saying "by default this histogram should use base-2 exponential buckets", and we want to the SDK to respect the hint - unless the application owner explicitly specified "I WANT to use explicit buckets".

And I think there is a way to avoid breaking change:

  1. If there is no "hint" (this is where we are today), we will use the default explicit buckets.
  2. If there is "hint" (later), it should be respected, unless the user explicitly specified the buckets in a View configuration.

@pirgeo
Copy link
Member

pirgeo commented Mar 24, 2022

Adding the Hint API might still break some scenarios. Consider the following:

  1. Somebody sets up an app with the SDK as it is today (no Hint API). They use the default explicit bucket aggregation because it fits with their backend. The app uses a HTTP server library that is autoinstrumented and files away request sizes as a histogram.
  2. The Hint API is released. The authors of the server library decide to provide a Hint to use the exponential histogram layout.
  3. The User of the server library updates to the latest version. They also update the SDK to the latest version.
  4. The SDK now respects hints, and picks up that the instrumented server library suggests using an exponential histogram. The exported data will therefore be in exponential format.
  5. Now the User would have to go back and explicitly change their code to use the explicit-bucket histogram in order to have data continuity.

Do we think this is okay, since it happens during a package-upgrade process? It would be a change of output of the system without any explicit code change (except bumping package versions). Is this considered a breaking change, even if the code still compiles and runs?
If yes, adding the Hint to the server library might be breaking behavior.

TL;DR: As far as I can tell, the fundamental question to address here is: Do we want to keep flexibility to change the default behavior in later releases? If the answer is No, then we will have to be very careful how we add Hints, or be very clear to make sure folks double check their instrumentation and add explicit defaults where necessary when the Hint API comes around, so their expected outputs don't change.

@jmacd
Copy link
Contributor

jmacd commented Mar 24, 2022

@pirgeo Similar questions came up in making recommendations about handling duplicate instrument conflicts in #2317. The data model has this section that defines duplicates conflicts to occur for a single Resource, which to me allows point kind to change across runs of the same application through View reconfiguration. If we excluded Resource from that statement, it would prevent you from ever changing an aggregation without being considered a conflict. I think the fact that we have a Views API means we support the idea of changing Aggregation across process starts, but we want to ensure the defaults do not change unexpectedly, hence I support this PR.

@cijothomas
Copy link
Member

Do we think this is okay, since it happens during a package-upgrade process? It would be a change of output of the system without any explicit code change (except bumping package versions).

This seems okay to me. As user updated the version of the library, he can be affected by any changes in that library - not just limited to addition of Hints..

To be clear, user is not bumping the package version of the OTel SDK - if all they did was bump SDK version and behavior changes, thats a different story.

@pirgeo
Copy link
Member

pirgeo commented Mar 24, 2022

@jmacd I agree that we do not want the defaults to change unexpectedly. I am trying to point out (and make sure I understood that correctly) that by merging this change we will bar ourselves forever from using any HistogramAggregation except the one specified today as the default without calling it a breaking change. Once this change is in, the default is an explicit bucket histogram as defined today, forever. If I want something else, I can just add a view, and specify the preferred aggreation. As soon as I specified something explicitly, the default doesn't apply anymore (which totally makes sense to me).

This also means that when the Hint API is added, the default behavior cannot change - if I don't specify anything for my Histogram, it should always export the default bucket boundaries.

@cijothomas

To be clear, user is not bumping the package version of the OTel SDK - if all they did was bump SDK version and behavior changes, that's a different story.

I think this might be the case for the version bump that introduces Hints: Take the following scenario (and I am not sure if its a valid one):

  • A library you use in your project is instrumented with OTel, it produces a Histogram, and provides a Hint to use an exponential Histogram.
  • The SDK you use in your project is not the latest version, and does not support Hints, so it ignores the Hint from the library.
  • Since the Hint is ignored, the SDK exports the default, which is the explicit bucket histogram, and that is what you expect in your backend.
  • One day you decide to update your SDK, and the new SDK version implements the Hints API, picks up that the library provides a hint (to export the exponential histogram) and starts exporting that instead.

--> no code change, just a bump in SDK version and the output behavior changes. This of course may only ever happen once, when Hints are added to the SDK, so I am not sure how valid this example is, but that's why I am asking whether this is considered a breaking change. It feels to me like the Hint would allow us to have "conditional defaults", but if we set the default in stone here, I am not sure we can use those "conditional defaults" any more. @reyang do you see any problems with this?

@reyang
Copy link
Member

reyang commented Mar 25, 2022

@pirgeo

Here is my take:

Scenario 1

  1. The customer was using an instrumentation library, and the library was not using Hint.
  2. The customer took the SDK which understands the Hint. Since there is no Hint, default buckets are used.
  3. The customer upgraded the instrumentation library which provides the Hint. The behavior is now changed to use the buckets provided by the Hint.

I consider this as a breaking change in the instrumentation library, and the instrumentation library author should either:

  • If the library is Stable, bump the major version, because adding Hint is a contract change.

OR

  • If the library is Preview, keep releasing the preview version since preview version doesn't promise compat.

Scenario 2

  1. The customer was using an instrumentation library, and the library is using Hint.
  2. The customer took the SDK which does not understands the Hint. Since the SDK does not understand Hint, default buckets are used.
  3. The customer upgraded the SDK which understands the Hint. The behavior is now changed to use the buckets provided by the Hint.

I think this is in the gray area and personally I don't treat it as a breaking change in the SDK.

Here is my analogy - if the customer is using an old SDK which does not understand certain environment variable, and later upgraded to a new SDK which starts to understand that environment variable. If that specific environment variable has been existing (even before the old SDK was used), and the upgrade triggered side effect (e.g. the environment variable is saying "disable sampling"), do we call that "breaking change"?

@anuraaga
Copy link
Contributor

anuraaga commented Mar 25, 2022

@reyang I don't think it's possible to use hints to control histogram vs exponential. The exported data for both is completely different

https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/metrics/v1/metrics.proto#L227

So whether to use histogram vs exponential is part of the contract between the app and backend, instrumentation could never pick appropriately between those. Hints could provide buckets for use the the explicit bucket histogram because that doesn't change the exported data format itself though - if the user had configured exponential histograms, then the hint would be ignored appropriately.

Perhaps the solution will be for metric exporters to define their preferred histogram aggregation, which is used by the SDK, similar to the temporality preference.

@joaopgrassi
Copy link
Member

After reading the PR and the discussions here, I feel we might have a hidden problem:

Correct. The assertion is that SDKS shouldn't ever change the default behavior because doing so may be a breaking change for users. For example, backends might rely on explicit bucket histogram and not support exponential histogram, or some user experience might rely on the default bucket boundaries of explicit bucket histogram.

The PR makes it clear that the default should be explicit bucket histogram and I agree that changing it could be breaking for users, so I think this is a good change.

BUT: With the start of discussions on the "Hint API" I feel that it defeats the purpose of the changes proposed here.

And I think there is a way to avoid breaking change:

  1. If there is no "hint" (this is where we are today), we will use the default explicit buckets.
    2. If there is "hint" (later), it should be respected, unless the user explicitly specified the buckets in a View configuration.

Point two (emphasis mine), IMHO contradicts what's proposed here. If we agree this PR makes sense, to "shield" from possible SDKs changing implementations from one version to the next, then why it's ok to allow Hints changing the behavior after? Even worse, it's widening the scope of introducing breaking changes to not only the SDK now, but to any other instrumentation library in the future.

Consider a case where I have a repo with 30+ microservices, all configured with the default as of now. By following point 2 above, I, as the application owner will have to change all my 30 services, just to stay on the current behavior. Thinking on the end-user experience of OTel, I don't think I would be happy if this was happening to me.

The Hint Api might still be useful, for example if I don't know much about histograms and if the instrumentation library offers me a better alternative it's great. But IMHO it should be my decision still to do it, because as @jack-berg mentioned, it can break many things. If I get logs telling me, "hey you might want to use exponential histograms or xyz buckets to get better metrics", I can plan this ahead and work on it later.

@reyang
Copy link
Member

reyang commented Mar 25, 2022

@reyang I don't think it's possible to use hints to control histogram vs exponential. The exported data for both is completely different

This is something we will need to explore/discuss once we started working on Hint API. From the API perspective both explicit bucket and exponential bucket histogram might share the same instrument type.

Copy link
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to focus on @anuraaga's remarks above:

@reyang I don't think it's possible to use hints to control histogram vs exponential. The exported data for both is completely different

https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/metrics/v1/metrics.proto#L227

The "completely different" remark is doubtful, because (a) there is a non-lossy conversion from the exponential histogram data point to an explicit bucket histogram data point, and (b) a simple reconfiguration of explicit boundaries also makes for "completely different" data the way it is commonly used. Changing explicit buckets is likely to be a problem, in other words, yet this is something we want to be configurable.

I think we should expect "Compliant" consumers of OTLP to accept exponential histogram data; consumers are free to convert the data and downsample into explicit bucket histogram format as needed.

Speaking as a vendor, we will support both formats some time after the SDK specification lands.

So whether to use histogram vs exponential is part of the contract between the app and backend, instrumentation could never pick appropriately between those. Hints could provide buckets for use the the explicit bucket histogram because that doesn't change the exported data format itself though - if the user had configured exponential histograms, then the hint would be ignored appropriately.

We shouldn't debate what Hints are since they aren't formally proposed in any specification document.

Perhaps the solution will be for metric exporters to define their preferred histogram aggregation, which is used by the SDK, similar to the temporality preference.

Exporters already define the default Aggregation for each instrument type (from #2404). The spec for MetricReader reads The default output aggregation (optional), a function of instrument kind. If not configured, the default aggregation SHOULD be used.

I support your premise that, like Temporality, the choice of Aggregation is something the exporter should have control over, however this appears to be in a "limiting" fashion.

The SDK/Prometheus exporter today would supply the explicit bucket Histogram aggregator as the default when constructing the corresponding MetricReader.

The SDK/OTLP exporter will need (like Temporality) an option to choose the desired default histogram aggregation. This debate is about whether the default choice can later change for the OTLP exporter (particularly the SDK auto-configured one), and if so how. I agree with @reyang that this will require a major version number increase for OTel SDKs to accomplish, but I don't expect this will happen until a Collector's Prometheus Remote Write exporter has a way to write exponential histogram data points.

FWIW when it comes to "Hints" I see them as the programmatic equivalent of entering Views; however, I think @anuraaga is correct that changing the type of histogram aggregation is probably an Exporter-wide decision, like Temporality, and I (personally) would not expect a user to mix the two styles of histogram in a single session.

This leads me to suggest a different fix: Where the View clause currently includes an aggregation, this is not actually a configurable parameter. Instead, the View aggregation is just a section heading for the configuration that follows. This means you can't change the underlying aggregator type chosen for Histogram instruments through Views (or Hints), only provide configuration for relevant aggregator choices that will take effect when the aggregation configured by the MetricReader matches.

@reyang
Copy link
Member

reyang commented Mar 26, 2022

The "completely different" remark is doubtful, because (a) there is a non-lossy conversion from the exponential histogram data point to an explicit bucket histogram data point, and (b) a simple reconfiguration of explicit boundaries also makes for "completely different" data the way it is commonly used. Changing explicit buckets is likely to be a problem, in other words, yet this is something we want to be configurable.

+1 💯

@jack-berg
Copy link
Member Author

This leads me to suggest a different fix: Where the View clause currently includes an aggregation, this is not actually a configurable parameter. Instead, the View aggregation is just a section heading for the configuration that follows. This means you can't change the underlying aggregator type chosen for Histogram instruments through Views (or Hints), only provide configuration for relevant aggregator choices that will take effect when the aggregation configured by the MetricReader matches.

I can't imagine we could make this change in a way that is backwards compatible with an initial stable release. This does seem like a good way to think about how to interpret aggregation options from a future hint API.

I think we should expect "Compliant" consumers of OTLP to accept exponential histogram data; consumers are free to convert the data and downsample into explicit bucket histogram format as needed.

I agree with this. Perhaps there's a way to retain the flexibility to change the histogram aggregation for OTLP export while giving users a way to opt in to consistency, and always using explicit bucket histograms for prometheus export. What if we:

  • Extended the prometheus exporter to say that it must be paired with a MetricReader who's default aggregation for histogram instruments is explicit bucket histogram.
  • Add a new environment variable to the otlp exporter called something like OTEL_EXPORTER_OTLP_METRICS_DEFAULT_HISTOGRAM_AGGREGATION. Its options could be:
    • best_histogram The best histogram available in the SDK implementation, which may include explicit bucket or exponential histogram implementations.
    • explicit_bucket_histogram Explicit bucket histogram.
  • The default option would be best_histogram, which would give SDKs the ability to change histogram implementations over time, and which is based on assumption that compliant OTLP metric receivers accept both. But if users aren't comfortable with the potential for change, they can opt into consistency with explicit_bucket_histogram.
  • Once explicit bucket histograms are fully specced, we could include a third exponential_histogram option.

@anuraaga
Copy link
Contributor

anuraaga commented Mar 30, 2022

I think we should expect "Compliant" consumers of OTLP to accept exponential histogram data; consumers are free to convert the data and downsample into explicit bucket histogram format as needed.

This doesn't seem appropriate if a version of the spec is released with a data model that includes stable explicit bucket histograms but experimental exponential histograms. We'd expect a system that targets a stable version of the spec to be compliant, and any additive changes after that without a major version change could never be expected for compliance. My reading of the spec is that it's acceptable for a backend to only read the explicit histogram field of an incoming OTLP message. But it is admittedly not clear how to reconcile the data model doc and protocol doc when deciding that, perhaps experimental in data model but stable in OTLP means we do actually require backends to support exponential histograms to be compliant. It's not my first reaction though.

best_histogram The best histogram available in the SDK implementation, which may include explicit bucket or exponential histogram implementations.

I still disagree with this sort of dynamic approach since it requires the user to know what that best histogram is actually set to so they can confirm their backend reads that field from the OTLP message, at which point they could have just specified the histogram type itself. Unless we really double down on the statement that all OTLP receiving backends MUST read exponential histograms, and are OK with users having confusing problems if they run into a backend that does not comply with this. It would be good to document this, it probably means that backends only look at the proto to determine requirements, not the data model, and are compliant if they support all fields in the first stable release of a proto file. This is probably fine but I can't find it mentioned explicitly right now.

@pirgeo
Copy link
Member

pirgeo commented Mar 31, 2022

The default option would be best_histogram

Doesn't this lead to the same situation where, if users don't explicitly state that they want explicit_bucket_histograms the histogram they get might just change from under them? I am trying to look at this from the perspective of a developer who does not know the OTel spec well (which I think will be the majority of users). I think we should give users the possibility of opting into the histogram automatically changing, instead of them having to opt into keeping the histogram that they use stable.

More generally, and after thinking about it for a little while longer I think I agree with @anuraaga. Are there scenarios where we want the histogram aggregation for a metric to change over time? The two different histograms are intended for somewhat different use-cases: The exponential one when I need a specific error bound, and the explicit one when I want to specify buckets. I think I have been convinced that I don't want them changing, and definitely not without my explicitly stating it.

The more I think about it, the more I believe that the actual type of histogram has to be decided depending on its use-case, and there is no catch-all. Which is probably why we have two histogram types and this wording in the first place. Of course we will have to provide a default, but I am not sure what the right default is. In the interest of compatibility, the currently specced out explicit histogram would make sense, however, its layout will probably fail for a lot of cases (e.g. recording double values between 0 and 1). Are there users who want all of their histograms to be exponential? My guess is yes, so we should give an option to set it explicitly. If we change it in the spec, we are potentially breaking or at least changing the data format for folks with backends that don't support exponential histograms.

@jack-berg
Copy link
Member Author

But it is admittedly not clear how to reconcile the data model doc and protocol doc when deciding that, perhaps experimental in data model but stable in OTLP means we do actually require backends to support exponential histograms to be compliant.

I don't know the backstory here - the inconsistency between the data model and the protos is strange.

Doesn't this lead to the same situation where, if users don't explicitly state that they want explicit_bucket_histograms the histogram they get might just change from under them? [....] I think we should give users the possibility of opting into the histogram automatically changing, instead of them having to opt into keeping the histogram that they use stable.

Yes it would. It relies on the assumption that OTLP receivers should work with either explicit or exponential histograms, and thus the ambiguity of which histogram is used isn't relevant. The idea of opting into a stable histogram was an attempt to be able to retain some sort of ability to use exponentials as the default, since I think exponential histograms will ultimately prove superior to explicit bucket histograms in their ability to represent a very different data distributions without configuration.

With that said, the points that @pirgeo and @anuraaga make about a predictable histogram also resonate with me. While I think the ideal situation would be if we had exponential histograms as a stable part of the initial release and used as the default, I suppose a reasonable compromise would be to ensure that once exponential histograms are available, that there's an easy way to opt into them as the default.

@jack-berg
Copy link
Member Author

We spoke at the 4/5 Specification SIG and there appeared to be general consensus that retaining the "best possible histogram aggregation" whose definition is subject to change is a bad experience for users and something we should not have.

The hesitation around removing "best possible histogram aggregation" stems from the belief that once available, exponential histograms will likely be a better default histogram option. We discussed that it would be valuable to add an option to make exponential histogram the default aggregation for the OTLP exporter. I've opened #2471 to capture this sentiment, which hopefully unblocks this PR.

@jmacd
Copy link
Contributor

jmacd commented Apr 5, 2022

The discussion in today's Spec SIG leads to the following course of action:

  1. Accept this PR as is
  2. Accept Specify optional Exponential Histogram Aggregation, add example code in the data model #2252
  3. Add an OTLP exporter environment variable e.g., OTEL_EXPORTER_OTLP_DEFAULT_HISTOGRAM_AGGREGATION that effectively lets "exponential" be selected.

@jack-berg
Copy link
Member Author

@jmacd can you remove your change request?

@reyang
Copy link
Member

reyang commented Apr 11, 2022

@reyang reyang closed this Apr 11, 2022
@reyang reyang reopened this Apr 11, 2022
@reyang reyang closed this Apr 11, 2022
@reyang reyang reopened this Apr 11, 2022
@reyang reyang closed this Apr 12, 2022
@reyang reyang reopened this Apr 12, 2022
@reyang reyang closed this Apr 12, 2022
@reyang reyang reopened this Apr 12, 2022
@trask
Copy link
Member

trask commented Apr 12, 2022

@reyang I like that you keep trying 👍😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:sdk Related to the SDK spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove "best Histogram Aggregation available" option
9 participants