Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus and OTel High Resolution Histograms incompatibilities #2611

Closed
gouthamve opened this issue Jun 8, 2022 · 12 comments · Fixed by #2633
Closed

Prometheus and OTel High Resolution Histograms incompatibilities #2611

gouthamve opened this issue Jun 8, 2022 · 12 comments · Fixed by #2633
Assignees
Labels
spec:metrics Related to the specification/metrics directory triaged-needmoreinfo The issue is triaged - the OTel community needs more information to decide

Comments

@gouthamve
Copy link
Member

In the recent SIG Spec call on 31st May, we discussed the incompatibilities in spec of Prometheus Sparse High Resolution Histograms and OTel Exponential Histograms.

I've documented the differences and how things work in this doc here: https://docs.google.com/document/d/11z8hc_T4RHPHaXuWiETxGa81Ix89dBV5jh9byh9ySgE/edit

Please add your suggestions and comments to the doc and we will be meeting on June 22 (8am-9am PT) in the Prometheus WG meeting to discuss this. Please feel free to join that meeting if interested.

@jmacd
Copy link
Contributor

jmacd commented Jul 7, 2022

During the review of #2633, I discovered both a mistake in exposition for scales <= 0 and was surprised to learn something new about Prometheus' plans. The Prometheus (proposed) mapping function is off-by-one relative to what is contained in #2633. This is not an incompatibility, it just means that to convert between Prometheus and OpenTelemetry buckets one has to either add or subtract one. On the surface, this feels like an unnecessary defect. @beorn7 made it sound like artificial complexity, even:

Just as a side note: If you changed this definition to "greater than base**(index-1) and less than or equal to base**index, you would save a whole bunch of - 1 terms below. (Not saying that you should do it. This is really just an internal implementation detail. But during the discussion, it was mentioned that the "upper inclusive" way was unnatural because you need those - 1 terms. But in reality, it depends on the definition of index. (And fittingly, in Prometheus, the bucket with id==0 is the one with an upper bound of 1.))

This just illustrates how few people we have that are actually following this debate-- I will state here that @beorn7 is confounding the issues. The implied change in this statement would be a breaking change relative to OTLP v0.11 (unlike #2633), but it bears analysis that (unfortunately) the Prometheus team has not done.

When we look at scale values > 0, where we are able to approximate the mapping function using a built-in logarithm function, these approaches look more-or-less natural. There are three variations:

Description Scale > 0 Formula
OTLP v0.11 math.Floor(math.Log(value) * scaleFactor)
OTel PR #2633 math.Ceil(math.Log(value) * scaleFactor) - 1
Prometheus (proposed) math.Ceil(math.Log(value) * scaleFactor)

At this point, if we had to choose what appears more or less natural, both OTLP v0.11 and Prometheus (proposed) look equally natural. PR #2633 looks unnatural, by some definition, because of the -1, which (although it is a breaking change) would appear to be an unnecessary complication relative to either alternative.

Now comparing these three approaches for scales <= 0 we will see differences that make the OTLP v0.11 look most "natural", as follows.

In OTLP v.11, for scale <= 0, the formula is simplest because IEEE floating point has a built-in (exactly lower-inclusive) base2 logarithm function.

func MapToIndex(value float64) int32 {
    return getBase2(value) >> scale
}

The getBase2() function can be implemented using bitwise operations and arithmetic or by calling math.Frexp():

func getBase2(value float64) int32 {
    _, exp := math.Frexp(value)
    return exp - 1 // This -1: because math.Frexp uses a range of [0.5, 1)
}

In PR #2633, to change from lower-inclusive to upper-inclusive for power-of-two values (not breaking otherwise), we need a correction for exact powers of two. The correction is based on the significand:

func MapToIndex(value float64) int32 {
    exp := getBase2(value)
    // The following expression subtracts 1 when the significand is zero
    exp += int32((getSignificand(value)-1)>>SignificandWidth)
    return exp >> scale
}

This assumes getSignificand(), which extracts the 52-bit (i.e., SignificandWidth) unsigned value from the float64 value. Using math.Frexp():

func MapToIndex(value float64) int32 {
    frac, exp := math.Frexp(value)
    if frac == 0.5 {
        exp--
    }
    return (exp - 1) >> scale
}

For the Prometheus (proposed), the result is similar to that of PR 2633 with an additional complication. We cannot directly use the builtin shift-right operator to compute the index as we have above, it will not round correctly. We have to add one after the shift:

One approach:

func MapToIndex(value float64) int32 {
    exp := getBase2(value)
    // The following expression subtracts 1 when the significand is zero
    exp += int32((getSignificand(value)-1)>>SignificandWidth)
    // Add 1 after shifting
    return ((exp - 1) >> scale) + 1
}

Note that the right-shift operation used above is applied "off-by-one", that is we have an expression ((X - 1) >> S) + 1. It turns out, any time there is a change of scale, under this formulation, the index has to be modified using a similar formula. For example:

// downscaleIndex returns the new bucket index corresponding to `idx` for a downward change of scale.
func downscaleIndex(idx, scaleChange int32) int32 {
   return ((idx - 1) >> scaleChange) + 1
}

Hopefully, this demonstrates why the OpenTelemetry project still considers the OTLP v0.11 design to be the simpler option, because it allows direct use of the hardware base-2 exponent and right-shift instruction for downscaling. The compromise in #2633 is an acceptable middle-ground where we add special cases for exact powers-of-two but preserve the ability to use right-shift for downscaling histogram buckets.

@jmacd jmacd assigned jsuereth and unassigned SergeyKanzhelev Jul 8, 2022
@jsuereth
Copy link
Contributor

I know this has likely been discussed, but I think it's important we push on some of these technical discussions from a requirements aspect and tradeoffs. I'm going to list the MAJOR requirements that OpenTelemetry had when designing exponential histograms (and you can see the things @jmacd fighting for reflecting these requirements).

OTel Exponential Histogram Requirements

This is a post-design list of all the considerations that went into OTel's exponential histogram. SOME of these requirements exist because it was proven that we have algorithms capable of meeting these requirements.

Basic Behavior

  • Exponential histograms allows users to calculate Percentiles at query time, e.g. Latency monitoring + alerting.
  • Users should be able to visualize distribution of data via Exponential Histograms.

Convenience

  • Exponential Histograms should not require the user to specify dynamic bounds to record, instead these should ideally be inferred.
    • Consequence: We expect re-scaling operations to be frequent during "histogram bootstrap".
    • Consequence: We pick a default scale + bucket factor that meets our desired accuracy requirements.
  • Exponential Histograms should be opt-in where existing histograms are used today.
  • Exponential Histograms should be a clear win where percentile calculation can be done query-side rather than in-process in instrumentation.
    • Consequence: Memory efficiency requirements
    • Consequence: accuracy requirements.
  • [Low] Exponential Histograms should be easily implementable in all programming languages.
    • Consequnce: We want math.log usage to be "fast enough" for naive implementations where precomputed index-lookup may not be practical in a language.

Efficiency

  • Exponential Histograms should have runtime cost as close to the same order of magnitude of explicit bucket histograms as possible.
    • Consequence: We are highly sensitive to
  • Exponential Histograms should have the same instrumentation runtime performance as existing low-error rate "summary" calculations leveraging similar sized exemplar pools or exponential techniques. This includes all cases where Gauges of percentiles are calculated at ~95% accuracy, e.g. micrometer distribution calculation.

Accuracy

  • Exponential Histogram metric streams should be able to achieve 95% accuracy (at query time) when calculating percentiles.
    • Note: We targeted this bar at limiting to 320 buckets. We also are ok losing accuracy as dynamic range expands (see this comment, e.g.)
  • Users should be able to choose runtime overhead vs. accuracy via selecting the amount of memory (# of buckets) used in exponential histograms.

Issues with Prometheus plans for Sparse Histograms

Prometheus would like to preserve backwards compatibility with existing histogram semantics. This means having metric samples with a le label where the value is how many sampled measurements fall less than or equal to that bucket value.

The proposal here outlines a path forward for OTEL + Prometheus that adds this new requirement in place and attempts to resolve issues.

However, I still see the following unresolved, unaddressed in the proposal:

  • What is the performance impact on rescaling? We expect this operation to be very important during histograms bootstrap. See @jmacd's comments concerning this here.
  • What is the accuracy impact of switching? I think we're all assuming it will be trivial, but @jmacd's discoveries make me think more investigation is required.

TL;DR;

I think the crux of the argument between Prometheus / OpenTelemetry revolves around which requirement is more important:

  • Runtime / in-process overhead of instrumentation calculating a histogram
  • "Backwards compatibility" of histograms in prometheus.

From my take: Almost every user I've seen of monitoring/metrics would prefer lower-overhead instrumentation, as cost is one of the most significant concerns when adding instrumentation to code. This is one of those decisions that would be hard to tune the instrumentation for later, so I'd like us to really understand performance impact of solutions being proposed and trade off these two priorities from an end-to-end user perspective.

@weyert
Copy link

weyert commented Jul 14, 2022

Yes, from a consumer perspective I am more concerned about the resources (e.g. memory) that are needed when emitting metrics or when consuming them by Prometheus.

Regarding backward compatibility I personally think this might cause more confusing than it wins. Currently in Prometheus you need to define the bucket sizes. The exponential histogram main feature is that you don't need to do this but more you need to define a level of precision you want. I think these two are a bit in conflict. I think it would be totally reasonable to introduce a new metric type in Prometheus

@gouthamve
Copy link
Member Author

Hi sorry for the delay in answering, busy week.

I will do a benchmark early next week and see what changes need to be done and how much more memory / cpu the change in #2633 takes specially when rescaling. I do not expect there to be more resource or cpu usage but hopefully the benchmarks make that clear.

Regarding backward compatibility I personally think this might cause more confusing than it wins

The backward compatibility is to allow applications which are instrumented with the newer sparse histograms to be scraped by older versions of Prometheus. We also believe that we can make the histogram_quantile function work across both old and new histograms, and that way if you upgrade your code from the old histograms to new, you don't need to change your alerts / dashboards (using histogram_quantile).

@jmacd
Copy link
Contributor

jmacd commented Jul 14, 2022

I do not believe that we should use "benchmarking" as the yardstick to evaluate these proposals. The number of CPU cycles spent on additional work here is truly negligible.

The big concern being raised here--to me--is about code complexity. Of the three forms that were outlined in my post above, each of them has the same relative accuracy, but they are successively more difficult to implement. The time spent in update is small relative to the total cost of reporting the metric. What I don't want to see is a long comment explaining why you have to perform an operation like ((X-1)>>S)+1 in a dozen places where a simpler design would've let you perform X>>S instead.

@jsuereth
Copy link
Contributor

The backward compatibility is to allow applications which are instrumented with the newer sparse histograms to be scraped by older versions of Prometheus. We also believe that we can make the histogram_quantile function work across both old and new histograms, and that way if you upgrade your code from the old histograms to new, you don't need to change your alerts / dashboards (using histogram_quantile).

@gouthamve Does prometheus allow flexible bucket boundaries per-metric-stream already? Can an old prometheus pull in an exponential histogram if bucket boundaries change during its lifecycle?

Also, thanks for the response!

@jmacd
Copy link
Contributor

jmacd commented Jul 15, 2022

I have back-ported Lightstep's reviewed, tested, and running-in-production exponential histogram data structure to the OTel-Go repository here: open-telemetry/opentelemetry-go#3022

@gouthamve
Copy link
Member Author

Of the three forms that were outlined in my post above, each of them has the same relative accuracy, but they are successively more difficult to implement.

Gotcha. Do you think #2633 is decent compromise?

I do realise that the Prometheus (proposed) caused more confusion than intended and if its too complicated, it can be ignored. The Prometheus community is happy with #2633.

Can an old prometheus pull in an exponential histogram if bucket boundaries change during its lifecycle?

Yes, as long as the bucket boundaries don't change frequently. It'll create more timeseries, and the query will likely break when dealing with the time-range where the change happens, but everywhere else it'll work.

@jsuereth jsuereth added the triaged-needmoreinfo The issue is triaged - the OTel community needs more information to decide label Jul 15, 2022
@jsuereth
Copy link
Contributor

jsuereth commented Jul 15, 2022

Can an old prometheus pull in an exponential histogram if bucket boundaries change during its lifecycle?
Yes, as long as the bucket boundaries don't change frequently. It'll create more timeseries, and the query will likely break when dealing with the time-range where the change happens, but everywhere else it'll work.

I think we likely want to see how much this is broken in practice. I have concerns around this where I feel like spikes in latency are LIKELY to break timeseries when they happen the first time per-process. Can we get a good demo of exponential-on-existing promtheus so we can evaluate tradeoffs?

@jsuereth
Copy link
Contributor

jsuereth commented Jul 19, 2022

Follow up discussion from Specification SiG:

  • @gouthamve is going to look into dynamic rescaling of buckets in Prometheus and let us know if this use case will be supported in newer prometheus
  • We will NOT expose exponential buckets to older prometheus as explicit buckets. Instead we'll use a conversion to known bucket boundaries (cc @jmacd).

We still need to make progress on both the technical issues around:

  • < vs >
  • Zero threshold (limit near-zero buckets)

@gouthamve
Copy link
Member Author

dynamic rescaling of buckets in Prometheus and let us know if this use case will be supported in newer prometheus

Yes, this is completely possible and one of the core ideas.

We will NOT expose exponential buckets to older prometheus as explicit buckets. Instead we'll use a conversion to known bucket boundaries.

This is also okay.

@jmacd
Copy link
Contributor

jmacd commented Jul 27, 2022

Please consider approving #2633 @open-telemetry/specs-metrics-approvers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:metrics Related to the specification/metrics directory triaged-needmoreinfo The issue is triaged - the OTel community needs more information to decide
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants