Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for synchronous Gauge and an UpDownCounter that can be Set() #2318

Closed
jmacd opened this issue Feb 4, 2022 · 34 comments · Fixed by #3540
Closed

Support for synchronous Gauge and an UpDownCounter that can be Set() #2318

jmacd opened this issue Feb 4, 2022 · 34 comments · Fixed by #3540
Assignees
Labels
spec:metrics Related to the specification/metrics directory

Comments

@jmacd
Copy link
Contributor

jmacd commented Feb 4, 2022

What are you trying to achieve?

There are long-standing requests to support a settable Gauge, as this is an API both Prometheus and Statsd support.

In the OpenTelemetry data model, the same request happens for an UpDownCounter that can be set. Whereas an asynchronous UpDownCounter can be set from a callback, users are asking for an API to set a Gauge or cumulative UpDownCounter from synchronous code (and, for that matter, a cumulative Counter)

The OpenTelemetry data model says exactly what kind of point should be produced for these events (Gauges and cumulative Sums), so we know what the API should be and what the data should look like, but still there's been an objection. Let's focus on Gauge, since it has a longer history.

You can argue that by making the OTel Gauge instrument only support asynchronous use from a callback, we are improving the efficiency of a monitoring solution by preventing the user from making extraneous observations.

The counter-argument is that you're making the user's life more difficult, particularly in situations where they are holding a value and do not want to defer it until a callback. See, e.g., #2054.

One of the objections to having a synchronous Gauge is that depending on the collection interval, there may or may not be gaps in the series. However, the data model discusses gap handling, and the resulting series with gaps is well-defined. The result, some will argue, is working as intended.

The big distinction in terms of implementation between the synchronous and asynchronous instruments inside the SDK is whether an aggregator will see many updates per interval or is expected to see one update per interval. Our specification is not very strong on what should happen when multiple observations are made for a single instrument and attributes during a collection interval. The result is currently unspecified.

Proposal

This proposal would be easy to pull off in OTel-Go because we are leaning toward using similar APIs for synchronous and asynchronous instruments, i.e., in OTel-Go an asynchronous instrument is just like a synchronous instrument that you only call from a registered callback.

To add the requested APIs:

  1. Allow registering "background asynchronous" Counter, UpDownCounter, and Gauge instruments. These have the same API as the asynchronous instruments, i.e., Observe(value) for setting cumulative and current values but would be used outside of registered callbacks
  2. Enforce that at most 1 value is observed per collection per aggregator. We have to prevent the natural merge function from happening if more than one observation is received between collections, otherwise this does not provide the requested functionality. This would resolve the unspecified behavior of the current specification for duplicate observations in favor of a strict last-value-wins policy.

The result would be as expected. If the user does not update a synchronous gauge during a collection interval, missing data (i.e., a gap). If the user updates once or more between collection intervals, the last value will be observed.

Overall, this seems like a small extension to the current specification, does not create any new instruments or APIs or data model, and gives users what they've been asking for.

@emilgelman
Copy link

Hey @jmacd,

I appreciate your effort on this matter. I was wondering if there was any progress made regarding this issue?

I'll try to provide some additional context. We are using the OpenTelemetry Go SDK to instrument our services.
In one of our services, we would like to measure the duration of a specific operation. We have the value during the execution of a specific synchronous function - so we can't use an async instrument, and we do not wish to use traces to achieve the same behavior as our backend has different billing and functionality provided for traces.

Our current workaround is to use a custom in memory map (to achieve last value aggregation) and an asynchronous gauge to report those map values, and then flush it every collect interval. Having to implement a custom data structure just to support a simple metric type (which exists in practically any other metric stack) is both cumbersome and not intuitive at all.

I think that the argument of not implementing this as it improves the efficiency of monitoring is somewhat inconsistent with the whole approach of OpenTelemetry. Is there anything else that can be done to make this happen?

@Aaronontheweb
Copy link

Related since I asked about this in the dotnet implementation repo open-telemetry/opentelemetry-dotnet#2362 (comment)

Our use case is that we have millions of actors whose mailbox depth we would like to periodically report on - but only when the actor is actually processing messages. Having an observable gauge for each of those would be non-performant for our use cases - right now we're trying to use histograms and it's a bit of a hack.

@jmacd
Copy link
Contributor Author

jmacd commented Apr 6, 2022

From CNCF Slack we have a use-case described:

if the CRD is present then it is denoted by 1 as the metric value and if it's not (if it's deleted) then it is denoted by 0 ...so everytime I want to set the metric value to 0 (and observe it) whenever I delete the CRD.

In this case, the application as built does not keep persistent state, it just observes changes to the current value of a cumulative sum or gauge quantity.

@bogdandrutu
Copy link
Member

@Aaronontheweb why the current "UpDownCounter" is not good? Curios what exactly is the metric about the actors that you want to report.

@bogdandrutu
Copy link
Member

if the CRD is present then it is denoted by 1 as the metric value and if it's not (if it's deleted) then it is denoted by 0 ...so everytime I want to set the metric value to 0 (and observe it) whenever I delete the CRD.

@jmacd now that I see better this example, can that be an UpDownCounter where +1 on start and -1 on end? I know it is a bit of an abuse of the API, but want to understand that if this is good as a short term until we get more proof that we need sync Gauge.

@Aaronontheweb
Copy link

@Aaronontheweb why the current "UpDownCounter" is not good? Curios what exactly is the metric about the actors that you want to report.

Because I'm measuring the point-in-time size of the queue depth at any given time, rather than its rate of change - I can measure the rate at which actors process messages with a SumCounter and I can measure per-message processing latency with a histogram, but I really need a gauge to measure the backlog of the queue - and rather than create an ObservableGauge for hundreds of thousands / millions of small queues it'd be better to share a single Gauge as a bounded metric that is differentiated by its attributes (which are low cardinality) - same as we do for the counter and histogram types.

@bogdandrutu
Copy link
Member

@Aaronontheweb "UpDownCounter" is not for "rate". The "UpDownCounter" is as a name says a counter (sum) that supports positive and negative values, so your backlog will do "+1" when element is added and "-1" when is removed. And you get the right value.

@Aaronontheweb
Copy link

I can't track when a value is added to it - I can only check it when the actor is running.

@tomasfreund
Copy link

I have a similar issue - when processing messages from e.g. Azure EventHub, I have the offset and enqueued time of all messages in the batch and I get the last offset and last enqueued time from EventHub every time I fetch a new batch.
Ideally, I'd use this to calculate consumer lag (both in messages and seconds) and record the value in a gauge.
I can't use an UpDownCounter because I don't want to add to the value, I need to set the value, it comes from an external system.
I also don't want to use an asynchronous gauge because that would require me to write additional code just to handle fetching the value on demand and I would also need to do extra I/O every time metrics are collected.

There are many other cases when you might need a synchronous gauge. An example might be an IoT device where some module periodically reports some values to the main control unit. This main control unit would like to record these values somewhere the moment it gets them. At this point there is no way to record gauge metrics this way. Sure, you could store them in local variables and then use an async gauge to collect the values on metrics collection but it seems very convoluted.
Of course an async gauge is useful for many things, it's just that I don't understand why it would be a good idea to only have an async version.

@Aaronontheweb
Copy link

Any updates on this as a consideration?

@rlinehan
Copy link

Would really, really like to see this supported.

Our use case: we run a set of tests against our service via an AWS Lambda every minute. We want to emit a gauge indicating whether the tests succeeded or not (basically a binary "1" or "0"). The code is much more complicated with an async gauge than a synchronous gauge since we just want to issue it once and right at the end of the test run, and because we're in a Lambda and need to make sure it emits while the Lambda is running.

@jmacd
Copy link
Contributor Author

jmacd commented Jun 28, 2022

This is a top priority (IMO) and I will begin working on the (very straightforward) proposal in the coming few weeks.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 8, 2022

^^^ Here is the prototype I am working with,.

@ty-elastic
Copy link

Hi, I'd love to see support for synchronous Gauges as well! The lack of sync gauges can make it difficult to interface with existing libraries which push metrics to your app at specific intervals which they define. In this example (https://github.com/ty-elastic/videojs_observability), I am pushed metrics which I then copy into a global map. From there, I have to coax otel into turning around and reading/reporting those metrics (some as gauges) asynchronously. At best, the timing is wrong (the data is timestamped at the time of async reading rather than then time when they are pushed to me). A better approach would allow one to write gauge values synchronously with a proper timestamp; the last value is then reported when the exporter gathers metrics.

@jack-berg
Copy link
Member

A better approach would allow one to write gauge values synchronously with a proper timestamp

I've been assuming that the timestamps reported for synchronous gauges would be the time of collection, not the time the instrument was most recently "set". A deviation from this would be inconsistent with other instruments / aggregations.

@ty-elastic
Copy link

ty-elastic commented Jul 19, 2022

I've been assuming that the timestamps reported for synchronous gauges would be the time of collection, not the time the instrument was most recently "set". A deviation from this would be inconsistent with other instruments / aggregations.

interesting.. that is a reasonable assumption for sync counters, but for sync gauges, it seems like ideally you would assert the timestamp and value together to ensure that, from an observability perspective, you can say "oh - this gauge was at this water mark at this specific time" rather than "this gauge was at this water mark sometime in between this collection interval". for most apps, it probably matters not. for "real-time" apps (e.g., a video compression engine), that difference could be confusing.

@zbindenren
Copy link

zbindenren commented Dec 9, 2022

Any news on this? I think a lot of people need this.

@gaeljw
Copy link

gaeljw commented Dec 13, 2022

Just a quick comment to describe a use-case of mines: we have an application that runs a scheduled task to count inconsistencies in our data. This computation might be long enough to not want to do it when the instrument is observed but only do it on a schedule that the application itself defines and is responsible of.

We used to use Prometheus gauges.

With OpenTelemetry, this is not possible with gauges. Technically it would be possible to use a counter but from a "functional" point of view, we are not computing delta values, we are computing a "total value" at a scheduled rate.

Edit: I guess that the thing that makes more sense in my case is to store the value on my end somehow (in memory) and read it async via the async gauge.

@nickb937
Copy link

I have a few applications that do data processing and I have a desire to expose some of the internal variables as metrics around the internal health of the system. The metrics aren't time-series as such, they are samples of a value of a variable at a point-in-time that I desire to emit to the monitoring system for observation. For example, several batches of CSV files come in each day and I may want to observe the number of lines read from those files.

The Gauge datatype looks like the correct type for publishing internal metrics because each sample is distinct from the previous sample.

However, there's no way to publish a Gauge value directly (synchronously), like this:

file_lines = open("/tmp/datafile.csv", "r").readlines()
metric_lines = meter.create_gauge("line-count")
metric_lines.set_value(len(file_lines))

It's unclear how I would go about emitting the value of a variable through an ObservableGauge without first publishing the variable into a global variable for sampling with the callback, which isn't a realistic solution:

LINE_COUNT = 0

def observable_gauge_func(options):
    global LINE_COUNT
    yield Observation(LINE_COUNT, {})

file_lines = open("/tmp/datafile.csv", "r").readlines()
meter.create_observable_gauge("line-count", [observable_gauge_func])
LINE_COUNT = len(file_lines)

@Woodrow1026
Copy link

We also need such functions very much,

In our scenario, there are a large number of test cases to be executed, but due to performance reasons, they are only run once a day (configurable). We use the histogram to get whether each case is successful (0, 1 binary)

But now, there are some use cases that need to be returned. We want to care about their return values. At this time, histogram is no longer appropriate, because some values in the bucket will be lost. Asynchronous dashboards are also not suitable for us because we execute use cases through scheduling. The operation of UpDownCounter is strange.

@benpoliquin
Copy link

+1 on the need for this synchronous gauge type.

@ascrookes
Copy link

ascrookes commented Mar 1, 2023

I could also benefit from a synchronous gauge type. Like others in this thread, I want to use it to measure queue depth and unique users in the queue across time. I'm using the Python library.

Do let me know if there is another metrics instrument that could be used for this but I didn't see one that would be a better fit.

In the meantime, I have a simple Python class that wraps the asynchronous gauge so I can set point in time values without using the callback logic in my application code. It only reports out on the interval you have set on the metrics exported. That means if I set a value, and then set another value before the initial value has been exported, it'll get dropped. There is probably a way to fix that but for my use case that trade-off is fine.

from typing import Iterable

from opentelemetry import metrics

# Alias to make it easier to access the Observations class.
# This takes in a single value (float or int) and an optional dictionary
# of attributes.
# https://opentelemetry-python.readthedocs.io/en/latest/api/metrics.html#opentelemetry.metrics.Observation
MetricsGaugeObservation = metrics.Observation


class MetricsGauge:
    def __init__(self, name: str, meter: metrics.Meter):
        self._name = name
        self._meter = meter
        self._internal_gauge = None
        self._observations = []

    def set_observations(self, observations: Iterable[MetricsGaugeObservation]):
        self._observations = observations
        # The async gauge starts reporting as soon as it is created. Create
        # the async gauge once a real value has been set to avoid creating noise
        # prior to values getting set.
        if self._internal_gauge is None:
            self._internal_gauge = self._meter.create_observable_gauge(
                name=self._name, callbacks=[self._gauge_callback]
            )

    def _gauge_callback(self, _options):
        return self._observations


# In application code

meter = get_meter(__name__)
gauge = MetricsGauge("gauge.name", meter)
# When the gauge value changes
new_value, new_attributes = func_to_get_gauge_value_and_attributes()

# This value takes an interable so that you can set multiple values and attributes
# for the same timestamp.
gauge.set_observations([MetricsGaugeObservation(new_value, new_attributes)])

@GlassonGlassoff
Copy link

@jmacd Did you ever get any traction on the proposal you mentioned in July?

The lack of sync gauges is causing a massive headache for us in a service where state isn't saved, so there isn't a great way to handle callbacks.

@lsytj0413
Copy link

any update for this proposal?

@triblondon
Copy link

Here's my polyfill for JavaScript:

class OTelGauge {
  private currentValue: number;
  private gauge: Observable;
  constructor(meter: Meter, name: string, initValue?: number) {
    this.currentValue = initValue;
    this.gauge = meter.createObservableGauge(name);
    this.gauge.addCallback((result) => result.observe(this.currentValue));
  }
  observeValue(newValue: number) {
    this.currentValue = newValue;
  }
}

@tommaybe
Copy link

tommaybe commented May 5, 2023

For queue tracking purposes this one is also really essential for us.

@MatisseHack
Copy link

Having a Set method on UpDownCounters would be really useful for my team! We maintain an event driven service and would like to track various metrics related to each event. At the moment we've resorted to exposing a LastEvent property on some of our classes that can be accessed by ObservableUpDownCounters, but this approach isn't nearly as elegant as using a Set method directly when handling the event.

@jaelrod
Copy link

jaelrod commented May 25, 2023

Synchronous version of gauge would be helpful for our use case as well, as we have a need for emitting metrics with negative values by absolute value and not incrementation, and currently the only support for negative values is asynchronous gauge/up-down-counter or synchronous up-down-counter.

@agunglotto
Copy link

I've ported an application from Prometheus Client to OpenCensus. Both have synchroneous gauge implementation. So the switch wasn't that hard. Now, OpenCensus is no longer supported starting from July and recommends using OpenTelementry. And here is only an asynchroneus gauge.
It's really hard to understand why the developers decided "randomly" for sync/async model in OpenTelemetry. And then to see this request open for one year...

@jack-berg
Copy link
Member

FYI, I've built a prototype java implementation of this here and am planning on opening a spec PR with a change proposal.

@bogdandrutu
Copy link
Member

bogdandrutu commented Jun 13, 2023

@jmacd very nice written argument for how to implement Gauge, but I feel that there are not enough examples where the async/observer pattern does not work. The link issue regarding async up down is very weak, unable to understand the argument why that needs to expose a state...

jack-berg added a commit that referenced this issue Aug 16, 2023
…ence on metric point persistence (#3540)

Resolves #2318.

Extends the metrics API to include a new synchronous gauge instrument,
whose value can be set as measurements occur. This is useful when the
measurements are made accessible via subscription, such as JFR events.

There are interesting questions around the behavior of
`start_time_unix_nano` and whether or not sync gauges continue to export
series without new measurements. This PR has gone through some iteration
(available to see via commit history), but as it currently stands, I've
made the following changes:
- Add new language stating that metric reader's configurable aggregation
temporality as a function of instrument type influences the persistence
of metric points across successive collections, as described in this
#3540 (comment).
This influence occurs regardless of whether the resulting metric point
has an aggregation temporality field. In other words, sync gauge
persistence behavior is influenced by aggregation temporality selection.
- Add new language stating that metric reader's configurable aggregation
temporality as a function of instrument type influences the starting
timestamp of metric points. When an instrument is cumulative, its
starting timestamp will always be the same. When an instrument is delta,
its starting timestamp advances with successive collections. This
influence occurs regardless of whether the resulting metric point has an
aggregation temporality field. In other words, sync gauge starting
timestamp is influenced by aggregation temporality selection.

---------

Co-authored-by: Reiley Yang <reyang@microsoft.com>
@agunglotto
Copy link

There are commits for Python, C++ and .Net - do you plan to provide it for other languages / SDKs like Golang etc.?

@freemansoft
Copy link

freemansoft commented Feb 21, 2024

My bad. Saw that this was picked up in #3540

@amedveshchek
Copy link

amedveshchek commented Feb 23, 2024

I found that the synchronous Gauge metric is already implemented for Golang, but in low-level code, which is still available for the usage -- they simply haven't implemented the convenience methods around that.

Here's the official code of telemetrygen tool, which submits the Gauge metrics by default: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/cmd/telemetrygen/internal/metrics/worker.go#L52-L64

So I used this code and worked it around for now. Will replace it with the original synchronous Gauge once it's implemented in the official lib for golang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:metrics Related to the specification/metrics directory
Projects
None yet