Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify compliant metric SDK specification implementation: MetricExporter/Push Metric Exporter #3666

Closed
2 tasks done
Tracked by #3674
MrAlias opened this issue Feb 3, 2023 · 7 comments
Closed
2 tasks done
Tracked by #3674
Assignees
Labels
area:metrics Part of OpenTelemetry Metrics pkg:SDK Related to an SDK package

Comments

@MrAlias
Copy link
Contributor

MrAlias commented Feb 3, 2023

  • Identify all the normative requirements, recommendations, and options the specification defines as comments to this issue
  • Ensure the current metric SDK implementation is compliant with these normative requirements, recommendations, and options in those comments.
@MadVikingGod
Copy link
Contributor

Requirements from reading the Push Metric Exporter Section of the spec.

  • Push Metric Exporter Interface Must have the following functions:
    • Export

      • SDK MUST provide a way for the exporter to get the Meter information (e.g. name, version, etc.) associated with each Metric point.

      • Export will never be called concurrently for the same exporter instance. Export can be called again only after the current call returns.

        • (Interpretation) Push Reader must not call Export concurrently.
      • Export MUST NOT block indefinitely, there MUST be a reasonable upper limit after which the call must time out with an error result (Failure).

        • (Interpretation) There must be a cancellation mechanism for Export. Also there is no way to stop execution of an exporter that won't respect cancellation, this is a limitation of the language.
      • The default SDK SHOULD NOT implement retry logic [around Export]

      • Export must accept a batch of Metric Points (this is language Dependant).
      • Export Must return ExportResult - Success or Failure
    • ForceFlush

      • ForceFlush SHOULD provide a way to let the caller know whether it succeeded, failed or timed out.

      • ForceFlush SHOULD only be called in cases where it is absolutely necessary, such as when using some FaaS providers that may suspend the process after an invocation, but before the exporter exports the completed metrics.

      • ForceFlush SHOULD complete or abort within some timeout. ForceFlush can be implemented as a blocking API or an asynchronous API which notifies the caller via a callback or an event.

      • (optional) flush timeout configurable.
    • Shutdown

      • Shutdown SHOULD be called only once for each MetricExporter instance.

        • (Interpretation) Push Reader must call Shutdown on its Exporter
        • (Interpretation) Push Reader must not call Shutdown after it has been called once.
      • (Non-Normative) After the call to Shutdown subsequent calls to Export are not allowed and should return a Failure result.
      • Shutdown SHOULD NOT block indefinitely (e.g. if it attempts to flush the data and the destination is unavailable).
        • (Interpretation) There must be a cancellation mechanism for Shutdown. Also there is no way to stop execution of an exporter that won't respect cancellation, this is a limitation of the language.
      • (optional) shutdown timeout configurable.

Notes:
Non-Normative means that there is a description in the spec but does not include normative vocabulary indicating requirements to the spec.

@dashpole
Copy link
Contributor

Export(batch)

Our exporter interface has an export function that takes a batch of metrics as an argument.

Export(context.Context, *metricdata.ResourceMetrics) error

The SDK MUST provide a way for the exporter to get the Meter information (e.g. name, version, etc.) associated with each Metric point.

The batch of metrics includes the meter name and version as the scope in the batch of metrics. Similar to OTLP, the scope is grouped with the metrics that were created by the meter:

// Scope is the Scope that the Meter was created with.
Scope instrumentation.Scope
// Metrics are a list of aggregations created by the Meter.
Metrics []Metrics

Export will never be called concurrently for the same exporter instance. Export can be called again only after the current call returns.

(Interpretation) Push Reader must not call Export concurrently.

collectAndExport calls from the periodic collect and forceflush are serialized in the run() function:

func (r *PeriodicReader) run(ctx context.Context, interval time.Duration) {
ticker := newTicker(interval)
defer ticker.Stop()
for {
select {
case <-ticker.C:
err := r.collectAndExport(ctx)
if err != nil {
otel.Handle(err)
}
case errCh := <-r.flushCh:
errCh <- r.collectAndExport(ctx)
ticker.Reset(interval)
case <-ctx.Done():
return
}
}

The only other call to export comes during shutdown, which stops the above run() loop before exporting:

// Stop the run loop.
r.cancel()
<-r.done
// Any future call to Collect will now return ErrReaderShutdown.
ph := r.sdkProducer.Swap(produceHolder{
produce: shutdownProducer{}.produce,
})
if ph != nil { // Reader was registered.
// Flush pending telemetry.
m := r.rmPool.Get().(*metricdata.ResourceMetrics)
err = r.collect(ctx, ph, m)
if err == nil {
err = r.export(ctx, m)

Export MUST NOT block indefinitely, there MUST be a reasonable upper limit after which the call must time out with an error result (Failure).

(Interpretation) There must be a cancellation mechanism for Export. Also there is no way to stop execution of an exporter that won't respect cancellation, this is a limitation of the language.

Export takes a context, which provides a cancellation mechanism, and the reader's timeout is applied to the context before Export() is called:

Export(context.Context, *metricdata.ResourceMetrics) error

ctx, cancel := context.WithTimeout(ctx, r.timeout)
defer cancel()
// TODO (#3047): Use a sync.Pool or persistent pointer instead of allocating rm every Collect.
rm := r.rmPool.Get().(*metricdata.ResourceMetrics)
err := r.Collect(ctx, rm)
if err == nil {
err = r.export(ctx, rm)

The timeout is not applied during Shutdown(). That will be fixed by #4356

The default SDK SHOULD NOT implement retry logic, as the required logic is likely to depend heavily on the specific protocol and backend the metrics are being sent to.

We do not implement any retry logic.

Parameters: batch - a batch of Metric points.

Our export function takes a batch of metric points:

Export(context.Context, *metricdata.ResourceMetrics) error

Note: it is highly recommended that implementors design the Metric data type based on the Data Model, rather than directly use the data types generated from the proto files (because the types generated from proto files are not guaranteed to be backward compatible).

We do not directly rely on the OTLP proto files. The metricdata correctly mirrors the data model.

Returns: ExportResult
ExportResult is one of:

  • Success - The batch has been successfully exported. For protocol exporters this typically means that the data is sent over the wire and delivered to the destination server.
  • Failure - exporting failed. The batch must be dropped. For example, this can happen when the batch contains bad data and cannot be serialized.

Export returns an error, which allows determining if the export succeeded or failed.

ForceFlush()
We have a ForceFlush method on Exporter:

ForceFlush(context.Context) error

ForceFlush SHOULD provide a way to let the caller know whether it succeeded, failed or timed out.

ForceFlush returns an error.

ForceFlush SHOULD only be called in cases where it is absolutely necessary, such as when using some FaaS providers that may suspend the process after an invocation, but before the exporter exports the completed metrics.

We only invoke ForceFlush on exporters when ForceFlush is invoked on the associated Reader. The SDK does not make any other calls to ForceFlush:

func (r *PeriodicReader) ForceFlush(ctx context.Context) error {
errCh := make(chan error, 1)
select {
case r.flushCh <- errCh:
select {
case err := <-errCh:
if err != nil {
return err
}
close(errCh)
case <-ctx.Done():
return ctx.Err()
}
case <-r.done:
return ErrReaderShutdown
case <-ctx.Done():
return ctx.Err()
}
return r.exporter.ForceFlush(ctx)

Should we add disclaimers on ForceFlush to mirror this language in the specification?

ForceFlush SHOULD complete or abort within some timeout. ForceFlush can be implemented as a blocking API or an asynchronous API which notifies the caller via a callback or an event. OpenTelemetry SDK authors MAY decide if they want to make the flush timeout configurable.

ForceFlush is implemented as a blocking API. collectAndExport calls DO use the reader's timeout, but ForceFlush on the PeriodicReader does not apply a timeout to ForceFlush calls to the exporter.

Should we apply the PeriodicReader's timeout on the ForceFlush call to the exporter?

(optional) flush timeout configurable.

Timeout is configurable via the context argument.

Shutdown()

Our Exporter interface has a Shutdown Method:

Shutdown(context.Context) error

Called when SDK is shut down.
Shutdown SHOULD be called only once for each MetricExporter instance.

(Interpretation) Push Reader must call Shutdown on its Exporter

sErr := r.exporter.Shutdown(ctx)

(Interpretation) Push Reader must not call Shutdown after it has been called once.

exporter.Shutdown() is called within a block that is invoked within shutdownOnce.Do(, which ensures exporter.Shutdown is only invoked once.

r.shutdownOnce.Do(func() {

(Non-Normative) After the call to Shutdown subsequent calls to Export are not allowed and should return a Failure result.

We return ErrReaderShutdown in subsequent calls to reader.Shutdown:

err := ErrReaderShutdown

Shutdown SHOULD NOT block indefinitely (e.g. if it attempts to flush the data and the destination is unavailable).

(Interpretation) There must be a cancellation mechanism for Shutdown. Also there is no way to stop execution of an exporter that won't respect cancellation, this is a limitation of the language.

Shutdown takes a context as a argument, which is used for timeout and cancellation

(optional) shutdown timeout configurable.

Timeout is configurable via the provided context.Context argument.

@dashpole
Copy link
Contributor

After #4356 and #4359, the only item above that isn't addressed is: Should we apply a timeout on the ForceFlush call to the exporter?

I'm leaning toward not doing this for a few reasons:

  • Applying a timeout without making it configurable prevents users from lengthening the timeout using the provided context.
  • Adding explicit configuration just for the timeout of ForceFlush seems overkill
  • Using the existing timeout for export seems wrong. Someone would have to lengthen their export timeout if forceflush takes longer. ForceFlush may do more than export does, so I'm not sure this makes sense.
  • It isn't hard to set a timeout on the provided context.

We can discuss on Thursday.

@pellared
Copy link
Member

Should we apply a timeout on the ForceFlush call to the exporter

I think that we already had some precedence where we agreed that using context.Context is idiomatic for Go and that we do not want to provide any other configuration or defaults. CC @MrAlias @MadVikingGod

@MrAlias
Copy link
Contributor Author

MrAlias commented Jul 27, 2023

From SIG meeting:

Ideally, the timeout in the passed context from the user would be used for both the ForceFlush and Shutdown methods. If there is no timeout there, use the export timeout.

@MrAlias
Copy link
Contributor Author

MrAlias commented Jul 27, 2023

From SIG meeting:

Ideally, the timeout in the passed context from the user would be used for both the ForceFlush and Shutdown methods. If there is no timeout there, use the export timeout.

#4377

@MrAlias
Copy link
Contributor Author

MrAlias commented Jul 30, 2023

Closing as this looks done. Please reopen if this was in error.

@MrAlias MrAlias closed this as completed Jul 30, 2023
@MrAlias MrAlias added this to the v1.17.0/v0.40.0 milestone Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:metrics Part of OpenTelemetry Metrics pkg:SDK Related to an SDK package
Projects
No open projects
Development

No branches or pull requests

4 participants