Verify compliant metric SDK specification implementation: MetricExporter/Push Metric Exporter #3666

MrAlias · 2023-02-03T16:38:30Z

Identify all the normative requirements, recommendations, and options the specification defines as comments to this issue
Ensure the current metric SDK implementation is compliant with these normative requirements, recommendations, and options in those comments.

MadVikingGod · 2023-06-27T19:18:19Z

Requirements from reading the Push Metric Exporter Section of the spec.

Push Metric Exporter Interface Must have the following functions:
- Export
  - SDK MUST provide a way for the exporter to get the Meter information (e.g. name, version, etc.) associated with each Metric point.
  - Export will never be called concurrently for the same exporter instance. Export can be called again only after the current call returns.
    - (Interpretation) Push Reader must not call Export concurrently.
  - Export MUST NOT block indefinitely, there MUST be a reasonable upper limit after which the call must time out with an error result (Failure).
    - (Interpretation) There must be a cancellation mechanism for Export. Also there is no way to stop execution of an exporter that won't respect cancellation, this is a limitation of the language.
  - The default SDK SHOULD NOT implement retry logic [around Export]
  - Export must accept a batch of Metric Points (this is language Dependant).
  - Export Must return ExportResult - Success or Failure
- ForceFlush
  - ForceFlush SHOULD provide a way to let the caller know whether it succeeded, failed or timed out.
  - ForceFlush SHOULD only be called in cases where it is absolutely necessary, such as when using some FaaS providers that may suspend the process after an invocation, but before the exporter exports the completed metrics.
  - ForceFlush SHOULD complete or abort within some timeout. ForceFlush can be implemented as a blocking API or an asynchronous API which notifies the caller via a callback or an event.
  - (optional) flush timeout configurable.
- Shutdown
  - Shutdown SHOULD be called only once for each MetricExporter instance.
    - (Interpretation) Push Reader must call Shutdown on its Exporter
    - (Interpretation) Push Reader must not call Shutdown after it has been called once.
  - (Non-Normative) After the call to Shutdown subsequent calls to Export are not allowed and should return a Failure result.
  - Shutdown SHOULD NOT block indefinitely (e.g. if it attempts to flush the data and the destination is unavailable).
    - (Interpretation) There must be a cancellation mechanism for Shutdown. Also there is no way to stop execution of an exporter that won't respect cancellation, this is a limitation of the language.
  - (optional) shutdown timeout configurable.

Notes:
Non-Normative means that there is a description in the spec but does not include normative vocabulary indicating requirements to the spec.

dashpole · 2023-07-24T18:58:09Z

Export(batch)

Our exporter interface has an export function that takes a batch of metrics as an argument.

opentelemetry-go/sdk/metric/exporter.go

Line 52 in e26d8bd

Export(context.Context, *metricdata.ResourceMetrics) error

The SDK MUST provide a way for the exporter to get the Meter information (e.g. name, version, etc.) associated with each Metric point.

The batch of metrics includes the meter name and version as the scope in the batch of metrics. Similar to OTLP, the scope is grouped with the metrics that were created by the meter:

opentelemetry-go/sdk/metric/metricdata/data.go

Lines 36 to 39 in e26d8bd

    
           // Scope is the Scope that the Meter was created with. 
        
           Scope instrumentation.Scope 
        
           // Metrics are a list of aggregations created by the Meter. 
        
           Metrics []Metrics

Export will never be called concurrently for the same exporter instance. Export can be called again only after the current call returns.

(Interpretation) Push Reader must not call Export concurrently.

collectAndExport calls from the periodic collect and forceflush are serialized in the run() function:

opentelemetry-go/sdk/metric/periodic_reader.go

Lines 168 to 185 in e26d8bd

    
           func (r *PeriodicReader) run(ctx context.Context, interval time.Duration) { 
        
           	ticker := newTicker(interval) 
        
           	defer ticker.Stop() 
        
           	for { 
        
           		select { 
        
           		case <-ticker.C: 
        
           			err := r.collectAndExport(ctx) 
        
           			if err != nil { 
        
           				otel.Handle(err) 
        
           			} 
        
           		case errCh := <-r.flushCh: 
        
           			errCh <- r.collectAndExport(ctx) 
        
           			ticker.Reset(interval) 
        
           		case <-ctx.Done(): 
        
           			return 
        
           		} 
        
           	}

The only other call to export comes during shutdown, which stops the above run() loop before exporting:

opentelemetry-go/sdk/metric/periodic_reader.go

Lines 326 to 340 in e26d8bd

    
           // Stop the run loop. 
        
           r.cancel() 
        
           <-r.done 
        
           // Any future call to Collect will now return ErrReaderShutdown. 
        
           ph := r.sdkProducer.Swap(produceHolder{ 
        
           	produce: shutdownProducer{}.produce, 
        
           }) 
        
           if ph != nil { // Reader was registered. 
        
           	// Flush pending telemetry. 
        
           	m := r.rmPool.Get().(*metricdata.ResourceMetrics) 
        
           	err = r.collect(ctx, ph, m) 
        
           	if err == nil { 
        
           		err = r.export(ctx, m)

Export MUST NOT block indefinitely, there MUST be a reasonable upper limit after which the call must time out with an error result (Failure).

(Interpretation) There must be a cancellation mechanism for Export. Also there is no way to stop execution of an exporter that won't respect cancellation, this is a limitation of the language.

Export takes a context, which provides a cancellation mechanism, and the reader's timeout is applied to the context before Export() is called:

opentelemetry-go/sdk/metric/exporter.go

Line 52 in e26d8bd

Export(context.Context, *metricdata.ResourceMetrics) error

opentelemetry-go/sdk/metric/periodic_reader.go

Lines 226 to 233 in e26d8bd

    
           ctx, cancel := context.WithTimeout(ctx, r.timeout) 
        
           defer cancel() 
        
           // TODO (#3047): Use a sync.Pool or persistent pointer instead of allocating rm every Collect. 
        
           rm := r.rmPool.Get().(*metricdata.ResourceMetrics) 
        
           err := r.Collect(ctx, rm) 
        
           if err == nil { 
        
           	err = r.export(ctx, rm)

The timeout is not applied during Shutdown(). That will be fixed by #4356

The default SDK SHOULD NOT implement retry logic, as the required logic is likely to depend heavily on the specific protocol and backend the metrics are being sent to.

We do not implement any retry logic.

Parameters: batch - a batch of Metric points.

Our export function takes a batch of metric points:

opentelemetry-go/sdk/metric/exporter.go

Line 52 in e26d8bd

Export(context.Context, *metricdata.ResourceMetrics) error

Note: it is highly recommended that implementors design the Metric data type based on the Data Model, rather than directly use the data types generated from the proto files (because the types generated from proto files are not guaranteed to be backward compatible).

We do not directly rely on the OTLP proto files. The metricdata correctly mirrors the data model.

Returns: ExportResult
ExportResult is one of:

Success - The batch has been successfully exported. For protocol exporters this typically means that the data is sent over the wire and delivered to the destination server.

Failure - exporting failed. The batch must be dropped. For example, this can happen when the batch contains bad data and cannot be serialized.

Export returns an error, which allows determining if the export succeeded or failed.

ForceFlush()
We have a ForceFlush method on Exporter:

opentelemetry-go/sdk/metric/exporter.go

Line 60 in e26d8bd

ForceFlush(context.Context) error

ForceFlush SHOULD provide a way to let the caller know whether it succeeded, failed or timed out.

ForceFlush returns an error.

ForceFlush SHOULD only be called in cases where it is absolutely necessary, such as when using some FaaS providers that may suspend the process after an invocation, but before the exporter exports the completed metrics.

We only invoke ForceFlush on exporters when ForceFlush is invoked on the associated Reader. The SDK does not make any other calls to ForceFlush:

opentelemetry-go/sdk/metric/periodic_reader.go

Lines 299 to 317 in e26d8bd

    
           func (r *PeriodicReader) ForceFlush(ctx context.Context) error { 
        
           	errCh := make(chan error, 1) 
        
           	select { 
        
           	case r.flushCh <- errCh: 
        
           		select { 
        
           		case err := <-errCh: 
        
           			if err != nil { 
        
           				return err 
        
           			} 
        
           			close(errCh) 
        
           		case <-ctx.Done(): 
        
           			return ctx.Err() 
        
           		} 
        
           	case <-r.done: 
        
           		return ErrReaderShutdown 
        
           	case <-ctx.Done(): 
        
           		return ctx.Err() 
        
           	} 
        
           	return r.exporter.ForceFlush(ctx)

Should we add disclaimers on ForceFlush to mirror this language in the specification?

ForceFlush SHOULD complete or abort within some timeout. ForceFlush can be implemented as a blocking API or an asynchronous API which notifies the caller via a callback or an event. OpenTelemetry SDK authors MAY decide if they want to make the flush timeout configurable.

ForceFlush is implemented as a blocking API. collectAndExport calls DO use the reader's timeout, but ForceFlush on the PeriodicReader does not apply a timeout to ForceFlush calls to the exporter.

Should we apply the PeriodicReader's timeout on the ForceFlush call to the exporter?

(optional) flush timeout configurable.

Timeout is configurable via the context argument.

Shutdown()

Our Exporter interface has a Shutdown Method:

opentelemetry-go/sdk/metric/exporter.go

Line 72 in e26d8bd

Shutdown(context.Context) error

Called when SDK is shut down.
Shutdown SHOULD be called only once for each MetricExporter instance.

(Interpretation) Push Reader must call Shutdown on its Exporter

opentelemetry-go/sdk/metric/periodic_reader.go

Line 345 in e26d8bd

sErr := r.exporter.Shutdown(ctx)

(Interpretation) Push Reader must not call Shutdown after it has been called once.

exporter.Shutdown() is called within a block that is invoked within shutdownOnce.Do(, which ensures exporter.Shutdown is only invoked once.

opentelemetry-go/sdk/metric/periodic_reader.go

Line 325 in e26d8bd

r.shutdownOnce.Do(func() {

(Non-Normative) After the call to Shutdown subsequent calls to Export are not allowed and should return a Failure result.

We return ErrReaderShutdown in subsequent calls to reader.Shutdown:

opentelemetry-go/sdk/metric/periodic_reader.go

Line 324 in e26d8bd

err := ErrReaderShutdown

Shutdown SHOULD NOT block indefinitely (e.g. if it attempts to flush the data and the destination is unavailable).

(Interpretation) There must be a cancellation mechanism for Shutdown. Also there is no way to stop execution of an exporter that won't respect cancellation, this is a limitation of the language.

Shutdown takes a context as a argument, which is used for timeout and cancellation

(optional) shutdown timeout configurable.

Timeout is configurable via the provided context.Context argument.

dashpole · 2023-07-24T19:42:36Z

After #4356 and #4359, the only item above that isn't addressed is: Should we apply a timeout on the ForceFlush call to the exporter?

I'm leaning toward not doing this for a few reasons:

Applying a timeout without making it configurable prevents users from lengthening the timeout using the provided context.
Adding explicit configuration just for the timeout of ForceFlush seems overkill
Using the existing timeout for export seems wrong. Someone would have to lengthen their export timeout if forceflush takes longer. ForceFlush may do more than export does, so I'm not sure this makes sense.
It isn't hard to set a timeout on the provided context.

We can discuss on Thursday.

pellared · 2023-07-25T07:54:04Z

Should we apply a timeout on the ForceFlush call to the exporter

I think that we already had some precedence where we agreed that using context.Context is idiomatic for Go and that we do not want to provide any other configuration or defaults. CC @MrAlias @MadVikingGod

MrAlias · 2023-07-27T21:15:38Z

From SIG meeting:

Ideally, the timeout in the passed context from the user would be used for both the ForceFlush and Shutdown methods. If there is no timeout there, use the export timeout.

MrAlias · 2023-07-27T21:21:08Z

From SIG meeting:

Ideally, the timeout in the passed context from the user would be used for both the ForceFlush and Shutdown methods. If there is no timeout there, use the export timeout.

#4377

MrAlias · 2023-07-30T14:48:09Z

Closing as this looks done. Please reopen if this was in error.

MrAlias mentioned this issue Feb 3, 2023

Have OTel TC review metric SDK and sign-off on stable release #3674

Closed

33 tasks

MrAlias added pkg:SDK Related to an SDK package area:metrics Part of OpenTelemetry Metrics labels Feb 3, 2023

MrAlias mentioned this issue May 30, 2023

Verify compliant metric SDK specification implementation: MeterProvider/Shutdown #3643

Closed

2 tasks

MadVikingGod mentioned this issue Jun 27, 2023

Verify compliant metric SDK specification implementation: MetricExporter #3665

Closed

2 tasks

MadVikingGod mentioned this issue Jun 27, 2023

Verify compliant metric SDK specification implementation: MetricExporter/Pull Metric Exporter #3667

Closed

2 tasks

dashpole self-assigned this Jul 24, 2023

This was referenced Jul 24, 2023

PeriodicReader.Shutdown now applies the periodic reader's timeout by default #4356

Merged

Add disclaimer to ForceFlush to recommend calling only when necessary #4359

Closed

MrAlias mentioned this issue Jul 27, 2023

Use PeriodicReader timeout for ForceFlush #4377

Merged

MrAlias closed this as completed Jul 30, 2023

MrAlias added this to the v1.17.0/v0.40.0 milestone Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify compliant metric SDK specification implementation: MetricExporter/Push Metric Exporter #3666

Verify compliant metric SDK specification implementation: MetricExporter/Push Metric Exporter #3666

MrAlias commented Feb 3, 2023 •

edited

MadVikingGod commented Jun 27, 2023

dashpole commented Jul 24, 2023

dashpole commented Jul 24, 2023

pellared commented Jul 25, 2023

MrAlias commented Jul 27, 2023

MrAlias commented Jul 27, 2023

MrAlias commented Jul 30, 2023

Verify compliant metric SDK specification implementation: MetricExporter/Push Metric Exporter #3666

Verify compliant metric SDK specification implementation: MetricExporter/Push Metric Exporter #3666

Comments

MrAlias commented Feb 3, 2023 • edited

MadVikingGod commented Jun 27, 2023

dashpole commented Jul 24, 2023

dashpole commented Jul 24, 2023

pellared commented Jul 25, 2023

MrAlias commented Jul 27, 2023

MrAlias commented Jul 27, 2023

MrAlias commented Jul 30, 2023

MrAlias commented Feb 3, 2023 •

edited