Exporters shutdown takes longer then a minute when failing to send metrics/traces #3309

Elli-Rid · 2023-05-11T22:09:21Z

Environment: Mac OS, RHEL 8 (doesn't matter)

Configure traces and/or metrics GRPC exporter with invalid collector URL

  # init tracing
  trace_provider = TracerProvider(resource=resource)
  trace_exporter = create_trace_exporter(exporter, otlp_config, jaeger_config)
  trace_processor = BatchSpanProcessor(trace_exporter)
  trace_provider.add_span_processor(trace_processor)
  telemetry_trace.set_tracer_provider(trace_provider)

  # init metrics
  global _telemetry_meter  # pylint: disable=global-statement
  metrics_exporter = create_metrics_exporter(exporter, otlp_config)
  metric_reader = PeriodicExportingMetricReader(
      metrics_exporter, export_interval_millis=export_interval, export_timeout_millis=3000
  )
  metrics_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
  telemetry_metrics.set_meter_provider(metrics_provider)
  _telemetry_meter = telemetry_metrics.get_meter_provider().get_meter(service_name, str(service_version))

What is the expected behavior?
When executing:

meter_provider = telemetry_metrics.get_meter_provider()
if meter_provider and (shutdown_meter := getattr(meter_provider, 'shutdown', None)):
    shutdown_meter(timeout_millis=3000)

it should shutdown in about 3 seconds

What is the actual behavior?
It takes ~60 seconds to shutdown with logs like

WARNING - opentelemetry.exporter.otlp.proto.grpc.exporter::_export:363 | Transient error StatusCode.UNAVAILABLE encountered while exporting metrics, retrying in 8s.

Additional context
From what it looks like both metrics and traces exporters use this base/mixin which has a timeout_millis

Now going level up where OTLPMetricExporter.shudown calls it correctly

    def shutdown(self, timeout_millis: float = 30_000, **kwargs) -> None:
        OTLPExporterMixin.shutdown(self, timeout_millis=timeout_millis)

However, PeriodicExportingMetricReader.shutdown calls OTLPMetricExporter.shutdown with a different kwarg name which seems to be completely ignored

    def shutdown(self, timeout_millis: float = 30_000, **kwargs) -> None:
        deadline_ns = time_ns() + timeout_millis * 10**6

        def _shutdown():
            self._shutdown = True

        did_set = self._shutdown_once.do_once(_shutdown)
        if not did_set:
            _logger.warning("Can't shutdown multiple times")
            return

        self._shutdown_event.set()
        if self._daemon_thread:
            self._daemon_thread.join(
                timeout=(deadline_ns - time_ns()) / 10**9
            )
        self._exporter.shutdown(timeout=(deadline_ns - time_ns()) / 10**6)  # <--- timeout vs timeout_millis

given the use of time_ns() if correct kwarg name it would lead to the error of negative timeout value being supplied

As for traces, exporter calls OTLPExporterMixin.shutdown without propagating any timeouts at all.

This leads to some bad behaviour when combined with k8s and async application since timeoutless thread lock blocks event loop and also leads to hanging containers in k8s cluster until they are killed.

The text was updated successfully, but these errors were encountered:

Elli-Rid · 2023-05-15T21:25:27Z

Looking deeper into this issue it appears that both PeriodicExportingMetricReader and BatchSpanProcessor shutdown methods first perform self._daemon_thread.join and only then notify child loops like self._exporter.shutdown which I believe is incorrect usage of threading.

Finally, OTLPExporterMixin._export method has this logic in place

        max_value = 64
        # expo returns a generator that yields delay values which grow
        # exponentially. Once delay is greater than max_value, the yielded
        # value will remain constant.
        for delay in _expo(max_value=max_value):

which basically means that even if we notify the exporter loop to shutdown it can hit the spot where it sleeps for 1 minute and only then would become aware of the event.

Possible solution could be to add exporter shutdown event and a slightly customized sleep method that would be aware of that event

    def sleep(self, delay: int) -> None:
        slept = 0
        while (slept < delay) and not self._shutdown_event.wait(1):
            slept += 1

something like this ^

And finally changing the order in which shutdown happens where it would first notify the exporter(s) and then listen for the thread to die (.join())

EdLeafe · 2023-05-19T18:29:35Z

I have a similar issue - some of the users of our application run it on systems with the required port blocked by firewalls. The code defines the UNAVAILABLE error as transient, and thus keeps retrying on the blocked port. I would request that there be a setting to not retry on StatusCode.UNAVAILABLE, because it makes no sense in this case.

corbands · 2023-05-25T07:58:35Z

The same for me.
Agreed with @Elli-Rid , method's invokes inside BatchSpanProcessor.shutdown should be swapped to

self.span_exporter.shutdown()
self.worker_thread.join()

Another change is about OTLPSpanExporter._export. It's better to change time.sleep on threading.Event.wait(delay)t o not wait the last retry which can lasts up to 64 seconds. Smth like that:

class OTLPSpanExporterExtended(OTLPSpanExporter):
    def __init__(self, *args, **kwargs):
        self.shutdown_event = threading.Event()
        super().__init__(*args, **kwargs)

    def shutdown(self) -> None:
        super().shutdown()
        self.shutdown_event.set()
  
    def _export(self, *args, **kwargs):
        ...
        self.shutdown_event.wait(delay)
        ...

ocelotl · 2023-06-13T12:07:38Z

@Elli-Rid seems like you almost have a solution, do you think you could open a PR for this issue? 🙂

rajat315315 · 2023-07-19T13:00:52Z

Kindly review my PR. I hope shutting down exporters before calling threads.join(), should resolve the problem.
Users shouldn't have to wait then.

aabmass · 2023-07-20T16:27:58Z

I have a bunch of additional context in #2663, but I'm not sure if it's all still relevant. @rajat315315 your PR looks valuable but I think the biggest issue was called out by @Elli-Rid above #3309 (comment): the exponential backoff is unconditionally sleeping

opentelemetry-python/exporter/opentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py

Line 316 in 8378db9

sleep(delay)

@rajat315315 could you make a separate PR with this?

rajat315315 · 2023-07-21T05:52:18Z

@aabmass I have updated my PR with waiting for a shutdown_event instead of unconditional sleep..

Fixes open-telemetry#2284, open-telemetry#2663, open-telemetry#3309

Elli-Rid added the bug Something isn't working label May 11, 2023

rajat315315 mentioned this issue Jul 19, 2023

Shutting down exporters before calling threads.join() #3387

Closed

11 tasks

ZhenbangYou mentioned this issue Jul 30, 2023

timeout_millis in gprc OTLPMetricExporter.export doesn't work #3397

Closed

aabmass mentioned this issue Oct 4, 2023

Lock is released even if it's not acquired #3398

Open

aabmass mentioned this issue Jan 25, 2024

Resource Detectors produce blank Exception() after 5 seconds #3644

Open

Arnatious added a commit to Arnatious/opentelemetry-python that referenced this issue Mar 7, 2024

export and shutdown timeouts for all OTLP exporters

84a2e2b

Fixes open-telemetry#2284, open-telemetry#2663, open-telemetry#3309

Arnatious added a commit to Arnatious/opentelemetry-python that referenced this issue Mar 7, 2024

export and shutdown timeouts for all OTLP exporters

c3c2415

Fixes open-telemetry#2284, open-telemetry#2663, open-telemetry#3309

Arnatious added a commit to Arnatious/opentelemetry-python that referenced this issue Mar 7, 2024

export and shutdown timeouts for all OTLP exporters

6625df4

Fixes open-telemetry#2284, open-telemetry#2663, open-telemetry#3309

Arnatious added a commit to Arnatious/opentelemetry-python that referenced this issue Mar 7, 2024

export and shutdown timeouts for all OTLP exporters

bd19a3d

Fixes open-telemetry#2284, open-telemetry#2663, open-telemetry#3309

Arnatious linked a pull request Mar 7, 2024 that will close this issue

export and shutdown timeouts for all OTLP exporters #3764

Open

10 tasks

Arnatious added a commit to Arnatious/opentelemetry-python that referenced this issue Mar 7, 2024

export and shutdown timeouts for all OTLP exporters

3cd93c9

Fixes open-telemetry#2284, open-telemetry#2663, open-telemetry#3309

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exporters shutdown takes longer then a minute when failing to send metrics/traces #3309

Exporters shutdown takes longer then a minute when failing to send metrics/traces #3309

Elli-Rid commented May 11, 2023 •

edited

Elli-Rid commented May 15, 2023

EdLeafe commented May 19, 2023

corbands commented May 25, 2023

ocelotl commented Jun 13, 2023

rajat315315 commented Jul 19, 2023

aabmass commented Jul 20, 2023

rajat315315 commented Jul 21, 2023

Exporters shutdown takes longer then a minute when failing to send metrics/traces #3309

Exporters shutdown takes longer then a minute when failing to send metrics/traces #3309

Comments

Elli-Rid commented May 11, 2023 • edited

Elli-Rid commented May 15, 2023

EdLeafe commented May 19, 2023

corbands commented May 25, 2023

ocelotl commented Jun 13, 2023

rajat315315 commented Jul 19, 2023

aabmass commented Jul 20, 2023

rajat315315 commented Jul 21, 2023

Elli-Rid commented May 11, 2023 •

edited