Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporters shutdown takes longer then a minute when failing to send metrics/traces #3309

Open
Elli-Rid opened this issue May 11, 2023 · 7 comments · May be fixed by #3764
Open

Exporters shutdown takes longer then a minute when failing to send metrics/traces #3309

Elli-Rid opened this issue May 11, 2023 · 7 comments · May be fixed by #3764
Labels
bug Something isn't working

Comments

@Elli-Rid
Copy link

Elli-Rid commented May 11, 2023

Environment: Mac OS, RHEL 8 (doesn't matter)

Configure traces and/or metrics GRPC exporter with invalid collector URL

  # init tracing
  trace_provider = TracerProvider(resource=resource)
  trace_exporter = create_trace_exporter(exporter, otlp_config, jaeger_config)
  trace_processor = BatchSpanProcessor(trace_exporter)
  trace_provider.add_span_processor(trace_processor)
  telemetry_trace.set_tracer_provider(trace_provider)

  # init metrics
  global _telemetry_meter  # pylint: disable=global-statement
  metrics_exporter = create_metrics_exporter(exporter, otlp_config)
  metric_reader = PeriodicExportingMetricReader(
      metrics_exporter, export_interval_millis=export_interval, export_timeout_millis=3000
  )
  metrics_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
  telemetry_metrics.set_meter_provider(metrics_provider)
  _telemetry_meter = telemetry_metrics.get_meter_provider().get_meter(service_name, str(service_version))

What is the expected behavior?
When executing:

meter_provider = telemetry_metrics.get_meter_provider()
if meter_provider and (shutdown_meter := getattr(meter_provider, 'shutdown', None)):
    shutdown_meter(timeout_millis=3000)

it should shutdown in about 3 seconds

What is the actual behavior?
It takes ~60 seconds to shutdown with logs like

WARNING - opentelemetry.exporter.otlp.proto.grpc.exporter::_export:363 | Transient error StatusCode.UNAVAILABLE encountered while exporting metrics, retrying in 8s.

Additional context
From what it looks like both metrics and traces exporters use this base/mixin which has a timeout_millis

Now going level up where OTLPMetricExporter.shudown calls it correctly

    def shutdown(self, timeout_millis: float = 30_000, **kwargs) -> None:
        OTLPExporterMixin.shutdown(self, timeout_millis=timeout_millis)

However, PeriodicExportingMetricReader.shutdown calls OTLPMetricExporter.shutdown with a different kwarg name which seems to be completely ignored

    def shutdown(self, timeout_millis: float = 30_000, **kwargs) -> None:
        deadline_ns = time_ns() + timeout_millis * 10**6

        def _shutdown():
            self._shutdown = True

        did_set = self._shutdown_once.do_once(_shutdown)
        if not did_set:
            _logger.warning("Can't shutdown multiple times")
            return

        self._shutdown_event.set()
        if self._daemon_thread:
            self._daemon_thread.join(
                timeout=(deadline_ns - time_ns()) / 10**9
            )
        self._exporter.shutdown(timeout=(deadline_ns - time_ns()) / 10**6)  # <--- timeout vs timeout_millis
  • given the use of time_ns() if correct kwarg name it would lead to the error of negative timeout value being supplied

As for traces, exporter calls OTLPExporterMixin.shutdown without propagating any timeouts at all.

This leads to some bad behaviour when combined with k8s and async application since timeoutless thread lock blocks event loop and also leads to hanging containers in k8s cluster until they are killed.

@Elli-Rid Elli-Rid added the bug Something isn't working label May 11, 2023
@Elli-Rid
Copy link
Author

Looking deeper into this issue it appears that both PeriodicExportingMetricReader and BatchSpanProcessor shutdown methods first perform self._daemon_thread.join and only then notify child loops like self._exporter.shutdown which I believe is incorrect usage of threading.

Finally, OTLPExporterMixin._export method has this logic in place

        max_value = 64
        # expo returns a generator that yields delay values which grow
        # exponentially. Once delay is greater than max_value, the yielded
        # value will remain constant.
        for delay in _expo(max_value=max_value):

which basically means that even if we notify the exporter loop to shutdown it can hit the spot where it sleeps for 1 minute and only then would become aware of the event.

Possible solution could be to add exporter shutdown event and a slightly customized sleep method that would be aware of that event

    def sleep(self, delay: int) -> None:
        slept = 0
        while (slept < delay) and not self._shutdown_event.wait(1):
            slept += 1

something like this ^

And finally changing the order in which shutdown happens where it would first notify the exporter(s) and then listen for the thread to die (.join())

@EdLeafe
Copy link

EdLeafe commented May 19, 2023

I have a similar issue - some of the users of our application run it on systems with the required port blocked by firewalls. The code defines the UNAVAILABLE error as transient, and thus keeps retrying on the blocked port. I would request that there be a setting to not retry on StatusCode.UNAVAILABLE, because it makes no sense in this case.

@corbands
Copy link

The same for me.
Agreed with @Elli-Rid , method's invokes inside BatchSpanProcessor.shutdown should be swapped to

self.span_exporter.shutdown()
self.worker_thread.join()

Another change is about OTLPSpanExporter._export. It's better to change time.sleep on threading.Event.wait(delay)t o not wait the last retry which can lasts up to 64 seconds. Smth like that:

class OTLPSpanExporterExtended(OTLPSpanExporter):
    def __init__(self, *args, **kwargs):
        self.shutdown_event = threading.Event()
        super().__init__(*args, **kwargs)

    def shutdown(self) -> None:
        super().shutdown()
        self.shutdown_event.set()
  
    def _export(self, *args, **kwargs):
        ...
        self.shutdown_event.wait(delay)
        ...

@ocelotl
Copy link
Contributor

ocelotl commented Jun 13, 2023

@Elli-Rid seems like you almost have a solution, do you think you could open a PR for this issue? 🙂

@rajat315315
Copy link

Kindly review my PR. I hope shutting down exporters before calling threads.join(), should resolve the problem.
Users shouldn't have to wait then.

@aabmass
Copy link
Member

aabmass commented Jul 20, 2023

I have a bunch of additional context in #2663, but I'm not sure if it's all still relevant. @rajat315315 your PR looks valuable but I think the biggest issue was called out by @Elli-Rid above #3309 (comment): the exponential backoff is unconditionally sleeping

@rajat315315 could you make a separate PR with this?

@rajat315315
Copy link

@aabmass I have updated my PR with waiting for a shutdown_event instead of unconditional sleep..

Arnatious added a commit to Arnatious/opentelemetry-python that referenced this issue Mar 7, 2024
Arnatious added a commit to Arnatious/opentelemetry-python that referenced this issue Mar 7, 2024
Arnatious added a commit to Arnatious/opentelemetry-python that referenced this issue Mar 7, 2024
Arnatious added a commit to Arnatious/opentelemetry-python that referenced this issue Mar 7, 2024
@Arnatious Arnatious linked a pull request Mar 7, 2024 that will close this issue
10 tasks
Arnatious added a commit to Arnatious/opentelemetry-python that referenced this issue Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
6 participants