Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BatchSpanProcessor deadlock in flushing spans #3886

Open
vipinsankhhwar opened this issue Apr 30, 2024 · 0 comments
Open

BatchSpanProcessor deadlock in flushing spans #3886

vipinsankhhwar opened this issue Apr 30, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@vipinsankhhwar
Copy link

We have a python3 service (Gunicorn Gevent worker based) instrumented using otel lib. We are using BatchSpanProcessor with queue_size as 2048 and export batch size as 512. Load to these services is very minimal, roughly 5-10 API calls/min.

In above service, we randomly see significant memory spike in production in a few of the service instances. On further looking into the issue, we have found that all the API calls (greenlets) starts getting stuck in "opentelemetry/sdk/trace/export/init.py", line 235, in on_end" at line with self.condition. Reason for the same is that opentelemetry/sdk/trace/export/init.py", line 264, in worker is stuck while holding the lock, causing APIs to hang forever waiting for lock to notify the condition variable about queue length grown beyond max_export_batch_size

We do not have a simple reproducer as of now. Below are the greenlet dumps for worker and API greenlet states. The symptoms of this issue are very similar to this issue.

For now we have disabled otel instrumentation and as a next step we are going to downgrade urllib3 to see if it fixes the issue.

# Greenlet id:139830751128928 parent:139830869965216)[] File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/threading.py", line 937, in _bootstrap self._bootstrap_inner() File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 264, in worker self.condition.wait(timeout) File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/threading.py", line 303, in wait if not self._is_owned(): File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/threading.py", line 274, in _is_owned if self._lock.acquire(False): File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/site-packages/gevent/thread.py", line 132, in acquire sleep() File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/site-packages/gevent/hub.py", line 159, in sleep waiter.get()

File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/site-packages/flask/app.py", line 1488, in __call__ return self.wsgi_app(environ, start_response) File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/site-packages/opentelemetry/instrumentation/flask/__init__.py", line 356, in _wrapped_app result = wsgi_app(wrapped_app_environ, _start_response) File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/site-packages/flask/app.py", line 1479, in wsgi_app ctx.pop(error) File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/site-packages/flask/ctx.py", line 410, in pop self.app.do_teardown_request(exc) File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/site-packages/flask/app.py", line 1308, in do_teardown_request self.ensure_sync(func)(exc) File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/site-packages/opentelemetry/instrumentation/flask/__init__.py", line 470, in _teardown_request activation.__exit__(None, None, None) File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/contextlib.py", line 126, in __exit__ next(self.gen) File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/site-packages/opentelemetry/trace/__init__.py", line 600, in use_span span.end() File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/site-packages/opentelemetry/sdk/trace/__init__.py", line 895, in end self._span_processor.on_end(self._readable_span()) File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/site-packages/opentelemetry/sdk/trace/__init__.py", line 166, in on_end sp.on_end(span) File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 235, in on_end with self.condition: File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/threading.py", line 257, in __enter__ return self._lock.__enter__() File: "/usr/local/pyenv/versions/3.9.16/lib/python3.9/site-packages/gevent/thread.py", line 112, in acquire acquired = BoundedSemaphore.acquire(self, blocking, timeout)

@vipinsankhhwar vipinsankhhwar added the bug Something isn't working label Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant