Possible race condition leading to a connection reset if worker is gracefully terminating #2315
Replies: 3 comments
-
Hi, we are seeing the same(?) problem: when a worker restarts due to max-requests, sometimes a request gets lost. In these cases, RST/ACK can be observed. Setup:
Our App is an API server with async FastApi endpoints. It does receive relatively largely sized requests (say 2-15K). I think the larger requests have a better chance of triggering the race. While trying to repro this, I was trying with I'm also seeing this error message in the error log now: |
Beta Was this translation helpful? Give feedback.
-
Oh yeah, it's not very rare for us. With max_requests = 10000 and 4 workers, we hit this every few hours :-) |
Beta Was this translation helpful? Give feedback.
-
Repro code: asgi_sample.py: async def app(scope, receive, send):
headers = [(b"content-type", b"text/html")]
body = b'<html>hi!<br>'
await send({"type": "http.response.start", "status": 200, "headers": headers})
await send({"type": "http.response.body", "body": body}) gunicorn invocation:
curl script: #!/bin/bash
REMOTE='192.168.103.39:9222'
echo 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' | \
curl -XPUT 'http://'$REMOTE'/404/404/404/404/404/404/404' \
-H 'user-agent: python-requests/2.31.0' \
-H 'sentry-trace: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' \
-H 'baggage: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE' \
-H 'x-request-id: FFFFFFFFFFFFFFFFFFFF' \
-H 'accept-encoding: gzip, deflate' \
-H 'accept: */*' \
-H 'content-type: application/json' \
--data-binary @-
echo Running Note that repro_vpt.sh has the IP address of the host running gunicorn in the curl command line. Having the "useless data" in the curl call seems to help with reproducing, but it is not completely necessary. |
Beta Was this translation helpful? Give feedback.
-
We have encountered a relatively rare connection error which is probably due to a race condition as
uvicorn
worker is trying to shutdown.Here is the setup
uvicorn
workergunicorn
with--max-requests
for regularly restarting workersI can reproduce it with Python 3.11, both
uvloop
andasyncio
, but couldn't reproduce withasyncio
and Python 3.12.To reproduce I launch below app as
Then I stress test the application with many concurrent users constantly hitting the API.
After some waiting I eventually hit
Connection reset by peer
errors.I did some initial investigation into what is happening. Here's a
tcpdump
for one of these errors which I tried to correlate with some events in the code. It always happens around the time whenmax-requests
is reached and worker is shutting down.It seems that in certain cases worker doesn't shutdown gracefully despite data having just arrived in TCP stack.
After some deep-dive I noticed that every time that the error happens it is true that
self.cycle=None
inHttpToolsProtocol.shutdown()
whenever the error is triggered, and if I am correct the reverse is true as well. It seems to me that adding a blockinto
httptools_impl.py
orh11_impl.py
solves the issue but I am not really sure what this means.Beta Was this translation helpful? Give feedback.
All reactions