-
Notifications
You must be signed in to change notification settings - Fork 760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebSocketApp close() hangs on broken connection despite of timeout #592
Comments
@yhhsteven when you say that your websocket tries to reconnect, how is the websocket reconnecting? Are you using run_forever(), which handles the reconnection automatically, or are you manually adding logic to reconnect your websocket? It would be very helpful if you could share a minimum reproducible example with an explanation of what triggers this issue, so I can try to recreate the issue. Maybe the issue has been resolved in the latest code. The documentation timeout examples run without problems, so I would like to see how your situation is different. |
I'll pickup from where this user left off as the issue's still open, and I'm experiencing a similar one, if not the same entirely. Occasionally, our ISP's core router loses connection for whatever reason. When this happens, about 1/3 of the time, my WS is left in a hung state, documenting that it died, but not returning to service. I have two copies of the application running at all times (one for development and one for production), so it's rare that both are left in a hung state at the same time after an ISP outage, which leads me to believe that this might be something to do with where the outage occurs in code while the WS is running, to build on what @yhhsteven said originally. This is where I'm catching the connection's death, however, my on_close listener isn't called at the end. I should also note the on_close listener has a logging mechanism that I'm looking for when these events occur. Typically, the last log I see is that it's detected the connection's death, but as on_close isn't called, there's no further logging to signify its resurrection. I'm wondering if it's worth killing the WS instance entirely when these connection issues occur, and restarting it from scratch until the issue is resolved as a workaround. Perhaps I can call on_close manually instead... Is there anything that reports the WS connection state? Perhaps this can be used to further track the issue on my end. As I'm diving down the rabbit's hole here, and others have had similar issue in other tickets, I'm wondering if this has anything to do with connection deaths specifically. I use status code 1000 to signify the connection needs to be closed, rather than resumed, but if the connection to the internet is dead, could it be that this causes the wsApp to hang entirely? Perhaps it's less likely to hang without a status code in the close call. Lots to consider here, but definitely related to the unexpected connection interruption I mentioned in #686. In #612, you mention artifacts from previous connections. Perhaps this has something to do with that, but more a ghost of the previous connection, rather than an artifact, as the new connection and closure evert aren't called. |
Bumping to see if there's been any progress on your end @engn33r. I do know you're very busy. Perhaps I'll look into the loop and see if there's a place things could potentially lock up when a close event's fired after an error (if the close event's even happening on connection error as it probably should be), as we seem to be dying somewhere in that event. I think I stated it above, but we're catching the error message, but not the closure. This doesn't happen every time, thus is somewhat difficult to reproduce. |
@elderlabs have you tried replicating this issue with multiple WebSocket servers, or do you always encounter it with the same server? I haven't looked at this problem closely yet, but if you have a minimum reproducible example that lets me recreate the issue on my end, I'll be able to dig deeper into this when I have the time. This seems to be the type of issue that's hard to recreate, which obviously makes it harder to debug. Hopefully a workaround can be found soon |
That doesn't feel relevant here. The issue is that close events aren't functioning when the connection dies. The upstream WS server is negligible in this case, though I'm only working with one, others with the issue aren't working with the same. It should be replicable in testing with any WS server, then removing the network connection from the client while connected to an upstream server. The issue should present itself randomly in a small series of tests. |
So I've tested something on my end that appears to correct the issue, and may highlight the issue in the WS App client. Calling Perhaps this will help some time down the road when we finally identify what causes the WS App to hang when the network connection is interrupted. |
Thanks for the tip @elderlabs. As you probably noticed, I haven't had time to dig into the remaining open issues yet but it's good to know a workaround was found. |
No problem there, we've all had our busy times. The important thing is to remember time for family and rest. This workaround does appear to handle the issue in the meantime, for those that need a quick fix. It's interesting that a dead connection doesn't trigger an error or run I'm still investigating another issue I've seen, particularly |
Following up on the |
A minor update, this might be the same issue reported in #777, which references an empty frame. There is no mention of the exact error in 777, but it might be related to the I finally caught up on your latest notes. I wrote some example code to replicate the error message you shared. The following code triggers the same error:
If that no length error appears each time you have this problem, an ABNF object is passed to len() instead of ABNF.data. For comparison, the following code passes ABNF.data (not the ABNF object) to len() and yields a result of zero without any error.
Like you did, I looked for all occurrences of len() in the websocket-client code for places where ABNF may be passed to len(). I didn't see anything that looked obviously wrong, but the two more suspect cases of len() in my view were line 305 of _core.py, although I'm still not sure how the error gets triggered, but you can try running the latest code with PR #795, which fixes a 10 year old bug. |
Following up, the issue is still present in the current release (1.3.2). I can also confirm the issue appears to be present in somewhat unstable network connections (in addition to closed connections), but further than that, I have little idea. Perhaps we can isolate and ignore an empty frame, unless there's some means of requesting the frame again, rather than terminating the connection entirely. While the cause does appear to be external, ideally we should account for this behavior and offer some means of handling it. I wish there was more I could offer, but given the rarity of the issue, and how it appears under very specific conditions, this is the most I can offer at this time. EDIT: a follow up to this. After several hours of debugging with an end-user, we found an issue within a wrapper in our library that seemingly broke error codes (pervious maintainer duplicated code already in websocket-client's handler). Now that we can properly see error codes on our end, our application exits before processing empty frames. There are likely still cases where this issue will appear, but one cause on our end is no longer an issue. The fix on our end is to look for an expected list of codes on close, and exit when appropriate, though the true fix may have been confirming we could see error codes to begin with. |
I have similar problem on the websocket connection. def on_close(ws, close_code, close_reason):
logging.warning(f"连接关闭回调别调用! 状态代码:[{close_code}],关闭原因:[{close_reason}]")
if close_code == 1000:
ws.run_forever(reconnect=5)
websocket.enableTrace(True)
ws = AdvancedWebSocketApp(ws_address, on_open=on_open, on_message=on_message, on_close=on_close,
on_error=on_error)
# 断开后每隔5秒尝试重新建立连接
ws.run_forever(reconnect=5) I don't mean anything else, I just hope to help developers with similar needs. |
When my WebSocket client try to reconnect, it is set to closing the original connection first. Howerver, the target WebSocket server is in a proxied environment behind nginx. It might times out after few hour. Then the close() hangs the whole program for 10-15 mins.
From the observation based on following code, it probably hangs in "frame = self.recv_frame()". So no matter how many timeout I set, it didn't work. I wonder if this is a bug, or something I should change. Please give me some advises, thank you!
websocket/_core.py close()
The text was updated successfully, but these errors were encountered: