Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebSocketApp close() hangs on broken connection despite of timeout #592

Open
yhhsteven opened this issue Dec 24, 2019 · 13 comments
Open

WebSocketApp close() hangs on broken connection despite of timeout #592

yhhsteven opened this issue Dec 24, 2019 · 13 comments
Labels

Comments

@yhhsteven
Copy link

When my WebSocket client try to reconnect, it is set to closing the original connection first. Howerver, the target WebSocket server is in a proxied environment behind nginx. It might times out after few hour. Then the close() hangs the whole program for 10-15 mins.

From the observation based on following code, it probably hangs in "frame = self.recv_frame()". So no matter how many timeout I set, it didn't work. I wonder if this is a bug, or something I should change. Please give me some advises, thank you!

websocket/_core.py close()

try:
    self.connected = False
    self.send(struct.pack('!H', status) +
              reason, ABNF.OPCODE_CLOSE)
    sock_timeout = self.sock.gettimeout()
    self.sock.settimeout(timeout)
    start_time = time.time()
    print(f'websocket.close() entering timeout loop gettimeout()={self.sock.gettimeout()}')
    while timeout is None or time.time() - start_time < timeout:
        try:
            frame = self.recv_frame()
            if frame.opcode != ABNF.OPCODE_CLOSE:
                continue
            if isEnabledForError():
                recv_status = struct.unpack("!H", frame.data[0:2])[0]
                if recv_status != STATUS_NORMAL:
                    error("close status: " + repr(recv_status))
            break
        except Exception as e:
            print(f'websocket.close() timeout loop exception',e)
            break
    self.sock.settimeout(sock_timeout)
    print(f'websocket.close() entering sock.shutdown gettimeout()={self.sock.gettimeout()}')
    self.sock.shutdown(socket.SHUT_RDWR)
except Exception as e:
    print(f'websocket.close() exception',e)
    pass

self.shutdown()
@engn33r
Copy link
Collaborator

engn33r commented May 15, 2021

@yhhsteven when you say that your websocket tries to reconnect, how is the websocket reconnecting? Are you using run_forever(), which handles the reconnection automatically, or are you manually adding logic to reconnect your websocket?

It would be very helpful if you could share a minimum reproducible example with an explanation of what triggers this issue, so I can try to recreate the issue. Maybe the issue has been resolved in the latest code. The documentation timeout examples run without problems, so I would like to see how your situation is different.

@elderlabs
Copy link

elderlabs commented Aug 26, 2021

I'll pickup from where this user left off as the issue's still open, and I'm experiencing a similar one, if not the same entirely. Occasionally, our ISP's core router loses connection for whatever reason. When this happens, about 1/3 of the time, my WS is left in a hung state, documenting that it died, but not returning to service. I have two copies of the application running at all times (one for development and one for production), so it's rare that both are left in a hung state at the same time after an ISP outage, which leads me to believe that this might be something to do with where the outage occurs in code while the WS is running, to build on what @yhhsteven said originally.

This is where I'm catching the connection's death, however, my on_close listener isn't called at the end. I should also note the on_close listener has a logging mechanism that I'm looking for when these events occur. Typically, the last log I see is that it's detected the connection's death, but as on_close isn't called, there's no further logging to signify its resurrection.

I'm wondering if it's worth killing the WS instance entirely when these connection issues occur, and restarting it from scratch until the issue is resolved as a workaround. Perhaps I can call on_close manually instead...

Is there anything that reports the WS connection state? Perhaps this can be used to further track the issue on my end.

As I'm diving down the rabbit's hole here, and others have had similar issue in other tickets, I'm wondering if this has anything to do with connection deaths specifically. I use status code 1000 to signify the connection needs to be closed, rather than resumed, but if the connection to the internet is dead, could it be that this causes the wsApp to hang entirely? Perhaps it's less likely to hang without a status code in the close call. Lots to consider here, but definitely related to the unexpected connection interruption I mentioned in #686. In #612, you mention artifacts from previous connections. Perhaps this has something to do with that, but more a ghost of the previous connection, rather than an artifact, as the new connection and closure evert aren't called.

@elderlabs
Copy link

Bumping to see if there's been any progress on your end @engn33r. I do know you're very busy. Perhaps I'll look into the loop and see if there's a place things could potentially lock up when a close event's fired after an error (if the close event's even happening on connection error as it probably should be), as we seem to be dying somewhere in that event. I think I stated it above, but we're catching the error message, but not the closure. This doesn't happen every time, thus is somewhat difficult to reproduce.

@engn33r
Copy link
Collaborator

engn33r commented Sep 2, 2021

@elderlabs have you tried replicating this issue with multiple WebSocket servers, or do you always encounter it with the same server? I haven't looked at this problem closely yet, but if you have a minimum reproducible example that lets me recreate the issue on my end, I'll be able to dig deeper into this when I have the time. This seems to be the type of issue that's hard to recreate, which obviously makes it harder to debug. Hopefully a workaround can be found soon

@elderlabs
Copy link

elderlabs commented Sep 2, 2021

That doesn't feel relevant here. The issue is that close events aren't functioning when the connection dies. The upstream WS server is negligible in this case, though I'm only working with one, others with the issue aren't working with the same.

It should be replicable in testing with any WS server, then removing the network connection from the client while connected to an upstream server. The issue should present itself randomly in a small series of tests.

@elderlabs
Copy link

So I've tested something on my end that appears to correct the issue, and may highlight the issue in the WS App client. Calling on_close after the connection is severed does appear to force the WS back into action. I'm curious what the WS is hanging on internally when the internet connection is no longer available, but for the time being, this patch seems to work. For those interested, my force fix is here.

Perhaps this will help some time down the road when we finally identify what causes the WS App to hang when the network connection is interrupted.

@engn33r
Copy link
Collaborator

engn33r commented Oct 7, 2021

Thanks for the tip @elderlabs. As you probably noticed, I haven't had time to dig into the remaining open issues yet but it's good to know a workaround was found.

@elderlabs
Copy link

elderlabs commented Oct 7, 2021

No problem there, we've all had our busy times. The important thing is to remember time for family and rest.

This workaround does appear to handle the issue in the meantime, for those that need a quick fix. It's interesting that a dead connection doesn't trigger an error or run on_close() when .close() is called. Likely something no one planned for, but an interesting challenge, nonetheless. I'm interested in where it's hanging, or what's causing it to hang, given you can manually call on_close() afterwards.

I'm still investigating another issue I've seen, particularly object of type 'ABNF' has no len(), which I believe is buried somewhere in websocket-client, though I'm not certain yet. Likely worthy of another issue write-up, and tends to appear when the WS is already closed, but something inside websocket-client attempts to act upon it. I'll create a ticket for it once I know more. So far, I've only seen it in audio websockets, and in reports from users with spotty internet connections. So far, the most I've found suggests the culprit might be here or here, but still more to dig into before being certain. Seems to be some amount of validation run when close() is called in the first link. Second link is unlikely to be it, but noted for reference.

@engn33r engn33r added the bug label Oct 20, 2021
@elderlabs
Copy link

Following up on the TypeError("object of type 'ABNF' has no len()"), I now have reason to believe it comes from _core.py :L305. It seems to occur when there is data sent over the WS, but is interrupted. I'm still working out a means to silence the error, but it's quite difficult to do so as there's no way to tell when it's not going to be in an expected format, which only happens when the socket is unexpectedly closed. Still digging.

@engn33r
Copy link
Collaborator

engn33r commented Feb 25, 2022

A minor update, this might be the same issue reported in #777, which references an empty frame. There is no mention of the exact error in 777, but it might be related to the TypeError("object of type 'ABNF' has no len()") text you shared.

I finally caught up on your latest notes. I wrote some example code to replicate the error message you shared. The following code triggers the same error:

from websocket import ABNF
frame = ABNF.create_frame(b'', ABNF.OPCODE_CLOSE)
len(frame)

If that no length error appears each time you have this problem, an ABNF object is passed to len() instead of ABNF.data. For comparison, the following code passes ABNF.data (not the ABNF object) to len() and yields a result of zero without any error.

from websocket import ABNF
frame = ABNF(0, 0, 0, 0, opcode=ABNF.OPCODE_CLOSE, data="")
len(frame.data)
frame2 = ABNF.create_frame(b'', ABNF.OPCODE_CLOSE)
len(frame2.data)

Like you did, I looked for all occurrences of len() in the websocket-client code for places where ABNF may be passed to len(). I didn't see anything that looked obviously wrong, but the two more suspect cases of len() in my view were line 305 of _core.py, although data = frame.format() should set data to a string and return a length of zero at worst, and line 47 of _abnf.py, which might be able to return the no length error if an ABNF object is passed instead of data to the mask() function.

I'm still not sure how the error gets triggered, but you can try running the latest code with PR #795, which fixes a 10 year old bug.

@elderlabs
Copy link

elderlabs commented May 23, 2022

Following up, the issue is still present in the current release (1.3.2). I can also confirm the issue appears to be present in somewhat unstable network connections (in addition to closed connections), but further than that, I have little idea. Perhaps we can isolate and ignore an empty frame, unless there's some means of requesting the frame again, rather than terminating the connection entirely.

While the cause does appear to be external, ideally we should account for this behavior and offer some means of handling it. I wish there was more I could offer, but given the rarity of the issue, and how it appears under very specific conditions, this is the most I can offer at this time.

EDIT: a follow up to this. After several hours of debugging with an end-user, we found an issue within a wrapper in our library that seemingly broke error codes (pervious maintainer duplicated code already in websocket-client's handler). Now that we can properly see error codes on our end, our application exits before processing empty frames. There are likely still cases where this issue will appear, but one cause on our end is no longer an issue. The fix on our end is to look for an expected list of codes on close, and exit when appropriate, though the true fix may have been confirming we could see error codes to begin with.

@engn33r
Copy link
Collaborator

engn33r commented Jun 17, 2023

Similar to #612, this may have been solved by #918, in which case using v1.6.0 will help.

@Haoke98
Copy link

Haoke98 commented Nov 22, 2023

I have similar problem on the websocket connection.
My ws connection will be closed automatically per 24h.
After some researching on this, i have fixed this problem on my way:

def on_close(ws, close_code, close_reason):
    logging.warning(f"连接关闭回调别调用! 状态代码:[{close_code}],关闭原因:[{close_reason}]")
    if close_code == 1000:
        ws.run_forever(reconnect=5)
websocket.enableTrace(True)
 ws = AdvancedWebSocketApp(ws_address, on_open=on_open, on_message=on_message, on_close=on_close,
                              on_error=on_error)
# 断开后每隔5秒尝试重新建立连接
ws.run_forever(reconnect=5)

I don't mean anything else, I just hope to help developers with similar needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants