-
-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(WebSocketShard): Add RESUMED and READY timeout #8759
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ |
Makes RESUMED timeout delay itself if there were events received in the last 20 seconds
Makes RESUMED timeout delay itself if there were events received in the last 20 seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
п
This stuff looks great! I will test this fix on production 400 shards to see if it reduces or fixes the problem on my side 👍 |
I didn't test it on a large bot, because it was a rewrite of me moving my main bot from eris as it is not supported anymore. I tested it on a combination of mobile data + mobile hotspot + an unstable Linux wi-fi driver, with 4 shards. This is very unstable and I found this issue quickly as the test bot randomly died and never connected back, I had 4 shards spread across 2 processes to make sure the fix worked, I always looked through the debug logs and seeing that it got READY was rare, most of the times it got RESUMED and after 10 days all the shards were up and working before I had to restart my pc. You can check the session starts yourself, because before it died on the second or even the first reconnect and I was unable to check that at all |
Ok, good job 🔥 Edit: #9001 has been done for that 🎉Also, what do you think about this fix 2e1c68e ? It had for me a good impact on zombie connections (but it increased a lot the new logins), if we read the documenation of the > discord api : https://discord.com/developers/docs/topics/gateway#sending-heartbeats We should change the |
Co-authored-by: DraftMan <contact@draftman.fr>
I don't think we need this pr anymore since #8989 has fixed #8486 Moreover, we shouldn't be handling resumed and ready event timeouts by ourselves since if any connection goes dead it's going to be picked up by zombie connection code in a few seconds closeTimeout |
It can get stuck on RESUMING, where it should never heartbeat until RESUMED, so if it gets stuck there is no way to detect it |
Can it really get stuck there though? None of the errors in the issue you linked had that as their cause, all were related to reconnecting over and over and then getting stuck because of that. |
@gamer0mega If it gets stuck on resuming you need a better host for the bot. This is highly unlikely unless the host has a bad internet connection. Resuming -> Resumed depends on Our Server to Discord Server, if the resumed didn't happen even with a good connection (which is highly unlikely) its a discord issue. |
It is not a problem with the network, the socket gets disconnected, the library doesnt know that and keeps RESUMING for eternity |
@gamer0mega |
By that, I meant that the library should reconnect even if my networking sucks and not get stuck P.S. You disagree? That's funny then, it's impossible to have a hosting which has no network issues, eventually your network can go down for a few minutes, and if it doesn't reconnect then some shards will be down |
Superseded by #9099. |
Please describe the changes this PR makes and why it should be merged:
This PR should solve the issue #8486, where the connection gets stuck on IDENTIFY or RESUME.
It adds new private properties of WebSocketShard: readyDispatchTimeout and resumedDispatchTimeout, as well as 2 more private methods: setReadyDispatchTimeout and setResumedDispatchTimeout.
It also removes one .unref() of a timeout as the process exited on it if that was the only shard.
I agree that the names could be confusing (readyTimeout and readyDispatchTimeout), feel free to change them before merging.
I ran a bot with 4 shards for 2 days and all the shards are still up and working fine, whereas before they stopped reconnecting after a few hours of running.
Status and versioning classification: