Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(WebSocketShard): Add RESUMED and READY timeout #8759

Closed
wants to merge 11 commits into from

Conversation

gamer0mega
Copy link

Please describe the changes this PR makes and why it should be merged:
This PR should solve the issue #8486, where the connection gets stuck on IDENTIFY or RESUME.

It adds new private properties of WebSocketShard: readyDispatchTimeout and resumedDispatchTimeout, as well as 2 more private methods: setReadyDispatchTimeout and setResumedDispatchTimeout.

It also removes one .unref() of a timeout as the process exited on it if that was the only shard.

I agree that the names could be confusing (readyTimeout and readyDispatchTimeout), feel free to change them before merging.

I ran a bot with 4 shards for 2 days and all the shards are still up and working fine, whereas before they stopped reconnecting after a few hours of running.

Status and versioning classification:

  • Code changes have been tested against the Discord API, or there are no code changes
  • I know how to update typings and have done so, or typings don't need updating
  • This PR changes the library's interface (methods or parameters added)

@vercel
Copy link

vercel bot commented Oct 16, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

2 Ignored Deployments
Name Status Preview Comments Updated
discord-js ⬜️ Ignored (Inspect) Jan 7, 2023 at 5:21PM (UTC)
discord-js-guide ⬜️ Ignored (Inspect) Jan 7, 2023 at 5:21PM (UTC)

Makes RESUMED timeout delay itself if there were events received in the last 20 seconds
Makes RESUMED timeout delay itself if there were events received in the last 20 seconds
@gamer0mega gamer0mega requested review from kyranet and removed request for vladfrangu and iCrawl November 20, 2022 11:51
Copy link

@l1v0n1 l1v0n1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

п

@DraftProducts
Copy link
Contributor

This stuff looks great!
What did you test it on? @gamer0mega
I realized that the issue can be increased when the network speed is low or not proportional to the amount of servers supported.
Did you compare the session start tokens used? If so, is there a significant difference?

I will test this fix on production 400 shards to see if it reduces or fixes the problem on my side 👍

@gamer0mega
Copy link
Author

This stuff looks great!
What did you test it on? @gamer0mega
I realized that the issue can be increased when the network speed is low or not proportional to the amount of servers supported.
Did you compare the session start tokens used? If so, is there a significant difference?

I will test this fix on production 400 shards to see if it reduces or fixes the problem on my side 👍

I didn't test it on a large bot, because it was a rewrite of me moving my main bot from eris as it is not supported anymore. I tested it on a combination of mobile data + mobile hotspot + an unstable Linux wi-fi driver, with 4 shards. This is very unstable and I found this issue quickly as the test bot randomly died and never connected back, I had 4 shards spread across 2 processes to make sure the fix worked, I always looked through the debug logs and seeing that it got READY was rare, most of the times it got RESUMED and after 10 days all the shards were up and working before I had to restart my pc. You can check the session starts yourself, because before it died on the second or even the first reconnect and I was unable to check that at all

@DraftProducts
Copy link
Contributor

DraftProducts commented Dec 27, 2022

Ok, good job 🔥
I added this patch in production along with the patch made in collaboration with @legendhimself.
After +10 hours, I don't seem to have had any other issues caused by this RP.

Edit: #9001 has been done for that 🎉

Also, what do you think about this fix 2e1c68e ?

It had for me a good impact on zombie connections (but it increased a lot the new logins), if we read the documenation of the > discord api : https://discord.com/developers/docs/topics/gateway#sending-heartbeats

image

We should change the reset value to false on L607 to allow a RESUME on a zombie connection.

Co-authored-by: DraftMan <contact@draftman.fr>
@legendhimself
Copy link
Contributor

legendhimself commented Jan 13, 2023

I don't think we need this pr anymore since #8989 has fixed #8486
Correct me if I am wrong but we don't need the READY timeout since we already have a timeout on the ShardingManager

Moreover, we shouldn't be handling resumed and ready event timeouts by ourselves since if any connection goes dead it's going to be picked up by zombie connection code in a few seconds closeTimeout

@gamer0mega
Copy link
Author

I don't think we need this pr anymore since #8989 has fixed #8486 Correct me if I am wrong but we don't need the READY timeout since we already have a timeout on the ShardingManager

Moreover, we shouldn't be handling resumed and ready event timeouts by ourselves since if any connection goes dead it's going to be picked up by zombie connection code in a few seconds closeTimeout

It can get stuck on RESUMING, where it should never heartbeat until RESUMED, so if it gets stuck there is no way to detect it

@Qjuh
Copy link
Contributor

Qjuh commented Jan 15, 2023

I don't think we need this pr anymore since #8989 has fixed #8486 Correct me if I am wrong but we don't need the READY timeout since we already have a timeout on the ShardingManager
Moreover, we shouldn't be handling resumed and ready event timeouts by ourselves since if any connection goes dead it's going to be picked up by zombie connection code in a few seconds closeTimeout

It can get stuck on RESUMING, where it should never heartbeat until RESUMED, so if it gets stuck there is no way to detect it

Can it really get stuck there though? None of the errors in the issue you linked had that as their cause, all were related to reconnecting over and over and then getting stuck because of that.

@legendhimself
Copy link
Contributor

It can get stuck on RESUMING, where it should never heartbeat until RESUMED, so if it gets stuck there is no way to detect it

@gamer0mega If it gets stuck on resuming you need a better host for the bot. This is highly unlikely unless the host has a bad internet connection. Resuming -> Resumed depends on Our Server to Discord Server, if the resumed didn't happen even with a good connection (which is highly unlikely) its a discord issue.

@gamer0mega
Copy link
Author

It can get stuck on RESUMING, where it should never heartbeat until RESUMED, so if it gets stuck there is no way to detect it

@gamer0mega If it gets stuck on resuming you need a better host for the bot. This is highly unlikely unless the host has a bad internet connection. Resuming -> Resumed depends on Our Server to Discord Server, if the resumed didn't happen even with a good connection (which is highly unlikely) its a discord issue.

It is not a problem with the network, the socket gets disconnected, the library doesnt know that and keeps RESUMING for eternity

@legendhimself
Copy link
Contributor

@gamer0mega
Can you provide me with a repro
This repro should work even with a good connection as you just mentioned "It is not a problem with the network"

@gamer0mega
Copy link
Author

gamer0mega commented Jan 30, 2023

@gamer0mega
Can you provide me with a repro
This repro should work even with a good connection as you just mentioned "It is not a problem with the network"

By that, I meant that the library should reconnect even if my networking sucks and not get stuck

P.S. You disagree? That's funny then, it's impossible to have a hosting which has no network issues, eventually your network can go down for a few minutes, and if it doesn't reconnect then some shards will be down

@Jiralite
Copy link
Member

Superseded by #9099.

@Jiralite Jiralite closed this Mar 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

10 participants