Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(websocketshard): deal with zombie connection caused by 4009 #7581

Merged
merged 51 commits into from Jun 5, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
a2326a7
fix(websocketshard): deal with zombie connection caused by 4009
legendhimself Mar 1, 2022
aab64b8
chore: remove eslint-disable for console
legendhimself Mar 1, 2022
25ac5ea
fix: use enum instead of using numbers
legendhimself Mar 1, 2022
4847898
Apply suggestions from code review
legendhimself Mar 1, 2022
52b0abd
fix: apply suggestions
legendhimself Mar 1, 2022
ac2beb3
Apply suggestions from code review
legendhimself Mar 1, 2022
2cb93a6
fix: remove extra reassignment of status
legendhimself Mar 1, 2022
792f38d
fix: remove extra reassignment of status
legendhimself Mar 1, 2022
78f84f9
fix: clear timeout on ws close
legendhimself Mar 1, 2022
b1a061f
Merge branch 'v13' into v13
legendhimself Mar 2, 2022
e80a867
Apply suggestions from code review
legendhimself Mar 6, 2022
78b15bc
Apply suggestions from code review
legendhimself Mar 6, 2022
c154ceb
refactor: apply suggestions from code review
legendhimself Mar 6, 2022
47b9587
Merge branch 'v13' into v13
legendhimself Mar 6, 2022
d3dfbac
Apply suggestions from code review
legendhimself Mar 6, 2022
fb3ab02
fix: close emitting twice
legendhimself Mar 7, 2022
edd617d
Merge branch 'v13' into v13
legendhimself Mar 7, 2022
130a62f
Merge branch 'v13' into v13
legendhimself Mar 10, 2022
043fe3e
feat: add zombieConnection to typings
legendhimself Mar 10, 2022
196ec16
fix: handle edge cases with setWsCloseTimeout
legendhimself Mar 14, 2022
f26a03d
Merge branch 'v13' into v13
legendhimself Apr 18, 2022
60f6425
fix: connection object before accessing readyState
legendhimself Apr 19, 2022
9ec75ff
Merge branch 'v13' into v13
legendhimself Apr 19, 2022
a69dd35
Merge branch 'v13' into v13
legendhimself May 3, 2022
8493928
Merge branch 'discordjs:v13' into v13
legendhimself May 12, 2022
1856ac2
feat(websocketshard): better way of handling zombie connection
legendhimself May 12, 2022
170edfa
fix: ackHeartbeat which was used for testing purpose
legendhimself May 12, 2022
d4cbd64
fix: docs for websocketshard
legendhimself May 12, 2022
c05a062
refactor: apply suggestions
legendhimself May 12, 2022
c023b18
Update src/client/websocket/WebSocketShard.js
legendhimself May 12, 2022
37759f0
docs: websocketshard
legendhimself May 12, 2022
8dbb5b0
fix: deal with edge cases
legendhimself May 12, 2022
836d934
refactor(websocketshard): move close debug log into emitClose method
legendhimself May 12, 2022
2f4a7e6
fix(websocketshard): reset session id on 4009
legendhimself May 12, 2022
f0770f8
fix: reconnect loops
legendhimself May 13, 2022
b2fd715
Merge branch 'v13' into v13
legendhimself May 13, 2022
9b5f5f6
fix: add timeout only on closing states
legendhimself May 13, 2022
b57f27f
Merge branch 'v13' of https://github.com/legendhimslef/discord.js int…
legendhimself May 13, 2022
97a2e5f
fix: closeEmitted variable
legendhimself May 13, 2022
7194f49
fix: debug logs
legendhimself May 14, 2022
f74985a
chore: cleanup
legendhimself May 14, 2022
7453895
chore: debug logs
legendhimself May 14, 2022
a3b94b2
refactor: remove else statement
legendhimself May 14, 2022
7a97d2f
Merge branch 'v13' of https://github.com/legendhimslef/discord.js int…
legendhimself May 14, 2022
4a9e393
fix: debug logs & clearTimeout at start
legendhimself May 15, 2022
0daf304
fix: docs
legendhimself May 16, 2022
5877246
docs: elaborated a bit more
legendhimself May 20, 2022
7b108a4
fix: if statement
legendhimself May 20, 2022
be04819
feat: handle edge cases while reconnecting
legendhimself May 24, 2022
88fb5f6
fix: deal with an edge case
legendhimself May 26, 2022
931fac5
chore: cleanup
legendhimself May 29, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
196 changes: 180 additions & 16 deletions src/client/websocket/WebSocketShard.js
@@ -1,7 +1,8 @@
'use strict';

const EventEmitter = require('node:events');
const { setTimeout, setInterval } = require('node:timers');
const { setTimeout, setInterval, clearTimeout } = require('node:timers');
const { setTimeout: sleep } = require('node:timers/promises');
const WebSocket = require('../../WebSocket');
const { Status, Events, ShardEvents, Opcodes, WSEvents } = require('../../util/Constants');
const Intents = require('../../util/Intents');
Expand Down Expand Up @@ -81,6 +82,13 @@ class WebSocketShard extends EventEmitter {
*/
this.lastHeartbeatAcked = true;

/**
* Used to prevent from calling the onClose event twice while closing or terminating the WebSocket.
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
* @type {boolean}
* @private
*/
this.closeEmitted = false;

/**
* Contains the rate limit queue and metadata
* @name WebSocketShard#ratelimit
Expand Down Expand Up @@ -126,6 +134,14 @@ class WebSocketShard extends EventEmitter {
*/
Object.defineProperty(this, 'helloTimeout', { value: null, writable: true });

/**
* The WebSocket timeout.
* @name WebSocketShard#WsCloseTimeout
* @type {?NodeJS.Timeout}
* @private
*/
Object.defineProperty(this, 'WsCloseTimeout', { value: null, writable: true });
legendhimself marked this conversation as resolved.
Show resolved Hide resolved

/**
* If the manager attached its event handlers on the shard
* @name WebSocketShard#eventsAttached
Expand Down Expand Up @@ -188,6 +204,7 @@ class WebSocketShard extends EventEmitter {
this.removeListener(ShardEvents.RESUMED, onResumed);
this.removeListener(ShardEvents.INVALID_SESSION, onInvalidOrDestroyed);
this.removeListener(ShardEvents.DESTROYED, onInvalidOrDestroyed);
this.removeListener(ShardEvents.ZOMBIE_CONNECTION, this.handleZombieConnection);
};

const onReady = () => {
Expand Down Expand Up @@ -253,7 +270,8 @@ class WebSocketShard extends EventEmitter {

this.connectedAt = Date.now();

const ws = (this.connection = WebSocket.create(gateway, wsQuery));
// Adding a handshake timeout to just make sure no zombie connection appears.
const ws = (this.connection = WebSocket.create(gateway, wsQuery, { handshakeTimeout: 30000 }));
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
ws.onopen = this.onOpen.bind(this);
ws.onmessage = this.onMessage.bind(this);
ws.onerror = this.onError.bind(this);
Expand Down Expand Up @@ -351,10 +369,25 @@ class WebSocketShard extends EventEmitter {
this.setHeartbeatTimer(-1);
this.setHelloTimeout(-1);
// If we still have a connection object, clean up its listeners
if (this.connection) this._cleanupConnection();
if (this.connection) {
this.debug('[WebSocket] onClose: Cleaning up the Connection.');
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
this._cleanupConnection();
}
this.closeEmitted = true;
this.status = Status.DISCONNECTED;

// Step 1: Null the connection object
this.debug('Step 1: Null the connection object.');
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
this.connection = null;

this.debug('Step 2: Set the shard status to DISCONNECTED');
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
// Step 2: Set the shard status to DISCONNECTED
this.status = Status.DISCONNECTED;

this.debug('Step 3: Cache the old sequence (use to attempt a resume)');
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
// Step 3: Cache the old sequence (use to attempt a resume)
if (this.sequence !== -1) this.closeSequence = this.sequence;

/**
* Emitted when a shard's WebSocket closes.
* @private
Expand Down Expand Up @@ -519,10 +552,50 @@ class WebSocketShard extends EventEmitter {
this.debug('Setting a HELLO timeout for 20s.');
this.helloTimeout = setTimeout(() => {
this.debug('Did not receive HELLO in time. Destroying and connecting again.');
this.destroy({ reset: true, closeCode: 4009 });
this.destroy({ closeCode: 4009 });
}, 20_000).unref();
}

/**
* Sets the WebSocket Close timeout.
* This method is responsilble to detect any zombie connections if the ws fails to close properly,
* Wait for 6s for the ws#close event after ws.close() or ws.terminate() is called.
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
* @param {number} [time] If set to -1, it will clear the timeout
* @private
*/
setWsCloseTimeout(time) {
if (time === -1) {
if (this.WsCloseTimeout) {
this.debug('Clearing the WebSocket Close timeout.');
clearTimeout(this.WsCloseTimeout);
this.WsCloseTimeout = null;
}
return;
}
this.WsCloseTimeout = setTimeout(() => {
this.debug(`[WebSocket] Close Emitted: ${this.closeEmitted}`);
// Check connection is null or if close event was emitted.
if (!this.connection || this.closeEmitted) {
this.debug(
`[WebSocket] WebSocket close not detected. | WS State: ${
CONNECTION_STATE[this.connection?.readyState ?? 3]
} | Close Emitted: ${this.closeEmitted}`,
);
this.closeEmitted = false;
this.setWsCloseTimeout(-1);
return;
}
// Waiting for approx 5s.

/**
* Emitted when a shard's WebSocket is not closed in time.
* @private
* @event WebSocketShard#zombieConnection
*/
this.emit(ShardEvents.ZOMBIE_CONNECTION);
}, 6000).unref();
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
}

/**
* Sets the heartbeat timer for this shard.
* @param {number} time If -1, clears the interval, any other number sets an interval
Expand Down Expand Up @@ -563,8 +636,7 @@ class WebSocketShard extends EventEmitter {
Sequence : ${this.sequence}
Connection State: ${this.connection ? CONNECTION_STATE[this.connection.readyState] : 'No Connection??'}`,
);

this.destroy({ closeCode: 4009, reset: true });
this.destroy({ closeCode: 4009 });
return;
}

Expand Down Expand Up @@ -702,31 +774,49 @@ class WebSocketShard extends EventEmitter {
* @private
*/
destroy({ closeCode = 1_000, reset = false, emit = true, log = true } = {}) {
// Making the variable false to check for zombie connections.
this.closeEmitted = false;

if (log) {
this.debug(`[DESTROY]
Close Code : ${closeCode}
Reset : ${reset}
Emit DESTROYED: ${emit}`);
}

this.debug(`[WS Destroy] Step 0: Remove all timers.`);
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
// Step 0: Remove all timers
this.setHeartbeatTimer(-1);
this.setHelloTimeout(-1);

this.debug(
`[WS Destroy] Step 1: Attempting to close the WebSocket. | WS State: ${
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
CONNECTION_STATE[this.connection?.readyState ?? 3]
}`,
);
// Step 1: Close the WebSocket connection, if any, otherwise, emit DESTROYED
if (this.connection) {
// If the connection is currently opened, we will (hopefully) receive close
if (this.connection.readyState === WebSocket.OPEN) {
this.connection.close(closeCode);
this.debug(
`[WebSocket] Close: Tried closing. | WS State: ${CONNECTION_STATE[this.connection?.readyState ?? 3]}`,
);
} else {
// Connection is not OPEN
this.debug(`WS State: ${CONNECTION_STATE[this.connection.readyState]}`);
this.debug(`WS State: ${CONNECTION_STATE[this.connection?.readyState ?? 3]}`);
// Remove listeners from the connection
this._cleanupConnection();
// Attempt to close the connection just in case
try {
this.connection.close(closeCode);
} catch {
this.debug(
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
`[WebSocket] Close: Something went wrong while closing the WebSocket, Now calling terminate. | WS State: ${
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
CONNECTION_STATE[this.connection?.readyState ?? 3]
}`,
);
this.connection.terminate();
// No-op
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
}
// Emit the destroyed event if needed
Expand All @@ -737,30 +827,104 @@ class WebSocketShard extends EventEmitter {
this._emitDestroyed();
}

// Step 2: Null the connection object
this.connection = null;

// Step 3: Set the shard status to DISCONNECTED
this.status = Status.DISCONNECTED;
/**
* Just to make sure the readyState is not stuck at CLOSING incase of a closeCode 4009.
* we can use this for other closeCodes aswell but rightnow so far only 4009 is failing to close.
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
**/
if (closeCode === 4009) {
this.debug(
`[WebSocket] closeCode[4009]: Adding a timeout to check the connection state. | WS State: ${
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
CONNECTION_STATE[this.connection?.readyState ?? 3]
}`,
);
this.setWsCloseTimeout();
}

// Step 4: Cache the old sequence (use to attempt a resume)
if (this.sequence !== -1) this.closeSequence = this.sequence;
this.debug(
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
`[WS Destroy] Step 2: Resetting the sequence and session id if requested. | WS State: ${
CONNECTION_STATE[this.connection?.readyState ?? 3]
}`,
);

// Step 5: Reset the sequence and session id if requested
// Step 2: Reset the sequence and session id if requested
if (reset) {
this.sequence = -1;
this.sessionId = null;
}

// Step 6: reset the rate limit data
// Step 3: Watch for zombie connection event which could be emitted when step 1 fails to close the WebSocket
this.on(ShardEvents.ZOMBIE_CONNECTION, this.handleZombieConnection);
kyranet marked this conversation as resolved.
Show resolved Hide resolved

this.debug('Step 4: reset the rate limit data');
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
// Step 4: reset the rate limit data
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
this.ratelimit.remaining = this.ratelimit.total;
this.ratelimit.queue.length = 0;
if (this.ratelimit.timer) {
this.debug(`[WS Destroy] Step 4: Clearing ratelimit timer`);
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
clearTimeout(this.ratelimit.timer);
this.ratelimit.timer = null;
}
}

/**
* Forcefully closes the WebSocket when the destory() method could not close it normally
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
* using the ws.close(); and ws.terminate();
* @private
* @returns {Promise<void>}
*/
async handleZombieConnection() {
// Continuing from step 2 from the destroy method.

// Step 3 Check if the connection was closed and readyState is at CLOSED. If not close it.
if (this.connection?.readyState === WebSocket.CLOSING || this.connection?.readyState === WebSocket.OPEN) {
this.debug(
`[WS Destroy] Step 3: WebSocket Still Connected: trying to terminate. | WS State: ${
CONNECTION_STATE[this.connection?.readyState ?? 3]
}`,
);

// Trying to close WebSocket with terminate..
try {
this.connection.terminate();
} catch {
// No op.
}

// Wait for 400ms for ws to close.
await sleep(400);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 400 ms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after several days of testing, I found out that it takes around 200-400ms to close the ws.
image

👆 The first date here is logged just before closing, the second one is logged in onClose method.
The difference of these 2 dates is 315ms. This why I kept it 400ms [max]

another example 👇
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found out that it takes around 200-400ms to close the ws.

That's with your internet (or rather, with your provider's internet). It's highly dependant on location and how fast the packets are routed to Discord and back, so yeah, this is an unreliable thing.

If anything, you should use events.once to wait for a closing event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found out that it takes around 200-400ms to close the ws.

That's with your internet (or rather, with your provider's internet). It's highly dependant on location and how fast the packets are routed to Discord and back, so yeah, this is an unreliable thing.

If anything, you should use events.once to wait for a closing event.

I agree, We can make it a option "6000"

this.debug(`[WebSocket] Final Close Check. | WS State: ${CONNECTION_STATE[this.connection?.readyState]}`);
if (this.connection?.readyState === WebSocket.CLOSING || this.connection?.readyState === WebSocket.OPEN) {
this.debug(
// eslint-disable-next-line max-len
`[WebSocket] Connection still stuck at Closing: Calling a Manual Destroy of the socket to Close and reconnect. | WS State: ${
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
CONNECTION_STATE[this.connection?.readyState ?? 3]
}`,
);

/**
* This is an important step to deal with zombie connections where in shard never reconnects
* after a 4009 closeCode due to WebSocket being stuck at CLOSING ready state.
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
* Check the issue https://github.com/discordjs/discord.js/issues/7450
*
* The _socket.destroy() method Ensures that no more I/O activity happens on this socket.
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
* Destroys the stream and closes the connection. Refer: https://nodejs.org/api/net.html#socketdestroy
* _socket.destroy() is also being invoked in ws.terminate() method.
* ._socket.destroy emits the close event and makes the readyState to CLOSED.
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
*/

// manual destory
this.connection._socket.destroy();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure what to feel about this, since it's an object managed by a third-party package, and also a private property.

Also, typo on destory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no proper answer from ws package on "why" readyState is stuck at CLOSING forever. I just had to directly destroy the socket for them and emit the event manually. Even tho the code for ws.terminate() is pretty much similar.

All these changes and codes are in production right now and I have seen 0 downtimes. except a few 4009 where some session cannot be resumed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these changes and codes are in production right now

It's more referring to the fact that because it's private, the creators of the ws package may change it without prior notice, e.g. making the field a true-private one.

In case that happens, I believe we should be prepared for it.

wdyt @vladfrangu?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo, the proper fix for a 4009 is to not care anymore about the connection. Remove all listeners, terminate the connection and start a-new, no waiting or anything. If we get in a state where we emit a 4009, no point to wait for something that won't happen 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo, the proper fix for a 4009 is to not care anymore about the connection. Remove all listeners, terminate the connection and start a-new, no waiting or anything. If we get in a state where we emit a 4009, no point to wait for something that won't happen 👍

But as far I have seen with my bot, There are many times where the bot resumes the session after a 4009 with this pr changes ie 0 downtimes, but if we do what you said, > Remove all listeners, terminate the connection, and start a-new.
I think this will cause the bot to have tiny downtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll always have downtime, what matters is trying to resume afterwards

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll always have downtime, what matters is trying to resume afterwards

So we should directly clear everything and make a new connection for 4009 without even trying to resume the session even once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also want to ask if anyone knows why only one of my shards has this issue? and it[4009] only appears on it. The rest of my 15 shards are fine.
If it was my connection shouldn't it be random and appear on random shards?
@kyranet @vladfrangu

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll always have downtime, what matters is trying to resume afterwards

So we should directly clear everything and make a new connection for 4009 without even trying to resume the session even once?

No. I said you should clean the ws listeners (this.clearConnectionListeners or w/e its called), call this.connection.terminate() and simply not care if it works or doesn't and re-schedule the shard to connect. Thus, you get to attempt a resume, if it works 🎉, if it doesn't oh well, identify

this.connection.emitClose();

this.debug(
`[WebSocket] Applied a Manual Destroy, Clearing the WS State Logger in 5s. | WS State: ${
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
CONNECTION_STATE[this.connection?.readyState ?? 3]
}`,
);
}
}
}

/**
* Cleans up the WebSocket connection listeners.
* @private
Expand Down
1 change: 1 addition & 0 deletions src/util/Constants.js
Expand Up @@ -218,6 +218,7 @@ exports.ShardEvents = {
READY: 'ready',
RESUMED: 'resumed',
ALL_READY: 'allReady',
ZOMBIE_CONNECTION: 'zombieConnection',
legendhimself marked this conversation as resolved.
Show resolved Hide resolved
};

/**
Expand Down
5 changes: 5 additions & 0 deletions typings/index.d.ts
Expand Up @@ -2608,6 +2608,7 @@ export interface WebSocketShardEvents {
invalidSession: [];
close: [event: CloseEvent];
allReady: [unavailableGuilds?: Set<Snowflake>];
zombieConnection: [];
}

export class WebSocketShard extends EventEmitter {
Expand All @@ -2623,6 +2624,8 @@ export class WebSocketShard extends EventEmitter {
private eventsAttached: boolean;
private expectedGuilds: Set<Snowflake> | null;
private readyTimeout: NodeJS.Timeout | null;
private closeEmitted: boolean;
private WsCloseTimeout: NodeJS.Timeout | null;

public manager: WebSocketManager;
public id: number;
Expand All @@ -2638,6 +2641,7 @@ export class WebSocketShard extends EventEmitter {
private onPacket(packet: unknown): void;
private checkReady(): void;
private setHelloTimeout(time?: number): void;
private setWsCloseTimeout(time?: number): void;
private setHeartbeatTimer(time: number): void;
private sendHeartbeat(): void;
private ackHeartbeat(): void;
Expand All @@ -2647,6 +2651,7 @@ export class WebSocketShard extends EventEmitter {
private _send(data: unknown): void;
private processQueue(): void;
private destroy(destroyOptions?: { closeCode?: number; reset?: boolean; emit?: boolean; log?: boolean }): void;
private handleZombieConnection(): Promise<void>;
private _cleanupConnection(): void;
private _emitDestroyed(): void;

Expand Down