fix: ping race condition #144

koraykoska · 2023-07-07T16:42:25Z

I have been guessing in the dark for a while but this seems to fix the crash (see below) that is happening in certain cases. I cannot reproduce them other than extremely heavy load for hours or sometimes days.
In general, having ~2000 active Websocket connections with 10 second ping intervals each on 8 different servers generates 1-2 of those crashes per day.

I assumed it was a race condition, which was confirmed by a few people on the Discord as well. These changes seem to fix it, which means I haven't been able to see the crash since.

In general, being in the EventLoop is a good approach anyways, even if I am not sure why there are cases of race conditions.

One idea I had is that setting pingInterval triggers the schedule function, but is not guaranteed to be in the EventLoop. If channel is accessed at the same time from somewhere else the race condition might occur. It's a guess though.

Original Crash Report:

Received signal 11. Backtrace:

0x559e53f44ce2, Backtrace.(printBacktrace in _B82A8C0ED7C904841114FDF244F9E58E)(signal: Swift.Int32) -> () at /build/.build/checkouts/swift-backtrace/Sources/Backtrace/Backtrace.swift:66
0x559e53f44f57, closure #1 (Swift.Int32) -> () in static Backtrace.Backtrace.install(signals: Swift.Array<Swift.Int32>) -> () at /build/.build/checkouts/swift-backtrace/Sources/Backtrace/Backtrace.swift:80
0x559e53f44f57, @objc closure #1 (Swift.Int32) -> () in static Backtrace.Backtrace.install(signals: Swift.Array<Swift.Int32>) -> () at /build/<compiler-generated>:79
0x7f04ddd7551f
0x559e5495064a, WebSocketKit.WebSocket.(pingAndScheduleNextTimeoutTask in _449C2CFC2E87FDBFEFFF87218556DA24)@Sendable () -> () at /build/.build/checkouts/websocket-kit/Sources/WebSocketKit/WebSocket.swift:311
0x559e549539bb, reabstraction thunk helper from @escaping @callee_guaranteed @Sendable () -> (@error @owned Swift.Error) to @escaping @callee_guaranteed @Sendable () -> (@out (), @error @owned Swift.Error) at /build/<compiler-generated>:0
0x559e549539bb, partial apply forwarder for reabstraction thunk helper from @escaping @callee_guaranteed @Sendable () -> (@error @owned Swift.Error) to @escaping @callee_guaranteed @Sendable () -> (@out (), @error @owned Swift.Error) at /build/<compiler-generated>:0
0x559e54515a2b, closure #1 () -> () in NIOPosix.SelectableEventLoop.scheduleTask<A>(deadline: NIOCore.NIODeadline, _: () throws -> A) -> NIOCore.Scheduled<A> at /build/.build/checkouts/swift-nio/Sources/NIOPosix/SelectableEventLoop.swift:258
0x559e54514803, closure #4 () -> () in NIOPosix.SelectableEventLoop.run() throws -> () at /build/.build/checkouts/swift-nio/Sources/NIOPosix/SelectableEventLoop.swift:518
0x559e54514803, generic specialization <()> of NIOPosix.withAutoReleasePool<A>(() throws -> A) throws -> A at /build/.build/checkouts/swift-nio/Sources/NIOPosix/SelectableEventLoop.swift:29
0x559e54514803, NIOPosix.SelectableEventLoop.run() throws -> () at /build/.build/checkouts/swift-nio/Sources/NIOPosix/SelectableEventLoop.swift:517
0x559e544f85d6, static NIOPosix.MultiThreadedEventLoopGroup.(runTheLoop in _C2B1528F4FBA68A3DBFA89DBAEBE9D4D)(thread: NIOPosix.NIOThread, parentGroup: Swift.Optional<NIOPosix.MultiThreadedEventLoopGroup>, canEventLoopBeShutdownIndividually: Swift.Bool, selectorFactory: () throws -> NIOPosix.Selector<NIOPosix.NIORegistration>, initializer: (NIOPosix.NIOThread) -> (), _: (NIOPosix.SelectableEventLoop) -> ()) -> () at /build/.build/checkouts/swift-nio/Sources/NIOPosix/MultiThreadedEventLoopGroup.swift:95
0x559e544f85d6, closure #1 (NIOPosix.NIOThread) -> () in static NIOPosix.MultiThreadedEventLoopGroup.(setupThreadAndEventLoop in _C2B1528F4FBA68A3DBFA89DBAEBE9D4D)(name: Swift.String, parentGroup: NIOPosix.MultiThreadedEventLoopGroup, selectorFactory: () throws -> NIOPosix.Selector<NIOPosix.NIORegistration>, initializer: (NIOPosix.NIOThread) -> ()) -> NIOPosix.SelectableEventLoop at /build/.build/checkouts/swift-nio/Sources/NIOPosix/MultiThreadedEventLoopGroup.swift:116
0x559e544fb13e, partial apply forwarder for closure #1 (NIOPosix.NIOThread) -> () in static NIOPosix.MultiThreadedEventLoopGroup.(setupThreadAndEventLoop in _C2B1528F4FBA68A3DBFA89DBAEBE9D4D)(name: Swift.String, parentGroup: NIOPosix.MultiThreadedEventLoopGroup, selectorFactory: () throws -> NIOPosix.Selector<NIOPosix.NIORegistration>, initializer: (NIOPosix.NIOThread) -> ()) -> NIOPosix.SelectableEventLoop at /build/<compiler-generated>:0
0x559e544fd199, reabstraction thunk helper from @escaping @callee_guaranteed (@guaranteed NIOPosix.NIOThread) -> () to @escaping @callee_guaranteed (@in_guaranteed NIOPosix.NIOThread) -> (@out ()) at /build/<compiler-generated>:0
0x559e545302fe, closure #1 (Swift.Optional<Swift.UnsafeMutableRawPointer>) -> Swift.Optional<Swift.UnsafeMutableRawPointer> in static NIOPosix.ThreadOpsPosix.run(handle: inout Swift.Optional<Swift.UInt>, args: NIOPosix.Box<(body: (NIOPosix.NIOThread) -> (), name: Swift.Optional<Swift.String>)>, detachThread: Swift.Bool) -> () at /build/.build/checkouts/swift-nio/Sources/NIOPosix/ThreadPosix.swift:105
0x7f04dddc7b42
0x7f04dde599ff
0xffffffffffffffff

gwynne · 2023-07-07T18:10:18Z

While you are correct that the setter for the pingInterval property does not properly dispatch to the socket's event loop when calling pingAndScheduleNextTimeoutTask(), that does not quite look to be the direct proximate cause of this crash. The backtrace shows that the crash occurs in a call made by the scheduled task, which is explicitly on the event loop. I also can't see anything that pingAndScheduleNextTimeoutTask() does that's sufficiently unsafe to cause the segmentation fault that backtrace is reporting, certainly not on the first line of the method (this suggests that self has become invalid, which shouldn't be possible since the scheduled task should be holding at least one retain on it).

This being said, the logic for handling "locking" in this code has at least three TOCTOU errors I've seen so far. (To be sure, I mean no insult to @0xTim or his efforts to make the code Sendable; this code was already a twisty mess to begin with and Sendable is no picnic to get right. Whole thing should be rewritten as a state machine...) I suspect your fix is changing the timing of something enough to sidestep some race or another. My best guess is the TOCTOU races are allowing reentrant execution of something on multiple threads in a way that defeats the locks, resulting in a crash (although I still can't quite work out exactly what sequence of execution would cause that).

In any event, while your fix is definitely in the correct spirit (and that method should definitely be making that check, even if only as an assertion), I don't think it's actually solving the underlying issue, just hiding it. Off the top of my head, the fixes that are needed include:

Get rid of the separate NIOLockedValueBoxes for the three ping-related properties (waitingForPong, scheduledTimeoutTask, _pingInterval); replace those three properties with a little struct of their own (something like struct AutopingState { var waitingForPong: Bool /* etc */ }), and wrap that entire struct with a NIOLockedValueBox. When the code is appropriately updated for this change, this removes the various races that can occur between checking one property and updating another.

Put the checks of property values and the corresponding updates of those properties within the same locking context - and do this for all locked properties, not just the ping-related ones. For example, this code:

guard !self.waitingForClose.withLockedValue({ $0 }) else {
    promise?.succeed(())
    return
}
self.waitingForClose.withLockedValue { $0 = true }

should look more like this:

let wasAlreadyWaiting = self.waitingForClose.withLockedValue {
    defer { $0 = true }
    return $0
}
guard !wasAlreadyWaiting else { return promise?.succeed() ?? () }

Speaking of the "close" logic, waitingForClose and _closeCode should also be part of the same locked value box.
The scheduledTimeoutTask should be canceled and set to nil when 1) close() is called, and 2) pingInterval is set to nil.

I think that covers all the obvious issues I saw so far.

I know this is a lot - if you aren't sure how to tackle it, I'll see if I can do at least some of it this weekend (assuming @0xTim doesn't beat me to it).

koraykoska · 2023-07-07T18:26:26Z

@gwynne Thank you for the detailed response. I was aware that this PR doesn't really fix the cause.

Unfortunately, I have little experience with Sendable, especially NIOLockedValueBox, so I would be grateful if you could take this over.

Let me know if I can help in any way. Thanks again for the fast response!

koraykoska · 2023-07-12T10:04:47Z

@gwynne Quick update: with the changes in this PR the crash still occurs, just less frequently. I am down to 1 crash every 3-4 days with the same setup as described in my first post. So you are right that this is not the fix yet.

fix: ping race condition

49b4437

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: ping race condition #144

fix: ping race condition #144

koraykoska commented Jul 7, 2023 •

edited by MahdiBM

gwynne commented Jul 7, 2023

koraykoska commented Jul 7, 2023

koraykoska commented Jul 12, 2023

fix: ping race condition #144

Are you sure you want to change the base?

fix: ping race condition #144

Conversation

koraykoska commented Jul 7, 2023 • edited by MahdiBM

gwynne commented Jul 7, 2023

koraykoska commented Jul 7, 2023

koraykoska commented Jul 12, 2023

koraykoska commented Jul 7, 2023 •

edited by MahdiBM