Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race Condition on opening a Stream over a DataChannel #2443

Open
justin0mcateer opened this issue Mar 15, 2024 · 3 comments
Open

Race Condition on opening a Stream over a DataChannel #2443

justin0mcateer opened this issue Mar 15, 2024 · 3 comments
Labels
need/triage Needs initial labeling and prioritization

Comments

@justin0mcateer
Copy link

justin0mcateer commented Mar 15, 2024

  • Version:
    1.2.4

  • Platform:
    Linux master 6.7.6-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 23 Feb 2024 16:31:48 +0000 x86_64 GNU/Linux
    Windows 11
    MacOS
    Google Chrome 122.0.6261.94
    Brave Browser 122.1.63.165

  • Subsystem:
    WebRTC Transport

Severity:

High

Description:

DataChannels open but Streams timeout and fail.
The main symptom is that we end up with 'one way' Streams, where the incoming Stream was successfully opened, but the outgoing Stream aborts.

Steps to reproduce the error:

The timing window for the race condition is extremely small ~4ms, so the error only occurs when the time required to open a DataChannel is within a very small window. We typically see it when trying to establish connections in a LAN environment, but then the timing varies slightly based on the operating system and browser. Changing browsers may fix the problem, enabling a VPN will almost always fix the problem (due to added latency). The problem seems to occur with much greater frequency on Chrome on MacOS.

We only began seeing this problem when upgrading from 0.46 to 1.1 (now 1.2). The problem seems to have been introduced in the refactoring done here:
@achingbrain
#2237

What we are seeing is that there is a very small delay between checking for the current state of the DataChannel and pEvent actually registering the event handler to capture the 'open' event. Such that if the timing of the DataChannel open event is just right, pEvent never receives the event and throws on timing out.
https://github.com/libp2p/js-libp2p/blob/main/packages/transport-webrtc/src/stream.ts#L228-L230

Catching the Timeout error and checking again, proceeding if the DataChannel is in fact open works around the issue.

Debugging log points have been added in this log output which illustrate the issue:
image

If pEvent caught the 'open' event this log message should be printed, but it is not:
https://github.com/libp2p/js-libp2p/blob/main/packages/transport-webrtc/src/stream.ts#L231

It seems that due to the way pEvent is implemented, it leaves a tiny window of time between when the pEvent function is called and when 'addEventHandler' is actually called. A custom function that periodically polls, in addition to capturing the event may be required, since the timing is so tight in this case.

@justin0mcateer
Copy link
Author

Upon reviewing the pEvent code it more detail and thinking about it a little more. I believe this issue can only occur with events emitted from native code, such as the DataChannel event the WebRTC Transport library is trying to await. Otherwise, the pEvent code is effectively synchronous, so this would not happen with an event emitted from JavaScript code.

@justin0mcateer
Copy link
Author

Just to follow up on this, we added a small patch to address this, it was happening fairly regularly and we haven't seen this problem any more after applying the patch.

Basically, we shortened the timeout and we just check it again if pEvent times out. This second check covers the case that pEvent misses the open event from the DataChannel.

I think a better solution long-term here might be to 'poll' the state periodically in addition to listening for the event.

+++ b/dist/src/stream.js
@@ -38,7 +38,7 @@ export const FIN_ACK_TIMEOUT = 5000;
  * When sending data messages, if the channel is not in the "open" state, wait
  * this long for the "open" event to fire.
  */
-export const OPEN_TIMEOUT = 5000;
+export const OPEN_TIMEOUT = 500;
 export class WebRTCStream extends AbstractStream {
     /**
      * The data channel used to send and receive data
@@ -177,7 +177,16 @@ export class WebRTCStream extends AbstractStream {
         }
         if (this.channel.readyState !== 'open') {
             this.log('channel state is "%s" and not "open", waiting for "open" event before sending data', this.channel.readyState);
-            await pEvent(this.channel, 'open', { timeout: this.openTimeout });
+            try {
+               await pEvent(this.channel, 'open', { timeout: this.openTimeout });
+               this.log('channel state is now "%s", sending data', this.channel.readyState);
+            } catch (err) {
+               // check if the channel is open after the timeout
+               if (this.channel.readyState !== 'open') {
+                   this.log.error('channel state is still "%s" after waiting for "open" event', this.channel.readyState);
+                   throw err;
+               }
+            }
             this.log('channel state is now "%s", sending data', this.channel.readyState);
         }
         // send message without copying data

@justin0mcateer
Copy link
Author

Would an using 'Promise.all' across pWaitFor and pEvent be an acceptable solution? If so, we would be happy to submit a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/triage Needs initial labeling and prioritization
Projects
None yet
Development

No branches or pull requests

1 participant