Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

peerConnection.Close() hangs forever (deadlock?) while handling a callback from Pion #2404

Closed
daniel-abramov opened this issue Feb 6, 2023 · 3 comments

Comments

@daniel-abramov
Copy link

What did you do?

A little bit of a context: I'm working on the SFU, currently, we have the following rough architecture:

  • We subscribe to all callbacks from the peer connection.
  • The callbacks are 'transformed' into a stream of events, i.e. they are just sent over the channel chan PeerConnectionEvent.
  • This means that each time we receive a callback, we send a message over the aforementioned unbounded channel.
  • The handler of the messages listens for them and executes certain modifications of the conference state including modifying the peer connection.

This approach worked well when we used buffered channels with quite a large buffer, but occasionally the conference appeared to be frozen. I replaced the buffered channel with an unbounded channel (i.e. the sender blocks until the receiver reads the message). Right after that, I observed that quite often the conference gets frozen by indefinitely waiting inside a peerConnection.Close() method.

After some investigations, I've noticed that it primarily happens when shortly before calling the Close() function, we receive a callback for OnICECandidate. Close() hangs forever and never returns.

Here is the callback:

runtime.gopark (/opt/homebrew/Cellar/go/1.19.4/libexec/src/runtime/proc.go:364)
runtime.chanrecv (/opt/homebrew/Cellar/go/1.19.4/libexec/src/runtime/chan.go:583)
runtime.chanrecv1 (/opt/homebrew/Cellar/go/1.19.4/libexec/src/runtime/chan.go:442)
github.com/pion/ice/v2.(*Agent).Close (/Users/danielabramov/go/pkg/mod/github.com/pion/ice/v2@v2.2.3/agent.go:920)
github.com/pion/webrtc/v3.(*ICEGatherer).Close (/Users/danielabramov/go/pkg/mod/github.com/pion/webrtc/v3@v3.1.31/icegatherer.go:185)
github.com/pion/webrtc/v3.(*ICETransport).Stop (/Users/danielabramov/go/pkg/mod/github.com/pion/webrtc/v3@v3.1.31/icetransport.go:198)
github.com/pion/webrtc/v3.(*PeerConnection).Close (/Users/danielabramov/go/pkg/mod/github.com/pion/webrtc/v3@v3.1.31/peerconnection.go:1982)

It hangs here:

https://github.com/pion/ice/blob/4dea7246b63bd6230dcb40b14c77679a10844e62/agent.go#L927

It seems like there is some race or a deadlock in Pion that acquires certain resources when calling the OnICECandidate callback which is then expected to be released by the moment the callback finishes. I.e. essentially the workaround that we found is to never block or wait for any I/O inside the OnICECandidate callback, but we're a bit concerned that there might be more places where such a thing can occur. It seems like sometimes it hangs for no apparent reason.

What did you expect?

I expected that Close() does not hang forever. Ideally, the behavior for the callbacks should be unified and predictable unless there is a reason to have different requirements for the callbacks that we pass to Pion (I think it probably would be a good idea to document them), i.e. if the callbacks are not allowed to block of it it's not allowed to access the peer connection from the callback closure, it probably needs to be documented. This happens not every time which gives a hint that there must be a race somewhere inside Pion that leads to a deadlock.

What happened?

peerConnection.Close() hung forever.

@edaniels
Copy link
Member

edaniels commented Feb 9, 2023

We have a similar issue and workaround in our rpc library https://github.com/viamrobotics/goutils/blob/2d9c3aeea721c2d61835e5b94db63dfff4c3ece9/rpc/wrtc_client.go#L231

@jech
Copy link
Contributor

jech commented Jul 14, 2023

I'm not sure whether this is a bug. The callbacks are called synchronously, if they need to block, then you should make them asynchronous yourself. This is the usual idiom in Go, where making a synchronous function asynchronous is easy (just add the go keyword), but making an asynchronous invocation synchronous is difficult (requires either a channel or a workgroup).

@Sean-Der
Copy link
Member

Sean-Der commented May 9, 2024

Thanks for filing this @daniel-abramov I have had users ask about this/be confused. I think this is best for users

  • A user shouldn't be able to cause a deadlock by calling external APIs
  • In cases where high performance matters (like OnMessage for DataChannels) we might have to accept the sharp API
  • I am going to go through and add unit tests that ensure you can call Close in every callback! Will fix the places I find that have issues.

I am going to merge this into #2405

@Sean-Der Sean-Der closed this as completed May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants