Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasionally stuck WebTransport streams #1451

Open
ansemjo opened this issue Jul 13, 2023 · 20 comments
Open

Occasionally stuck WebTransport streams #1451

ansemjo opened this issue Jul 13, 2023 · 20 comments

Comments

@ansemjo
Copy link

ansemjo commented Jul 13, 2023

Hi! I opened an issue in quic-go/webtransport-go two days ago because certain BidirectionalStreams got stuck when I sent too much data in a short interval and I suspected the Go server at first. However, I couldn't reproduce this issue with neither a Go client, nor Chromium. The reproducer I used is in this Gist of mine.

@marten-seemann suggested that I record a qlog and visualize it with qvis. I set network.http.http3.enable_qlog in Firefox' about:config and tried to visualize the log. But the recorded JSON is broken / corrupt. I've attached such a log from opening the reproducer above in a fresh Firefox profile here: localhost.qlog.txt

Opening the file in VSCode highlights a few places where it looks like maybe writes to logfile were interleaved, so the objects were not properly printed?
image

I'm not sure how I can debug this further and I haven't worked with neqo directly yet but this seemed like the most sensible place to report this issue? For now, it seems I found a workaround for my initial issue, anyway.

@martinthomson
Copy link
Member

@jesup and @LPardue might both be interested in this one.

@LPardue
Copy link
Contributor

LPardue commented Jul 16, 2023

Yes I'm interested. Each event should be atomically written. If you're able to pinpoint some issue in the qlog crate I'm happy to take a look.

@ansemjo
Copy link
Author

ansemjo commented Jul 21, 2023

Yes I'm interested. Each event should be atomically written. If you're able to pinpoint some issue in the qlog crate I'm happy to take a look.

If you're waiting on me @LPardue .. I'm afraid I can't help you with that, sorry. :/

I haven't used the qlog create directly yet; in fact I haven't used neqo directly yet either. And to be honest, I only picked this place over Bugzilla because I already have an account here and I figured that since neqo is the QUIC component in Firefox it might make sense. 🙃

If you have some specific instructions, e.g. on how to capture a more detailed trace with Firefox, or questions about my reproducer, I'm all ears though! 😄

@LPardue
Copy link
Contributor

LPardue commented Jul 21, 2023

Thanks @ansemjo , I think we need on of the mozilla team to give a pointer how Firefox uses the qlog crate. It seems to me like there could be a multi-thread problem in how the app accesses qlog access that is causing a single even to be truncated. I'm not familiar enough with the neqo code base and lack the time right now to spend on figuring that out, but I am very interested in getting to the root cause and a fix.

@ansemjo
Copy link
Author

ansemjo commented Jul 21, 2023

This is slightly offtopic to the "corrupted logs" issue but in my original issue I tried capturing some more logs to find the cause. The last capture in quic-go/webtransport-go#84 (comment) shows that both qlog files from Firefox and webtransport-go logged all packets properly, while the JavaScript code was stuck again after just a fraction of that.

To me, that sounds like an issue somewhere at the interface between neqo and Firefox. As you say, maybe it's a multi-threading issue somewhere ...

@LPardue
Copy link
Contributor

LPardue commented Jul 22, 2023

On closer inspection, neqo is using an old version of the qlog crate that provided a hack-ish way to do streaming JSON serialization,this is a possible cause of the weird logs. Since 0.4.0 we've moved to the JSON-SEQ format which could possibly fix this problem, see the qlog bump PR #1454

@larseggert
Copy link
Collaborator

We're now on qlog 0.12.0. Closing this under the assumption that that has addressed the issue.

@ansemjo
Copy link
Author

ansemjo commented Feb 7, 2024

I've checked my reproducer above and I don't have any corrupted qlog files anymore; all of them can be imported fine in qvis. However, the "blocked streams" from my original issue still occur.

@larseggert
Copy link
Collaborator

Thanks for the update. Reopening this.

@larseggert larseggert reopened this Feb 8, 2024
@ansemjo
Copy link
Author

ansemjo commented Feb 8, 2024

Thanks. I'm not sure this here is the right issue for it though? I've held off on reporting it to Mozilla directly because I didn't have the time for it back when I found it and I've personally only tested it with this one quic-go server.

This particular issue was about the broken qlog files - and those seem to be fixed.

@larseggert
Copy link
Collaborator

Well, this repo is a Mozilla repo :-) I'll leave it up to you if you want to close this and reopen it when you have confirmed it is our issue.

@ansemjo
Copy link
Author

ansemjo commented Apr 9, 2024

You're right. I was thinking of the Bugzilla issue tracker though. :P

Sorry for staying silent on this for so long. I just updated the dependencies in my reproducer and retried with Firefox 124.0.1 on Linux. The issue still occurs, albeit much less frequently! Here are two fresh qlog files from Firefox and quic-go, where the unidirectional stream is stuck after sending 22 blobs or about 2 MB:

I guess, in order to further narrow it down to an actual Firefox issue:

  • try a standalone neqo-client against the webtransport-go server
  • try to rebuild the reproducer with neqo-server and connect with firefox
  • try any other implementation of WebTransport servers?

It might also be entirely unrelated to the QUIC code and instead be an issue with AsyncIterators? At least the simple while (true) loop in the reproducer surely can't have much room for errors, can it?

const reader = stream.getReader();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  const chunk = value;
  received += chunk.length;
  console.log(...prefix, "CHUNK", received, chunk);
};

Could you maybe try to reproduce the issue on your end first?

@ansemjo ansemjo changed the title Corrupted qlog on a stuck WebTransport stream Occasionally stuck WebTransport streams Apr 9, 2024
@larseggert
Copy link
Collaborator

Could you maybe try to reproduce the issue on your end first?

Am trying to. Am following the steps in your reproducer. What is xdg-open https://localhost:7443/ supposed to do, that doesn't seem a command I have available after having followed the steps until then?

@ansemjo
Copy link
Author

ansemjo commented Apr 10, 2024

Could you maybe try to reproduce the issue on your end first?

Am trying to. Am following the steps in your reproducer. What is xdg-open https://localhost:7443/ supposed to do, that doesn't seem a command I have available after having followed the steps until then?

Ah, sorry, that is simply opening the URL in a web browser (Firefox in this case). On Linux with a Gnome Desktop that usually opens any compatible program, which can handle https:// links.

@larseggert
Copy link
Collaborator

Ah, got it. Is there any way to just test this with the neqo-client demo client? Would be much easier to debug.

@ansemjo
Copy link
Author

ansemjo commented Apr 10, 2024

That's among my TODOs to narrow it down; haven't tried that yet and I haven't worked with neqo directly either yet. You can see a client implementation in Go in the gist but as far as I remember, I couldn't reproduce it with a webtransport-go client yet. That's why in quic-go/webtransport-go#84 it was decided that it's probably not a webtransport-go issue.

@KershawChang
Copy link
Collaborator

I seem to able to reproduce this, but I am not entirely sure.
@ansemjo , what's the expected output of the browser console?
This is what I saw:
Screenshot 2024-04-10 at 2 04 38 PM

@ansemjo
Copy link
Author

ansemjo commented Apr 10, 2024

I seem to able to reproduce this, but I am not entirely sure. @ansemjo , what's the expected output of the browser console? This is what I saw: [...]

Your screenshot shows a success, actually. Meaning your WebTransport stream did not hang. You can see that because the logged lines contain the total number of bytes received, which should be 32 * 100 KiB = 3276800 at the end. The server window is a better indicator because the output should just stop after some blob without ever printing the done sending in the end.

In the latest commit (revision 12 in the gist) I added a SUCCESS message to the browser console output. When you don't see that, your stream is stuck. Edit: I realize I should be clearer with my messages. 🙃 So, just to be abundantly clear: when you do see a SUCCESS or 3276800 received bytes, you did not hit this issue. When that's missing and your stream is stuck, that's when you ... successfully reproduced the issue. Sorry, I need to fix that wording when I'm home.


I'm on another computer with Windows right now and reproduced it after a few tries. The binary was compiled with GOOS=windows go build -o reproducer.exe main.go under WSL2.

Screenshot 2024-04-10 185928

@KershawChang
Copy link
Collaborator

Sorry, I have no luck to reproduce this at my side.
@ansemjo , could you try to follow the instructions here to record a log file?
Please set New log modules: to timestamp,sync,nsHttp:5,WebTransport:5,nsWebTransport:5,neqo_http3::*:5,neqo_transport::*:5 and select Logging to a file.
Thanks.

@ansemjo
Copy link
Author

ansemjo commented Apr 11, 2024

Sure!
Wow, that log is verbose. I used a Firefox Developer 125.0b6 (64-bit) window with a clean profile because I have many tabs open and they all appeared in the log otherwise.

And how it looked in the terminal:

2024/04/11 11:17:32 connection from 127.0.0.1:51676
2024/04/11 11:17:32 logging QUIC connection to "quic_323636_6662353932323963383361386630646530353533613538343261373637323563.qlog"
2024/04/11 11:17:33 write blob #00
2024/04/11 11:17:33 write blob #01
2024/04/11 11:17:33 write blob #02
2024/04/11 11:17:33 write blob #03
2024/04/11 11:17:33 write blob #04
2024/04/11 11:17:33 write blob #05
2024/04/11 11:17:33 write blob #06
2024/04/11 11:17:33 write blob #07
2024/04/11 11:17:33 write blob #08
2024/04/11 11:17:33 write blob #09
2024/04/11 11:17:33 write blob #10
2024/04/11 11:17:33 write blob #11
2024/04/11 11:17:33 write blob #12
2024/04/11 11:17:33 write blob #13
2024/04/11 11:17:33 write blob #14
2024/04/11 11:17:33 write blob #15
2024/04/11 11:17:33 write blob #16
2024/04/11 11:17:33 write blob #17
2024/04/11 11:17:33 write blob #18
2024/04/11 11:17:33 write blob #19
2024/04/11 11:17:33 write blob #20

And finally, the browser console:
image

I hope that's helpful to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants