Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A test to trigger 'earliest > now' assertion #1491

Merged
merged 8 commits into from
May 16, 2024

Conversation

KershawChang
Copy link
Collaborator

This test demonstrates one possible case that how the assertion is triggered.
Basically, this happens when the state of IdleTimeout is set to IdleTimeoutState::AckElicitingPacketSent and we haven't receive any packets after that.

Note that in expired function, we set keep_alive to false, so the expiry function returns a value larger than now and we don't think the idle timer is expired.
However, when expiry is called in next_delay, we set keep_alive to true, so the value returns from expiry is smaller than now.

Copy link
Member

@martinthomson martinthomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code comments only. Once we have a fix, we should land this test with that fix.

neqo-transport/tests/server.rs Outdated Show resolved Hide resolved
neqo-transport/tests/server.rs Outdated Show resolved Hide resolved
neqo-transport/tests/server.rs Outdated Show resolved Hide resolved
@martinthomson
Copy link
Member

Is the fix to propagate the keep_alive flag through expired?

@KershawChang
Copy link
Collaborator Author

KershawChang commented Nov 14, 2023

Is the fix to propagate the keep_alive flag through expired?

If I understand this fix correctly, doing this means that the connection will closed when the keep_alive timer is expired.
However, as this test expects, we should send a ping instead of closing the connection.
I assume we don't want to change our current behavior. If so, maybe the simplest fix is ignoring the idle_time when it's earlier than now?

Copy link
Member

@martinthomson martinthomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this fix correctly, doing this means that the connection will closed when the keep_alive timer is expired.

No, just that we would compute the idle timeout based on whether a keep alive is active. We'd still send a ping when that shorter time elapsed, but we wouldn't have this drift issue you refer to.

One thing that is reassuring about this situation is that it only occurs when the server goes AWOL, as this test demonstrates. That's also troubling also because it's a crash, so it's good to see that you caught this one.

neqo-transport/tests/server.rs Outdated Show resolved Hide resolved
neqo-transport/tests/server.rs Outdated Show resolved Hide resolved
@KershawChang
Copy link
Collaborator Author

I've moved the test to idle.rs, since it seems to be a better place for testing idle timeout.
Note this PR breaks some tests, so I am not sure if this is the change we want.

@martinthomson
Copy link
Member

Hmm, yeah, this doesn't look right. We should not be closing the connection at this point. I think that we need to look for another approach.

@martinthomson
Copy link
Member

Thinking on this more, I think that we probably need to consider tweaks to the expiry() function and how that works. What we want is to have this track when keep-alive packets were sent and either provide updated estimates for the next timeout based on that or simply switch to reporting the keep_alive == false timeout when a keep-alive has been sent.

My understanding is that we'll retransmit the keep-alive using the usual logic for lost packets (PTO, etc...) so there is no need to continue to report the keep-alive time once we've initiated the keep-alive.

So maybe the idle code needs to be told when the keep-alive is sent so that it can stop reporting the shorter timeout.

@KershawChang
Copy link
Collaborator Author

Thinking on this more, I think that we probably need to consider tweaks to the expiry() function and how that works. What we want is to have this track when keep-alive packets were sent and either provide updated estimates for the next timeout based on that or simply switch to reporting the keep_alive == false timeout when a keep-alive has been sent.

My understanding is that we'll retransmit the keep-alive using the usual logic for lost packets (PTO, etc...) so there is no need to continue to report the keep-alive time once we've initiated the keep-alive.

So maybe the idle code needs to be told when the keep-alive is sent so that it can stop reporting the shorter timeout.

Thanks. I'll think about this.

@larseggert larseggert linked an issue Feb 1, 2024 that may be closed by this pull request
@larseggert
Copy link
Collaborator

@KershawChang do we have a way forward for this PR?

@larseggert
Copy link
Collaborator

Making this a draft PR until it becomes reviewable again.

@KershawChang
Copy link
Collaborator Author

Thinking on this more, I think that we probably need to consider tweaks to the expiry() function and how that works. What we want is to have this track when keep-alive packets were sent and either provide updated estimates for the next timeout based on that or simply switch to reporting the keep_alive == false timeout when a keep-alive has been sent.

My understanding is that we'll retransmit the keep-alive using the usual logic for lost packets (PTO, etc...) so there is no need to continue to report the keep-alive time once we've initiated the keep-alive.

So maybe the idle code needs to be told when the keep-alive is sent so that it can stop reporting the shorter timeout.

@martinthomson , hope I understand this comment correctly. I think we can make expiry always return the idle timeout and introduce another function keep_alive_timeout that is only used for sending keep-alive.

Note that I didn't fix test failures in this PR. I'll fix them if you think this is a right approach. Thanks.

Copy link

github-actions bot commented May 8, 2024

Failed Interop Tests

QUIC Interop Runner, client vs. server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

Copy link

github-actions bot commented May 8, 2024

Firefox builds for this PR

The following builds are available for testing. Crossed-out builds did not succeed.

neqo-transport/src/connection/idle.rs Outdated Show resolved Hide resolved
neqo-transport/src/connection/idle.rs Outdated Show resolved Hide resolved
@martinthomson
Copy link
Member

I'm seeing some build failures here that are probably worth looking into.

@KershawChang
Copy link
Collaborator Author

I'm seeing some build failures here that are probably worth looking into.

@martinthomson , I've fixed those test failures.
Could you take a look again? Thanks.

@larseggert
Copy link
Collaborator

larseggert commented May 13, 2024

@KershawChang I merged this into #1875 locally, but the idle_timeout_crazy_rtt test still asserts on earliest > now. FWIW.

Copy link

github-actions bot commented May 13, 2024

Benchmark results

Performance differences relative to a71e43d.

  • coalesce_acked_from_zero 1+1 entries
    time: [197.14 ns 197.61 ns 198.14 ns]
    change: [-0.3783% -0.0070% +0.3929%] (p = 0.97 > 0.05)
    No change in performance detected.

  • coalesce_acked_from_zero 3+1 entries
    time: [238.71 ns 239.35 ns 240.00 ns]
    change: [-0.8117% -0.4749% -0.1197%] (p = 0.01 < 0.05)
    Change within noise threshold.

  • coalesce_acked_from_zero 10+1 entries
    time: [237.64 ns 238.43 ns 239.41 ns]
    change: [-2.9261% -1.1705% +0.1530%] (p = 0.16 > 0.05)
    No change in performance detected.

  • coalesce_acked_from_zero 1000+1 entries
    time: [216.83 ns 216.95 ns 217.11 ns]
    change: [-8.3337% -2.7851% +0.5132%] (p = 0.42 > 0.05)
    No change in performance detected.

  • RxStreamOrderer::inbound_frame()
    time: [119.52 ms 119.57 ms 119.63 ms]
    change: [-0.8905% -0.6921% -0.5618%] (p = 0.00 < 0.05)
    Change within noise threshold.

  • transfer/Run multiple transfers with varying seeds
    time: [120.95 ms 121.24 ms 121.53 ms]
    thrpt: [32.913 MiB/s 32.991 MiB/s 33.071 MiB/s]
    change:
    time: [+0.5202% +0.8598% +1.2050%] (p = 0.00 < 0.05)
    thrpt: [-1.1907% -0.8524% -0.5175%]
    Change within noise threshold.

  • transfer/Run multiple transfers with the same seed
    time: [120.79 ms 121.05 ms 121.32 ms]
    thrpt: [32.971 MiB/s 33.043 MiB/s 33.116 MiB/s]
    change:
    time: [+0.4764% +0.7911% +1.1141%] (p = 0.00 < 0.05)
    thrpt: [-1.1019% -0.7849% -0.4741%]
    Change within noise threshold.

  • 1-conn/1-100mb-resp (aka. Download)/client
    time: [1.0832 s 1.1076 s 1.1362 s]
    thrpt: [88.012 MiB/s 90.284 MiB/s 92.323 MiB/s]
    change:
    time: [-2.5241% +1.2245% +5.0643%] (p = 0.56 > 0.05)
    thrpt: [-4.8202% -1.2097% +2.5894%]
    No change in performance detected.

  • 1-conn/10_000-parallel-1b-resp (aka. RPS)/client
    time: [388.67 ms 392.02 ms 395.34 ms]
    thrpt: [25.295 Kelem/s 25.509 Kelem/s 25.729 Kelem/s]
    change:
    time: [-1.4500% -0.3464% +0.8658%] (p = 0.54 > 0.05)
    thrpt: [-0.8584% +0.3476% +1.4714%]
    No change in performance detected.

  • 1-conn/1-1b-resp (aka. HPS)/client
    time: [41.820 ms 41.946 ms 42.094 ms]
    thrpt: [23.756 elem/s 23.840 elem/s 23.912 elem/s]
    change:
    time: [-0.2421% +0.1965% +0.6379%] (p = 0.39 > 0.05)
    thrpt: [-0.6339% -0.1961% +0.2426%]
    No change in performance detected.

Client/server transfer results

Transfer of 134217728 bytes over loopback.

Client Server CC Pacing Mean [ms] Min [ms] Max [ms] Relative
msquic msquic 877.5 ± 360.9 442.6 1413.8 1.00
neqo msquic reno on 1075.3 ± 262.4 800.8 1625.8 1.00
neqo msquic reno 1121.5 ± 285.2 777.8 1418.7 1.00
neqo msquic cubic on 976.0 ± 241.0 751.9 1388.0 1.00
neqo msquic cubic 956.8 ± 199.1 736.4 1285.5 1.00
msquic neqo reno on 4417.7 ± 228.7 4246.4 4845.0 1.00
msquic neqo reno 4151.5 ± 43.8 4069.3 4196.6 1.00
msquic neqo cubic on 4022.3 ± 96.9 3974.3 4292.6 1.00
msquic neqo cubic 4031.9 ± 190.2 3818.2 4512.9 1.00
neqo neqo reno on 3299.3 ± 297.7 2976.8 3811.5 1.00
neqo neqo reno 3314.4 ± 162.9 3074.6 3624.3 1.00
neqo neqo cubic on 3849.1 ± 139.4 3597.3 4143.1 1.00
neqo neqo cubic 3797.4 ± 186.9 3591.5 4295.8 1.00

⬇️ Download logs

Copy link
Member

@martinthomson martinthomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this. Want to flip out of draft state?

Think about the comment below, but I'll defer to your judgment on the resolution once you've worked through it.

neqo-transport/src/connection/tests/idle.rs Outdated Show resolved Hide resolved
neqo-transport/src/connection/tests/idle.rs Outdated Show resolved Hide resolved
neqo-transport/src/connection/tests/idle.rs Show resolved Hide resolved
@larseggert larseggert marked this pull request as ready for review May 14, 2024 05:56
Copy link

codecov bot commented May 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.20%. Comparing base (e44c472) to head (b0c176c).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1491      +/-   ##
==========================================
- Coverage   93.20%   93.20%   -0.01%     
==========================================
  Files         111      110       -1     
  Lines       36025    35763     -262     
==========================================
- Hits        33578    33333     -245     
+ Misses       2447     2430      -17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@KershawChang
Copy link
Collaborator Author

Please take another look at test keep_alive_with_ack_eliciting_packet_lost.
Thanks.

neqo-transport/src/connection/tests/idle.rs Outdated Show resolved Hide resolved
neqo-transport/src/connection/tests/idle.rs Outdated Show resolved Hide resolved
@KershawChang KershawChang added this pull request to the merge queue May 16, 2024
Merged via the queue into mozilla:main with commit cb343da May 16, 2024
52 of 53 checks passed
@KershawChang KershawChang deleted the h3_upload_test branch May 16, 2024 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

assertion failed: earliest > now
3 participants