Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Black hole detection false-positives #1855

Closed
Ralith opened this issue May 8, 2024 · 5 comments · Fixed by #1858
Closed

Black hole detection false-positives #1855

Ralith opened this issue May 8, 2024 · 5 comments · Fixed by #1858
Labels
bug Something isn't working

Comments

@Ralith
Copy link
Collaborator

Ralith commented May 8, 2024

Black hole detection triggers when the number of "suspicious loss bursts" since the last MTU-sized packet exceeds a threshold:

fn on_non_probe_acked(&mut self, current_mtu: u16, packet_number: u64, packet_bytes: u16) {
// Reset the black hole counter if a packet the size of the current MTU or larger
// has been acknowledged
if packet_bytes >= current_mtu
&& self
.largest_acked_mtu_sized_packet
.map_or(true, |pn| packet_number > pn)
{
self.suspicious_loss_bursts = 0;
self.largest_acked_mtu_sized_packet = Some(packet_number);
}
}

Due to approximations in packet assembly, we (almost?) never actually produce an exactly MTU-sized packet. As a result, any loss rate of packets larger than the minimum threshold to be suspicious will eventually trigger black hole detection, resulting in a spurious MTU reduction and loss of throughput.

We should relax the conditions for resetting the counter. For example, if we've confirmed delivery of a packet of a certain size, loss bursts containing packets of equal or smaller size shouldn't be considered suspicious any more.

@Ralith Ralith added the bug Something isn't working label May 8, 2024
@Ralith
Copy link
Collaborator Author

Ralith commented May 9, 2024

Design proposal:

A loss burst is said to be suspicious (possibly caused by an MTU reduction) if its smallest packet is larger than any more recently acknowledged packet. Packets are deemed lost when at most one packet threshold (typically set to 3) worth of later packets have been acknowledged. We therefore need to store only at most one packet threshold worth of packet sizes, those of the most recently acknowledged packets not preceding a loss burst, to judge that a future loss burst might be caused by an MTU reduction. Past loss bursts are forgotten immediately upon acknowledgement of a larger packet than their smallest. If the number of past loss bursts exceeds a threshold, then we report a black hole.

@djc
Copy link
Collaborator

djc commented May 9, 2024

Sounds okay. Should we run this by quicdev or something?

Edit: ah, I saw you already did. Should have checked first. 😄

@Ralith
Copy link
Collaborator Author

Ralith commented May 9, 2024

Some thoughts from that discussion:

  • Some paths don't have a stable MTU, e.g. ones that internally load-balance over heterogeneous paths at a lower level. These are very difficult to detect, perhaps best dealt with manually (complain to your network administrator and/or configure a maximum MTU).
  • Clamping MTU to the largest packet successfully delivered in some fixed window is safe, but prone to ratcheting downwards with application datagrams.
  • Probing regularly and/or when suspicious loss occurs helps provide a source of truth when application traffic isn't so obliging, but isn't free.

Still seems to me like the approach I sketched above is an attractive sweet spot.

@djc
Copy link
Collaborator

djc commented May 12, 2024

A loss burst is said to be suspicious (possibly caused by an MTU reduction) if its smallest packet is larger than any more recently acknowledged packet.

In reviewing #1858, I'm wondering if this holds? If the MTU reduces, why would we only consider losses to be suspicious only if the smallest packet is larger than any more recently acknowledged packet? The MTU could have decreased past whatever was recently acknowledged, right?

@Ralith
Copy link
Collaborator Author

Ralith commented May 13, 2024

When a packet is acknowledged, we know that the MTU is at least that packet's size. If a loss burst includes packets that are smaller than a packet that was transmitted later and which has been acknowledged, then we know at least some of the packets in that burst were not dropped for MTU reasons, so most likely the entire burst was lost due to congestion or link errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants