Black hole detection false-positives #1855

Ralith · 2024-05-08T22:15:17Z

Black hole detection triggers when the number of "suspicious loss bursts" since the last MTU-sized packet exceeds a threshold:

quinn/quinn-proto/src/connection/mtud.rs

Lines 398 to 409 in a35ad73

    
           fn on_non_probe_acked(&mut self, current_mtu: u16, packet_number: u64, packet_bytes: u16) { 
        
               // Reset the black hole counter if a packet the size of the current MTU or larger 
        
               // has been acknowledged 
        
               if packet_bytes >= current_mtu 
        
                   && self 
        
                       .largest_acked_mtu_sized_packet 
        
                       .map_or(true, |pn| packet_number > pn) 
        
               { 
        
                   self.suspicious_loss_bursts = 0; 
        
                   self.largest_acked_mtu_sized_packet = Some(packet_number); 
        
               } 
        
           }

Due to approximations in packet assembly, we (almost?) never actually produce an exactly MTU-sized packet. As a result, any loss rate of packets larger than the minimum threshold to be suspicious will eventually trigger black hole detection, resulting in a spurious MTU reduction and loss of throughput.

We should relax the conditions for resetting the counter. For example, if we've confirmed delivery of a packet of a certain size, loss bursts containing packets of equal or smaller size shouldn't be considered suspicious any more.

Ralith · 2024-05-09T03:36:12Z

Design proposal:

A loss burst is said to be suspicious (possibly caused by an MTU reduction) if its smallest packet is larger than any more recently acknowledged packet. Packets are deemed lost when at most one packet threshold (typically set to 3) worth of later packets have been acknowledged. We therefore need to store only at most one packet threshold worth of packet sizes, those of the most recently acknowledged packets not preceding a loss burst, to judge that a future loss burst might be caused by an MTU reduction. Past loss bursts are forgotten immediately upon acknowledgement of a larger packet than their smallest. If the number of past loss bursts exceeds a threshold, then we report a black hole.

djc · 2024-05-09T12:16:55Z

Sounds okay. Should we run this by quicdev or something?

Edit: ah, I saw you already did. Should have checked first. 😄

Ralith · 2024-05-09T18:03:30Z

Some thoughts from that discussion:

Some paths don't have a stable MTU, e.g. ones that internally load-balance over heterogeneous paths at a lower level. These are very difficult to detect, perhaps best dealt with manually (complain to your network administrator and/or configure a maximum MTU).
Clamping MTU to the largest packet successfully delivered in some fixed window is safe, but prone to ratcheting downwards with application datagrams.
Probing regularly and/or when suspicious loss occurs helps provide a source of truth when application traffic isn't so obliging, but isn't free.

Still seems to me like the approach I sketched above is an attractive sweet spot.

djc · 2024-05-12T09:36:51Z

A loss burst is said to be suspicious (possibly caused by an MTU reduction) if its smallest packet is larger than any more recently acknowledged packet.

In reviewing #1858, I'm wondering if this holds? If the MTU reduces, why would we only consider losses to be suspicious only if the smallest packet is larger than any more recently acknowledged packet? The MTU could have decreased past whatever was recently acknowledged, right?

Ralith · 2024-05-13T03:37:39Z

When a packet is acknowledged, we know that the MTU is at least that packet's size. If a loss burst includes packets that are smaller than a packet that was transmitted later and which has been acknowledged, then we know at least some of the packets in that burst were not dropped for MTU reasons, so most likely the entire burst was lost due to congestion or link errors.

Ralith added the bug Something isn't working label May 8, 2024

Ralith mentioned this issue May 10, 2024

Reduce black hole detection false positives #1858

Merged

3 tasks

Ralith closed this as completed in #1858 May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Black hole detection false-positives #1855

Black hole detection false-positives #1855

Ralith commented May 8, 2024

Ralith commented May 9, 2024 •

edited

djc commented May 9, 2024 •

edited

Ralith commented May 9, 2024

djc commented May 12, 2024

Ralith commented May 13, 2024

Black hole detection false-positives #1855

Black hole detection false-positives #1855

Comments

Ralith commented May 8, 2024

Ralith commented May 9, 2024 • edited

djc commented May 9, 2024 • edited

Ralith commented May 9, 2024

djc commented May 12, 2024

Ralith commented May 13, 2024

Ralith commented May 9, 2024 •

edited

djc commented May 9, 2024 •

edited