fix(proto): retry off-path NAT traversal probes and retire stale CIDs#524
fix(proto): retry off-path NAT traversal probes and retire stale CIDs#524dignifiedquire merged 14 commits intomainfrom
Conversation
|
Documentation for this PR has been generated and is available at: https://n0-computer.github.io/noq/pr/524/docs/noq/ Last updated: 2026-04-08T10:45:45Z |
Performance Comparison Report
|
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5342.1 Mbps | 7898.0 Mbps | -32.4% | 94.5% / 106.0% |
| medium-concurrent | 5413.6 Mbps | 7842.3 Mbps | -31.0% | 91.5% / 96.8% |
| medium-single | 4146.8 Mbps | 4749.5 Mbps | -12.7% | 95.7% / 109.0% |
| small-concurrent | 3867.4 Mbps | 5327.1 Mbps | -27.4% | 96.9% / 109.0% |
| small-single | 3615.0 Mbps | 4799.4 Mbps | -24.7% | 92.9% / 109.0% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3086.7 Mbps | 3953.1 Mbps | -21.9% |
| lan | 782.4 Mbps | 810.4 Mbps | -3.5% |
| lossy | 69.8 Mbps | 69.8 Mbps | ~0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 25.7% slower on average
689cc33f78e3daa7bdb4f4a4e03f054b6e8be2b1 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5453.9 Mbps | 7820.3 Mbps | -30.3% | 93.6% / 98.1% |
| medium-concurrent | 5332.9 Mbps | 7668.2 Mbps | -30.5% | 93.9% / 98.5% |
| medium-single | 3740.4 Mbps | 4189.2 Mbps | -10.7% | 98.3% / 134.0% |
| small-concurrent | 3759.5 Mbps | 5143.5 Mbps | -26.9% | 98.7% / 138.0% |
| small-single | 3377.7 Mbps | 4461.2 Mbps | -24.3% | 86.9% / 96.5% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3035.4 Mbps | 3925.2 Mbps | -22.7% |
| lan | 782.4 Mbps | 800.1 Mbps | -2.2% |
| lossy | 69.8 Mbps | 69.9 Mbps | ~0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 25.0% slower on average
c200a40c22e0c3a3a8369676584d59aabf9278a4 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5434.2 Mbps | 8012.9 Mbps | -32.2% | 91.5% / 96.9% |
| medium-concurrent | 5368.7 Mbps | 7864.6 Mbps | -31.7% | 88.4% / 95.9% |
| medium-single | 3560.1 Mbps | 4544.2 Mbps | -21.7% | 99.5% / 188.0% |
| small-concurrent | 3874.1 Mbps | 5200.6 Mbps | -25.5% | 97.7% / 127.0% |
| small-single | 3340.8 Mbps | 4721.8 Mbps | -29.2% | 85.6% / 96.1% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3145.4 Mbps | 3665.7 Mbps | -14.2% |
| lan | 782.4 Mbps | 796.4 Mbps | -1.8% |
| lossy | 69.8 Mbps | 55.9 Mbps | +25.0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 26.6% slower on average
2d783facd08ff1700b8ff62e17a917613c6faf80 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5533.1 Mbps | 7890.4 Mbps | -29.9% | 97.6% / 98.9% |
| medium-concurrent | 5479.5 Mbps | 7794.6 Mbps | -29.7% | 97.6% / 100.0% |
| medium-single | 4124.9 Mbps | 4676.2 Mbps | -11.8% | 95.8% / 98.3% |
| small-concurrent | 4002.0 Mbps | 5180.5 Mbps | -22.7% | 97.4% / 99.6% |
| small-single | 3614.8 Mbps | 4746.3 Mbps | -23.8% | 96.0% / 98.5% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 2863.6 Mbps | 3615.9 Mbps | -20.8% |
| lan | 777.9 Mbps | 796.5 Mbps | -2.3% |
| lossy | 69.8 Mbps | 55.9 Mbps | +25.0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 23.8% slower on average
0ae9b27dd4324c33ca3894be802df6e26080922f - artifacts
No results available
e57489f044cf9adabb59fd21e99c32ba6e1366c9 - artifacts
No results available
ea0c0c430b08f848cef5ae8c36c94f49504c8462 - artifacts
No results available
dea8f18b812551a4f1ed25a37a638d4573694fcd - artifacts
No results available
4c4aabdf3322a533b905b730f5cd50064ffc9c6d - artifacts
No results available
34eb5667b5b3c528ec4581b2ac74eda68e3b07a9 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5501.2 Mbps | 8044.0 Mbps | -31.6% | 95.8% / 106.0% |
| medium-concurrent | 5426.1 Mbps | 7742.1 Mbps | -29.9% | 96.3% / 108.0% |
| medium-single | 3964.9 Mbps | 4749.3 Mbps | -16.5% | 92.2% / 106.0% |
| small-concurrent | 3813.8 Mbps | 5431.3 Mbps | -29.8% | 95.0% / 109.0% |
| small-single | 3491.5 Mbps | 4911.5 Mbps | -28.9% | 88.8% / 97.0% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3072.8 Mbps | 3911.2 Mbps | -21.4% |
| lan | 782.4 Mbps | 810.4 Mbps | -3.5% |
| lossy | 69.8 Mbps | 69.8 Mbps | ~0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 26.7% slower on average
0aa51b888852a0e3088e3e3ed3864c3f51ff4c25 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5362.8 Mbps | 7984.5 Mbps | -32.8% | 97.5% / 163.0% |
| medium-concurrent | 5456.5 Mbps | 7625.1 Mbps | -28.4% | 95.1% / 109.0% |
| medium-single | 3873.5 Mbps | 4189.2 Mbps | -7.5% | 90.7% / 98.0% |
| small-concurrent | 3954.2 Mbps | 5151.0 Mbps | -23.2% | 93.9% / 110.0% |
| small-single | 3633.4 Mbps | 4343.6 Mbps | -16.4% | 89.1% / 97.5% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3130.8 Mbps | 3687.3 Mbps | -15.1% |
| lan | 782.4 Mbps | 796.4 Mbps | -1.8% |
| lossy | 69.8 Mbps | 55.9 Mbps | +25.0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 22.3% slower on average
890355a622787ba2b44ec9b6dcc510fb474d07d6 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5664.5 Mbps | 7749.1 Mbps | -26.9% | 94.2% / 100.0% |
| medium-concurrent | 5420.0 Mbps | 7775.9 Mbps | -30.3% | 93.3% / 99.3% |
| medium-single | 3740.2 Mbps | 4749.5 Mbps | -21.3% | 91.5% / 99.1% |
| small-concurrent | 3920.3 Mbps | 5385.5 Mbps | -27.2% | 95.0% / 124.0% |
| small-single | 3520.1 Mbps | 4715.6 Mbps | -25.4% | 92.3% / 102.0% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3030.3 Mbps | N/A | N/A |
| lan | 782.4 Mbps | N/A | N/A |
| lossy | 69.8 Mbps | N/A | N/A |
| wan | 83.8 Mbps | N/A | N/A |
Summary
noq is 26.7% slower on average
65807f55d5fbe67d601089914accdb17b983d9f6 - artifacts
No results available
6fcbf2eded950b6343c7764f16ae64e1ed40a225 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5656.1 Mbps | 7956.0 Mbps | -28.9% | 94.2% / 108.0% |
| medium-concurrent | 5353.7 Mbps | 7599.4 Mbps | -29.6% | 92.4% / 97.3% |
| medium-single | 4027.2 Mbps | 4469.6 Mbps | -9.9% | 98.6% / 162.0% |
| small-concurrent | 3877.0 Mbps | 5156.4 Mbps | -24.8% | 98.2% / 163.0% |
| small-single | 3542.3 Mbps | 4743.7 Mbps | -25.3% | 93.7% / 111.0% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | N/A | 4036.8 Mbps | N/A |
| lan | N/A | 810.4 Mbps | N/A |
| lossy | N/A | 55.9 Mbps | N/A |
| wan | N/A | 83.8 Mbps | N/A |
Summary
noq is 25.0% slower on average
58dec50b3a74384b3d19bb32a03d73ed13cf3dfa - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 6027.8 Mbps | 8019.3 Mbps | -24.8% | 97.3% / 98.8% |
| medium-concurrent | 6145.9 Mbps | 7589.2 Mbps | -19.0% | 97.2% / 100.0% |
| medium-single | 4124.4 Mbps | 4571.8 Mbps | -9.8% | 97.1% / 99.5% |
| small-concurrent | 3979.0 Mbps | 5257.2 Mbps | -24.3% | 96.9% / 99.4% |
| small-single | 3622.9 Mbps | 4746.4 Mbps | -23.7% | 96.6% / 98.5% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3142.6 Mbps | 4022.9 Mbps | -21.9% |
| lan | 782.5 Mbps | 810.4 Mbps | -3.4% |
| lossy | 69.8 Mbps | 69.8 Mbps | ~0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 20.4% slower on average
3cbd18a6de6737ee42bd140766378d88856e9266 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5389.2 Mbps | 7703.7 Mbps | -30.0% | 98.1% / 160.0% |
| medium-concurrent | 5349.9 Mbps | 7097.3 Mbps | -24.6% | 95.9% / 105.0% |
| medium-single | 4253.5 Mbps | 4361.7 Mbps | -2.5% | 90.4% / 98.5% |
| small-concurrent | 3817.8 Mbps | 5108.8 Mbps | -25.3% | 95.0% / 109.0% |
| small-single | 3514.3 Mbps | 4380.6 Mbps | -19.8% | 88.8% / 96.8% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3106.7 Mbps | N/A | N/A |
| lan | 782.4 Mbps | N/A | N/A |
| lossy | 69.9 Mbps | N/A | N/A |
| wan | 83.8 Mbps | N/A | N/A |
Summary
noq is 22.1% slower on average
fd9a7f5f09eb463690cd263173d467bc378add4a - artifacts
No results available
8e6d40577fd48e7b6e95c2761f991c7ae539124c - artifacts
No results available
a3375242bf9c7333e077eb7621125a9e14c529c4 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5390.7 Mbps | 7908.7 Mbps | -31.8% | 96.7% / 131.0% |
| medium-concurrent | 5334.0 Mbps | 7532.6 Mbps | -29.2% | 90.5% / 96.7% |
| medium-single | 3757.0 Mbps | 4745.6 Mbps | -20.8% | 91.8% / 101.0% |
| small-concurrent | 3865.7 Mbps | 5229.2 Mbps | -26.1% | 91.7% / 99.3% |
| small-single | 3371.7 Mbps | 4817.0 Mbps | -30.0% | 86.7% / 96.5% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3198.5 Mbps | 4022.8 Mbps | -20.5% |
| lan | 782.4 Mbps | 810.4 Mbps | -3.4% |
| lossy | 69.8 Mbps | 69.8 Mbps | ~0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 26.6% slower on average
c200a40 to
2d783fa
Compare
ea0c0c4 to
dea8f18
Compare
0aa51b8 to
890355a
Compare
divagant-martian
left a comment
There was a problem hiding this comment.
This is a partial review but I found things important enough for a partial review
| @@ -1,6 +1,6 @@ | |||
| use bytes::{BufMut, BytesMut}; | |||
| use proptest::{prelude::*, prop_assert_ne}; | |||
There was a problem hiding this comment.
another one to revert
|
(removed myself from review because it is still in draft. please request again once ready) |
Off-path probes were fire-and-forget: sent once per address per round with no retry. This broke simultaneous-open NAT traversal because the first probe is typically dropped (the peer's NAT mapping doesn't exist yet when the probe arrives). Changes: - Retry off-path probes up to 10 times (once per PTO firing) - Track per-probe CID so retries reuse the same CID (RFC 9000 §9.5 compliant: same CID to same remote address) - New OffPathProbeRetry connection timer drives retransmission - On new NAT traversal round, retire CIDs from old round's failed probes to prevent CID exhaustion (#410) Fixes #410 Relates to #376
handle_reach_out may silently ignore frames (old round, unsupported IP family) without advancing the round. Compare current_round() before and after to avoid clearing valid ongoing probes.
Move mark_as_sent after PacketBuilder completes so attempt count isn't incremented if packet build fails. Uses new borrow-free next_probe_info/mark_probe_sent API to avoid holding a mutable borrow across the packet build.
…tion queue_retries now returns CIDs from probes that exceeded max attempts, enabling callers to retire them. Currently unused but plumbed for when CidQueue gains a retire-by-CID API.
65807f5 to
6fcbf2e
Compare
divagant-martian
left a comment
There was a problem hiding this comment.
This seem ok on myend, we can work on the other changes later
flub
left a comment
There was a problem hiding this comment.
No serious objection, only some nits.
| if let Ok(server_state) = self.n0_nat_traversal.server_side_mut() | ||
| && server_state.has_pending_retries() | ||
| { | ||
| let pto = self.pto(SpaceKind::Data, path_id); |
There was a problem hiding this comment.
Using the on-path PTO here is strange. There's no reason at all that this is a relevant duration. It's the equivalent of making this a fairly random.
Probably better off using the PTO-base of the configured initial RTT?
There was a problem hiding this comment.
changed, can you please check the initial one I am now using is calculated correctly?
| { | ||
| let pto = self.pto(SpaceKind::Data, path_id); | ||
| self.timers.set( | ||
| Timer::Conn(ConnTimer::OffPathProbeRetry), |
There was a problem hiding this comment.
Could we name this timer NatTraverslaProbeRetry? Off-path is going to confuse me at some point, since not all off-path probes are for nat traversal. But IIUC this timer is more specific than that.
| let initial_pto = RttEstimator::new(self.config.initial_rtt).pto_base() | ||
| + self.ack_frequency.max_ack_delay_for_pto(); |
There was a problem hiding this comment.
You should not do the + max_ack_delay since the probes must be responded to immediately.
| let initial_pto = RttEstimator::new(self.config.initial_rtt).pto_base() | |
| + self.ack_frequency.max_ack_delay_for_pto(); | |
| let initial_pto = RttEstimator::new(self.config.initial_rtt).pto_base(); |
Description
Off-path NAT traversal probes were fire-and-forget: sent once per address per round with no retry. This broke simultaneous-open NAT traversal because the first probe is typically dropped (the peer's NAT mapping doesn't exist yet when the probe arrives).
Now probes are retransmitted up to 10 times at initial-RTT PTO intervals, with fresh CIDs reserved for each attempt.
Fixes #410, relates to #376
Changes
ServerStatetracks per-probe attempt count viaProbeStateNatTraversalProbeRetryconnection timer fires at initial PTO-base intervals to re-queue probesReferences
PICOQUIC_CHALLENGE_REPEAT_MAX(3) attemptsNotes