Skip to content

fix(proto): retry off-path NAT traversal probes and retire stale CIDs#524

Merged
dignifiedquire merged 14 commits intomainfrom
fix/off-path-probe-retry
Apr 8, 2026
Merged

fix(proto): retry off-path NAT traversal probes and retire stale CIDs#524
dignifiedquire merged 14 commits intomainfrom
fix/off-path-probe-retry

Conversation

@dignifiedquire
Copy link
Copy Markdown
Contributor

@dignifiedquire dignifiedquire commented Mar 21, 2026

Description

Off-path NAT traversal probes were fire-and-forget: sent once per address per round with no retry. This broke simultaneous-open NAT traversal because the first probe is typically dropped (the peer's NAT mapping doesn't exist yet when the probe arrives).

Now probes are retransmitted up to 10 times at initial-RTT PTO intervals, with fresh CIDs reserved for each attempt.

Fixes #410, relates to #376

Changes

  • ServerState tracks per-probe attempt count via ProbeState
  • New NatTraversalProbeRetry connection timer fires at initial PTO-base intervals to re-queue probes
  • Each probe reserves a fresh CID (no cross-path CID reuse)
  • On new round, stale off-path challenges are cleared

References

Notes

  • If no reserved CIDs are available, the probe is skipped (not sent with the active CID)
  • Retry timer uses PTO-base from initial RTT without max_ack_delay, since PATH_RESPONSE must be sent immediately

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 21, 2026

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/noq/pr/524/docs/noq/

Last updated: 2026-04-08T10:45:45Z

@n0bot n0bot Bot added this to iroh Mar 21, 2026
@github-project-automation github-project-automation Bot moved this to 🚑 Needs Triage in iroh Mar 21, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 21, 2026

Performance Comparison Report

3132a6b3687256af2ac68b6bb6ed5301fd68583e - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5342.1 Mbps 7898.0 Mbps -32.4% 94.5% / 106.0%
medium-concurrent 5413.6 Mbps 7842.3 Mbps -31.0% 91.5% / 96.8%
medium-single 4146.8 Mbps 4749.5 Mbps -12.7% 95.7% / 109.0%
small-concurrent 3867.4 Mbps 5327.1 Mbps -27.4% 96.9% / 109.0%
small-single 3615.0 Mbps 4799.4 Mbps -24.7% 92.9% / 109.0%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3086.7 Mbps 3953.1 Mbps -21.9%
lan 782.4 Mbps 810.4 Mbps -3.5%
lossy 69.8 Mbps 69.8 Mbps ~0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 25.7% slower on average

---
689cc33f78e3daa7bdb4f4a4e03f054b6e8be2b1 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5453.9 Mbps 7820.3 Mbps -30.3% 93.6% / 98.1%
medium-concurrent 5332.9 Mbps 7668.2 Mbps -30.5% 93.9% / 98.5%
medium-single 3740.4 Mbps 4189.2 Mbps -10.7% 98.3% / 134.0%
small-concurrent 3759.5 Mbps 5143.5 Mbps -26.9% 98.7% / 138.0%
small-single 3377.7 Mbps 4461.2 Mbps -24.3% 86.9% / 96.5%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3035.4 Mbps 3925.2 Mbps -22.7%
lan 782.4 Mbps 800.1 Mbps -2.2%
lossy 69.8 Mbps 69.9 Mbps ~0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 25.0% slower on average

---
c200a40c22e0c3a3a8369676584d59aabf9278a4 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5434.2 Mbps 8012.9 Mbps -32.2% 91.5% / 96.9%
medium-concurrent 5368.7 Mbps 7864.6 Mbps -31.7% 88.4% / 95.9%
medium-single 3560.1 Mbps 4544.2 Mbps -21.7% 99.5% / 188.0%
small-concurrent 3874.1 Mbps 5200.6 Mbps -25.5% 97.7% / 127.0%
small-single 3340.8 Mbps 4721.8 Mbps -29.2% 85.6% / 96.1%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3145.4 Mbps 3665.7 Mbps -14.2%
lan 782.4 Mbps 796.4 Mbps -1.8%
lossy 69.8 Mbps 55.9 Mbps +25.0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 26.6% slower on average

---
2d783facd08ff1700b8ff62e17a917613c6faf80 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5533.1 Mbps 7890.4 Mbps -29.9% 97.6% / 98.9%
medium-concurrent 5479.5 Mbps 7794.6 Mbps -29.7% 97.6% / 100.0%
medium-single 4124.9 Mbps 4676.2 Mbps -11.8% 95.8% / 98.3%
small-concurrent 4002.0 Mbps 5180.5 Mbps -22.7% 97.4% / 99.6%
small-single 3614.8 Mbps 4746.3 Mbps -23.8% 96.0% / 98.5%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 2863.6 Mbps 3615.9 Mbps -20.8%
lan 777.9 Mbps 796.5 Mbps -2.3%
lossy 69.8 Mbps 55.9 Mbps +25.0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 23.8% slower on average

---
0ae9b27dd4324c33ca3894be802df6e26080922f - artifacts

No results available

---
e57489f044cf9adabb59fd21e99c32ba6e1366c9 - artifacts

No results available

---
ea0c0c430b08f848cef5ae8c36c94f49504c8462 - artifacts

No results available

---
dea8f18b812551a4f1ed25a37a638d4573694fcd - artifacts

No results available

---
4c4aabdf3322a533b905b730f5cd50064ffc9c6d - artifacts

No results available

---
34eb5667b5b3c528ec4581b2ac74eda68e3b07a9 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5501.2 Mbps 8044.0 Mbps -31.6% 95.8% / 106.0%
medium-concurrent 5426.1 Mbps 7742.1 Mbps -29.9% 96.3% / 108.0%
medium-single 3964.9 Mbps 4749.3 Mbps -16.5% 92.2% / 106.0%
small-concurrent 3813.8 Mbps 5431.3 Mbps -29.8% 95.0% / 109.0%
small-single 3491.5 Mbps 4911.5 Mbps -28.9% 88.8% / 97.0%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3072.8 Mbps 3911.2 Mbps -21.4%
lan 782.4 Mbps 810.4 Mbps -3.5%
lossy 69.8 Mbps 69.8 Mbps ~0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 26.7% slower on average

---
0aa51b888852a0e3088e3e3ed3864c3f51ff4c25 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5362.8 Mbps 7984.5 Mbps -32.8% 97.5% / 163.0%
medium-concurrent 5456.5 Mbps 7625.1 Mbps -28.4% 95.1% / 109.0%
medium-single 3873.5 Mbps 4189.2 Mbps -7.5% 90.7% / 98.0%
small-concurrent 3954.2 Mbps 5151.0 Mbps -23.2% 93.9% / 110.0%
small-single 3633.4 Mbps 4343.6 Mbps -16.4% 89.1% / 97.5%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3130.8 Mbps 3687.3 Mbps -15.1%
lan 782.4 Mbps 796.4 Mbps -1.8%
lossy 69.8 Mbps 55.9 Mbps +25.0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 22.3% slower on average

---
890355a622787ba2b44ec9b6dcc510fb474d07d6 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5664.5 Mbps 7749.1 Mbps -26.9% 94.2% / 100.0%
medium-concurrent 5420.0 Mbps 7775.9 Mbps -30.3% 93.3% / 99.3%
medium-single 3740.2 Mbps 4749.5 Mbps -21.3% 91.5% / 99.1%
small-concurrent 3920.3 Mbps 5385.5 Mbps -27.2% 95.0% / 124.0%
small-single 3520.1 Mbps 4715.6 Mbps -25.4% 92.3% / 102.0%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3030.3 Mbps N/A N/A
lan 782.4 Mbps N/A N/A
lossy 69.8 Mbps N/A N/A
wan 83.8 Mbps N/A N/A

Summary

noq is 26.7% slower on average

---
65807f55d5fbe67d601089914accdb17b983d9f6 - artifacts

No results available

---
6fcbf2eded950b6343c7764f16ae64e1ed40a225 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5656.1 Mbps 7956.0 Mbps -28.9% 94.2% / 108.0%
medium-concurrent 5353.7 Mbps 7599.4 Mbps -29.6% 92.4% / 97.3%
medium-single 4027.2 Mbps 4469.6 Mbps -9.9% 98.6% / 162.0%
small-concurrent 3877.0 Mbps 5156.4 Mbps -24.8% 98.2% / 163.0%
small-single 3542.3 Mbps 4743.7 Mbps -25.3% 93.7% / 111.0%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal N/A 4036.8 Mbps N/A
lan N/A 810.4 Mbps N/A
lossy N/A 55.9 Mbps N/A
wan N/A 83.8 Mbps N/A

Summary

noq is 25.0% slower on average

---
58dec50b3a74384b3d19bb32a03d73ed13cf3dfa - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 6027.8 Mbps 8019.3 Mbps -24.8% 97.3% / 98.8%
medium-concurrent 6145.9 Mbps 7589.2 Mbps -19.0% 97.2% / 100.0%
medium-single 4124.4 Mbps 4571.8 Mbps -9.8% 97.1% / 99.5%
small-concurrent 3979.0 Mbps 5257.2 Mbps -24.3% 96.9% / 99.4%
small-single 3622.9 Mbps 4746.4 Mbps -23.7% 96.6% / 98.5%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3142.6 Mbps 4022.9 Mbps -21.9%
lan 782.5 Mbps 810.4 Mbps -3.4%
lossy 69.8 Mbps 69.8 Mbps ~0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 20.4% slower on average

---
3cbd18a6de6737ee42bd140766378d88856e9266 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5389.2 Mbps 7703.7 Mbps -30.0% 98.1% / 160.0%
medium-concurrent 5349.9 Mbps 7097.3 Mbps -24.6% 95.9% / 105.0%
medium-single 4253.5 Mbps 4361.7 Mbps -2.5% 90.4% / 98.5%
small-concurrent 3817.8 Mbps 5108.8 Mbps -25.3% 95.0% / 109.0%
small-single 3514.3 Mbps 4380.6 Mbps -19.8% 88.8% / 96.8%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3106.7 Mbps N/A N/A
lan 782.4 Mbps N/A N/A
lossy 69.9 Mbps N/A N/A
wan 83.8 Mbps N/A N/A

Summary

noq is 22.1% slower on average

---
fd9a7f5f09eb463690cd263173d467bc378add4a - artifacts

No results available

---
8e6d40577fd48e7b6e95c2761f991c7ae539124c - artifacts

No results available

---
a3375242bf9c7333e077eb7621125a9e14c529c4 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5390.7 Mbps 7908.7 Mbps -31.8% 96.7% / 131.0%
medium-concurrent 5334.0 Mbps 7532.6 Mbps -29.2% 90.5% / 96.7%
medium-single 3757.0 Mbps 4745.6 Mbps -20.8% 91.8% / 101.0%
small-concurrent 3865.7 Mbps 5229.2 Mbps -26.1% 91.7% / 99.3%
small-single 3371.7 Mbps 4817.0 Mbps -30.0% 86.7% / 96.5%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3198.5 Mbps 4022.8 Mbps -20.5%
lan 782.4 Mbps 810.4 Mbps -3.4%
lossy 69.8 Mbps 69.8 Mbps ~0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 26.6% slower on average

@divagant-martian divagant-martian self-requested a review March 23, 2026 02:52
@divagant-martian divagant-martian moved this from 🚑 Needs Triage to 👀 In review in iroh Mar 23, 2026
@dignifiedquire dignifiedquire force-pushed the fix/off-path-probe-retry branch from c200a40 to 2d783fa Compare March 23, 2026 11:00
dignifiedquire added a commit that referenced this pull request Mar 23, 2026
@dignifiedquire dignifiedquire added this to the noq: iroh v0.98 milestone Mar 23, 2026
@dignifiedquire dignifiedquire force-pushed the fix/off-path-probe-retry branch 2 times, most recently from ea0c0c4 to dea8f18 Compare March 23, 2026 16:14
@flub flub self-requested a review March 25, 2026 15:07
@dignifiedquire dignifiedquire force-pushed the fix/off-path-probe-retry branch 2 times, most recently from 0aa51b8 to 890355a Compare March 25, 2026 17:59
Copy link
Copy Markdown
Collaborator

@divagant-martian divagant-martian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a partial review but I found things important enough for a partial review

Comment thread noq-proto/src/connection/send_buffer.rs Outdated
Comment thread noq-proto/src/connection/paths.rs Outdated
Comment thread noq-proto/src/connection/timer.rs Outdated
Comment thread noq-proto/src/tests/encode_decode.rs Outdated
@@ -1,6 +1,6 @@
use bytes::{BufMut, BytesMut};
use proptest::{prelude::*, prop_assert_ne};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another one to revert

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread noq-proto/src/n0_nat_traversal.rs Outdated
Comment thread noq-proto/src/n0_nat_traversal.rs Outdated
Comment thread noq-proto/src/connection/mod.rs Outdated
@github-project-automation github-project-automation Bot moved this from 👀 In review to 🏗 In progress in iroh Mar 26, 2026
@divagant-martian divagant-martian marked this pull request as draft March 26, 2026 18:49
@flub flub removed their request for review March 30, 2026 13:25
@flub
Copy link
Copy Markdown
Collaborator

flub commented Mar 30, 2026

(removed myself from review because it is still in draft. please request again once ready)

Off-path probes were fire-and-forget: sent once per address per round
with no retry. This broke simultaneous-open NAT traversal because the
first probe is typically dropped (the peer's NAT mapping doesn't exist
yet when the probe arrives).

Changes:
- Retry off-path probes up to 10 times (once per PTO firing)
- Track per-probe CID so retries reuse the same CID (RFC 9000 §9.5
  compliant: same CID to same remote address)
- New OffPathProbeRetry connection timer drives retransmission
- On new NAT traversal round, retire CIDs from old round's failed
  probes to prevent CID exhaustion (#410)

Fixes #410
Relates to #376
handle_reach_out may silently ignore frames (old round, unsupported
IP family) without advancing the round. Compare current_round()
before and after to avoid clearing valid ongoing probes.
Move mark_as_sent after PacketBuilder completes so attempt count
isn't incremented if packet build fails. Uses new borrow-free
next_probe_info/mark_probe_sent API to avoid holding a mutable
borrow across the packet build.
…tion

queue_retries now returns CIDs from probes that exceeded max attempts,
enabling callers to retire them. Currently unused but plumbed for when
CidQueue gains a retire-by-CID API.
@dignifiedquire dignifiedquire force-pushed the fix/off-path-probe-retry branch from 65807f5 to 6fcbf2e Compare April 6, 2026 18:42
Copy link
Copy Markdown
Collaborator

@divagant-martian divagant-martian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seem ok on myend, we can work on the other changes later

@dignifiedquire dignifiedquire marked this pull request as ready for review April 8, 2026 09:01
Copy link
Copy Markdown
Collaborator

@flub flub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No serious objection, only some nits.

Comment thread noq-proto/src/connection/mod.rs Outdated
if let Ok(server_state) = self.n0_nat_traversal.server_side_mut()
&& server_state.has_pending_retries()
{
let pto = self.pto(SpaceKind::Data, path_id);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the on-path PTO here is strange. There's no reason at all that this is a relevant duration. It's the equivalent of making this a fairly random.

Probably better off using the PTO-base of the configured initial RTT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed, can you please check the initial one I am now using is calculated correctly?

Comment thread noq-proto/src/connection/mod.rs Outdated
{
let pto = self.pto(SpaceKind::Data, path_id);
self.timers.set(
Timer::Conn(ConnTimer::OffPathProbeRetry),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we name this timer NatTraverslaProbeRetry? Off-path is going to confuse me at some point, since not all off-path probes are for nat traversal. But IIUC this timer is more specific than that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread noq-proto/src/connection/mod.rs Outdated
Comment on lines +2123 to +2124
let initial_pto = RttEstimator::new(self.config.initial_rtt).pto_base()
+ self.ack_frequency.max_ack_delay_for_pto();
Copy link
Copy Markdown
Collaborator

@flub flub Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not do the + max_ack_delay since the probes must be responded to immediately.

Suggested change
let initial_pto = RttEstimator::new(self.config.initial_rtt).pto_base()
+ self.ack_frequency.max_ack_delay_for_pto();
let initial_pto = RttEstimator::new(self.config.initial_rtt).pto_base();

@dignifiedquire dignifiedquire enabled auto-merge April 8, 2026 10:44
@dignifiedquire dignifiedquire disabled auto-merge April 8, 2026 10:44
@dignifiedquire dignifiedquire enabled auto-merge April 8, 2026 10:46
@dignifiedquire dignifiedquire added this pull request to the merge queue Apr 8, 2026
Merged via the queue into main with commit 7d60937 Apr 8, 2026
36 checks passed
@dignifiedquire dignifiedquire deleted the fix/off-path-probe-retry branch April 8, 2026 11:07
@github-project-automation github-project-automation Bot moved this from 🏗 In progress to ✅ Done in iroh Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

apply timers to off-path path challenges

3 participants