Skip to content

perf(crypto): CON-1650 Use Rayon in NIDKG#9007

Open
randombit wants to merge 8 commits intomasterfrom
jack/crp-2550
Open

perf(crypto): CON-1650 Use Rayon in NIDKG#9007
randombit wants to merge 8 commits intomasterfrom
jack/crp-2550

Conversation

@randombit
Copy link
Contributor

This substantially improves NIDKG performance. The following table of results is from the crypto benchmarks (so no network/consensus) on my devenv which has 16 cores similar to that used in the Gen2 machines.

All times for the 40 nodes, 40 dealers, high threshold setting. The improvements for other settings are similar

Operation master rayon
create_initial_dealing 283 ms 108 ms
create_reshare_dealing 272 ms 105 ms
verify_dealing 310 ms 86 ms
create_transcript 152 ms 14 ms
load_transcript 1800 ms 166 ms
retain_active_keys/0 3640 ms 479 ms
retain_active_keys/1 3770 ms 496 ms
retain_active_keys/5 3870 ms 515 ms
retain_active_keys/15 4400 ms 589 ms
retain_active_keys/30 5200 ms 671 ms

@github-actions github-actions bot added the perf label Feb 23, 2026
@randombit
Copy link
Contributor Author

randombit commented Feb 23, 2026

Todo

  • There is a stability test for key update but I think just one. I'd like to add several more to cover different scenarios, since the update process depends on the epochs in use.
  • Verify proofs in parallel
  • Investigate BSGS
  • Table lookup for honest dealer
  • Update CON-1444 doc

@randombit
Copy link
Contributor Author

To set some number of CPUs

    let result = ThreadPoolBuilder::new()
        .num_threads(4)
        .build_global();

I'm not sure what to do here, default is # of cores but this may oversubscribe us a bit. OTOH there are points where the replica is pretty well just blocked on us and the faster the vault completes the work the better.

@randombit
Copy link
Contributor Author

randombit commented Feb 23, 2026

I checked 40 dealer 13 recipient NIDKG also since that's relevant for engines, numbers look quite good to me

crypto_nidkg_13_nodes_40_dealers_high/create_initial_dealing
                        time:   [59.307 ms 63.587 ms 69.395 ms]
crypto_nidkg_13_nodes_40_dealers_high/create_reshare_dealing
                        time:   [58.164 ms 63.475 ms 69.890 ms]
crypto_nidkg_13_nodes_40_dealers_high/verify_dealing
                        time:   [50.716 ms 50.977 ms 51.259 ms]
crypto_nidkg_13_nodes_40_dealers_high/create_transcript
                        time:   [7.0596 ms 7.1216 ms 7.1903 ms]
crypto_nidkg_13_nodes_40_dealers_high/load_transcript
                        time:   [117.18 ms 118.20 ms 119.22 ms]
crypto_nidkg_13_nodes_40_dealers_high/retain_active_keys/0
                        time:   [338.97 ms 342.47 ms 345.78 ms]
crypto_nidkg_13_nodes_40_dealers_high/retain_active_keys/1
                        time:   [355.97 ms 358.14 ms 360.47 ms]
crypto_nidkg_13_nodes_40_dealers_high/retain_active_keys/5
                        time:   [370.10 ms 373.88 ms 377.47 ms]
crypto_nidkg_13_nodes_40_dealers_high/retain_active_keys/15
                        time:   [429.25 ms 431.44 ms 433.05 ms]

@randombit randombit changed the title perf(crypto): CRP-2550 Use Rayon in NIDKG perf(crypto): CON-1650 Use Rayon in NIDKG Feb 24, 2026
@randombit randombit marked this pull request as ready for review March 9, 2026 13:28
@randombit randombit requested a review from a team as a code owner March 9, 2026 13:28
@basvandijk
Copy link
Collaborator

basvandijk commented Mar 10, 2026

I noticed //rs/tests/nns:nns_token_balance_test was flaky on this PR:

$ bazel run //ci/githubstats:query -- last --flaky //rs/tests/nns:nns_token_balance_test
...
Downloading logs to: /ic/logs/nns_token_balance_test/2026-03-09T22:11:14
...
╒════╤═════════════════════════╤════════════╤══════════╤═══════════════╤═══════╤══════════╤══════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════╕
│    │   last started at (UTC) │   duration │ status   │ branch        │ PR    │ commit   │ buildbuddy   │ errors per attempt                                                                            │
╞════╪═════════════════════════╪════════════╪══════════╪═══════════════╪═══════╪══════════╪══════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════╡
│  0 │ Mon 2026-03-09 17:24:03 │       2:37 │ FLAKY    │ jack/crp-2550 │ #9007 │ fe35a78  │ logs         │ 1: setup: called `Result::unwrap()` on an `Err` value: Func="await_status_is_healthy of ht... │
│    │                         │            │          │               │       │          │              │    test: Failed to find SetupResult attribute after setup. Cancelling test function.          │
├────┼─────────────────────────┼────────────┼──────────┼───────────────┼───────┼──────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────────────────────┤
│  1 │ Mon 2026-03-09 13:59:50 │       2:26 │ FLAKY    │ jack/crp-2550 │ #9007 │ c0ea430  │ logs         │ 1: setup: called `Result::unwrap()` on an `Err` value: Func="await_status_is_healthy of ht... │
│    │                         │            │          │               │       │          │              │    test: Failed to find SetupResult attribute after setup. Cancelling test function.          │
╘════╧═════════════════════════╧════════════╧══════════╧═══════════════╧═══════╧══════════╧══════════════╧═══════════════════════════════════════════════════════════════════════════════════════════════╛

So I asked Claude Opus 4.6 about it:

Prompt

The //rs/tests/nns:nns_token_balance_test as implemented in rs/tests/nns/nns_token_balance_test.rs has a simple setup() function which sets up an IC testnet consisting of a single node NNS subnet and a single node application subnet:

    InternetComputer::new()
        .add_fast_single_node_subnet(SubnetType::System)
        .add_fast_single_node_subnet(SubnetType::Application)
        .setup_and_start(&env)
        .expect("failed to setup IC under test");

However, as observed in some flaky runs of this test the application node doesn't become healthy because it can't fetch a NNS delegation from the NNS node because its own firewall blocks incoming packets from the NNS node.

And example of this can be found in the following log of a flaky run: logs/nns_token_balance_test/2026-03-09T22:11:14/2026-03-09T17:24:03_dc9adb68-14be-423e-88ce-b71a4d3034f6/1/FAILED.log. Specifically in the log of the application node: logs/nns_token_balance_test/2026-03-09T22:11:14/2026-03-09T17:24:03_dc9adb68-14be-423e-88ce-b71a4d3034f6/1/ic_logs/2602:fb2b:110:10:50a2:78ff:fe84:6fdc.log where you see it successfully establishes a TCP connection to the NNS node at IPv6 2602:fb2b:110:10:502f:52ff:fe34:668a but then fails to establish a TLS stream to that same node because its own firewall blocks it:

2026-03-09 17:23:44.428001 Establishing TCP connection to g4zag-dfsbp-vyukk-bqngo-f53lb-es6h4-czfup-kdnu2-qighy-gcx32-fae @ [2602:fb2b:110:10:502f:52ff:fe34:668a]:8080
2026-03-09 17:23:44.428001 Establishing TLS stream to g4zag-dfsbp-vyukk-bqngo-f53lb-es6h4-czfup-kdnu2-qighy-gcx32-fae. Tcp stream: TcpStream { addr: [2602:fb2b:110:10:50a2:78ff:fe84:6fdc]:58544, peer: [2602:fb2b:110:10:502f:52ff:fe34:668a]:8080, fd: 39 }
2026-03-09 17:23:47.248279 Drop - default policy: IN=enp1s0 OUT= MAC=52:a2:78:84:6f:dc:52:2f:52:34:66:8a:86:dd SRC=2602:fb2b:0110:0010:502f:52ff:fe34:668a DST=2602:fb2b:0110:0010:50a2:78ff:fe84:6fdc LEN=205 TC=0 HOPLIMIT=64 FLOWLBL=3982 PROTO=TCP SPT=8080 DPT=60338 WINDOW=16 RES=0x00 ACK PSH URGP=0

Why is the firewall of the application node dropping packets from other nodes in the testnet?

Root Cause Analysis: Application Node Firewall Drops in nns_token_balance_test

Test: //rs/tests/nns:nns_token_balance_test
Date: 2026-03-09
Run: 2026-03-09T17:24:03_dc9adb68-14be-423e-88ce-b71a4d3034f6

Summary

The application node's firewall drops packets from the NNS node, preventing the app node from fetching an NNS delegation and causing the test to fail. The drops are a symptom, not the root cause. The true cause is the NNS node's CSP vault being overloaded during DKG/consensus initialization, which prevents TLS handshakes from completing.

Environment

Role IPv6 Address
NNS node (System subnet) 2602:fb2b:110:10:502f:52ff:fe34:668a
APP node (Application subnet) 2602:fb2b:110:10:50a2:78ff:fe84:6fdc

Ports involved: 8080 (HTTP handler), 2497 (XNet), 4100 (transport)

Timeline

Time Event
17:15:31 Orchestrator writes firewall rules — both nodes correctly whitelisted on ports 22, 2497, 4100, 8080 (TCP) and 4100 (UDP)
17:15:31–17:20:34 ~5 minutes of normal operation, no drops
17:20:34.878 First NNS TLS handshake timeout (CspServerEd25519Signer: TransientInternalError: "the request exceeded its deadline")
17:20:34.879 First APP firewall drop logged — 0.5 ms after the NNS error
17:20:34–17:23:24 Drops repeat every ~5 seconds on ports 2497 and 8080, each correlated sub-millisecond with a NNS TLS error

Mechanism

APP node                          NNS node
   |                                 |
   |------- TLS ClientHello -------->|  (conntrack: NEW → ESTABLISHED)
   |                                 |  NNS CSP vault busy with DKG signing
   |                                 |  ...deadline exceeded after ~5s...
   |<-- [timeout] client gives up    |
   |  sends RST, conntrack entry     |
   |  destroyed                      |
   |                                 |
   |                                 |  CSP vault finally returns error
   |<------- TLS error response -----|  arrives ~0.5ms later
   |                                 |
   |  No conntrack entry found       |
   |  → ct state = NEW              |
   |  → No rule accepts NEW TCP     |
   |    to ephemeral dest port      |
   |  → DEFAULT POLICY: DROP        |
   |  → logged as "Drop - default   |
   |    policy"                      |
  1. The APP node opens a TLS connection to the NNS node (ports 2497 or 8080).
  2. The NNS node's CspServerEd25519Signer is saturated by DKG/consensus work and cannot complete the TLS handshake within the deadline (~5 seconds).
  3. The APP node's TLS client times out and sends a TCP RST, destroying the conntrack entry.
  4. ~0.5 ms later, the NNS node's error response packet arrives at the APP node.
  5. With no conntrack entry, nftables classifies it as ct state new.
  6. No rule accepts new inbound TCP on random ephemeral destination ports → dropped by the default drop policy.
  7. The drop is logged with prefix "Drop - default policy: ".

Evidence: Sub-millisecond Timing Correlation

Every NNS TLS error matches an APP firewall drop within < 1 ms:

NNS TLS Error APP Firewall Drop Delta
17:20:34.878498 17:20:34.879075 +0.6 ms
17:20:39.922932 17:20:39.922915 −0.0 ms
17:20:44.937072 17:20:44.936623 −0.4 ms
17:20:49.944747 17:20:49.944717 −0.0 ms
17:20:54.964523 17:20:54.964530 +0.0 ms
17:20:59.976485 17:20:59.976427 −0.1 ms
17:21:04.987832 17:21:04.987796 −0.0 ms
17:21:09.997946 17:21:09.998127 +0.2 ms
17:21:15.011556 17:21:15.011534 −0.0 ms
17:21:20.019024 17:21:20.019048 +0.0 ms

Firewall Configuration (Verified Correct)

The orchestrator correctly applied the following whitelisting rules at 17:15:31:

# IPv6 node whitelisting
tcp dport { 22, 2497, 4100, 8080 } ip6 saddr { 2602:fb2b:110:10:502f:52ff:fe34:668a, 2602:fb2b:110:10:50a2:78ff:fe84:6fdc } accept
udp dport { 4100 } ip6 saddr { 2602:fb2b:110:10:502f:52ff:fe34:668a, 2602:fb2b:110:10:50a2:78ff:fe84:6fdc } accept

The dropped packets have destination ports that differ from the whitelisted ports (they are ephemeral ports from closed connections), confirming the firewall rules are not at fault.

Root Cause

The NNS node's CspServerEd25519Signer is overloaded during initial DKG/consensus bootstrapping, unable to service TLS handshake signing requests within their deadline. This prevents the APP node from establishing any TLS connection to the NNS node, blocking NNS delegation fetches and causing the test to fail.

The firewall drops are a harmless secondary effect of stale TCP response packets arriving after connections have already been reset.

Potential Fixes

  1. Increase CSP vault signing deadline during the bootstrap phase to tolerate DKG load.
  2. Prioritize TLS handshake signing requests over DKG signing in the CSP vault.
  3. Add more resilient retry logic in the APP node's NNS delegation fetcher to tolerate prolonged TLS failures.
  4. Ensure test waits for NNS node readiness (post-DKG stabilization) before starting the application subnet.

@randombit
Copy link
Contributor Author

Thanks @basvandijk, definitely an issue - I think none of the solutions Claude lays out really works since they are more or less specific to this one test, while resource starvation in the vault probably could apply in all sorts of scenarios. I guess as a starting point I'll configure Rayon's thread pool to use only a portion of the available core count.

@basvandijk
Copy link
Collaborator

Sounds good. Also note I observed some other tests flaking as well on this PR. Just making sure that we're not introducing a new source of flakiness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants