perf(crypto): CON-1650 Use Rayon in NIDKG by randombit · Pull Request #9007 · dfinity/ic

randombit · 2026-02-23T14:25:02Z

This substantially improves NIDKG performance. The following table of results is from the crypto benchmarks (so no network/consensus) on my devenv which has 16 cores similar to that used in the Gen2 machines.

All times for the 40 nodes, 40 dealers, high threshold setting. The improvements for other settings are similar

Operation	master	rayon
create_initial_dealing	283 ms	108 ms
create_reshare_dealing	272 ms	105 ms
verify_dealing	310 ms	86 ms
create_transcript	152 ms	14 ms
load_transcript	1800 ms	166 ms
retain_active_keys/0	3640 ms	479 ms
retain_active_keys/1	3770 ms	496 ms
retain_active_keys/5	3870 ms	515 ms
retain_active_keys/15	4400 ms	589 ms
retain_active_keys/30	5200 ms	671 ms

randombit · 2026-02-23T14:39:56Z

Todo

There is a stability test for key update but I think just one. I'd like to add several more to cover different scenarios, since the update process depends on the epochs in use.
Verify proofs in parallel
Investigate BSGS
Table lookup for honest dealer
Update CON-1444 doc

randombit · 2026-02-23T16:36:16Z

To set some number of CPUs

    let result = ThreadPoolBuilder::new()
        .num_threads(4)
        .build_global();

I'm not sure what to do here, default is # of cores but this may oversubscribe us a bit. OTOH there are points where the replica is pretty well just blocked on us and the faster the vault completes the work the better.

randombit · 2026-02-23T20:19:23Z

I checked 40 dealer 13 recipient NIDKG also since that's relevant for engines, numbers look quite good to me

crypto_nidkg_13_nodes_40_dealers_high/create_initial_dealing
                        time:   [59.307 ms 63.587 ms 69.395 ms]
crypto_nidkg_13_nodes_40_dealers_high/create_reshare_dealing
                        time:   [58.164 ms 63.475 ms 69.890 ms]
crypto_nidkg_13_nodes_40_dealers_high/verify_dealing
                        time:   [50.716 ms 50.977 ms 51.259 ms]
crypto_nidkg_13_nodes_40_dealers_high/create_transcript
                        time:   [7.0596 ms 7.1216 ms 7.1903 ms]
crypto_nidkg_13_nodes_40_dealers_high/load_transcript
                        time:   [117.18 ms 118.20 ms 119.22 ms]
crypto_nidkg_13_nodes_40_dealers_high/retain_active_keys/0
                        time:   [338.97 ms 342.47 ms 345.78 ms]
crypto_nidkg_13_nodes_40_dealers_high/retain_active_keys/1
                        time:   [355.97 ms 358.14 ms 360.47 ms]
crypto_nidkg_13_nodes_40_dealers_high/retain_active_keys/5
                        time:   [370.10 ms 373.88 ms 377.47 ms]
crypto_nidkg_13_nodes_40_dealers_high/retain_active_keys/15
                        time:   [429.25 ms 431.44 ms 433.05 ms]

basvandijk · 2026-03-10T11:47:22Z

I noticed //rs/tests/nns:nns_token_balance_test was flaky on this PR:

$ bazel run //ci/githubstats:query -- last --flaky //rs/tests/nns:nns_token_balance_test
...
Downloading logs to: /ic/logs/nns_token_balance_test/2026-03-09T22:11:14
...
╒════╤═════════════════════════╤════════════╤══════════╤═══════════════╤═══════╤══════════╤══════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════╕
│    │   last started at (UTC) │   duration │ status   │ branch        │ PR    │ commit   │ buildbuddy   │ errors per attempt                                                                            │
╞════╪═════════════════════════╪════════════╪══════════╪═══════════════╪═══════╪══════════╪══════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════╡
│  0 │ Mon 2026-03-09 17:24:03 │       2:37 │ FLAKY    │ jack/crp-2550 │ #9007 │ fe35a78  │ logs         │ 1: setup: called `Result::unwrap()` on an `Err` value: Func="await_status_is_healthy of ht... │
│    │                         │            │          │               │       │          │              │    test: Failed to find SetupResult attribute after setup. Cancelling test function.          │
├────┼─────────────────────────┼────────────┼──────────┼───────────────┼───────┼──────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────────────────────┤
│  1 │ Mon 2026-03-09 13:59:50 │       2:26 │ FLAKY    │ jack/crp-2550 │ #9007 │ c0ea430  │ logs         │ 1: setup: called `Result::unwrap()` on an `Err` value: Func="await_status_is_healthy of ht... │
│    │                         │            │          │               │       │          │              │    test: Failed to find SetupResult attribute after setup. Cancelling test function.          │
╘════╧═════════════════════════╧════════════╧══════════╧═══════════════╧═══════╧══════════╧══════════════╧═══════════════════════════════════════════════════════════════════════════════════════════════╛

So I asked Claude Opus 4.6 about it:

Prompt

The //rs/tests/nns:nns_token_balance_test as implemented in rs/tests/nns/nns_token_balance_test.rs has a simple setup() function which sets up an IC testnet consisting of a single node NNS subnet and a single node application subnet:

    InternetComputer::new()
        .add_fast_single_node_subnet(SubnetType::System)
        .add_fast_single_node_subnet(SubnetType::Application)
        .setup_and_start(&env)
        .expect("failed to setup IC under test");

However, as observed in some flaky runs of this test the application node doesn't become healthy because it can't fetch a NNS delegation from the NNS node because its own firewall blocks incoming packets from the NNS node.

And example of this can be found in the following log of a flaky run: logs/nns_token_balance_test/2026-03-09T22:11:14/2026-03-09T17:24:03_dc9adb68-14be-423e-88ce-b71a4d3034f6/1/FAILED.log. Specifically in the log of the application node: logs/nns_token_balance_test/2026-03-09T22:11:14/2026-03-09T17:24:03_dc9adb68-14be-423e-88ce-b71a4d3034f6/1/ic_logs/2602:fb2b:110:10:50a2:78ff:fe84:6fdc.log where you see it successfully establishes a TCP connection to the NNS node at IPv6 2602:fb2b:110:10:502f:52ff:fe34:668a but then fails to establish a TLS stream to that same node because its own firewall blocks it:

2026-03-09 17:23:44.428001 Establishing TCP connection to g4zag-dfsbp-vyukk-bqngo-f53lb-es6h4-czfup-kdnu2-qighy-gcx32-fae @ [2602:fb2b:110:10:502f:52ff:fe34:668a]:8080
2026-03-09 17:23:44.428001 Establishing TLS stream to g4zag-dfsbp-vyukk-bqngo-f53lb-es6h4-czfup-kdnu2-qighy-gcx32-fae. Tcp stream: TcpStream { addr: [2602:fb2b:110:10:50a2:78ff:fe84:6fdc]:58544, peer: [2602:fb2b:110:10:502f:52ff:fe34:668a]:8080, fd: 39 }
2026-03-09 17:23:47.248279 Drop - default policy: IN=enp1s0 OUT= MAC=52:a2:78:84:6f:dc:52:2f:52:34:66:8a:86:dd SRC=2602:fb2b:0110:0010:502f:52ff:fe34:668a DST=2602:fb2b:0110:0010:50a2:78ff:fe84:6fdc LEN=205 TC=0 HOPLIMIT=64 FLOWLBL=3982 PROTO=TCP SPT=8080 DPT=60338 WINDOW=16 RES=0x00 ACK PSH URGP=0

Why is the firewall of the application node dropping packets from other nodes in the testnet?

Root Cause Analysis: Application Node Firewall Drops in `nns_token_balance_test`

Test: //rs/tests/nns:nns_token_balance_test
Date: 2026-03-09
Run: 2026-03-09T17:24:03_dc9adb68-14be-423e-88ce-b71a4d3034f6

Summary

The application node's firewall drops packets from the NNS node, preventing the app node from fetching an NNS delegation and causing the test to fail. The drops are a symptom, not the root cause. The true cause is the NNS node's CSP vault being overloaded during DKG/consensus initialization, which prevents TLS handshakes from completing.

Environment

Role	IPv6 Address
NNS node (System subnet)	`2602:fb2b:110:10:502f:52ff:fe34:668a`
APP node (Application subnet)	`2602:fb2b:110:10:50a2:78ff:fe84:6fdc`

Ports involved: 8080 (HTTP handler), 2497 (XNet), 4100 (transport)

Timeline

Time	Event
17:15:31	Orchestrator writes firewall rules — both nodes correctly whitelisted on ports 22, 2497, 4100, 8080 (TCP) and 4100 (UDP)
17:15:31–17:20:34	~5 minutes of normal operation, no drops
17:20:34.878	First NNS TLS handshake timeout (`CspServerEd25519Signer: TransientInternalError: "the request exceeded its deadline"`)
17:20:34.879	First APP firewall drop logged — 0.5 ms after the NNS error
17:20:34–17:23:24	Drops repeat every ~5 seconds on ports 2497 and 8080, each correlated sub-millisecond with a NNS TLS error

Mechanism

APP node                          NNS node
   |                                 |
   |------- TLS ClientHello -------->|  (conntrack: NEW → ESTABLISHED)
   |                                 |  NNS CSP vault busy with DKG signing
   |                                 |  ...deadline exceeded after ~5s...
   |<-- [timeout] client gives up    |
   |  sends RST, conntrack entry     |
   |  destroyed                      |
   |                                 |
   |                                 |  CSP vault finally returns error
   |<------- TLS error response -----|  arrives ~0.5ms later
   |                                 |
   |  No conntrack entry found       |
   |  → ct state = NEW              |
   |  → No rule accepts NEW TCP     |
   |    to ephemeral dest port      |
   |  → DEFAULT POLICY: DROP        |
   |  → logged as "Drop - default   |
   |    policy"                      |

The APP node opens a TLS connection to the NNS node (ports 2497 or 8080).
The NNS node's CspServerEd25519Signer is saturated by DKG/consensus work and cannot complete the TLS handshake within the deadline (~5 seconds).
The APP node's TLS client times out and sends a TCP RST, destroying the conntrack entry.
~0.5 ms later, the NNS node's error response packet arrives at the APP node.
With no conntrack entry, nftables classifies it as ct state new.
No rule accepts new inbound TCP on random ephemeral destination ports → dropped by the default drop policy.
The drop is logged with prefix "Drop - default policy: ".

Evidence: Sub-millisecond Timing Correlation

Every NNS TLS error matches an APP firewall drop within < 1 ms:

NNS TLS Error	APP Firewall Drop	Delta
17:20:34.878498	17:20:34.879075	+0.6 ms
17:20:39.922932	17:20:39.922915	−0.0 ms
17:20:44.937072	17:20:44.936623	−0.4 ms
17:20:49.944747	17:20:49.944717	−0.0 ms
17:20:54.964523	17:20:54.964530	+0.0 ms
17:20:59.976485	17:20:59.976427	−0.1 ms
17:21:04.987832	17:21:04.987796	−0.0 ms
17:21:09.997946	17:21:09.998127	+0.2 ms
17:21:15.011556	17:21:15.011534	−0.0 ms
17:21:20.019024	17:21:20.019048	+0.0 ms

Firewall Configuration (Verified Correct)

The orchestrator correctly applied the following whitelisting rules at 17:15:31:

# IPv6 node whitelisting
tcp dport { 22, 2497, 4100, 8080 } ip6 saddr { 2602:fb2b:110:10:502f:52ff:fe34:668a, 2602:fb2b:110:10:50a2:78ff:fe84:6fdc } accept
udp dport { 4100 } ip6 saddr { 2602:fb2b:110:10:502f:52ff:fe34:668a, 2602:fb2b:110:10:50a2:78ff:fe84:6fdc } accept

The dropped packets have destination ports that differ from the whitelisted ports (they are ephemeral ports from closed connections), confirming the firewall rules are not at fault.

Root Cause

The NNS node's CspServerEd25519Signer is overloaded during initial DKG/consensus bootstrapping, unable to service TLS handshake signing requests within their deadline. This prevents the APP node from establishing any TLS connection to the NNS node, blocking NNS delegation fetches and causing the test to fail.

The firewall drops are a harmless secondary effect of stale TCP response packets arriving after connections have already been reset.

Potential Fixes

Increase CSP vault signing deadline during the bootstrap phase to tolerate DKG load.
Prioritize TLS handshake signing requests over DKG signing in the CSP vault.
Add more resilient retry logic in the APP node's NNS delegation fetcher to tolerate prolonged TLS failures.
Ensure test waits for NNS node readiness (post-DKG stabilization) before starting the application subnet.

randombit · 2026-03-10T13:30:23Z

Thanks @basvandijk, definitely an issue - I think none of the solutions Claude lays out really works since they are more or less specific to this one test, while resource starvation in the vault probably could apply in all sorts of scenarios. I guess as a starting point I'll configure Rayon's thread pool to use only a portion of the available core count.

basvandijk · 2026-03-10T13:32:31Z

Sounds good. Also note I observed some other tests flaking as well on this PR. Just making sure that we're not introducing a new source of flakiness.

github-actions bot added the perf label Feb 23, 2026

randombit changed the title ~~perf(crypto): CRP-2550 Use Rayon in NIDKG~~ perf(crypto): CON-1650 Use Rayon in NIDKG Feb 24, 2026

perf(crypto): CRP-2550 Add parallelism to NIDKG using Rayon

f8d55e3

randombit force-pushed the jack/crp-2550 branch from 5f9736f to f8d55e3 Compare March 6, 2026 18:56

IDX GitHub Automation and others added 2 commits March 6, 2026 19:00

Automatically fixing code for linting and formatting issues

95542f6

Bazel fix

c0ea430

randombit marked this pull request as ready for review March 9, 2026 13:28

randombit requested a review from a team as a code owner March 9, 2026 13:28

github-actions bot added the @consensus label Mar 9, 2026

Merge branch 'master' into jack/crp-2550

fe35a78

randombit and others added 4 commits March 10, 2026 16:02

Tone down the Rayon thread pool size a bit

b593d75

Automatically updated Cargo*.lock

38dee8b

Fix bazel

5791126

Automatically fixing code for linting and formatting issues

f7beaca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(crypto): CON-1650 Use Rayon in NIDKG#9007

perf(crypto): CON-1650 Use Rayon in NIDKG#9007
randombit wants to merge 8 commits intomasterfrom
jack/crp-2550

randombit commented Feb 23, 2026

Uh oh!

randombit commented Feb 23, 2026 •

edited

Loading

Uh oh!

randombit commented Feb 23, 2026

Uh oh!

randombit commented Feb 23, 2026 •

edited

Loading

Uh oh!

basvandijk commented Mar 10, 2026 •

edited

Loading

Uh oh!

randombit commented Mar 10, 2026

Uh oh!

basvandijk commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

randombit commented Feb 23, 2026

Uh oh!

randombit commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

randombit commented Feb 23, 2026

Uh oh!

randombit commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

basvandijk commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Prompt

Root Cause Analysis: Application Node Firewall Drops in nns_token_balance_test

Summary

Environment

Timeline

Mechanism

Evidence: Sub-millisecond Timing Correlation

Firewall Configuration (Verified Correct)

Root Cause

Potential Fixes

Uh oh!

randombit commented Mar 10, 2026

Uh oh!

basvandijk commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

randombit commented Feb 23, 2026 •

edited

Loading

randombit commented Feb 23, 2026 •

edited

Loading

basvandijk commented Mar 10, 2026 •

edited

Loading

Root Cause Analysis: Application Node Firewall Drops in `nns_token_balance_test`