Conversation
|
Todo
|
|
To set some number of CPUs I'm not sure what to do here, default is # of cores but this may oversubscribe us a bit. OTOH there are points where the replica is pretty well just blocked on us and the faster the vault completes the work the better. |
|
I checked 40 dealer 13 recipient NIDKG also since that's relevant for engines, numbers look quite good to me |
|
I noticed So I asked Claude Opus 4.6 about it: PromptThe However, as observed in some flaky runs of this test the application node doesn't become healthy because it can't fetch a NNS delegation from the NNS node because its own firewall blocks incoming packets from the NNS node. And example of this can be found in the following log of a flaky run: Why is the firewall of the application node dropping packets from other nodes in the testnet? Root Cause Analysis: Application Node Firewall Drops in
|
| Role | IPv6 Address |
|---|---|
| NNS node (System subnet) | 2602:fb2b:110:10:502f:52ff:fe34:668a |
| APP node (Application subnet) | 2602:fb2b:110:10:50a2:78ff:fe84:6fdc |
Ports involved: 8080 (HTTP handler), 2497 (XNet), 4100 (transport)
Timeline
| Time | Event |
|---|---|
| 17:15:31 | Orchestrator writes firewall rules — both nodes correctly whitelisted on ports 22, 2497, 4100, 8080 (TCP) and 4100 (UDP) |
| 17:15:31–17:20:34 | ~5 minutes of normal operation, no drops |
| 17:20:34.878 | First NNS TLS handshake timeout (CspServerEd25519Signer: TransientInternalError: "the request exceeded its deadline") |
| 17:20:34.879 | First APP firewall drop logged — 0.5 ms after the NNS error |
| 17:20:34–17:23:24 | Drops repeat every ~5 seconds on ports 2497 and 8080, each correlated sub-millisecond with a NNS TLS error |
Mechanism
APP node NNS node
| |
|------- TLS ClientHello -------->| (conntrack: NEW → ESTABLISHED)
| | NNS CSP vault busy with DKG signing
| | ...deadline exceeded after ~5s...
|<-- [timeout] client gives up |
| sends RST, conntrack entry |
| destroyed |
| |
| | CSP vault finally returns error
|<------- TLS error response -----| arrives ~0.5ms later
| |
| No conntrack entry found |
| → ct state = NEW |
| → No rule accepts NEW TCP |
| to ephemeral dest port |
| → DEFAULT POLICY: DROP |
| → logged as "Drop - default |
| policy" |
- The APP node opens a TLS connection to the NNS node (ports 2497 or 8080).
- The NNS node's
CspServerEd25519Signeris saturated by DKG/consensus work and cannot complete the TLS handshake within the deadline (~5 seconds). - The APP node's TLS client times out and sends a TCP RST, destroying the conntrack entry.
- ~0.5 ms later, the NNS node's error response packet arrives at the APP node.
- With no conntrack entry, nftables classifies it as
ct state new. - No rule accepts new inbound TCP on random ephemeral destination ports → dropped by the default
droppolicy. - The drop is logged with prefix
"Drop - default policy: ".
Evidence: Sub-millisecond Timing Correlation
Every NNS TLS error matches an APP firewall drop within < 1 ms:
| NNS TLS Error | APP Firewall Drop | Delta |
|---|---|---|
| 17:20:34.878498 | 17:20:34.879075 | +0.6 ms |
| 17:20:39.922932 | 17:20:39.922915 | −0.0 ms |
| 17:20:44.937072 | 17:20:44.936623 | −0.4 ms |
| 17:20:49.944747 | 17:20:49.944717 | −0.0 ms |
| 17:20:54.964523 | 17:20:54.964530 | +0.0 ms |
| 17:20:59.976485 | 17:20:59.976427 | −0.1 ms |
| 17:21:04.987832 | 17:21:04.987796 | −0.0 ms |
| 17:21:09.997946 | 17:21:09.998127 | +0.2 ms |
| 17:21:15.011556 | 17:21:15.011534 | −0.0 ms |
| 17:21:20.019024 | 17:21:20.019048 | +0.0 ms |
Firewall Configuration (Verified Correct)
The orchestrator correctly applied the following whitelisting rules at 17:15:31:
# IPv6 node whitelisting
tcp dport { 22, 2497, 4100, 8080 } ip6 saddr { 2602:fb2b:110:10:502f:52ff:fe34:668a, 2602:fb2b:110:10:50a2:78ff:fe84:6fdc } accept
udp dport { 4100 } ip6 saddr { 2602:fb2b:110:10:502f:52ff:fe34:668a, 2602:fb2b:110:10:50a2:78ff:fe84:6fdc } accept
The dropped packets have destination ports that differ from the whitelisted ports (they are ephemeral ports from closed connections), confirming the firewall rules are not at fault.
Root Cause
The NNS node's CspServerEd25519Signer is overloaded during initial DKG/consensus bootstrapping, unable to service TLS handshake signing requests within their deadline. This prevents the APP node from establishing any TLS connection to the NNS node, blocking NNS delegation fetches and causing the test to fail.
The firewall drops are a harmless secondary effect of stale TCP response packets arriving after connections have already been reset.
Potential Fixes
- Increase CSP vault signing deadline during the bootstrap phase to tolerate DKG load.
- Prioritize TLS handshake signing requests over DKG signing in the CSP vault.
- Add more resilient retry logic in the APP node's NNS delegation fetcher to tolerate prolonged TLS failures.
- Ensure test waits for NNS node readiness (post-DKG stabilization) before starting the application subnet.
|
Thanks @basvandijk, definitely an issue - I think none of the solutions Claude lays out really works since they are more or less specific to this one test, while resource starvation in the vault probably could apply in all sorts of scenarios. I guess as a starting point I'll configure Rayon's thread pool to use only a portion of the available core count. |
|
Sounds good. Also note I observed some other tests flaking as well on this PR. Just making sure that we're not introducing a new source of flakiness. |
This substantially improves NIDKG performance. The following table of results is from the crypto benchmarks (so no network/consensus) on my devenv which has 16 cores similar to that used in the Gen2 machines.
All times for the 40 nodes, 40 dealers, high threshold setting. The improvements for other settings are similar