Skip to content

Sync Issue 2 #476

@n13

Description

@n13

Here's my analysis of this log. This is a completely different failure mode from the previous logs -- and it confirms your hypothesis.

Timeline:

Time Event Block Rate
17:52:48 Fresh sync starts from genesis, 1 peer (Qmbct) #21 -
17:52 - 17:53:57 Fast sync through small blocks #335 3-8 bps
17:54:02 Hits large tx-flood blocks #338 0.3-0.7 bps
17:56:17 Through the fat region, speeds up #417 1.9-9.8 bps
18:01:43 Stalls completely at a huge block #1558 0.0 bps
18:02:02 First Network(Timeout) - peer disconnected #1558 stuck
18:03:03 Backoff clears, new request sent #1558 stuck
18:03:23 Second Network(Timeout) (20s later) #1558 stuck
18:04:23 Third disconnect #1558 stuck
18:04:26 "Potential long-range attack" (stale mined block) #1558 -
18:04 - 18:10 ~6 minutes dead: 0 peers, 0 progress #1558 -
18:10:11 Qmbct reconnects, Refused in 77ms #1558 -
18:10:24 QmQ4A connects, crawls at ~1 block/5 seconds #1559 ~0.1 bps
18:11:26 QmQ4A Network(Timeout) too #1562 stuck
18:11 onwards Endless Refused/Timeout cycle across all peers #1562 0
20:26:42 User kills node - still stuck at #1562, 0 peers #1562 dead

The node sat stuck for over 2 hours and never recovered.

Root cause -- confirmed: large block + 20s timeout:

The proof is in the download bandwidth numbers. While stuck at #1558:

18:01:48 -- 0.0 bps, download 87.8 kiB/s
18:01:53 -- 0.0 bps, download 200.2 kiB/s
18:01:58 -- 0.0 bps, download 172.7 kiB/s
18:02:02 -- Network(Timeout)

Data IS flowing (100-200 kiB/s) but the block doesn't finish importing. The block request times out at 20 seconds. At 200 kiB/s, the max transferable in 20 seconds is ~4 MB, but the block is likely larger (the tx-flood blocks had 538 extrinsics and hit the 5 MiB block weight limit). The peer is also bandwidth-constrained serving its own traffic.

When QmQ4A finally manages to serve block #1559 (took ~5s), that block barely made it. But subsequent blocks are equally large, and the timeouts resume.

This is NOT the same as the previous fork-loop problem. This node:

  • Never forked (it's syncing from genesis)
  • Never mined on its own chain (the "Potential long-range attack" at 18:04:26 was a stale block proposal from before the sync stall)
  • Has zero transaction propagation overhead

The sole problem is: the sync/2 request-response protocol's 20-second timeout is too short for large blocks over a bandwidth-limited link. One timeout triggers the disconnect-backoff cascade, and the node never recovers because the same large block blocks it every time.

Full log:
log9.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions