Here's my analysis of this log. This is a completely different failure mode from the previous logs -- and it confirms your hypothesis.
Timeline:
| Time |
Event |
Block |
Rate |
| 17:52:48 |
Fresh sync starts from genesis, 1 peer (Qmbct) |
#21 |
- |
| 17:52 - 17:53:57 |
Fast sync through small blocks |
#335 |
3-8 bps |
| 17:54:02 |
Hits large tx-flood blocks |
#338 |
0.3-0.7 bps |
| 17:56:17 |
Through the fat region, speeds up |
#417 |
1.9-9.8 bps |
| 18:01:43 |
Stalls completely at a huge block |
#1558 |
0.0 bps |
| 18:02:02 |
First Network(Timeout) - peer disconnected |
#1558 |
stuck |
| 18:03:03 |
Backoff clears, new request sent |
#1558 |
stuck |
| 18:03:23 |
Second Network(Timeout) (20s later) |
#1558 |
stuck |
| 18:04:23 |
Third disconnect |
#1558 |
stuck |
| 18:04:26 |
"Potential long-range attack" (stale mined block) |
#1558 |
- |
| 18:04 - 18:10 |
~6 minutes dead: 0 peers, 0 progress |
#1558 |
- |
| 18:10:11 |
Qmbct reconnects, Refused in 77ms |
#1558 |
- |
| 18:10:24 |
QmQ4A connects, crawls at ~1 block/5 seconds |
#1559 |
~0.1 bps |
| 18:11:26 |
QmQ4A Network(Timeout) too |
#1562 |
stuck |
| 18:11 onwards |
Endless Refused/Timeout cycle across all peers |
#1562 |
0 |
| 20:26:42 |
User kills node - still stuck at #1562, 0 peers |
#1562 |
dead |
The node sat stuck for over 2 hours and never recovered.
Root cause -- confirmed: large block + 20s timeout:
The proof is in the download bandwidth numbers. While stuck at #1558:
18:01:48 -- 0.0 bps, download 87.8 kiB/s
18:01:53 -- 0.0 bps, download 200.2 kiB/s
18:01:58 -- 0.0 bps, download 172.7 kiB/s
18:02:02 -- Network(Timeout)
Data IS flowing (100-200 kiB/s) but the block doesn't finish importing. The block request times out at 20 seconds. At 200 kiB/s, the max transferable in 20 seconds is ~4 MB, but the block is likely larger (the tx-flood blocks had 538 extrinsics and hit the 5 MiB block weight limit). The peer is also bandwidth-constrained serving its own traffic.
When QmQ4A finally manages to serve block #1559 (took ~5s), that block barely made it. But subsequent blocks are equally large, and the timeouts resume.
This is NOT the same as the previous fork-loop problem. This node:
- Never forked (it's syncing from genesis)
- Never mined on its own chain (the "Potential long-range attack" at 18:04:26 was a stale block proposal from before the sync stall)
- Has zero transaction propagation overhead
The sole problem is: the sync/2 request-response protocol's 20-second timeout is too short for large blocks over a bandwidth-limited link. One timeout triggers the disconnect-backoff cascade, and the node never recovers because the same large block blocks it every time.
Full log:
log9.txt
Here's my analysis of this log. This is a completely different failure mode from the previous logs -- and it confirms your hypothesis.
Timeline:
Network(Timeout)- peer disconnectedNetwork(Timeout)(20s later)Refusedin 77msNetwork(Timeout)tooThe node sat stuck for over 2 hours and never recovered.
Root cause -- confirmed: large block + 20s timeout:
The proof is in the download bandwidth numbers. While stuck at #1558:
Data IS flowing (100-200 kiB/s) but the block doesn't finish importing. The block request times out at 20 seconds. At 200 kiB/s, the max transferable in 20 seconds is ~4 MB, but the block is likely larger (the tx-flood blocks had 538 extrinsics and hit the 5 MiB block weight limit). The peer is also bandwidth-constrained serving its own traffic.
When QmQ4A finally manages to serve block #1559 (took ~5s), that block barely made it. But subsequent blocks are equally large, and the timeouts resume.
This is NOT the same as the previous fork-loop problem. This node:
The sole problem is: the sync/2 request-response protocol's 20-second timeout is too short for large blocks over a bandwidth-limited link. One timeout triggers the disconnect-backoff cascade, and the node never recovers because the same large block blocks it every time.
Full log:
log9.txt