This document explains how the raft package tests in SevenDB are structured, how the deterministic clock works, and how to write new fast, race-stable tests without wall-clock sleeps.
See also:
RAFT_ARCHITECTURE.mdfor a full architectural deep dive (elections, WAL, transport, pruning).
- Eliminate flakiness from timing races (leader election, heartbeats, snapshots)
- Reduce test wall time (no real
time.Sleep) to speed up CI - Provide clear patterns for: replication, failover, restart, pruning, WAL shadow parity, and error semantics
- Ensure WAL migration features (dual write, validator, pruning, strict sync) remain covered
ShardRaftNode accepts an optional simulated clock via the unexported field clk injected using RaftConfig.TestDeterministicClock.
We expose a helper in raft_test_helpers.go:
newDeterministicNode(t, cfg, startTime) *ShardRaftNode
This creates a normal etcd raft node but replaces its internal tick loop scheduling with a SimulatedClock from internal/harness/clock.
Instead of sleeping, tests call:
advanceAll(nodes, 10*time.Millisecond)
This increments the simulated time for any node that has a deterministic clock; each tick triggers raft's internal heartbeat/election processing just as a real ticker would.
Choose step sizes (e.g. 5–20ms) small enough to satisfy election timeouts (configured by HeartbeatMillis, ElectionTimeoutMillis) while still running quickly.
Old pattern:
waitUntil(t, 5*time.Second, 50*time.Millisecond, func() bool { _,_,ok := findLeader(nodes); return ok }, "leader")
New deterministic pattern:
for i := 0; i < 500; i++ { if _,_,ok := findLeader(nodes); ok { break } ; advanceAll(nodes, 10*time.Millisecond) }
if _,_,ok := findLeader(nodes); !ok { t.Fatalf("no leader") }
This avoids real elapsed time and finishes as soon as leadership is established.
Helper:
proposeOnLeader(t, nodes, bucket, payload)
After proposals, instead of sleeping, iterate advancing simulated time until Status().LastAppliedIndex reaches the target across all nodes.
Trigger snapshots by lowering config.Config.RaftSnapshotThresholdEntries (or adjusting per-node snapshotThreshold when safe in tests). Then loop advancing time until LastSnapshotIndex > 0 or PrunedThroughIndex >= snapshotIndex.
Shadow WAL (new protobuf-based log) is dual-written alongside the legacy in-memory raft storage. Tests that exercise failover or restart call:
n.ValidateShadowTail()
to assert the CRC/index parity ring matches recent writes.
Advanced tests can supply a mock or in-memory WAL by setting RaftConfig.WALFactory to a function returning a test implementation of the WAL interface. This allows assertions on:
- Append ordering (
AppendEntrycall sequence / indices / kinds) - HardState change filtering (no duplicate AppendEntry for unchanged HardState)
- Prune behavior (
PruneThroughreceives expected watermark after snapshot)
Pattern:
cfg.WALFactory = func(cfg RaftConfig) (WAL, error) { return newMockWAL(t), nil }
Ensure dual-write remains enabled if you want parity (set EnableWALShadow=true) or disable it to isolate the new WAL path performance characteristics.
Use the built-in abrupt termination helper:
leader.Crash()
This mimics an ungraceful process death:
- Does NOT flush or close WALs (unlike
Close()) - Stops background goroutines (
tickLoop,readyLoop) - Marks the node closed via both
closed(under mutex) andclosedAtomicfor lock-free fast path guards
Updated failover pattern:
- Detect leader:
lid, lnode := findLeader(...). - Call
lnode.Crash(). - Advance simulated time until a new leader is elected.
- (Optional) Restart crashed identity by constructing a NEW node with same
NodeID&DataDirand reattaching the transport. - Advance time until its
Status().LastAppliedIndexcatches up with current leader.
Legacy placeholder replacement is no longer required and should be removed from new tests.
When restarting a node, its prior WAL shadow directory is reused so that:
- HardState is replayed (commit/term restored)
- Entries are fed back into raft
MemoryStoragevia WAL primary read (if enabled) - Validator ring is seeded
Tests assert indices (LastAppliedIndex, LastSnapshotIndex) do not regress post-restart.
The raft layer assigns monotonically increasing per-bucket commit indices. Tests alternate bucket IDs during proposals and validate each sequence is contiguous starting at 1 without gaps.
Follower proposals must return a typed NotLeaderError struct containing the current leader's ID (or empty while unknown). Tests assert error type + hint correctness.
| File | Purpose | Deterministic Integration |
|---|---|---|
raft_wal_shadow_test.go |
WAL shadow parity, leader failover, restart | Converted: uses deterministic clock for multi-node failover scenario |
raft_cluster_test.go |
Core replication, snapshot, restart, pruning, bucket isolation, error semantics | All tests converted to deterministic advancement |
raft_wal_prune_test.go |
Snapshot-driven WAL segment pruning & recovery cleanup | Converted (single node leader wait + time advancement) |
raft_wal_primary_read_test.go |
WAL primary read seeding storage (static write then replay) | Purely file-based, does not need deterministic time |
raft_deterministic_test.go |
Manual mode validation (no background goroutines) | Separately exercises manual tick processing |
raft_test_helpers.go |
Helper utilities (closeAll, newDeterministicNode, advanceAll) | Hosts deterministic tooling |
- Configure global test knobs as needed:
config.Config = &config.DiceDBConfig{}
config.Config.RaftSnapshotThresholdEntries = 5
- Create nodes with deterministic clock:
start := time.Unix(0,0)
for i := 1; i <= N; i++ {
cfg := RaftConfig{ShardID: sid, NodeID: fmt.Sprintf("%d", i), Peers: peerSpecs, DataDir: dir(i), Engine: "etcd", ForwardProposals: true}
n := newDeterministicNode(t, cfg, start)
transport.attach(uint64(i), n); n.SetTransport(transport)
nodes = append(nodes, n)
}
- Drive election:
for steps:=0; steps<500; steps++ { if _,_,ok := findLeader(nodes); ok { break }; advanceAll(nodes, 10*time.Millisecond) }
- Propose and confirm apply:
_, rIdx := proposeOnLeader(t, nodes, "b", []byte("val"))
for steps:=0; steps<300 && nodes[0].Status().LastAppliedIndex < rIdx; steps++ { advanceAll(nodes, 10*time.Millisecond) }
- Assertions; prefer invariants on indices / snapshot progression over sleeps.
- Cleanup via
closeAll(t, nodes)to ensure WAL directory hygiene.
- Election: ~ (ElectionTimeoutMillis / step) * small multiplier (we used 500 * 10ms = 5s virtual upper bound)
- Apply loops: Bound by a few hundred iterations usually sufficient; bail early once condition met.
- Snapshot/prune: 1000 iterations * 10ms (10s virtual) should be generous; adjust if thresholds increase.
Because advancement is virtual, a larger bound does not slow real time proportionally—only loop CPU + raft processing cost matters.
- Do not mix real
time.Sleepwith simulated advancement inside the same logical waiting loop; it reintroduces nondeterminism. - Always re-detect the leader just before proposing if leadership changes are expected (use
proposeOnLeader). - After a crash simulation, ensure placeholder node isn't used for proposals (we tag with
shardID:"_dead"). - When pruning tests depend on snapshots, ensure you actually cross the configured snapshot threshold; check
LastSnapshotIndex. - After restart, compare indices (applied, snapshot) against pre-shutdown baseline to detect regressions.
Future enhancements may include:
- Enriching WAL validator comparisons beyond (index, crc) to include namespace/bucket/opcode/sequence.
- Metrics around mismatch counts and fsync latency for durability mode comparisons.
- Helper to wait on predicate with automatic advancement (wrapper around common pattern extracted once stabilized).
Run with:
go test -race ./internal/raft -count=1
Deterministic tests dramatically reduce flaky race reports caused by goroutine timing windows during election and snapshotting.
Manual mode (Manual: true) disables background tick & Ready loops entirely; tests explicitly call:
n.ManualTick()
for n.ManualProcessReady() {}
Use this when you need total control or to unit-test state machine transitions in isolation. Deterministic clock mode keeps the standard background loops but replaces wall-clock tick scheduling.
closeAll asserts only expected segment (seg-*.wal) and sidecar index (seg-*.wal.idx) files remain. Any temporary or .deleted remnants cause test failure, surfacing incomplete prune or rotation cleanup.
- Pure file reconstruction tests (e.g.,
raft_wal_primary_read_test.go) where timing is irrelevant can remain simple. - Network / integration layers not yet adapted to simulated time should isolate their sleeps behind helper abstractions if later conversion is desired.
- Decide if it depends on timing (election/snapshot). If yes, use deterministic clock.
- Compose scenario steps using proposals, crash/restart patterns, and index assertions.
- Validate WAL parity if shadow involved (
ValidateShadowTail). - Ensure final cleanup via
closeAll.
Q: Why keep a waitUntil helper?
A: Some legacy tests still rely on it; new tests should prefer explicit advancement loops. We may refactor waitUntil to accept an optional node slice for automatic advancement.
Q: Does advancing simulated time spin CPU heavily? A: Each advancement triggers raft tick processing; loop bounds are modest (hundreds) and far cheaper than multi-second real sleeps.
Q: Can deterministic and real nodes mix? A: Avoid mixing in the same test; either all nodes share simulated time or all use wall clock, to retain consistent timing behavior.
Recent changes added atomic leader hint caching and a fast closed check. Tests relying on leadership transitions should:
- Prefer
node.LeaderID()(which uses the atomic fallback) rather than callingrn.Status()directly. - Avoid asserting leader immediately after a crash; allow several election ticks (advance loop) so
lastKnownLeaderupdates through Ready processing. - When validating proposal abort semantics post-crash, assert
ErrProposalAborted(waiter channel closed) rather than generic errors.
| Scenario | Use | Reason |
|---|---|---|
| Graceful cluster scale down | Close() |
Ensures waiters notified & WAL flushed. |
| Abrupt failover / durability test | Crash() |
Emulates power loss, tests WAL recovery code path. |
| Race detector validation (resource leak) | Close() |
Deterministic teardown avoids false positives. |
Last updated: 2025-10-07
This section explains how to run a small RAFT cluster across two laptops on a LAN to validate elections, replication, and failover. You can do a minimal 2-node test (no failover) or a recommended 3-node test (fault tolerant) by running two nodes on one laptop and one on the other.
- Two machines on the same network, reachable by IP (for example 192.168.1.10 and 192.168.1.20).
- Open firewall for RAFT gRPC ports you choose (default 7090; second node on same host can use 7091).
- Build the server binary on each machine:
make buildPeers list (use on both machines): 1@192.168.1.10:7090, 2@192.168.1.20:7090.
Laptop A (node 1):
./sevendb \
--raft-enabled=true \
--raft-engine=etcd \
--raft-node-id=1 \
--raft-nodes 1@192.168.1.10:7090 \
--raft-nodes 2@192.168.1.20:7090 \
--raft-listen-addr=0.0.0.0:7090 \
--raft-advertise-addr=192.168.1.10:7090 \
--num-shards=1 \
--port=7379 \
--raft-persistent-dir=/tmp/sevendb-node1/raftdata \
--wal-dir=/tmp/sevendb-node1/logs \
--log-level=debugLaptop B (node 2):
./sevendb \
--raft-enabled=true \
--raft-engine=etcd \
--raft-node-id=2 \
--raft-nodes 1@192.168.1.10:7090 \
--raft-nodes 2@192.168.1.20:7090 \
--raft-listen-addr=0.0.0.0:7090 \
--raft-advertise-addr=192.168.1.20:7090 \
--num-shards=1 \
--port=7379 \
--raft-persistent-dir=/tmp/sevendb-node2/raftdata \
--wal-dir=/tmp/sevendb-node2/logs \
--log-level=debugNote: a 2-node cluster cannot make progress if either node is down (majority of 2 is 2). This is fine for connectivity and replication smoke tests, but not for failover testing.
Run two nodes on Laptop A and one node on Laptop B. Use a 3-peer set:
1@192.168.1.10:70902@192.168.1.20:70903@192.168.1.10:7091
Laptop A, node 1:
./sevendb \
--raft-enabled=true \
--raft-engine=etcd \
--raft-node-id=1 \
--raft-nodes 1@192.168.1.10:7090 \
--raft-nodes 2@192.168.1.20:7090 \
--raft-nodes 3@192.168.1.10:7091 \
--raft-listen-addr=0.0.0.0:7090 \
--raft-advertise-addr=192.168.1.10:7090 \
--num-shards=1 \
--port=7379 \
--raft-persistent-dir=/tmp/sevendb-node1/raftdata \
--wal-dir=/tmp/sevendb-node1/logs \
--log-level=debugLaptop B, node 2:
./sevendb \
--raft-enabled=true \
--raft-engine=etcd \
--raft-node-id=2 \
--raft-nodes 1@192.168.1.10:7090 \
--raft-nodes 2@192.168.1.20:7090 \
--raft-nodes 3@192.168.1.10:7091 \
--raft-listen-addr=0.0.0.0:7090 \
--raft-advertise-addr=192.168.1.20:7090 \
--num-shards=1 \
--port=7379 \
--raft-persistent-dir=/tmp/sevendb-node2/raftdata \
--wal-dir=/tmp/sevendb-node2/logs \
--log-level=debugLaptop A, node 3 (second process on Laptop A):
./sevendb \
--raft-enabled=true \
--raft-engine=etcd \
--raft-node-id=3 \
--raft-nodes 1@192.168.1.10:7090 \
--raft-nodes 2@192.168.1.20:7090 \
--raft-nodes 3@192.168.1.10:7091 \
--raft-listen-addr=0.0.0.0:7091 \
--raft-advertise-addr=192.168.1.10:7091 \
--num-shards=1 \
--port=7380 \
--raft-persistent-dir=/tmp/sevendb-node3/raftdata \
--wal-dir=/tmp/sevendb-node3/logs \
--log-level=debugThis setup tolerates one node failure (majority of 3 is 2). Kill the leader process and observe re-election on the remaining nodes.
Add these flags to each process to route RAFT appends through the unified WAL in addition to legacy paths:
--enable-wal \
--wal-variant=forge- Leader election: check periodic status JSON. By default it’s written under the metadata directory (or set
--status-file-path=/tmp/sevendb-node1/status.json). Look forleader_id,is_leader,last_applied_index. - Writes replicate: run a few writes on the leader (e.g., using
redis-clito the leader’s app port), and watchlast_applied_indexincrease on all nodes. - Failover (3-node): kill the leader process; another node should show
is_leader=truesoon after, and commits should continue. - WAL hygiene: inspect
--wal-dir; after snapshots, oldseg-*.walfiles are pruned and there are no lingering*.wal.deletedmarkers.
- Use unique directories per node for
--raft-persistent-dirand--wal-dir. Don’t share the same folder across processes. - Ensure RAFT ports are reachable between machines (e.g.,
nc -vz 192.168.1.20 7090). - Two-node clusters cannot make progress if one node is down; use three nodes to test failover.
- Transport is unauthenticated by default; keep tests on a trusted network.