Skip to content

feat(sync-service): Scale SQLite connection pool to 0#3908

Open
magnetised wants to merge 7 commits intomainfrom
magnetised/scale-sqlite-pool
Open

feat(sync-service): Scale SQLite connection pool to 0#3908
magnetised wants to merge 7 commits intomainfrom
magnetised/scale-sqlite-pool

Conversation

@magnetised
Copy link
Contributor

@magnetised magnetised commented Feb 25, 2026

NimblePool allows for lazy initialization and scale down of SQLite connections. This PR adds support for that.

The scaling is disabled for exclusive_mode as an exclusive_mode db could be in-memory, and closing that would be catastrophic.

I've added the number of active connections to the exported data. Will be interesting to see what goes on there, in both the larger instances and the many less-active ones.

This disables sqlite statistics collection by default as the usefulness of the exported numbers is low without memory statistics enabled, and enabling memory stats is potentially causing issues in some deployments.

@codecov
Copy link

codecov bot commented Feb 25, 2026

❌ 6 Tests Failed:

Tests completed Failed Passed Skipped
2480 6 2474 1
View the top 3 failed test(s) by shortest run time
Elixir.Electric.ShapeCacheTest::test get_or_create_shape_handle/2 against real db crashes when initial snapshot query fails to return data quickly enough
Stack Traces | 0s run time
11) test get_or_create_shape_handle/2 against real db crashes when initial snapshot query fails to return data quickly enough (Electric.ShapeCacheTest)
     test/electric/shape_cache_test.exs:501
     ** (EXIT from #PID<0.15226.0>) killed
Elixir.Electric.ShapeCacheTest::test after restart waits until publication filters are restored
Stack Traces | 0.0151s run time
6) test after restart waits until publication filters are restored (Electric.ShapeCacheTest)
     test/electric/shape_cache_test.exs:1098
     Assertion with == failed
     code:  assert shape_handle1 == shape_handle2
     left:  "95846100-1772555936253232"
     right: "95846100-1772555936259083"
     stacktrace:
       test/electric/shape_cache_test.exs:1119: (test)
Elixir.Electric.ShapeCacheTest::test after restart restores shapes with subqueries and their materializers when backup missing
Stack Traces | 0.0185s run time
8) test after restart restores shapes with subqueries and their materializers when backup missing (Electric.ShapeCacheTest)
     test/electric/shape_cache_test.exs:1172
     match (=) failed
     The following variables were pinned:
       dep_handle = "86050472-1772555935591404"
       shape_handle = "50617496-1772555935591771"
     code:  assert [{^dep_handle, _}, {^shape_handle, _}] = ShapeCache.list_shapes(ctx.stack_id)
     left:  [{^dep_handle, _}, {^shape_handle, _}]
     right: []
     stacktrace:
       test/electric/shape_cache_test.exs:1189: (test)
Elixir.Electric.ShapeCacheTest::test start_consumer_for_handle/2 starts a consumer plus dependencies
Stack Traces | 0.0207s run time
10) test start_consumer_for_handle/2 starts a consumer plus dependencies (Electric.ShapeCacheTest)
     test/electric/shape_cache_test.exs:1245
     match (=) failed
     The following variables were pinned:
       dep_handle = "86050472-1772555935437322"
       shape_handle = "50617496-1772555935437552"
     code:  assert [{^dep_handle, _}, {^shape_handle, _}] = ShapeCache.list_shapes(stack_id)
     left:  [{^dep_handle, _}, {^shape_handle, _}]
     right: []
     stacktrace:
       test/electric/shape_cache_test.exs:1261: (test)
Elixir.Electric.ShapeCacheTest::test after restart restores shapes with subqueries and their materializers
Stack Traces | 0.0395s run time
19) test after restart restores shapes with subqueries and their materializers (Electric.ShapeCacheTest)
     test/electric/shape_cache_test.exs:1151
     match (=) failed
     The following variables were pinned:
       dep_handle = "86050472-1772555929845785"
       shape_handle = "50617496-1772555929846224"
     code:  assert [{^dep_handle, _}, {^shape_handle, _}] = ShapeCache.list_shapes(ctx.stack_id)
     left:  [{^dep_handle, _}, {^shape_handle, _}]
     right: []
     stacktrace:
       test/electric/shape_cache_test.exs:1169: (test)
Elixir.Electric.ShapeCacheTest::test after restart restores latest offset
Stack Traces | 0.0396s run time
18) test after restart restores latest offset (Electric.ShapeCacheTest)
     test/electric/shape_cache_test.exs:1123
     ** (MatchError) no match of right hand side value:

         {:error, :unknown}

     code: :started = ShapeCache.await_snapshot_start(shape_handle, ctx.stack_id)
     stacktrace:
       test/electric/shape_cache_test.exs:1145: (test)
Elixir.Electric.ShapeCacheTest::test after restart restores shape_handles
Stack Traces | 0.0965s run time
32) test after restart restores shape_handles (Electric.ShapeCacheTest)
     test/electric/shape_cache_test.exs:1089
     Assertion with == failed
     code:  assert shape_handle1 == shape_handle2
     left:  "95846100-1772555927435806"
     right: "95846100-1772555927481081"
     stacktrace:
       test/electric/shape_cache_test.exs:1095: (test)
Elixir.Electric.Replication.ShapeLogCollector.RequestBatcherTest::test batching behavior add is invalidated if removal occurs before update is submitted
Stack Traces | 0.406s run time
4) test batching behavior add is invalidated if removal occurs before update is submitted (Electric.Replication.ShapeLogCollector.RequestBatcherTest)
     .../replication/shape_log_collector/request_batcher_test.exs:100
     Assertion failed, no matching message after 400ms
     The following variables were pinned:
       expected_msg = {:processor_called, %{}, MapSet.new(["handle-2"])}
     Showing 5 of 5 messages in the mailbox
     code: assert_receive ^expected_msg
     mailbox:
       pattern: ^expected_msg
       value:   {:processor_called, %{"handle-2" => "shape-2"}, MapSet.new([])}

       pattern: ^expected_msg
       value:   {#Reference<0.0.1398915.2673388554.2761228289.195239>, :ok}

       pattern: ^expected_msg
       value:   {:DOWN, #Reference<0.0.1398915.2673388554.2761228289.195239>, :process, #PID<0.10940.0>, :normal}

       pattern: ^expected_msg
       value:   {#Reference<0.0.1398915.2673388554.2761228290.181916>, :ok}

       pattern: ^expected_msg
       value:   {:DOWN, #Reference<0.0.1398915.2673388554.2761228290.181916>, :process, #PID<0.10943.0>, :normal}
     stacktrace:
       .../replication/shape_log_collector/request_batcher_test.exs:122: (test)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch 3 times, most recently from 72269af to 0c0a256 Compare February 25, 2026 14:52
@magnetised magnetised marked this pull request as ready for review February 25, 2026 15:08
@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch from 0c0a256 to 023caa8 Compare February 25, 2026 15:13
@alco alco added the claude label Feb 25, 2026
@claude
Copy link

claude bot commented Feb 25, 2026

Claude Code Review

Summary

Round 13: Two new commits since round 12 — df9e53b63 (style cleanup in connection.ex) and d31116a51 (supervisor strategy change + WriteBuffer trap_exit removal + test). The supervisor change is the meaningful one: the ShapeDb supervisor now uses :one_for_all, max_restarts: 0, ensuring any crash in the subtree brings the whole supervisor down rather than leaving split state between SQLite, WriteBuffer ETS tables, and ShapeStatus ETS caches. One item remains from the previous round.

What's Working Well

  • Supervisor strategy rationale is sound — The comment in supervisor.ex:74-77 correctly explains why :one_for_one is unsafe given split state. Using :one_for_all, max_restarts: 0 is the right call: prefer a clean restart over a partially-recovered inconsistent state.
  • Shutdown ordering is safe — Children are terminated in reverse-start order: Task → WriteBuffer → Write pool → Read pool → Migrator → Statistics → PoolRegistry. WriteBuffer's terminate/2 flushes pending writes while the Write NimblePool is still alive. The ordering protects the best-effort flush.
  • trap_exit removal is correct — WriteBuffer does not link to any external processes (no spawn_link, Task.async, etc.). Supervisors send :shutdown to children, which GenServer always handles by calling terminate/2, regardless of the trap_exit flag. Removing it is clean.
  • Test is solid — The new test "crashing WriteBuffer restarts entire supervision tree" directly validates the invariant with monitors on both WriteBuffer and the supervisor, confirming the crash propagates as intended.
  • Style fixes in df9e53b63 — Moving mode = ... out of the with expression and removing the redundant if(...) parens are both correct — with should only be used with pattern-matchable clauses.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

1. Logger.notice("SQLite statistics disabled") on every default startup (carried from round 9)

packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/statistics.ex:144

Still present. Default config (enable_stats?: false) emits this notice on every startup via the Statistics.initializehandle_cast(:initialize)read_stats(%{enable_stats?: false}) path. The default-off state of a feature is not noteworthy at :notice level.

Noting that the author has indicated they would prefer to move forward without this change — leaving it here for Alco's judgement.


Suggestions (Nice to Have)

2. Blacksmith 25 shell integration-test failures — still unresolved

The 2026-03-02 CI run still shows 25 shell test failures. No new CI runs visible since the last push. Worth a quick sanity-check that these are pre-existing flakiness on main before merging, not something introduced by this branch.

3. connections field in %__MODULE__{} struct is a placeholder (carried from round 12, observation only)

statistics.ex:26 — The struct stores connections: 0 but the live value lives in state.connections and is merged at call time. Fine as-is.

4. Missing @impl GenServer on handle_info/handle_cast clauses (carried from round 3, author declined)

statistics.ex:86,104,109,134 — Noted for completeness.

5. Task.async links the task — abnormal exit crashes Statistics (carried from round 5, author declined)

statistics.ex:166 — Noted for completeness.


Issue Conformance

No linked issue — same as all previous rounds. PR description is accurate and self-contained.

Previous Review Status

  • DONE: terminate_worker/3 — Fixed round 2.
  • DONE: use_pool? naming — Fixed round 2.
  • DONE: Fragile sleep in tests — Fixed round 2.
  • DONE: shrink_memory/1 dead code — Removed round 3.
  • DONE: Bare :error CaseClauseError risk — Fixed round 3.
  • DONE: CI does not scale the pool in exclusive mode — Fixed via handle_ping/2.
  • DONE: Statistics GenServer blocks on checkout — Fixed via Task.async.
  • DONE: Missing catch-all handle_info/2 — Fixed round 5.
  • DONE: CI test "statistics only exports disk usage by default" — Green.
  • DONE: Stats collection wakes pool preventing scale-to-0 — Resolved (stats disabled by default).
  • DONE: Logger.notice("Shape database file: ...") on every reconnect — Fixed (moved to migrator.ex).
  • DONE: Alco inline connection.ex suggestions — Applied.
  • DONE: handle_ping double-close/counter-negative bug — Fixed.
  • DONE: Style cleanup in connection.ex (if() parens, mode out of with) — Fixed in df9e53b.
  • DONE: one_for_one supervisor strategy left split state inconsistent — Fixed via one_for_all, max_restarts: 0 in d31116a.
  • OPEN: Logger.notice("SQLite statistics disabled") on default startup — Still present at statistics.ex:144. Author has indicated preference to proceed as-is.
  • OPEN: Blacksmith 25 shell integration-test failures — Status unconfirmed.
  • DECLINED: Task.async crash propagation — Author declined.
  • DECLINED: Missing @impl GenServer on handle_info/handle_cast clauses — Author declined.
  • DEFERRED: PRAGMA page_size each cycle — Fine to defer.

Review iteration: 13 | 2026-03-03

@alco
Copy link
Member

alco commented Feb 26, 2026

@magnetised I've looked through the issues Claude flagged up, all for them look legit. Please ping me when ready for another review round.

@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch from 023caa8 to 90029c5 Compare February 26, 2026 12:29
@blacksmith-sh

This comment has been minimized.

@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch from 90029c5 to f8fa293 Compare February 26, 2026 13:43
@blacksmith-sh

This comment has been minimized.

@blacksmith-sh

This comment has been minimized.

@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch 2 times, most recently from df9efb8 to 292b144 Compare February 26, 2026 15:40
@magnetised
Copy link
Contributor Author

@alco think we can ignore claude's suggestions. ready for another look pls

@magnetised magnetised requested a review from alco February 26, 2026 16:11
@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch 2 times, most recently from 9772e20 to 3d410a6 Compare March 2, 2026 10:00
@blacksmith-sh

This comment has been minimized.

@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch 2 times, most recently from 812f743 to e95c5ae Compare March 3, 2026 11:59
Copy link
Member

@alco alco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch 5 times, most recently from 524f5f8 to b316a68 Compare March 3, 2026 15:20
@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch from b316a68 to d31116a Compare March 3, 2026 16:17
so stats retrieval is not blocked by concurrent reading of stats
and always use a pooled connection for the retreival
because now we're maybe shutting the write connection when idle, this
will log all the time
@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch from d31116a to 872c8bd Compare March 3, 2026 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants