Guard against SQLite corruption and empty analytics cache by robelkin · Pull Request #210 · wesm/msgvault

robelkin · 2026-03-20T15:31:25Z

Summary

Increase SQLite busy_timeout from 5s to 30s — prevents timeout-induced corruption on large (27GB+) databases where lock contention during batch inserts can exceed 5 seconds
Set synchronous=NORMAL explicitly — safe with WAL mode, avoids I/O bottlenecks that compound lock contention
Add WAL checkpointing on Store.Close() and after each sync completion — folds WAL back into the main database, preventing accumulation across sessions and stale WAL entries
Fix build-cache writing _last_sync.json before verifying Parquet files — a failed export would write the state file with the current max message ID, causing future incremental builds to skip, leaving the cache permanently empty
Log count query errors in build-cache instead of silently swallowing them
Add SQLite integrity check (PRAGMA integrity_check) to the verify command with --skip-db-check opt-out

Context

Discovered two issues on a production 27GB database:

Analytics cache was permanently empty because _last_sync.json was written unconditionally even when the Parquet export produced zero files
SQLite database had 100+ integrity errors (invalid page numbers, out-of-order rowids, btreeInitPage errors) with a small 1.1MB WAL — consistent with WAL never being checkpointed and busy_timeout being too short for the database size

Test plan

All existing tests pass (make test)
Verify msgvault verify <email> runs integrity check and reports results
Verify msgvault build-cache with a corrupted/empty source DB does not write _last_sync.json
Verify msgvault build-cache --full-rebuild works correctly after the guard prevents state file write

🤖 Generated with Claude Code

roborev-ci · 2026-03-20T15:33:48Z

roborev: Combined Review (`3089eb9`)

Summary Verdict: The PR introduces valuable database integrity checks, WAL checkpointing, and cache-export guarding, but requires a fix to ensure proper error propagation on cache build failures.

Medium Severity

Location: cmd/msgvault/cmd/build_cache.go:42 6
- Problem: When the post-export read_parquet(...) count fails, or when a non-empty source database produces zero exported rows, buildCache now returns a nil error after printing a warning. This leaves the analytics cache empty or unusable while higher-level callers still
  observe a successful build, meaning scheduled/automated flows will not fail fast or retry.
- Fix: Return a non-nil error (or an explicit failure status that propagates to a non-zero exit) while still skipping sync-state persistence.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

roborev-ci · 2026-03-21T11:19:32Z

roborev: Combined Review (`0f165b7`)

Summary Verdict: The PR introduces solid database resilience and safety improvements, but contains two medium-severity issues regarding error handling during
database verification and WAL checkpointing.

Medium

cmd/msgvault/cmd/verify.go:87
- Problem: When runIntegrityCheck reports corruption, the command only prints the errors and continues with the rest of verification. That means verify can exit
  successfully even though the database is already known-bad, and later checks may produce misleading results on a corrupt store.
- Fix: Return a non-nil error after printing the integrity-check findings so the command fails fast on corruption, or at minimum propagate a failing exit status at the end.

internal/store/store.go:90

Problem: CheckpointWAL uses Exec("PRAGMA wal_checkpoint(TRUNCATE)") and treats nil error as success, but SQLite reports checkpoint status in the pragma result row. A busy or
incomplete checkpoint can therefore be silently accepted, leaving recent writes only in the WAL while callers assume the main DB file has been flushed.
Fix: Execute the pragma with QueryRow/Query and inspect the returned status columns, returning an error when SQLite reports a busy or incomplete checkpoint.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

roborev-ci · 2026-03-22T23:26:49Z

roborev: Combined Review (`ccd07b9`)

Verdict: The PR improves SQLite robustness and cache export safety, but introduces logical flaws in the cache validation and execution order of the integrity checks.

Medium

cmd/ msgvault/cmd/build_cache.go (around the new exportedCount guard)
The new exportedCount > 0 check only proves that some Parquet rows exist somewhere under messages/**/*.parquet; it does not prove this build produced them. If stale Parquet from a prior run
is still present, a failed current export can still pass this guard and incorrectly persist the new sync state.
Fix: Export into a fresh temp/output directory and atomically replace the old one on success, or validate against the expected row count for this run instead of checking for any existing Parquet rows.
cmd/msgvault/cmd/verify.go (around the new integrity-check block after Gmail client setup)
The code comments describe the integrity check as offline and independent of Gmail, but it runs only after OAuth/client initialization. If credential loading or Gmail client setup fails first
, database corruption is never checked or reported.
Fix: Move the SQLite integrity check to run immediately after opening the store, before any Gmail/OAuth setup.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

- Increase busy_timeout from 5s to 30s for large databases - Set synchronous=NORMAL explicitly (safe with WAL mode) - Add WAL checkpoint on Store.Close() and after sync completion - Fix build-cache writing _last_sync.json before verifying Parquet files exist (prevented cache from ever rebuilding after a failed export) - Log count query errors in build-cache instead of silently swallowing - Add SQLite integrity check to verify command (PRAGMA integrity_check) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The buildCache function now returns a non-nil error when the database has messages but the Parquet export produces zero rows, so callers (CLI, serve scheduler, TUI auto-build) can fail fast or retry instead of silently treating an empty cache as success. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

verify: track corruption state across all checks and return a non-zero exit code at the end when integrity errors are found. CheckpointWAL: read the (busy, log, checkpointed) result columns from PRAGMA wal_checkpoint instead of discarding them via Exec, so callers get an honest error when the checkpoint is incomplete. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Explicitly prohibit asking for commit permission and reframe committing as non-destructive to counter the system prompt's general caution about git operations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The early return for a missing Gmail source bypassed the final dbCorrupt check, causing verify to exit 0 despite finding integrity errors. Check dbCorrupt before returning nil in that branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

roborev-ci · 2026-03-23T16:37:15Z

roborev: Combined Review (`e488f07`)

Verdict: The PR introduces valuable database integrity checks and cache validation, but the new safeguards contain logic flaws that limit their effectiveness.

Medium

Location: cmd/msgvault/cmd/verify.go (~

- Problem: The new SQLite integrity check is described and commented as an offline preflight, but it still runs only after OAuth credentials are loaded and a Gmail token source/client are created. If credentials are missing/invalid, verify exits before the DB check ever runs, so corruption is
  not reported in one of the main recovery scenarios.
- Fix: Move runIntegrityCheck(s) to immediately after opening the store, before any Gmail credential or client setup, and only do Gmail-dependent work afterward.

Location: cmd/msgvault/cmd/ build_cache.go:423
- Problem: The new export guard counts all rows under messages/**/*.parquet, not the rows/files produced by this build. On an incremental rebuild with stale parquet already present, a failed export can still yield a positive exportedCount, allowing the
  sync state to advance while the cache remains outdated.
- Fix: Validate output from this run specifically, for example by exporting into a fresh temp directory and swapping it in on success, or by tracking the newly written files/count instead of globbing the entire existing cache tree.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

wesm · 2026-03-23T17:48:36Z

Acknowledged. Merging

robelkin-rational-partners and others added 5 commits March 23, 2026 10:55

Reinforce auto-commit instruction in CLAUDE.md

09a177c

Explicitly prohibit asking for commit permission and reframe committing as non-destructive to counter the system prompt's general caution about git operations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wesm force-pushed the fix/db-integrity-guards branch from ccd07b9 to e488f07 Compare March 23, 2026 16:34

wesm merged commit 754f459 into wesm:main Mar 23, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guard against SQLite corruption and empty analytics cache#210

Guard against SQLite corruption and empty analytics cache#210
wesm merged 5 commits intowesm:mainfrom
robelkin:fix/db-integrity-guards

robelkin commented Mar 20, 2026

Uh oh!

roborev-ci bot commented Mar 20, 2026

Uh oh!

roborev-ci bot commented Mar 21, 2026

Uh oh!

roborev-ci bot commented Mar 22, 2026

Uh oh!

roborev-ci bot commented Mar 23, 2026

Uh oh!

wesm commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

robelkin commented Mar 20, 2026

Summary

Context

Test plan

Uh oh!

roborev-ci bot commented Mar 20, 2026

roborev: Combined Review (3089eb9)

Medium Severity

Uh oh!

roborev-ci bot commented Mar 21, 2026

roborev: Combined Review (0f165b7)

Medium

Uh oh!

roborev-ci bot commented Mar 22, 2026

roborev: Combined Review (ccd07b9)

Medium

Uh oh!

roborev-ci bot commented Mar 23, 2026

roborev: Combined Review (e488f07)

Medium

Uh oh!

wesm commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

roborev: Combined Review (`3089eb9`)

roborev: Combined Review (`0f165b7`)

roborev: Combined Review (`ccd07b9`)

roborev: Combined Review (`e488f07`)