Skip to content

Guard against SQLite corruption and empty analytics cache#210

Merged
wesm merged 5 commits intowesm:mainfrom
robelkin:fix/db-integrity-guards
Mar 23, 2026
Merged

Guard against SQLite corruption and empty analytics cache#210
wesm merged 5 commits intowesm:mainfrom
robelkin:fix/db-integrity-guards

Conversation

@robelkin
Copy link
Contributor

Summary

  • Increase SQLite busy_timeout from 5s to 30s — prevents timeout-induced corruption on large (27GB+) databases where lock contention during batch inserts can exceed 5 seconds
  • Set synchronous=NORMAL explicitly — safe with WAL mode, avoids I/O bottlenecks that compound lock contention
  • Add WAL checkpointing on Store.Close() and after each sync completion — folds WAL back into the main database, preventing accumulation across sessions and stale WAL entries
  • Fix build-cache writing _last_sync.json before verifying Parquet files — a failed export would write the state file with the current max message ID, causing future incremental builds to skip, leaving the cache permanently empty
  • Log count query errors in build-cache instead of silently swallowing them
  • Add SQLite integrity check (PRAGMA integrity_check) to the verify command with --skip-db-check opt-out

Context

Discovered two issues on a production 27GB database:

  1. Analytics cache was permanently empty because _last_sync.json was written unconditionally even when the Parquet export produced zero files
  2. SQLite database had 100+ integrity errors (invalid page numbers, out-of-order rowids, btreeInitPage errors) with a small 1.1MB WAL — consistent with WAL never being checkpointed and busy_timeout being too short for the database size

Test plan

  • All existing tests pass (make test)
  • Verify msgvault verify <email> runs integrity check and reports results
  • Verify msgvault build-cache with a corrupted/empty source DB does not write _last_sync.json
  • Verify msgvault build-cache --full-rebuild works correctly after the guard prevents state file write

🤖 Generated with Claude Code

@roborev-ci
Copy link

roborev-ci bot commented Mar 20, 2026

roborev: Combined Review (3089eb9)

Summary Verdict: The PR introduces valuable database integrity checks, WAL checkpointing, and cache-export guarding, but requires a fix to ensure proper error propagation on cache build failures.

Medium Severity

  • Location: cmd/msgvault/cmd/build_cache.go:42 6
    • Problem: When the post-export read_parquet(...) count fails, or when a non-empty source database produces zero exported rows, buildCache now returns a nil error after printing a warning. This leaves the analytics cache empty or unusable while higher-level callers still
      observe a successful build, meaning scheduled/automated flows will not fail fast or retry.
    • Fix: Return a non-nil error (or an explicit failure status that propagates to a non-zero exit) while still skipping sync-state persistence.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link

roborev-ci bot commented Mar 21, 2026

roborev: Combined Review (0f165b7)

Summary Verdict: The PR introduces solid database resilience and safety improvements, but contains two medium-severity issues regarding error handling during
database verification and WAL checkpointing.

Medium

  • cmd/msgvault/cmd/verify.go:87

    • Problem: When runIntegrityCheck reports corruption, the command only prints the errors and continues with the rest of verification. That means verify can exit
      successfully even though the database is already known-bad, and later checks may produce misleading results on a corrupt store.
    • Fix: Return a non-nil error after printing the integrity-check findings so the command fails fast on corruption, or at minimum propagate a failing exit status at the end.

internal/store/store.go:90

  • Problem: CheckpointWAL uses Exec("PRAGMA wal_checkpoint(TRUNCATE)") and treats nil error as success, but SQLite reports checkpoint status in the pragma result row. A busy or
    incomplete checkpoint can therefore be silently accepted, leaving recent writes only in the WAL while callers assume the main DB file has been flushed.
  • Fix: Execute the pragma with QueryRow/Query and inspect the returned status columns, returning an error when SQLite reports a busy or incomplete checkpoint.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link

roborev-ci bot commented Mar 22, 2026

roborev: Combined Review (ccd07b9)

Verdict: The PR improves SQLite robustness and cache export safety, but introduces logical flaws in the cache validation and execution order of the integrity checks.

Medium

  • cmd/ msgvault/cmd/build_cache.go (around the new exportedCount guard)
    The new exportedCount > 0 check only proves that some Parquet rows exist somewhere under messages/**/*.parquet; it does not prove this build produced them. If stale Parquet from a prior run
    is still present, a failed current export can still pass this guard and incorrectly persist the new sync state.
    Fix: Export into a fresh temp/output directory and atomically replace the old one on success, or validate against the expected row count for this run instead of checking for any existing Parquet rows.

  • cmd/msgvault/cmd/verify.go (around the new integrity-check block after Gmail client setup)
    The code comments describe the integrity check as offline and independent of Gmail, but it runs only after OAuth/client initialization. If credential loading or Gmail client setup fails first
    , database corruption is never checked or reported.
    Fix: Move the SQLite integrity check to run immediately after opening the store, before any Gmail/OAuth setup.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

- Increase busy_timeout from 5s to 30s for large databases
- Set synchronous=NORMAL explicitly (safe with WAL mode)
- Add WAL checkpoint on Store.Close() and after sync completion
- Fix build-cache writing _last_sync.json before verifying Parquet
  files exist (prevented cache from ever rebuilding after a failed export)
- Log count query errors in build-cache instead of silently swallowing
- Add SQLite integrity check to verify command (PRAGMA integrity_check)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The buildCache function now returns a non-nil error when the database
has messages but the Parquet export produces zero rows, so callers
(CLI, serve scheduler, TUI auto-build) can fail fast or retry instead
of silently treating an empty cache as success.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
verify: track corruption state across all checks and return a non-zero
exit code at the end when integrity errors are found.

CheckpointWAL: read the (busy, log, checkpointed) result columns from
PRAGMA wal_checkpoint instead of discarding them via Exec, so callers
get an honest error when the checkpoint is incomplete.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Explicitly prohibit asking for commit permission and reframe
committing as non-destructive to counter the system prompt's
general caution about git operations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The early return for a missing Gmail source bypassed the final
dbCorrupt check, causing verify to exit 0 despite finding integrity
errors. Check dbCorrupt before returning nil in that branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@wesm wesm force-pushed the fix/db-integrity-guards branch from ccd07b9 to e488f07 Compare March 23, 2026 16:34
@roborev-ci
Copy link

roborev-ci bot commented Mar 23, 2026

roborev: Combined Review (e488f07)

Verdict: The PR introduces valuable database integrity checks and cache validation, but the new safeguards contain logic flaws that limit their effectiveness.

Medium

  • Location: cmd/msgvault/cmd/verify.go (~
    • Problem: The new SQLite integrity check is described and commented as an offline preflight, but it still runs only after OAuth credentials are loaded and a Gmail token source/client are created. If credentials are missing/invalid, verify exits before the DB check ever runs, so corruption is
      not reported in one of the main recovery scenarios.
    • Fix: Move runIntegrityCheck(s) to immediately after opening the store, before any Gmail credential or client setup, and only do Gmail-dependent work afterward.
  • Location: cmd/msgvault/cmd/ build_cache.go:423
    • Problem: The new export guard counts all rows under messages/**/*.parquet, not the rows/files produced by this build. On an incremental rebuild with stale parquet already present, a failed export can still yield a positive exportedCount, allowing the
      sync state to advance while the cache remains outdated.
    • Fix: Validate output from this run specifically, for example by exporting into a fresh temp directory and swapping it in on success, or by tracking the newly written files/count instead of globbing the entire existing cache tree.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@wesm
Copy link
Owner

wesm commented Mar 23, 2026

Acknowledged. Merging

@wesm wesm merged commit 754f459 into wesm:main Mar 23, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants