Skip to content

fix: include vectors in backup and fix Float32Array serialization#145

Open
baizenghu wants to merge 2 commits intoCortexReach:mainfrom
baizenghu:feat/backup-include-vectors
Open

fix: include vectors in backup and fix Float32Array serialization#145
baizenghu wants to merge 2 commits intoCortexReach:mainfrom
baizenghu:feat/backup-include-vectors

Conversation

@baizenghu
Copy link

Problem

The memory_backup tool produces JSONL files with empty vector arrays ("vector": []). When restoring from backup, all memories must be re-embedded, which is slow and costly — especially for large memory stores.

Root Cause

LanceDB returns vectors as Float32Array (a typed array), but the backup code used Array.isArray() to check vectors before serialization. Since Array.isArray(new Float32Array(...)) returns false, vectors silently fell through to the empty array default.

Solution

1. Add includeVectors option to store.list() (src/store.ts)

  • New optional parameter includeVectors = false (backward compatible)
    • When true, adds "vector" to the LanceDB select columns
    • Converts Float32Array to plain array via Array.from() for proper JSON serialization

2. Include vectors in backup with metadata header (index.ts)

  • Backup now calls store.list(..., true) to include vectors
    • First line of JSONL is a metadata header with embedding model name, dimensions, and timestamp
    • Enables restore to skip re-embedding when the model matches

Bug Fix Details

// Before (broken): Array.isArray(Float32Array) === false
vector: Array.isArray(row.vector) ? [...row.vector] : []

// After (fixed): Array.from works with any iterable
vector: includeVectors && row.vector
  ? Array.from(row.vector as Iterable<number>)
  : []

Testing

Verified backup now produces correct vector data (non-empty arrays with proper float values). Restore with matching embedding model skips re-embedding entirely.

baizenghu and others added 2 commits March 9, 2026 23:52
…up JSONL

Backup JSONL now contains vector data so restores work without re-embedding
(critical for intranet environments where embedding API may be unavailable).
Backup files include a _meta header line with model/dimensions/timestamp.
list() gains an optional includeVectors parameter (default false) to keep
existing callers unaffected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
LanceDB returns Float32Array (not plain Array), so Array.isArray()
returns false. Use Array.from() to correctly convert typed arrays
when includeVectors is true.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AliceLJY
Copy link
Collaborator

Review: fix: include vectors in backup and fix Float32Array serialization

Verdict: Fix-then-merge — the same missing test file issue blocks this one too.


✅ What's working

  • The two-commit approach is self-aware: commit 1 introduces the feature with a Array.isArray bug, commit 2 immediately fixes it with Array.from. Good catch
  • Array.from(row.vector as Iterable<number>) correctly handles LanceDB's Float32Array return type — Array.isArray(Float32Array) is false, but Array.from works on any iterable ✅
  • row.vector truthy guard prevents null/undefined crash if an entry has no vector ✅
  • includeVectors = false default keeps all existing list() callers completely unaffected ✅
  • _meta header design (with version, model, dimensions, timestamp) is forward-thinking — future restore code can distinguish memory lines from the metadata line via "_meta" in parsedLine
  • The motivation is real: intranet environments can't call embedding APIs for restore — shipping vectors in backup is the right call

🔴 Blocking

No test file. includeVectors path has zero test coverage. There's no test that verifies:

  1. Array.from(Float32Array([...])) produces a non-empty number[] (regression test for the commit-1 bug)
  2. list(..., false) still returns vector: [] (backward compat regression)
  3. row.vector = null returns vector: [] without crashing

Suggested test/backup-vector-serialization.test.mjs:

// Simulate Float32Array as LanceDB would return
const fa = new Float32Array([0.1, 0.2, 0.3]);
assert.ok(Array.isArray(Array.from(fa)));          // Array.from works
assert.ok(!Array.isArray(fa));                     // sanity: raw Float32Array is not Array
assert.equal(Array.from(fa).length, 3);

// Null guard
const result = (null && undefined) ? Array.from(null) : [];
assert.deepEqual(result, []);

Wire it into package.json test script (same issue as #148, #141, #144).

🟡 Suggested before merge

Backup size can grow dramatically without warning. At 1024 dimensions (float32), each memory entry adds ~4 KB of vector data. 1000 entries → backup grows from ~200 KB to ~4 MB; at the 10,000-entry cap → ~40 MB per file, 7 rotations → ~280 MB total. The success log already prints the entry count — adding a size estimate would help operators catch unexpected growth:

const backupBytes = Buffer.byteLength(metaLine + "\n" + dataLines.join("\n") + "\n");
api.logger.debug(`backup: ${allMemories.length} entries, ${(backupBytes/1024/1024).toFixed(1)} MB → ${backupFile}`);

⚪ Non-blocking

  • _meta format change is undocumented — old backups (no _meta line) and new backups coexist silently. When restore functionality is added, it'll need to handle both. A note in CHANGELOG or README would prevent future confusion
  • The Float32Array → number[] conversion introduces float64 round-trip (e.g., 0.1 in float32 becomes 0.10000000149011612). This is expected and harmless for cosine similarity

Clean fix for a real offline-restore problem. The blocking issue is solely the missing test coverage — the logic itself is correct.

@win4r
Copy link
Collaborator

win4r commented Mar 10, 2026

@claude

@claude
Copy link

claude bot commented Mar 10, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

@win4r
Copy link
Collaborator

win4r commented Mar 10, 2026

@codex review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants