Skip to content

splits the job into one per batch#41

Merged
techiejd merged 12 commits intomainfrom
multiple_jobs_for_multiple_batches
Feb 25, 2026
Merged

splits the job into one per batch#41
techiejd merged 12 commits intomainfrom
multiple_jobs_for_multiple_batches

Conversation

@techiejd
Copy link
Owner

No description provided.

@techiejd techiejd self-assigned this Feb 20, 2026
@techiejd techiejd added the enhancement New feature or request label Feb 20, 2026
@techiejd techiejd linked an issue Feb 20, 2026 that may be closed by this pull request
techiejd and others added 11 commits February 21, 2026 23:06
- Remove 30s waitUntil delay from per-batch task re-queue (was causing
  test timeouts since the original code had no such delay)
- Add failedChunkData JSON field to batch collection so per-batch tasks
  can store chunk-level failure data independently
- Aggregate failedChunkData from batch records in finalizeRunIfComplete()
  instead of relying on in-memory accumulation from the old single-task flow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rker architecture

Splits prepare-bulk-embedding into coordinator + per-collection workers.
Each worker processes one page of one collection, queuing a continuation
job before processing to ensure crash safety. Default batchLimit is 1000
when not explicitly set.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The second test was creating a separate Payload instance sharing the same
DB and job queues, causing two crons to compete for jobs. This led to
double-execution and mock state inconsistency (expected 4 to be 2).
Now both tests use the single beforeAll instance with cleanup between.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Every test file that creates a Payload instance now calls
payload.destroy() in afterAll (or try/finally for in-test instances).
This stops background cron jobs from accumulating across tests, which
was causing heap exhaustion in CI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add --max-old-space-size=8192 to test:int NODE_OPTIONS (cross-env was
  overriding the CI env var, so the heap limit never took effect)
- Fix polling.spec.ts queueSpy assertions: coordinator/worker adds an
  extra queue call, so poll-or-complete-single-batch is now call 3 and 4
  instead of 2 and 3
- Add extensive [vectorize-debug] console.log throughout task handlers
  (coordinator, worker, poll-single, finalize, streamAndBatchDocs) to
  diagnose any remaining CI hangs
- Remove redundant NODE_OPTIONS from CI workflow (now in the script)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ncrementally

Remove the backward-compatible fan-out task since the per-batch architecture
hasn't been released yet. Refactor finalizeRunIfComplete to aggregate batch
counts incrementally during pagination instead of collecting all batch objects
into memory.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Bump version 0.5.4 → 0.5.5
- Add 0.5.5 entry to CHANGELOG.md (coordinator/worker, batchLimit, per-batch polling)
- Document batchLimit in README CollectionVectorizeOption section
- Remove all diagnostic console.log statements from bulkEmbedAll.ts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@techiejd techiejd merged commit 82ac6a1 into main Feb 25, 2026
2 checks passed
techiejd added a commit that referenced this pull request Feb 26, 2026
* adds should embed (#38)

* adds should embed

* Ups version to get ready for release

* splits the job into one per batch (#41)

* splits the job into one per batch

* fix: remove waitUntil delay and persist failedChunkData on batch records

- Remove 30s waitUntil delay from per-batch task re-queue (was causing
  test timeouts since the original code had no such delay)
- Add failedChunkData JSON field to batch collection so per-batch tasks
  can store chunk-level failure data independently
- Aggregate failedChunkData from batch records in finalizeRunIfComplete()
  instead of relying on in-memory accumulation from the old single-task flow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add batchLimit to CollectionVectorizeOption with coordinator/worker architecture

Splits prepare-bulk-embedding into coordinator + per-collection workers.
Each worker processes one page of one collection, queuing a continuation
job before processing to ensure crash safety. Default batchLimit is 1000
when not explicitly set.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: rewrite batchLimit test 2 to reuse same Payload instance

The second test was creating a separate Payload instance sharing the same
DB and job queues, causing two crons to compete for jobs. This led to
double-execution and mock state inconsistency (expected 4 to be 2).
Now both tests use the single beforeAll instance with cleanup between.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add payload.destroy() in afterAll to prevent OOM from leaked crons

Every test file that creates a Payload instance now calls
payload.destroy() in afterAll (or try/finally for in-test instances).
This stops background cron jobs from accumulating across tests, which
was causing heap exhaustion in CI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Trying to not destroy our heap

* Runs tests in parallel now that each test gets its own db

* WIP

* fix: fix OOM, polling test assertions, and add diagnostic logging

- Add --max-old-space-size=8192 to test:int NODE_OPTIONS (cross-env was
  overriding the CI env var, so the heap limit never took effect)
- Fix polling.spec.ts queueSpy assertions: coordinator/worker adds an
  extra queue call, so poll-or-complete-single-batch is now call 3 and 4
  instead of 2 and 3
- Add extensive [vectorize-debug] console.log throughout task handlers
  (coordinator, worker, poll-single, finalize, streamAndBatchDocs) to
  diagnose any remaining CI hangs
- Remove redundant NODE_OPTIONS from CI workflow (now in the script)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: remove poll-or-complete-bulk-embedding task and aggregate incrementally

Remove the backward-compatible fan-out task since the per-batch architecture
hasn't been released yet. Refactor finalizeRunIfComplete to aggregate batch
counts incrementally during pagination instead of collecting all batch objects
into memory.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump to 0.5.5, update changelog, remove debug logging

- Bump version 0.5.4 → 0.5.5
- Add 0.5.5 entry to CHANGELOG.md (coordinator/worker, batchLimit, per-batch polling)
- Document batchLimit in README CollectionVectorizeOption section
- Remove all diagnostic console.log statements from bulkEmbedAll.ts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Adds upgrade note

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version to 0.6.0-beta.5

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: resolve 4 CI test failures from merge

- chunkers.spec.ts: remove getPayload() call that crashes on dummy db,
  pass SanitizedConfig directly to chunkRichText
- batchLimit.spec.ts: add missing dbAdapter (createMockAdapter) required
  by split_db_adapter architecture
- extensionFieldsVectorSearch.spec.ts: pass adapter as second arg to
  createVectorSearchHandlers (new signature from split_db_adapter)
- versionBump.spec.ts: destroy payload0 before creating payload1 to
  prevent cron worker race condition between two instances

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Cleans a nit double line

* Undoes a weird test fix done by the bot

* fix: harden versionBump test with sequential steps and queue isolation

- Use test.step() to enforce sequential execution of each phase
- Add separate realtimeQueueName per payload instance to prevent
  cron worker cross-talk on the default queue
- Use dynamic Date.now() keys to avoid cached state interference
- Increase waitForBulkJobs timeout to 30s for CI

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: prevent waitForBulkJobs from returning prematurely

waitForBulkJobs could return early in the coordinator/worker fan-out
pattern when there's a brief window with 0 pending jobs between job
transitions. Now it also checks the bulk embeddings run status — only
returns when both no pending jobs exist AND no runs are in queued/running
state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove test.step() — not available in Vitest

test.step() is a Playwright API, not Vitest. Reverted to flat
sequential code with phase comments for readability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: rewrite versionBump test with single Payload instance

Instead of creating two Payload instances (which caused cron cross-talk,
timeout, and queue isolation issues on CI), use one instance and mutate
the knowledgePools config version between bulk embed runs. Tests the
same code path (versionMismatch in streamAndBatchDocs) without the
multi-instance fragility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
techiejd added a commit that referenced this pull request Mar 5, 2026
Resolve conflicts by keeping our branch's versions, which are the
superset (adapter architecture + shouldEmbed + batchLimit already
integrated). Main's #38 and #41 implemented the same features
without the adapter layer.
techiejd added a commit that referenced this pull request Mar 5, 2026
* WIP

WIP

WIP

WIP

Uses mock adapter

WIP

WIP

WIP

WIP

WIP

WIP

* fix: ignore node_modules everywhere

* Adds split_db_adapter to CI run

* feat(cf-adapter): add Cloudflare Vectorize adapter (#28)

* feat(cf-adapter): add Cloudflare Vectorize adapter

* feat(cf-adapter): enhance Cloudflare Vectorize integration with config-based bindings and add tests

* feat(cf-adapter): refactor Cloudflare Vectorize integration to use config-based bindings and update tests

* chore: update pnpm-lock.yaml

* Preparing for automated pubishes. This one beta will be done by hand but hopefully the rest will be done automatically

* Bumps version since we added deleteEmbeddings. Also runs tsc so that we can be sure the whole project compiles

* Adds the type check to ci

* fixes type check

* removes silly double checking on split_db_adapter for push

* Adds root pnpm workspace

* feat(cf-adapter): update query parameters and method for deleting embeddings in Cloudflare Vectorize integration (#31)

* Bumps version to release

* Better typings (#34)

* Adds better id tracking for deletion and does only one search instead of many for querying (#35)

* Deduplicate shared logic across plugin and adapter packages (#36)

* Deduplicate shared logic across plugin and adapter packages

Extract repeated production patterns (chunk validation, delete embeddings,
task slug constants) into shared utilities exported from the root plugin.
Consolidate test helpers via vitest path aliases so adapter tests import
from the canonical root dev/ copies. Remove CF adapter dead test code
(unused utils, constants, helpers). Fix chunkRichText join bug in CF
adapter tests (was joining child nodes without spaces).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* cf adapter limitation acknowledgement and more DRY

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* Removes dead code (#37)

* Bumps version for rollout

* Merge main (#40)

* adds should embed (#38)

* adds should embed - merged

* Merge main into split_db_adapter (beta.5) (#42)

* adds should embed (#38)

* adds should embed

* Ups version to get ready for release

* splits the job into one per batch (#41)

* splits the job into one per batch

* fix: remove waitUntil delay and persist failedChunkData on batch records

- Remove 30s waitUntil delay from per-batch task re-queue (was causing
  test timeouts since the original code had no such delay)
- Add failedChunkData JSON field to batch collection so per-batch tasks
  can store chunk-level failure data independently
- Aggregate failedChunkData from batch records in finalizeRunIfComplete()
  instead of relying on in-memory accumulation from the old single-task flow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add batchLimit to CollectionVectorizeOption with coordinator/worker architecture

Splits prepare-bulk-embedding into coordinator + per-collection workers.
Each worker processes one page of one collection, queuing a continuation
job before processing to ensure crash safety. Default batchLimit is 1000
when not explicitly set.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: rewrite batchLimit test 2 to reuse same Payload instance

The second test was creating a separate Payload instance sharing the same
DB and job queues, causing two crons to compete for jobs. This led to
double-execution and mock state inconsistency (expected 4 to be 2).
Now both tests use the single beforeAll instance with cleanup between.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add payload.destroy() in afterAll to prevent OOM from leaked crons

Every test file that creates a Payload instance now calls
payload.destroy() in afterAll (or try/finally for in-test instances).
This stops background cron jobs from accumulating across tests, which
was causing heap exhaustion in CI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Trying to not destroy our heap

* Runs tests in parallel now that each test gets its own db

* WIP

* fix: fix OOM, polling test assertions, and add diagnostic logging

- Add --max-old-space-size=8192 to test:int NODE_OPTIONS (cross-env was
  overriding the CI env var, so the heap limit never took effect)
- Fix polling.spec.ts queueSpy assertions: coordinator/worker adds an
  extra queue call, so poll-or-complete-single-batch is now call 3 and 4
  instead of 2 and 3
- Add extensive [vectorize-debug] console.log throughout task handlers
  (coordinator, worker, poll-single, finalize, streamAndBatchDocs) to
  diagnose any remaining CI hangs
- Remove redundant NODE_OPTIONS from CI workflow (now in the script)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: remove poll-or-complete-bulk-embedding task and aggregate incrementally

Remove the backward-compatible fan-out task since the per-batch architecture
hasn't been released yet. Refactor finalizeRunIfComplete to aggregate batch
counts incrementally during pagination instead of collecting all batch objects
into memory.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump to 0.5.5, update changelog, remove debug logging

- Bump version 0.5.4 → 0.5.5
- Add 0.5.5 entry to CHANGELOG.md (coordinator/worker, batchLimit, per-batch polling)
- Document batchLimit in README CollectionVectorizeOption section
- Remove all diagnostic console.log statements from bulkEmbedAll.ts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Adds upgrade note

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version to 0.6.0-beta.5

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: resolve 4 CI test failures from merge

- chunkers.spec.ts: remove getPayload() call that crashes on dummy db,
  pass SanitizedConfig directly to chunkRichText
- batchLimit.spec.ts: add missing dbAdapter (createMockAdapter) required
  by split_db_adapter architecture
- extensionFieldsVectorSearch.spec.ts: pass adapter as second arg to
  createVectorSearchHandlers (new signature from split_db_adapter)
- versionBump.spec.ts: destroy payload0 before creating payload1 to
  prevent cron worker race condition between two instances

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Cleans a nit double line

* Undoes a weird test fix done by the bot

* fix: harden versionBump test with sequential steps and queue isolation

- Use test.step() to enforce sequential execution of each phase
- Add separate realtimeQueueName per payload instance to prevent
  cron worker cross-talk on the default queue
- Use dynamic Date.now() keys to avoid cached state interference
- Increase waitForBulkJobs timeout to 30s for CI

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: prevent waitForBulkJobs from returning prematurely

waitForBulkJobs could return early in the coordinator/worker fan-out
pattern when there's a brief window with 0 pending jobs between job
transitions. Now it also checks the bulk embeddings run status — only
returns when both no pending jobs exist AND no runs are in queued/running
state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove test.step() — not available in Vitest

test.step() is a Playwright API, not Vitest. Reverted to flat
sequential code with phase comments for readability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: rewrite versionBump test with single Payload instance

Instead of creating two Payload instances (which caused cron cross-talk,
timeout, and queue isolation issues on CI), use one instance and mutate
the knowledgePools config version between bulk embed runs. Tests the
same code path (versionMismatch in streamAndBatchDocs) without the
multi-instance fragility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* chore: update beta version

* Remove Cloudflare adapter to unblock main branch merge

Split the CF adapter work out so the core DbAdapter API and pg adapter
can be merged to main independently. The CF adapter will continue in
a separate branch.

* docs: call for help adding more database adapters

* chore: bump version to 0.7.0

---------

Co-authored-by: dejan-velimirovic-calendly <dejan.velimirovic@calendly.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor Bulk Embedding Job Creation to Use Per-Batch Jobs

2 participants