Support Distributed Execution for BranchContext

 BranchContext is single-node. To scale parallelism (BestOfN n=100, wide BeamSearch), we need distributed  task dispatch. Most AI agents are lightweight processes making LLM API calls, they run as containers in Kubernetes. No need for Ray, Modal, or custom gRPC. Just K8s Jobs.

 Deployment Model

 COORDINATOR (pod or bare metal)              WORKER PODS (K8s Jobs)
 ┌──────────────────────────┐                ┌──────────────────────────┐
 │ BranchFS FUSE daemon     │  RWX PVC       │ No BranchFS daemon       │
 │                          │ ─────────────→ │ /mnt/branchfs/ (PVC)     │
 │ Pattern (Speculate, etc) │                │   read/write only        │
 │   workspace.branch()     │                │                          │
 │     → ioctl create       │                │ branching-worker:        │
 │   K8s: ConfigMap(task)   │                │   read task from ConfigMap│
 │   K8s: create Job ───────┼── K8s API ───→ │   fn(*args)              │
 │   K8s: watch completion  │                │   print result to stdout │
 │     ← pod stdout (JSON)  │                │   exit                   │
 │   b.commit() or abort()  │                │                          │
 │     → ioctl              │                │                          │
 └──────────────────────────┘                └──────────────────────────┘
         │                                            │
         └──── ReadWriteMany PVC (EFS/Filestore/CephFS/NFS) ────┘

Storage-agnostic via PVC: The shared filesystem is a K8s PersistentVolumeClaim with ReadWriteMany access  mode. The admin provisions it with whatever backend their cluster supports: AWS EFS, GCP Filestore, Azure Files, CephFS (Rook), or plain NFS. The executor only needs the PVC name, not storage-specific details.

 Coordinator owns all branch decisions: Workers never call create/commit/abort (those are ioctls local to
  the BranchFS FUSE daemon). Workers only read/write files in the branch directory via the shared PVC.
 The pattern thread on the coordinator receives the worker's return value and decides commit/abort. This
 is already how the code works — run_in_process() returns a value, the pattern acts on it.

 Task input via ConfigMap: The serialized callable (cloudpickle) is stored in a K8s ConfigMap and mounted
  into the worker pod at /etc/branching/task.pkl. The branch directory stays clean — no ._task.pkl
 pollution. ConfigMaps are etcd-backed K8s objects with RBAC, lifecycle management, and garbage
 collection via ownerReferences (auto-deleted when the Job is deleted). For typical agent callables
 (<1KB), well within the 1MB ConfigMap limit.

 Result passing via pod stdout: JSON on stdout. Coordinator reads via K8s read_namespaced_pod_log() after
  Job completes. No filesystem IPC for results, no polling — K8s watch API is event-driven.

Data flow through K8s API objects (all etcd-backed):

 Task callable  → ConfigMap  (etcd write)
 Task dispatch  → Job        (etcd write)
 Completion     → Job watch  (etcd watch — event-driven, no polling)
 Result         → Pod logs   (k8s API read)
 Cleanup        → ownerRef   (ConfigMap auto-deleted with Job)
 Cancellation   → Job delete (etcd delete → pod killed)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Distributed Execution for BranchContext #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support Distributed Execution for BranchContext #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions