feat(decopilot): durable agent loop with crash recovery and resume by tlgimenes · Pull Request #2807 · decocms/studio

tlgimenes · 2026-03-22T22:33:59Z

What is this contribution about?

Agent runs previously lived entirely in-memory (RunRegistry Map), meaning deploys, pod restarts, and OOM kills would kill all active runs with no way to resume. This PR persists run ownership and config to the database, enabling automatic crash recovery and manual resume.

Key changes:

DB migration (047-durable-agent-runs): Adds run_owner_pod, run_config (JSONB), and run_started_at columns to threads table with a partial index for orphan lookups
State machine: New RESUME command and RUN_RESUMED event in the decider/projector/reactor pipeline
Storage methods: claimOrphanedRun (atomic CAS), listOrphanedRuns (dead-pod detection via stale run_started_at), orphanRunsByPod (graceful shutdown)
Registry: stopAll() now orphans runs in DB before clearing in-memory state; recoverOrphanedRuns() auto-resumes automation runs on startup (concurrency cap of 5)
Resume endpoint: POST /:org/decopilot/resume/:threadId with ownership validation, schema drift protection, and model permission re-check
Crash recovery: On startup, orphaned automation runs are automatically detected and resumed with audit trail system messages
Save-every-step on resume: Reduces message loss window by saving on every STEP_COMPLETED instead of every 5th during resumed runs

How to Test

Run bun run --cwd=apps/mesh migrate to apply the new migration
Start a Decopilot agent run, then kill the server process
Restart the server — orphaned automation runs should auto-resume within 10 seconds
For interactive runs, call POST /:org/decopilot/resume/:threadId to manually resume
Run bun test — 1248 pass, 0 new failures

Migration Notes

New migration 047-durable-agent-runs adds three nullable columns to threads table — no data backfill needed
Partial index idx_threads_orphaned_runs on (status, run_owner_pod) WHERE status = 'in_progress' for efficient orphan lookups

Review Checklist

PR title is clear and descriptive
Changes are tested and working
No breaking changes
Comprehensive test coverage for new state machine transitions, storage methods, registry behavior, and schema validation

Summary by cubic

Make Decopilot runs durable with crash recovery, pod‑death handoff, and transparent resume by persisting run ownership/config, adding DB CAS on start, and per‑pod NATS KV heartbeats. Runs auto‑resume on startup and when a pod dies; the client auto‑attaches and resumes without manual steps.

New Features
- DB/storage: added run_owner_pod, run_config (JSONB), run_started_at and partial index; CAS helpers claimRunStart, claimOrphanedRun (NULL or stale owner), listOrphanedRuns/listOrphanedRunsByPod, orphanRunsByPod.
- Engine/registry: new RESUME command and RUN_RESUMED event; RUN_STARTED uses CAS (persisting run_config, run_owner_pod, run_started_at); reactor clears run columns on terminal states; RunRegistry.stopAll() first orphans in DB, then aborts; recoverOrphanedRuns() resumes both automation and interactive runs (cap 5); start conflicts return 409.
- Pod death recovery: per‑pod NatsPodHeartbeat detects dead pods and handlePodDeath claims/resumes their runs; stable POD_ID from POD_NAME (or a UUID).
- Resume: GET /:org/decopilot/attach/:threadId replays if running locally, or CAS‑claims orphans and resumes using stored run_config; validates schema and current model permissions (403 on deny); previous POST resume endpoint removed; on resume, messages save on every step to reduce loss.
- UI: show “Resuming task...” in the assistant placeholder while attach resumes; removed the “Run in progress” label in chat input.
Migration
- Run apps/mesh migration 050-durable-agent-runs.
- No backfill; creates partial index idx_threads_run_owner for efficient orphan lookups.

^{Written for commit 8981c5a. Summary will update on new commits.}

github-actions · 2026-03-22T22:34:09Z

🧪 Benchmark

Should we run the Virtual MCP strategy benchmark for this PR?

React with 👍 to run the benchmark.

Reaction	Action
👍	Run quick benchmark (10 & 128 tools)

Benchmark will run on the next push after you react.

github-actions · 2026-03-22T22:34:11Z

Release Options

Should a new version be published when this PR is merged?

React with an emoji to vote on the release type:

Reaction	Type	Next Version
👍	Prerelease	`2.190.6-alpha.1`
🎉	Patch	`2.190.6`
❤️	Minor	`2.191.0`
🚀	Major	`3.0.0`

Current version: 2.190.5

Deployment

Deploy to production (triggers ArgoCD sync after Docker image is published)

cubic-dev-ai

9 issues found across 24 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/api/routes/decopilot/run-reactor.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/run-reactor.ts:120">
P1: Clear the ghost-run metadata in the same conditional update. As written, a new run that starts between these two queries can have its fresh ownership/config fields wiped out.</violation>
</file>

<file name="apps/mesh/src/api/routes/decopilot/automation-context.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/automation-context.ts:69">
P1: Load and pass custom-role permissions into this background auth client. With empty headers, resumed/manual automation runs for non-built-in roles will fail authorization.</violation>
</file>

<file name="apps/mesh/src/api/app.ts">

<violation number="1" location="apps/mesh/src/api/app.ts:793">
P1: Track and clear this delayed recovery timer during HMR/shutdown. Otherwise a previous `createApp` instance can still fire orphan recovery after its Decopilot resources were already cleaned up.

(Based on your team's feedback about tracking long-lived createApp resources during HMR/shutdown cleanup.) [FEEDBACK_USED]</violation>
</file>

<file name="apps/mesh/src/api/routes/decopilot/run-registry.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/run-registry.ts:133">
P1: If orphaning the DB rows fails, this method still clears local state and strands the run as `in_progress` with the old owner.</violation>

<violation number="2" location="apps/mesh/src/api/routes/decopilot/run-registry.ts:153">
P2: This startup recovery only processes one page of orphaned runs, so anything after the first 100 is never auto-resumed.</violation>
</file>

<file name="apps/mesh/src/core/pod-identity.ts">

<violation number="1" location="apps/mesh/src/core/pod-identity.ts:5">
P2: `HOSTNAME` is being treated as a pod id in every environment, so non-Kubernetes processes can share the same `POD_ID` instead of getting a per-process UUID.</violation>
</file>

<file name="apps/mesh/src/api/routes/decopilot/stream-core.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/stream-core.ts:195">
P1: Dispatch the resume state before the async setup or explicitly release the DB claim on setup failures. A failed resume attempt can otherwise strand the thread as claimed-but-not-running.</violation>
</file>

<file name="apps/mesh/src/api/routes/decopilot/run-config.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/run-config.ts:14">
P2: Validate persisted `capabilities` and `limits` with the same typed shape as the request schema; the current `z.record(..., z.unknown())` lets malformed run configs bypass resume-time schema checks.</violation>
</file>

<file name="apps/mesh/src/storage/threads.ts">

<violation number="1" location="apps/mesh/src/storage/threads.ts:451">
P1: Allow `claimOrphanedRun()` to reclaim stale dead-pod runs. With the current null-only CAS, threads found via stale `run_started_at` can never be resumed after an ungraceful pod crash.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

apps/mesh/src/api/routes/decopilot/run-reactor.ts

apps/mesh/src/api/routes/decopilot/automation-context.ts

apps/mesh/src/api/app.ts

apps/mesh/src/api/routes/decopilot/run-registry.ts

apps/mesh/src/api/routes/decopilot/stream-core.ts

apps/mesh/src/storage/threads.ts

apps/mesh/src/api/routes/decopilot/run-registry.ts

apps/mesh/src/core/pod-identity.ts

apps/mesh/src/api/routes/decopilot/run-config.ts

cubic-dev-ai

1 issue found across 5 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/storage/threads.ts">

<violation number="1" location="apps/mesh/src/storage/threads.ts:456">
P1: Using `run_started_at` as the stale-claim cutoff can steal still-running jobs after 15 minutes.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

apps/mesh/src/storage/threads.ts

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/storage/threads.ts">

<violation number="1" location="apps/mesh/src/storage/threads.ts:451">
P1: This startup recovery query now skips runs stranded with a stale `run_owner_pod`, so abnormal pod deaths leave some threads stuck in `in_progress` forever.

(Based on your team's feedback about avoiding stale-timestamp ownership stealing.) [FEEDBACK_USED]</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

apps/mesh/src/storage/threads.ts

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/api/routes/decopilot/routes.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/routes.ts:311">
P2: Don't turn a model-permission failure into `204 No Content`; the reconnect path will treat it like an empty attach and the resume failure becomes silent.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

apps/mesh/src/api/routes/decopilot/routes.ts

cubic-dev-ai

2 issues found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/storage/threads.ts">

<violation number="1" location="apps/mesh/src/storage/threads.ts:460">
P1: Do not claim runs that already have a non-null `run_owner_pod`; this lets another pod steal an active run during attach/resume.

(Based on your team's feedback about using only `run_owner_pod IS NULL` for claim/ownership CAS logic.) [FEEDBACK_USED]</violation>
</file>

<file name="apps/mesh/src/api/routes/decopilot/routes.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/routes.ts:368">
P0: Don't reclaim runs with a non-null `run_owner_pod` here; this can steal a still-running job from another healthy pod and start a second resume.

(Based on your team's feedback about using only `run_owner_pod IS NULL` for run-claim CAS logic.) [FEEDBACK_USED]</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

apps/mesh/src/api/routes/decopilot/routes.ts

apps/mesh/src/storage/threads.ts

cubic-dev-ai

2 issues found across 5 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/storage/threads.ts">

<violation number="1" location="apps/mesh/src/storage/threads.ts:455">
P0: This claim now steals runs from other pods instead of only claiming true orphans.

(Based on your team's feedback about using only `run_owner_pod IS NULL` for run ownership CAS.) [FEEDBACK_USED]</violation>

<violation number="2" location="apps/mesh/src/storage/threads.ts:468">
P0: This orphan query now includes runs owned by other healthy pods, so recovery can resume the same thread concurrently on multiple replicas.

(Based on your team's feedback about using only `run_owner_pod IS NULL` to detect orphaned runs.) [FEEDBACK_USED]</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

apps/mesh/src/storage/threads.ts

cubic-dev-ai

6 issues found across 9 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/api/routes/decopilot/run-reactor.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/run-reactor.ts:89">
P1: Throwing here leaves a phantom local `running` entry when the DB claim is lost.</violation>
</file>

<file name="apps/mesh/src/nats/pod-heartbeat.ts">

<violation number="1" location="apps/mesh/src/nats/pod-heartbeat.ts:52">
P2: Await the initial heartbeat write so startup cannot continue without registering the pod key.

(Based on your team's feedback about treating JetStream startup paths as fail-fast dependencies.) [FEEDBACK_USED]</violation>

<violation number="2" location="apps/mesh/src/nats/pod-heartbeat.ts:130">
P2: Also close the active `kv.watch()` iterator in `stop()`. Aborting the local signal alone leaves the watcher pending until another KV event arrives.

(Based on your team's feedback about cleaning up long-lived resources during app recreation/shutdown.) [FEEDBACK_USED]</violation>
</file>

<file name="apps/mesh/src/api/app.ts">

<violation number="1" location="apps/mesh/src/api/app.ts:336">
P1: Stopping the heartbeat before aborting/orphaning local runs lets another pod resume the same thread while this pod is still running it.</violation>
</file>

<file name="apps/mesh/src/storage/threads.ts">

<violation number="1" location="apps/mesh/src/storage/threads.ts:488">
P2: Dead-pod recovery only fetches the first 100 orphaned runs, leaving additional runs unrecovered.</violation>
</file>

<file name="apps/mesh/src/api/routes/decopilot/run-registry.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/run-registry.ts:204">
P1: Broadcasting `CANCEL` here can turn a graceful handoff into a permanent `failed` run. Other pods see the heartbeat delete before the old pod has orphaned its DB rows, so the old pod still handles the cancel and persists `status = 'failed'` before this recovery path can claim it.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

apps/mesh/src/api/routes/decopilot/run-reactor.ts

apps/mesh/src/api/app.ts

apps/mesh/src/api/routes/decopilot/run-registry.ts

apps/mesh/src/nats/pod-heartbeat.ts

apps/mesh/src/storage/threads.ts

Persist agent run ownership and config to DB so runs survive pod restarts, deploys, and OOM kills. Orphaned runs are automatically resumed on startup with a concurrency cap of 5. Users can also manually resume via a new POST /:org/decopilot/resume/:threadId endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Server crashes shouldn't punish users — every in-progress run (both automation and interactive) is now auto-resumed on startup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Load custom-role permissions in automation context factory - Allow claimOrphanedRun to reclaim stale dead-pod runs (15min threshold) - Use POD_NAME env var instead of HOSTNAME to avoid collisions in non-K8s - Add typed validation for capabilities/limits in persisted run config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The 15-minute threshold could steal still-running jobs from other pods. Revert to null-only CAS — graceful shutdown already nulls run_owner_pod. SIGKILL/OOM recovery should use a separate long-interval reaper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused POST /resume endpoint and move orphan detection into GET /attach/:threadId so the client auto-resumes crashed runs without any client-side changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Crashed pods never clear run_owner_pod, so orphan detection failed for hard crashes. Add claimStaleRun CAS for non-null stale pods alongside the existing null-pod path. UI: remove "Run in progress" label from input, show "Resuming task..." (animated, like Thinking) instead of "No response was generated" while the attach endpoint resumes the orphaned run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…n path Returning 204 made model-permission failures silent. Now throws 403 consistent with the /stream endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ate method Remove claimStaleRun — instead widen claimOrphanedRun CAS to match both NULL and stale (different-pod) run_owner_pod. Also widen listOrphanedRuns to find crash-orphaned runs (non-null owner != current pod) so startup recovery handles them. This avoids the race where a separate claimStaleRun in the attach path could steal a run from a healthy pod in multi-pod deployments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Prevents concurrent execution via claimRunStart CAS and detects pod death within 45s via NATS KV heartbeat watcher for automatic orphan recovery. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Main added 049-remove-org-admin-projects; rename ours to avoid collision. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cubic-dev-ai bot reviewed Mar 22, 2026

View reviewed changes

tlgimenes force-pushed the tlgimenes/agent-durability branch from 1beda3e to 3725764 Compare March 22, 2026 23:55

cubic-dev-ai bot reviewed Mar 23, 2026

View reviewed changes

apps/mesh/src/storage/threads.ts Outdated Show resolved Hide resolved

cubic-dev-ai bot reviewed Mar 23, 2026

View reviewed changes

apps/mesh/src/storage/threads.ts Outdated Show resolved Hide resolved

tlgimenes force-pushed the tlgimenes/agent-durability branch from 96adaac to 8aa9eb9 Compare March 23, 2026 13:39

cubic-dev-ai bot reviewed Mar 23, 2026

View reviewed changes

apps/mesh/src/api/routes/decopilot/routes.ts Show resolved Hide resolved

cubic-dev-ai bot reviewed Mar 23, 2026

View reviewed changes

apps/mesh/src/api/routes/decopilot/routes.ts Outdated Show resolved Hide resolved

apps/mesh/src/storage/threads.ts Outdated Show resolved Hide resolved

cubic-dev-ai bot reviewed Mar 23, 2026

View reviewed changes

apps/mesh/src/storage/threads.ts Show resolved Hide resolved

apps/mesh/src/storage/threads.ts Show resolved Hide resolved

cubic-dev-ai bot reviewed Mar 23, 2026

View reviewed changes

tlgimenes and others added 10 commits March 23, 2026 15:02

fix(decopilot): auto-recover all orphaned runs, not just automations

9832e04

Server crashes shouldn't punish users — every in-progress run (both automation and interactive) is now auto-resumed on startup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(decopilot): throw 403 on model permission failure in attach orpha…

52e9818

…n path Returning 204 made model-permission failures silent. Now throws 403 consistent with the /stream endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: renumber durable-agent-runs migration to 050 after rebase

8981c5a

Main added 049-remove-org-admin-projects; rename ours to avoid collision. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tlgimenes force-pushed the tlgimenes/agent-durability branch from 037e716 to 8981c5a Compare March 23, 2026 18:03

tlgimenes merged commit 9600edc into main Mar 23, 2026
14 checks passed

tlgimenes deleted the tlgimenes/agent-durability branch March 23, 2026 18:15

Conversation

tlgimenes commented Mar 22, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is this contribution about?

How to Test

Migration Notes

Review Checklist

Summary by cubic

Uh oh!

github-actions bot commented Mar 22, 2026

🧪 Benchmark

Uh oh!

github-actions bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release Options

Deployment

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tlgimenes commented Mar 22, 2026 •

edited by cubic-dev-ai bot

Loading

github-actions bot commented Mar 22, 2026 •

edited

Loading