Skip to content

feat(decopilot): durable agent loop with crash recovery and resume#2807

Merged
tlgimenes merged 10 commits intomainfrom
tlgimenes/agent-durability
Mar 23, 2026
Merged

feat(decopilot): durable agent loop with crash recovery and resume#2807
tlgimenes merged 10 commits intomainfrom
tlgimenes/agent-durability

Conversation

@tlgimenes
Copy link
Contributor

@tlgimenes tlgimenes commented Mar 22, 2026

What is this contribution about?

Agent runs previously lived entirely in-memory (RunRegistry Map), meaning deploys, pod restarts, and OOM kills would kill all active runs with no way to resume. This PR persists run ownership and config to the database, enabling automatic crash recovery and manual resume.

Key changes:

  • DB migration (047-durable-agent-runs): Adds run_owner_pod, run_config (JSONB), and run_started_at columns to threads table with a partial index for orphan lookups
  • State machine: New RESUME command and RUN_RESUMED event in the decider/projector/reactor pipeline
  • Storage methods: claimOrphanedRun (atomic CAS), listOrphanedRuns (dead-pod detection via stale run_started_at), orphanRunsByPod (graceful shutdown)
  • Registry: stopAll() now orphans runs in DB before clearing in-memory state; recoverOrphanedRuns() auto-resumes automation runs on startup (concurrency cap of 5)
  • Resume endpoint: POST /:org/decopilot/resume/:threadId with ownership validation, schema drift protection, and model permission re-check
  • Crash recovery: On startup, orphaned automation runs are automatically detected and resumed with audit trail system messages
  • Save-every-step on resume: Reduces message loss window by saving on every STEP_COMPLETED instead of every 5th during resumed runs

How to Test

  1. Run bun run --cwd=apps/mesh migrate to apply the new migration
  2. Start a Decopilot agent run, then kill the server process
  3. Restart the server — orphaned automation runs should auto-resume within 10 seconds
  4. For interactive runs, call POST /:org/decopilot/resume/:threadId to manually resume
  5. Run bun test — 1248 pass, 0 new failures

Migration Notes

  • New migration 047-durable-agent-runs adds three nullable columns to threads table — no data backfill needed
  • Partial index idx_threads_orphaned_runs on (status, run_owner_pod) WHERE status = 'in_progress' for efficient orphan lookups

Review Checklist

  • PR title is clear and descriptive
  • Changes are tested and working
  • No breaking changes
  • Comprehensive test coverage for new state machine transitions, storage methods, registry behavior, and schema validation

Summary by cubic

Make Decopilot runs durable with crash recovery, pod‑death handoff, and transparent resume by persisting run ownership/config, adding DB CAS on start, and per‑pod NATS KV heartbeats. Runs auto‑resume on startup and when a pod dies; the client auto‑attaches and resumes without manual steps.

  • New Features

    • DB/storage: added run_owner_pod, run_config (JSONB), run_started_at and partial index; CAS helpers claimRunStart, claimOrphanedRun (NULL or stale owner), listOrphanedRuns/listOrphanedRunsByPod, orphanRunsByPod.
    • Engine/registry: new RESUME command and RUN_RESUMED event; RUN_STARTED uses CAS (persisting run_config, run_owner_pod, run_started_at); reactor clears run columns on terminal states; RunRegistry.stopAll() first orphans in DB, then aborts; recoverOrphanedRuns() resumes both automation and interactive runs (cap 5); start conflicts return 409.
    • Pod death recovery: per‑pod NatsPodHeartbeat detects dead pods and handlePodDeath claims/resumes their runs; stable POD_ID from POD_NAME (or a UUID).
    • Resume: GET /:org/decopilot/attach/:threadId replays if running locally, or CAS‑claims orphans and resumes using stored run_config; validates schema and current model permissions (403 on deny); previous POST resume endpoint removed; on resume, messages save on every step to reduce loss.
    • UI: show “Resuming task...” in the assistant placeholder while attach resumes; removed the “Run in progress” label in chat input.
  • Migration

    • Run apps/mesh migration 050-durable-agent-runs.
    • No backfill; creates partial index idx_threads_run_owner for efficient orphan lookups.

Written for commit 8981c5a. Summary will update on new commits.

@github-actions
Copy link
Contributor

🧪 Benchmark

Should we run the Virtual MCP strategy benchmark for this PR?

React with 👍 to run the benchmark.

Reaction Action
👍 Run quick benchmark (10 & 128 tools)

Benchmark will run on the next push after you react.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 22, 2026

Release Options

Should a new version be published when this PR is merged?

React with an emoji to vote on the release type:

Reaction Type Next Version
👍 Prerelease 2.190.6-alpha.1
🎉 Patch 2.190.6
❤️ Minor 2.191.0
🚀 Major 3.0.0

Current version: 2.190.5

Deployment

  • Deploy to production (triggers ArgoCD sync after Docker image is published)

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 issues found across 24 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/api/routes/decopilot/run-reactor.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/run-reactor.ts:120">
P1: Clear the ghost-run metadata in the same conditional update. As written, a new run that starts between these two queries can have its fresh ownership/config fields wiped out.</violation>
</file>

<file name="apps/mesh/src/api/routes/decopilot/automation-context.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/automation-context.ts:69">
P1: Load and pass custom-role permissions into this background auth client. With empty headers, resumed/manual automation runs for non-built-in roles will fail authorization.</violation>
</file>

<file name="apps/mesh/src/api/app.ts">

<violation number="1" location="apps/mesh/src/api/app.ts:793">
P1: Track and clear this delayed recovery timer during HMR/shutdown. Otherwise a previous `createApp` instance can still fire orphan recovery after its Decopilot resources were already cleaned up.

(Based on your team's feedback about tracking long-lived createApp resources during HMR/shutdown cleanup.) [FEEDBACK_USED]</violation>
</file>

<file name="apps/mesh/src/api/routes/decopilot/run-registry.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/run-registry.ts:133">
P1: If orphaning the DB rows fails, this method still clears local state and strands the run as `in_progress` with the old owner.</violation>

<violation number="2" location="apps/mesh/src/api/routes/decopilot/run-registry.ts:153">
P2: This startup recovery only processes one page of orphaned runs, so anything after the first 100 is never auto-resumed.</violation>
</file>

<file name="apps/mesh/src/core/pod-identity.ts">

<violation number="1" location="apps/mesh/src/core/pod-identity.ts:5">
P2: `HOSTNAME` is being treated as a pod id in every environment, so non-Kubernetes processes can share the same `POD_ID` instead of getting a per-process UUID.</violation>
</file>

<file name="apps/mesh/src/api/routes/decopilot/stream-core.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/stream-core.ts:195">
P1: Dispatch the resume state before the async setup or explicitly release the DB claim on setup failures. A failed resume attempt can otherwise strand the thread as claimed-but-not-running.</violation>
</file>

<file name="apps/mesh/src/api/routes/decopilot/run-config.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/run-config.ts:14">
P2: Validate persisted `capabilities` and `limits` with the same typed shape as the request schema; the current `z.record(..., z.unknown())` lets malformed run configs bypass resume-time schema checks.</violation>
</file>

<file name="apps/mesh/src/storage/threads.ts">

<violation number="1" location="apps/mesh/src/storage/threads.ts:451">
P1: Allow `claimOrphanedRun()` to reclaim stale dead-pod runs. With the current null-only CAS, threads found via stale `run_started_at` can never be resumed after an ungraceful pod crash.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@tlgimenes tlgimenes force-pushed the tlgimenes/agent-durability branch from 1beda3e to 3725764 Compare March 22, 2026 23:55
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/storage/threads.ts">

<violation number="1" location="apps/mesh/src/storage/threads.ts:456">
P1: Using `run_started_at` as the stale-claim cutoff can steal still-running jobs after 15 minutes.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/storage/threads.ts">

<violation number="1" location="apps/mesh/src/storage/threads.ts:451">
P1: This startup recovery query now skips runs stranded with a stale `run_owner_pod`, so abnormal pod deaths leave some threads stuck in `in_progress` forever.

(Based on your team's feedback about avoiding stale-timestamp ownership stealing.) [FEEDBACK_USED]</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@tlgimenes tlgimenes force-pushed the tlgimenes/agent-durability branch from 96adaac to 8aa9eb9 Compare March 23, 2026 13:39
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/api/routes/decopilot/routes.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/routes.ts:311">
P2: Don't turn a model-permission failure into `204 No Content`; the reconnect path will treat it like an empty attach and the resume failure becomes silent.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/storage/threads.ts">

<violation number="1" location="apps/mesh/src/storage/threads.ts:460">
P1: Do not claim runs that already have a non-null `run_owner_pod`; this lets another pod steal an active run during attach/resume.

(Based on your team's feedback about using only `run_owner_pod IS NULL` for claim/ownership CAS logic.) [FEEDBACK_USED]</violation>
</file>

<file name="apps/mesh/src/api/routes/decopilot/routes.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/routes.ts:368">
P0: Don't reclaim runs with a non-null `run_owner_pod` here; this can steal a still-running job from another healthy pod and start a second resume.

(Based on your team's feedback about using only `run_owner_pod IS NULL` for run-claim CAS logic.) [FEEDBACK_USED]</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 5 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/storage/threads.ts">

<violation number="1" location="apps/mesh/src/storage/threads.ts:455">
P0: This claim now steals runs from other pods instead of only claiming true orphans.

(Based on your team's feedback about using only `run_owner_pod IS NULL` for run ownership CAS.) [FEEDBACK_USED]</violation>

<violation number="2" location="apps/mesh/src/storage/threads.ts:468">
P0: This orphan query now includes runs owned by other healthy pods, so recovery can resume the same thread concurrently on multiple replicas.

(Based on your team's feedback about using only `run_owner_pod IS NULL` to detect orphaned runs.) [FEEDBACK_USED]</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 issues found across 9 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/api/routes/decopilot/run-reactor.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/run-reactor.ts:89">
P1: Throwing here leaves a phantom local `running` entry when the DB claim is lost.</violation>
</file>

<file name="apps/mesh/src/nats/pod-heartbeat.ts">

<violation number="1" location="apps/mesh/src/nats/pod-heartbeat.ts:52">
P2: Await the initial heartbeat write so startup cannot continue without registering the pod key.

(Based on your team's feedback about treating JetStream startup paths as fail-fast dependencies.) [FEEDBACK_USED]</violation>

<violation number="2" location="apps/mesh/src/nats/pod-heartbeat.ts:130">
P2: Also close the active `kv.watch()` iterator in `stop()`. Aborting the local signal alone leaves the watcher pending until another KV event arrives.

(Based on your team's feedback about cleaning up long-lived resources during app recreation/shutdown.) [FEEDBACK_USED]</violation>
</file>

<file name="apps/mesh/src/api/app.ts">

<violation number="1" location="apps/mesh/src/api/app.ts:336">
P1: Stopping the heartbeat before aborting/orphaning local runs lets another pod resume the same thread while this pod is still running it.</violation>
</file>

<file name="apps/mesh/src/storage/threads.ts">

<violation number="1" location="apps/mesh/src/storage/threads.ts:488">
P2: Dead-pod recovery only fetches the first 100 orphaned runs, leaving additional runs unrecovered.</violation>
</file>

<file name="apps/mesh/src/api/routes/decopilot/run-registry.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/run-registry.ts:204">
P1: Broadcasting `CANCEL` here can turn a graceful handoff into a permanent `failed` run. Other pods see the heartbeat delete before the old pod has orphaned its DB rows, so the old pod still handles the cancel and persists `status = 'failed'` before this recovery path can claim it.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

tlgimenes and others added 10 commits March 23, 2026 15:02
Persist agent run ownership and config to DB so runs survive pod restarts,
deploys, and OOM kills. Orphaned runs are automatically resumed on startup
with a concurrency cap of 5. Users can also manually resume via a new
POST /:org/decopilot/resume/:threadId endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Server crashes shouldn't punish users — every in-progress run (both
automation and interactive) is now auto-resumed on startup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Load custom-role permissions in automation context factory
- Allow claimOrphanedRun to reclaim stale dead-pod runs (15min threshold)
- Use POD_NAME env var instead of HOSTNAME to avoid collisions in non-K8s
- Add typed validation for capabilities/limits in persisted run config

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The 15-minute threshold could steal still-running jobs from other pods.
Revert to null-only CAS — graceful shutdown already nulls run_owner_pod.
SIGKILL/OOM recovery should use a separate long-interval reaper.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused POST /resume endpoint and move orphan detection into
GET /attach/:threadId so the client auto-resumes crashed runs without
any client-side changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Crashed pods never clear run_owner_pod, so orphan detection failed for
hard crashes. Add claimStaleRun CAS for non-null stale pods alongside
the existing null-pod path.

UI: remove "Run in progress" label from input, show "Resuming task..."
(animated, like Thinking) instead of "No response was generated" while
the attach endpoint resumes the orphaned run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n path

Returning 204 made model-permission failures silent. Now throws 403
consistent with the /stream endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ate method

Remove claimStaleRun — instead widen claimOrphanedRun CAS to match both
NULL and stale (different-pod) run_owner_pod. Also widen
listOrphanedRuns to find crash-orphaned runs (non-null owner != current
pod) so startup recovery handles them.

This avoids the race where a separate claimStaleRun in the attach path
could steal a run from a healthy pod in multi-pod deployments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents concurrent execution via claimRunStart CAS and detects pod death
within 45s via NATS KV heartbeat watcher for automatic orphan recovery.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Main added 049-remove-org-admin-projects; rename ours to avoid collision.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tlgimenes tlgimenes force-pushed the tlgimenes/agent-durability branch from 037e716 to 8981c5a Compare March 23, 2026 18:03
@tlgimenes tlgimenes merged commit 9600edc into main Mar 23, 2026
14 checks passed
@tlgimenes tlgimenes deleted the tlgimenes/agent-durability branch March 23, 2026 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant