Skip to content

fix(gastown): rollback agent to idle on dispatch container start failure#2000

Closed
jrf0110 wants to merge 2 commits intoconvoy/fix-reconciler-p0-p1-bug-fixes-from-audi/f071bd6c/headfrom
convoy/fix-reconciler-p0-p1-bug-fixes-from-audi/f071bd6c/gt/maple/62e5c90a
Closed

fix(gastown): rollback agent to idle on dispatch container start failure#2000
jrf0110 wants to merge 2 commits intoconvoy/fix-reconciler-p0-p1-bug-fixes-from-audi/f071bd6c/headfrom
convoy/fix-reconciler-p0-p1-bug-fixes-from-audi/f071bd6c/gt/maple/62e5c90a

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented Apr 4, 2026

Summary

When dispatch_agent's async side effect fails (container start fails), the agent was left in working status with no container running, creating a 90-second dead zone until heartbeat timeout detection kicked in.

Now the .catch() handler in the dispatch_agent action rolls the agent back to idle via agentOps.updateAgentStatus() so the reconciler can retry dispatch on the next tick, per spec §5.4. The bead stays in_progress — no transition needed.

  • Updated comment to document the rollback behavior
  • Added rolling back to idle to the warning log message for observability

Verification

  • Typecheck passes
  • Reconciler-related tests pass (2 pre-existing failures in unrelated client.test.ts)
  • Manual review confirms the SQL write in .catch() is safe — it's part of the promise chain awaited by Promise.allSettled in Phase 2 of the alarm loop

Visual Changes

N/A

Reviewer Notes

  • The idle agent remains hooked to the in_progress bead. The reconciler handles this state: reconcileBeads will see an unassigned in_progress bead with an idle agent and re-dispatch.
  • There is an empty "WIP: container eviction save" commit (no file changes) — harmless artifact from the polecat's workflow.

John Fawcett added 2 commits April 4, 2026 15:01
When dispatch_agent's async side effect fails (container start fails),
the agent was left in 'working' status with no container running,
creating a 90-second dead zone until heartbeat timeout detection.

Now the .catch() handler rolls the agent back to 'idle' so the
reconciler can retry dispatch on the next tick, per spec §5.4.
// Best-effort dispatch. If the container start fails, roll the
// agent back to 'idle' so the reconciler can retry on the next
// tick. The bead stays 'in_progress' — no transition needed.
await ctx.dispatchAgent(capturedAgentId, beadId, rigId).catch(err => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: This only rolls back rejected dispatches

ctx.dispatchAgent() resolves false for normal startup failure paths in scheduling.dispatchAgent() (for example when startAgentInContainer() returns false or the rig lookup fails). In those cases this .catch() never runs, so the agent still stays working and the 90-second dead zone remains. Handle a falsy return here as well if the goal is to roll the agent back to idle on failed container starts.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Apr 4, 2026

Code Review Summary

Status: 1 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File Line Issue
cloudflare-gastown/src/dos/town/actions.ts 564 Rollback only runs on rejected dispatches, so dispatchAgent() failures that resolve false still leave the agent stuck in working.
Other Observations (not in diff)

N/A

Files Reviewed (1 files)
  • cloudflare-gastown/src/dos/town/actions.ts - 1 issue

Reviewed by gpt-5.4-20260305 · 284,225 tokens

@kilo-code-bot kilo-code-bot bot closed this Apr 5, 2026
@kilo-code-bot kilo-code-bot bot deleted the convoy/fix-reconciler-p0-p1-bug-fixes-from-audi/f071bd6c/gt/maple/62e5c90a branch April 5, 2026 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant