fix(gastown): dispatch circuit breaker — per-bead attempt cap, exponential backoff, town-level breaker (#1653) by jrf0110 · Pull Request #1921 · Kilo-Org/cloud

jrf0110 · 2026-04-02T13:31:48Z

Summary

Per-bead dispatch tracking: Added dispatch_attempts and last_dispatch_attempt_at columns to the beads table. Removed the dispatch_attempts = 0 reset in hookBead() that was the root cause of infinite retry loops. Both actions.ts and scheduling.ts now increment the bead counter on dispatch. Lowered MAX_DISPATCH_ATTEMPTS from 20 → 5 with exponential backoff (2min → 5min → 10min → 30min).
Town-level circuit breaker: Added checkDispatchCircuitBreaker() in reconciler.ts that counts total dispatch attempts across all beads in a 30-min window. If >20, all dispatch rules are skipped and a mayor notification is emitted. Auto-resets when the window expires.
Error logging: Added label field to agent.dispatch_failed analytics events for both the started===false path and the catch path.
Stale bead reset + triage RESTART cap: Reconciler Rule 3 transitions beads at max attempts to failed instead of open. Triage RESTART checks bead.dispatch_attempts >= MAX_DISPATCH_ATTEMPTS before allowing restart.

Fixes #1653.

Verification

oxfmt --list-different . — 0 warnings, 0 errors
tsgo --noEmit on cloudflare-gastown and dependencies — 0 errors
Lint — 0 warnings, 0 errors
Pre-push hooks (format, lint, typecheck) — all pass

Visual Changes

N/A

Reviewer Notes

The circuit breaker query sums lifetime dispatch_attempts for beads with last_dispatch_attempt_at in the 30-min window. This is conservative — a bead dispatched 25 minutes ago with 5 attempts still contributes to the sum. False positive trips (pausing dispatches when not strictly needed) are safer than false negatives, and the window auto-resets as timestamps age out.
checkDispatchCircuitBreaker() is called twice in reconcileReviewQueue (once at the top, once via the early return). This is a minor redundancy but acceptable given the fast DO SQLite reads.
DISPATCH_COOLDOWN_MS is still exported (used elsewhere) but no longer imported by reconciler.ts, which now uses getDispatchBackoffMs() exclusively.

…ntial backoff, town-level breaker, error logging (#1653) Fixes GitHub issue #1653 — no circuit breaker on dispatch failures causing infinite retry loops. Changes: 1. Per-bead dispatch tracking (Fix 1): - Add dispatch_attempts + last_dispatch_attempt_at columns to beads table - Stop resetting dispatch_attempts in hookBead (root cause of the loop) - Increment bead.dispatch_attempts in both actions.ts and scheduling.ts - Lower MAX_DISPATCH_ATTEMPTS from 20 to 5 - Add exponential backoff: 2min → 5min → 10min → 30min 2. Town-level circuit breaker (Fix 2): - Count total dispatch attempts across all beads in a 30-min window - If >20, skip all dispatch_agent actions and notify mayor - Auto-resets when the window expires 3. Error logging in dispatch_failed events (Fix 3): - Add reason field to agent.dispatch_failed analytics events - 'container returned false' for the started===false path - Error message for the catch path 4. Stale bead reset + triage RESTART honoring dispatch cap (Fix 4): - Rule 3 transitions beads at max attempts to 'failed' instead of 'open' - Triage RESTART checks bead.dispatch_attempts before allowing restart

kilo-code-bot · 2026-04-02T13:39:21Z

cloudflare-gastown/src/dos/town/actions.ts

+      // Track dispatch attempts on the bead itself (not just the agent).
+      // The bead counter is never reset by hookBead, preventing the
+      // infinite retry loop (#1653).
+      query(


WARNING: Dispatch attempts are counted twice per dispatch

dispatch_agent already hands off to ctx.dispatchAgent(), and Town.do routes that into scheduling.dispatchAgent(), which increments the bead counter again. With this extra update, every dispatch consumes two attempts, so the new 5-attempt cap and the town-level breaker will fire roughly twice as early as intended.

kilo-code-bot · 2026-04-02T13:39:21Z

cloudflare-gastown/src/dos/town/reconciler.ts

+ * in the recent window. If it exceeds the threshold, returns the failure count.
+ * Returns 0 if the circuit breaker is not tripped.
+ */
+function checkDispatchCircuitBreaker(sql: SqlStorage): number {


WARNING: The circuit breaker opens on successful dispatch volume too

This helper sums beads.dispatch_attempts for every bead with a recent dispatch timestamp. Because that counter is incremented before startAgentInContainer() and never reset on success, a town that successfully starts 21 beads in 30 minutes will also trip the breaker and pause scheduling. If this is meant to react to failures, it needs failure-specific state instead of the generic attempt counter.

kilo-code-bot · 2026-04-02T13:39:21Z

cloudflare-gastown/src/dos/Town.do.ts

@@ -1555,6 +1555,25 @@ export class TownDO extends DurableObject<Env> {
      switch (action) {
        case 'RESTART':
        case 'RESTART_WITH_BACKOFF': {


WARNING: RESTART_WITH_BACKOFF no longer applies the new backoff policy

The reconciler now gates redispatch off beads.last_dispatch_attempt_at, but this restart path still only changes agent state before handing control back to the scheduler. If the bead's last dispatch timestamp is already stale, the next reconcile can redispatch immediately, so the manual restart bypasses the intended cooldown.

kilo-code-bot · 2026-04-02T13:39:21Z

cloudflare-gastown/src/dos/town/reconciler.ts

+      `${LOG} circuit breaker OPEN: ${cbFailures} dispatch attempts in last 30min (threshold=${CIRCUIT_BREAKER_FAILURE_THRESHOLD}). Skipping all dispatches.`
+    );
+    actions.push({
+      type: 'notify_mayor',


WARNING: Opening the breaker does not actually notify the mayor

This action goes through notify_mayor, but applyAction() still handles that action by only writing a log line. In the failure mode where all dispatches are paused, the operator-facing alert path here is effectively a no-op.

kilo-code-bot · 2026-04-02T13:39:43Z

Code Review Summary

Status: 4 Issues Found | Recommendation: Address before merge

Overview

Severity	Count
CRITICAL	0
WARNING	4
SUGGESTION	0

Issue Details (click to expand)

WARNING

File	Line	Issue
`cloudflare-gastown/src/dos/town/actions.ts`	523	Bead dispatch attempts are incremented twice per dispatch, so the 5-attempt cap and breaker will trigger too early.
`cloudflare-gastown/src/dos/town/reconciler.ts`	77	The town-level breaker counts all recent dispatch attempts, including successful starts, so normal throughput can pause scheduling.
`cloudflare-gastown/src/dos/Town.do.ts`	1557	`RESTART_WITH_BACKOFF` does not update bead-level backoff state, allowing an immediate redispatch.
`cloudflare-gastown/src/dos/town/reconciler.ts`	507	The breaker emits `notify_mayor`, but that action currently only logs and never alerts the mayor.

Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

None.

Files Reviewed (7 files)

cloudflare-gastown/src/db/tables/beads.table.ts - 0 issues
cloudflare-gastown/src/dos/Town.do.ts - 1 issue
cloudflare-gastown/src/dos/town/actions.ts - 1 issue
cloudflare-gastown/src/dos/town/agents.ts - 0 issues
cloudflare-gastown/src/dos/town/beads.ts - 0 issues
cloudflare-gastown/src/dos/town/reconciler.ts - 2 issues
cloudflare-gastown/src/dos/town/scheduling.ts - 0 issues

Fix these issues in Kilo Cloud

_{Reviewed by gpt-5.4-2026-03-05 · 1,423,658 tokens}

jrf0110 · 2026-04-02T14:45:15Z

Closing — will re-approach this work separately.

kilo-code-bot bot reviewed Apr 2, 2026

View reviewed changes

jrf0110 closed this Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gastown): dispatch circuit breaker — per-bead attempt cap, exponential backoff, town-level breaker (#1653)#1921

fix(gastown): dispatch circuit breaker — per-bead attempt cap, exponential backoff, town-level breaker (#1653)#1921
jrf0110 wants to merge 1 commit intomainfrom
gt/toast/452c52e4

jrf0110 commented Apr 2, 2026

Uh oh!

kilo-code-bot bot Apr 2, 2026

Uh oh!

kilo-code-bot bot Apr 2, 2026

Uh oh!

kilo-code-bot bot Apr 2, 2026

Uh oh!

kilo-code-bot bot Apr 2, 2026

Uh oh!

kilo-code-bot bot commented Apr 2, 2026 •

edited

Loading

WARNING

Uh oh!

jrf0110 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jrf0110 commented Apr 2, 2026

Summary

Verification

Visual Changes

Reviewer Notes

Uh oh!

kilo-code-bot bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Overview

WARNING

Uh oh!

jrf0110 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kilo-code-bot bot commented Apr 2, 2026 •

edited

Loading