Skip to content

fix(gastown): dispatch circuit breaker — per-bead attempt cap, exponential backoff, town-level breaker (#1653)#1921

Closed
jrf0110 wants to merge 1 commit intomainfrom
gt/toast/452c52e4
Closed

fix(gastown): dispatch circuit breaker — per-bead attempt cap, exponential backoff, town-level breaker (#1653)#1921
jrf0110 wants to merge 1 commit intomainfrom
gt/toast/452c52e4

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented Apr 2, 2026

Summary

  • Per-bead dispatch tracking: Added dispatch_attempts and last_dispatch_attempt_at columns to the beads table. Removed the dispatch_attempts = 0 reset in hookBead() that was the root cause of infinite retry loops. Both actions.ts and scheduling.ts now increment the bead counter on dispatch. Lowered MAX_DISPATCH_ATTEMPTS from 20 → 5 with exponential backoff (2min → 5min → 10min → 30min).
  • Town-level circuit breaker: Added checkDispatchCircuitBreaker() in reconciler.ts that counts total dispatch attempts across all beads in a 30-min window. If >20, all dispatch rules are skipped and a mayor notification is emitted. Auto-resets when the window expires.
  • Error logging: Added label field to agent.dispatch_failed analytics events for both the started===false path and the catch path.
  • Stale bead reset + triage RESTART cap: Reconciler Rule 3 transitions beads at max attempts to failed instead of open. Triage RESTART checks bead.dispatch_attempts >= MAX_DISPATCH_ATTEMPTS before allowing restart.

Fixes #1653.

Verification

  • oxfmt --list-different . — 0 warnings, 0 errors
  • tsgo --noEmit on cloudflare-gastown and dependencies — 0 errors
  • Lint — 0 warnings, 0 errors
  • Pre-push hooks (format, lint, typecheck) — all pass

Visual Changes

N/A

Reviewer Notes

  • The circuit breaker query sums lifetime dispatch_attempts for beads with last_dispatch_attempt_at in the 30-min window. This is conservative — a bead dispatched 25 minutes ago with 5 attempts still contributes to the sum. False positive trips (pausing dispatches when not strictly needed) are safer than false negatives, and the window auto-resets as timestamps age out.
  • checkDispatchCircuitBreaker() is called twice in reconcileReviewQueue (once at the top, once via the early return). This is a minor redundancy but acceptable given the fast DO SQLite reads.
  • DISPATCH_COOLDOWN_MS is still exported (used elsewhere) but no longer imported by reconciler.ts, which now uses getDispatchBackoffMs() exclusively.

…ntial backoff, town-level breaker, error logging (#1653)

Fixes GitHub issue #1653 — no circuit breaker on dispatch failures causing
infinite retry loops.

Changes:

1. Per-bead dispatch tracking (Fix 1):
   - Add dispatch_attempts + last_dispatch_attempt_at columns to beads table
   - Stop resetting dispatch_attempts in hookBead (root cause of the loop)
   - Increment bead.dispatch_attempts in both actions.ts and scheduling.ts
   - Lower MAX_DISPATCH_ATTEMPTS from 20 to 5
   - Add exponential backoff: 2min → 5min → 10min → 30min

2. Town-level circuit breaker (Fix 2):
   - Count total dispatch attempts across all beads in a 30-min window
   - If >20, skip all dispatch_agent actions and notify mayor
   - Auto-resets when the window expires

3. Error logging in dispatch_failed events (Fix 3):
   - Add reason field to agent.dispatch_failed analytics events
   - 'container returned false' for the started===false path
   - Error message for the catch path

4. Stale bead reset + triage RESTART honoring dispatch cap (Fix 4):
   - Rule 3 transitions beads at max attempts to 'failed' instead of 'open'
   - Triage RESTART checks bead.dispatch_attempts before allowing restart
// Track dispatch attempts on the bead itself (not just the agent).
// The bead counter is never reset by hookBead, preventing the
// infinite retry loop (#1653).
query(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Dispatch attempts are counted twice per dispatch

dispatch_agent already hands off to ctx.dispatchAgent(), and Town.do routes that into scheduling.dispatchAgent(), which increments the bead counter again. With this extra update, every dispatch consumes two attempts, so the new 5-attempt cap and the town-level breaker will fire roughly twice as early as intended.

* in the recent window. If it exceeds the threshold, returns the failure count.
* Returns 0 if the circuit breaker is not tripped.
*/
function checkDispatchCircuitBreaker(sql: SqlStorage): number {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: The circuit breaker opens on successful dispatch volume too

This helper sums beads.dispatch_attempts for every bead with a recent dispatch timestamp. Because that counter is incremented before startAgentInContainer() and never reset on success, a town that successfully starts 21 beads in 30 minutes will also trip the breaker and pause scheduling. If this is meant to react to failures, it needs failure-specific state instead of the generic attempt counter.

@@ -1555,6 +1555,25 @@ export class TownDO extends DurableObject<Env> {
switch (action) {
case 'RESTART':
case 'RESTART_WITH_BACKOFF': {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: RESTART_WITH_BACKOFF no longer applies the new backoff policy

The reconciler now gates redispatch off beads.last_dispatch_attempt_at, but this restart path still only changes agent state before handing control back to the scheduler. If the bead's last dispatch timestamp is already stale, the next reconcile can redispatch immediately, so the manual restart bypasses the intended cooldown.

`${LOG} circuit breaker OPEN: ${cbFailures} dispatch attempts in last 30min (threshold=${CIRCUIT_BREAKER_FAILURE_THRESHOLD}). Skipping all dispatches.`
);
actions.push({
type: 'notify_mayor',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Opening the breaker does not actually notify the mayor

This action goes through notify_mayor, but applyAction() still handles that action by only writing a log line. In the failure mode where all dispatches are paused, the operator-facing alert path here is effectively a no-op.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Apr 2, 2026

Code Review Summary

Status: 4 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 4
SUGGESTION 0
Issue Details (click to expand)

WARNING

File Line Issue
cloudflare-gastown/src/dos/town/actions.ts 523 Bead dispatch attempts are incremented twice per dispatch, so the 5-attempt cap and breaker will trigger too early.
cloudflare-gastown/src/dos/town/reconciler.ts 77 The town-level breaker counts all recent dispatch attempts, including successful starts, so normal throughput can pause scheduling.
cloudflare-gastown/src/dos/Town.do.ts 1557 RESTART_WITH_BACKOFF does not update bead-level backoff state, allowing an immediate redispatch.
cloudflare-gastown/src/dos/town/reconciler.ts 507 The breaker emits notify_mayor, but that action currently only logs and never alerts the mayor.
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

None.

Files Reviewed (7 files)
  • cloudflare-gastown/src/db/tables/beads.table.ts - 0 issues
  • cloudflare-gastown/src/dos/Town.do.ts - 1 issue
  • cloudflare-gastown/src/dos/town/actions.ts - 1 issue
  • cloudflare-gastown/src/dos/town/agents.ts - 0 issues
  • cloudflare-gastown/src/dos/town/beads.ts - 0 issues
  • cloudflare-gastown/src/dos/town/reconciler.ts - 2 issues
  • cloudflare-gastown/src/dos/town/scheduling.ts - 0 issues

Fix these issues in Kilo Cloud


Reviewed by gpt-5.4-2026-03-05 · 1,423,658 tokens

@jrf0110
Copy link
Copy Markdown
Contributor Author

jrf0110 commented Apr 2, 2026

Closing — will re-approach this work separately.

@jrf0110 jrf0110 closed this Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(gastown): No circuit breaker on dispatch failures — dead container causes 70h runaway loop (+ spend)

1 participant