fix(gastown): dispatch circuit breaker — per-bead attempt cap, exponential backoff, town-level breaker (#1653)#1921
fix(gastown): dispatch circuit breaker — per-bead attempt cap, exponential backoff, town-level breaker (#1653)#1921
Conversation
…ntial backoff, town-level breaker, error logging (#1653) Fixes GitHub issue #1653 — no circuit breaker on dispatch failures causing infinite retry loops. Changes: 1. Per-bead dispatch tracking (Fix 1): - Add dispatch_attempts + last_dispatch_attempt_at columns to beads table - Stop resetting dispatch_attempts in hookBead (root cause of the loop) - Increment bead.dispatch_attempts in both actions.ts and scheduling.ts - Lower MAX_DISPATCH_ATTEMPTS from 20 to 5 - Add exponential backoff: 2min → 5min → 10min → 30min 2. Town-level circuit breaker (Fix 2): - Count total dispatch attempts across all beads in a 30-min window - If >20, skip all dispatch_agent actions and notify mayor - Auto-resets when the window expires 3. Error logging in dispatch_failed events (Fix 3): - Add reason field to agent.dispatch_failed analytics events - 'container returned false' for the started===false path - Error message for the catch path 4. Stale bead reset + triage RESTART honoring dispatch cap (Fix 4): - Rule 3 transitions beads at max attempts to 'failed' instead of 'open' - Triage RESTART checks bead.dispatch_attempts before allowing restart
| // Track dispatch attempts on the bead itself (not just the agent). | ||
| // The bead counter is never reset by hookBead, preventing the | ||
| // infinite retry loop (#1653). | ||
| query( |
There was a problem hiding this comment.
WARNING: Dispatch attempts are counted twice per dispatch
dispatch_agent already hands off to ctx.dispatchAgent(), and Town.do routes that into scheduling.dispatchAgent(), which increments the bead counter again. With this extra update, every dispatch consumes two attempts, so the new 5-attempt cap and the town-level breaker will fire roughly twice as early as intended.
| * in the recent window. If it exceeds the threshold, returns the failure count. | ||
| * Returns 0 if the circuit breaker is not tripped. | ||
| */ | ||
| function checkDispatchCircuitBreaker(sql: SqlStorage): number { |
There was a problem hiding this comment.
WARNING: The circuit breaker opens on successful dispatch volume too
This helper sums beads.dispatch_attempts for every bead with a recent dispatch timestamp. Because that counter is incremented before startAgentInContainer() and never reset on success, a town that successfully starts 21 beads in 30 minutes will also trip the breaker and pause scheduling. If this is meant to react to failures, it needs failure-specific state instead of the generic attempt counter.
| @@ -1555,6 +1555,25 @@ export class TownDO extends DurableObject<Env> { | |||
| switch (action) { | |||
| case 'RESTART': | |||
| case 'RESTART_WITH_BACKOFF': { | |||
There was a problem hiding this comment.
WARNING: RESTART_WITH_BACKOFF no longer applies the new backoff policy
The reconciler now gates redispatch off beads.last_dispatch_attempt_at, but this restart path still only changes agent state before handing control back to the scheduler. If the bead's last dispatch timestamp is already stale, the next reconcile can redispatch immediately, so the manual restart bypasses the intended cooldown.
| `${LOG} circuit breaker OPEN: ${cbFailures} dispatch attempts in last 30min (threshold=${CIRCUIT_BREAKER_FAILURE_THRESHOLD}). Skipping all dispatches.` | ||
| ); | ||
| actions.push({ | ||
| type: 'notify_mayor', |
There was a problem hiding this comment.
WARNING: Opening the breaker does not actually notify the mayor
This action goes through notify_mayor, but applyAction() still handles that action by only writing a log line. In the failure mode where all dispatches are paused, the operator-facing alert path here is effectively a no-op.
Code Review SummaryStatus: 4 Issues Found | Recommendation: Address before merge Overview
Issue Details (click to expand)WARNING
Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments: None. Files Reviewed (7 files)
Fix these issues in Kilo Cloud Reviewed by gpt-5.4-2026-03-05 · 1,423,658 tokens |
|
Closing — will re-approach this work separately. |
Summary
dispatch_attemptsandlast_dispatch_attempt_atcolumns to the beads table. Removed thedispatch_attempts = 0reset inhookBead()that was the root cause of infinite retry loops. Bothactions.tsandscheduling.tsnow increment the bead counter on dispatch. LoweredMAX_DISPATCH_ATTEMPTSfrom 20 → 5 with exponential backoff (2min → 5min → 10min → 30min).checkDispatchCircuitBreaker()inreconciler.tsthat counts total dispatch attempts across all beads in a 30-min window. If >20, all dispatch rules are skipped and a mayor notification is emitted. Auto-resets when the window expires.labelfield toagent.dispatch_failedanalytics events for both thestarted===falsepath and the catch path.failedinstead ofopen. Triage RESTART checksbead.dispatch_attempts >= MAX_DISPATCH_ATTEMPTSbefore allowing restart.Fixes #1653.
Verification
oxfmt --list-different .— 0 warnings, 0 errorstsgo --noEmiton cloudflare-gastown and dependencies — 0 errorsVisual Changes
N/A
Reviewer Notes
dispatch_attemptsfor beads withlast_dispatch_attempt_atin the 30-min window. This is conservative — a bead dispatched 25 minutes ago with 5 attempts still contributes to the sum. False positive trips (pausing dispatches when not strictly needed) are safer than false negatives, and the window auto-resets as timestamps age out.checkDispatchCircuitBreaker()is called twice inreconcileReviewQueue(once at the top, once via the early return). This is a minor redundancy but acceptable given the fast DO SQLite reads.DISPATCH_COOLDOWN_MSis still exported (used elsewhere) but no longer imported byreconciler.ts, which now usesgetDispatchBackoffMs()exclusively.