Skip to content

fix(gastown): Agent GC deleteAgent() bypasses terminal state guard — re-opens closed/failed beads #1988

@jrf0110

Description

@jrf0110

Bug

When the reconciler garbage-collects idle agents (>24h, no hook), deleteAgent() re-opens ALL beads assigned to that agent — including beads that are closed or failed. This bypasses the terminal state guard in updateBeadStatus() and mass-resurrects completed work.

Incident

Town 98172328 was idle. At 18:02:55 UTC on Apr 3, the reconciler GC'd 6 stale polecats. deleteAgent() re-opened 32 beads that were already closed/failed. The reconciler then tried to dispatch 30 agents to these zombie beads (60-second wall clock spike), all failed, triggering a cascade of dispatch failures, container evictions, and re-opening cycles. The town had to be manually recovered by having the Mayor close everything.

Root Cause

agents.ts:204-216deleteAgent() uses a raw SQL UPDATE that sets status = 'open' on every bead assigned to the deleted agent, with no terminal state check:

export function deleteAgent(sql, agentId) {
  query(sql, `
    UPDATE beads
    SET assignee_agent_bead_id = NULL,
        status = 'open',         // <-- BYPASSES TERMINAL STATE GUARD
        updated_at = ?
    WHERE assignee_agent_bead_id = ?
  `, [now(), agentId]);
  deleteBead(sql, agentId);
}

The terminal state guard in updateBeadStatus() (beads.ts:278-287) correctly blocks closed -> open transitions. But deleteAgent() bypasses this by writing raw SQL.

Fix

Exclude terminal beads from the status reset. Clear the assignee on terminal beads without changing their status:

export function deleteAgent(sql, agentId) {
  // Re-open non-terminal beads so they can be reassigned
  query(sql, `
    UPDATE beads
    SET assignee_agent_bead_id = NULL,
        status = 'open',
        updated_at = ?
    WHERE assignee_agent_bead_id = ?
      AND status NOT IN ('closed', 'failed')
  `, [now(), agentId]);

  // Clear assignee on terminal beads without changing status
  query(sql, `
    UPDATE beads
    SET assignee_agent_bead_id = NULL
    WHERE assignee_agent_bead_id = ?
      AND status IN ('closed', 'failed')
  `, [agentId]);

  deleteBead(sql, agentId);
}

Broader Concern

This is the same class of bug identified in audit #1986 (finding B5): raw SQL mutations that bypass updateBeadStatus(). Every direct UPDATE beads SET status = ... in the codebase should be audited. The close_sibling_mrs and close_convoy action handlers were fixed in an earlier PR, but deleteAgent() was missed.

Files

  • src/dos/town/agents.ts:204-216deleteAgent() (the bug)
  • src/dos/town/beads.ts:278-287 — terminal state guard (bypassed)
  • src/dos/town/reconciler.ts:1773-1781 — GC rule that triggers delete_agent actions

Impact

Critical — mass-resurrects completed beads, causing agents dispatched to already-done work (wasting credits), 60+ second alarm tick spikes, container eviction cascades, and towns becoming unusable until manually recovered.

Acceptance Criteria

  • deleteAgent() excludes closed/failed beads from status reset
  • Terminal beads get assignee cleared without status change
  • Audit all other raw UPDATE beads SET status queries for the same gap
  • Add a regression test: GC an agent with closed beads, verify beads stay closed

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Blocks soft launchbugSomething isn't workinggt:coreReconciler, state machine, bead lifecycle, convoy flow

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions