Skip to content

audit(gastown): Reconciler code quality audit — bugs, gaps, and hardening recommendations #1986

@jrf0110

Description

@jrf0110

Reconciler Code Quality Audit

Exhaustive line-by-line audit of the Gastown reconciler (reconciler.ts, actions.ts, events.ts) and supporting modules (agents.ts, beads.ts, review-queue.ts, Town.do.ts), compared against reconciliation-spec.md.

Files audited:

  • src/dos/town/reconciler.ts (2130 lines)
  • src/dos/town/actions.ts (974 lines)
  • src/dos/town/events.ts (213 lines)
  • src/dos/town/agents.ts (578 lines)
  • src/dos/town/beads.ts (950 lines)
  • src/dos/town/review-queue.ts (1453+ lines)
  • src/dos/Town.do.ts (alarm method, ~300 lines)
  • test/integration/reconciler.test.ts (489 lines)
  • docs/gt/reconciliation-spec.md (1035 lines)

1. Bugs Found

B1: SQL interpolation in circuit breaker query — potential injection / broken query

File: reconciler.ts:63
Impact: HIGH — could break the circuit breaker or cause SQL injection

WHERE ... > strftime('%Y-%m-%dT%H:%M:%fZ', 'now', '-${CIRCUIT_BREAKER_WINDOW_MINUTES} minutes')

CIRCUIT_BREAKER_WINDOW_MINUTES (30) is interpolated directly into the SQL string via a JavaScript template literal, not via a parameterized query argument. While this constant is hardcoded (not user-supplied), this violates the project convention ("use parameterized queries") and creates a maintenance hazard — if this constant were ever derived from user input, it would be a SQL injection vector.

Fix: Use a parameterized approach: compute the cutoff timestamp in JS (new Date(Date.now() - CIRCUIT_BREAKER_WINDOW_MINUTES * 60_000).toISOString()) and pass it as a query parameter.

B2: dispatch_agent side effect doesn't roll back agent status on failure

File: actions.ts:550-561
Impact: HIGH — agent stuck in 'working' with no container running

The spec (§5.4) says: "If [dispatch] not started: set agent back to idle. Bead stays in_progress. Reconciler will retry on next tick." But the implementation only catches the error and logs it — it does NOT set the agent back to idle:

return async () => {
  await ctx.dispatchAgent(capturedAgentId, beadId, rigId).catch(err => {
    console.warn(...); // No rollback!
  });
};

The comment above says "the reconciler detects the mismatch on the next tick (idle agent hooked to in_progress bead) and retries dispatch" — but the agent is NOT idle, it's 'working'. The reconciler's Rule 2 only fires for status = 'idle' agents. The reconciler's reconcileAgents would eventually catch this via heartbeat loss (90s), but that's a 90-second delay for every failed dispatch.

Fix: In the .catch() handler, call agentOps.updateAgentStatus(sql, capturedAgentId, 'idle').

B3: merge_pr side effect writes SQL inside async closure

File: actions.ts:780-798, 812-831, 838-856
Impact: MEDIUM — violates phase separation (spec §1.5: "SQL mutations are synchronous and atomic")

The merge_pr action's deferred async function writes SQL (UPDATE beads, UPDATE review_metadata) in multiple places:

  • Line 783-789: Clears auto_merge_pending when fresh feedback found
  • Line 815-830: Clears auto_merge_pending on merge failure
  • Line 838-856: Clears on catch

These SQL writes happen AFTER Promise.allSettled() in the alarm loop (Phase 2/4), potentially racing with the next alarm tick's Phase 0 event drain. If the alarm fires quickly, a concurrent drain could read stale auto_merge_pending state.

Fix: Extract the "clear auto_merge_pending" into a synchronous pre-check, or emit events from the async handler instead of direct SQL writes. The same issue exists in poll_pr (line 646-654: SQL writes inside async context).

B4: poll_pr calls checkPRFeedback twice per tick

File: actions.ts:608, 663
Impact: LOW — wasted API calls, doubled latency

When both auto_resolve_pr_feedback and auto_merge_delay_minutes are configured, checkPRFeedback is called once at line 608 and again at line 663. Each call is a GitHub API request.

Fix: Cache the result of the first call and reuse it.

B5: reconcileReviewQueue double-evaluates circuit breaker

File: reconciler.ts:1024 vs reconciler.ts:614
Impact: LOW — unnecessary query per tick

checkDispatchCircuitBreaker(sql) is called in reconcileBeads (line 614) and again independently in reconcileReviewQueue (line 1021). The circuit breaker does a COUNT query on every call. These run in the same synchronous tick.

Fix: Pass the circuit breaker result from reconcile() into each sub-function, or hoist it to reconcile() and pass it down.

B6: Rule 4 (orphaned PR review) only checks in_progress, but spec says open too

File: reconciler.ts:1118-1134
Impact: MEDIUM — spec says Rule 4 applies to status in ('open', 'in_progress') for PR-strategy beads

The spec (§5.3 reconcileReviewQueue Rule 4) says:

for each bead where status in ('open', 'in_progress') and pr_url is not null ...

But the implementation only checks mr.status === 'in_progress' (line 1119). An open MR bead with a pr_url that is stale >30min would not be caught. This could happen if a PR URL is set on an MR bead but the bead is never transitioned to in_progress (e.g., via a race in setReviewPrUrl).

Fix: Change the guard to also handle mr.status === 'open'.


2. Missing Edge Cases

E1: No poison event protection

File: Town.do.ts:3463-3477
Impact: HIGH — a single malformed event blocks the entire reconciler queue

The alarm loop catches applyEvent errors but just skips to the next event, leaving the poison event unprocessed. The comment says "Mark it processed anyway after 3 consecutive failures" but this is NOT implemented. On every tick, the poison event is drained again and fails again, consuming resources forever.

Fix: Track failure count per event (e.g., in a retry_count column or in-memory map). After N failures, mark the event as processed with an error log / Sentry alert.

E2: bead_cancelled event doesn't stop a working agent's container

File: reconciler.ts:259-289
Impact: MEDIUM — agent continues consuming credits after bead is cancelled

When bead_cancelled is applied, it unhooks the agent but doesn't emit a stop_agent side effect. The agent's container keeps running until the heartbeat times out (90s) or the reconciler's stale-heartbeat check fires. During that window, the agent consumes compute resources working on a cancelled bead.

Fix: After unhooking agents in bead_cancelled, also insert a stop_agent action (or defer a stop side effect).

E3: No handling of agent_done + agent_completed in same drain batch

File: reconciler.ts:201-228
Impact: MEDIUM — order-dependent behavior within a batch

If both agent_done and agent_completed events arrive for the same agent in the same drain batch, the outcome depends on insertion order. Normally agent_done runs first (gt_done fires before process exit), but if agent_completed runs first:

  • Non-refinery: agentCompleted unhooks the agent → agentDone finds no hook → logs warning and returns (for polecats). The MR is never created.
  • This is currently handled for refineries (line 507-547 in review-queue.ts), but not for polecats.

Fix: Document this as an accepted trade-off, or add recovery logic to agentDone for polecats similar to the refinery recovery path.

E4: container_status event with not_found uses JS timestamp comparison, not SQLite

File: reconciler.ts:324-326
Impact: LOW — potential timezone mismatch (though unlikely in practice since both use ISO 8601)

The 3-minute grace period is computed in JS (Date.now() - new Date(agent.last_activity_at).getTime()). This is correct as long as last_activity_at is stored as ISO 8601. The touchAgent function uses new Date().toISOString(), so this is consistent.

No fix needed, but worth noting for consistency.

E5: pr_feedback_detected event uses beadOps.createBead directly in applyEvent

File: reconciler.ts:392-407
Impact: LOW — spec violation: events should only update state, not create new entities

The spec says "Events represent facts that have occurred; applying them updates state to reflect those facts." Creating a new bead is a side effect, not a state update. This blurs the boundary between event application and reconciliation.

Fix: Consider moving feedback bead creation to a reconciler rule instead of the event handler.

E6: Convoy progress double-counting path

File: reconciler.ts:1440-1470 and beads.ts:330-469
Impact: LOW — redundant but not harmful

reconcileConvoys recounts closed beads and emits update_convoy_progress. But beads.updateBeadStatus also calls updateConvoyProgress inline when a bead reaches a terminal state. Both paths update the same counter. The reconciler path is a safety net, which is good, but the dual-write means the counter is updated twice on terminal transitions.

No fix needed — redundancy is acceptable for safety.


3. Race Conditions

R1: dispatch_agent action creates agent + hooks + sets working — all non-atomic with container start

File: actions.ts:522-561
Impact: HIGH — window where agent is 'working' but container hasn't started

Between the synchronous SQL mutations (agent → working, bead → in_progress) and the async container dispatch, there's a window where:

  1. Agent is 'working' + bead is 'in_progress'
  2. Container start fails
  3. Agent stays 'working' (see B2 above) until heartbeat timeout

During this window, the reconciler won't retry dispatch because the agent appears healthy (status = 'working'). Combined with B2, this creates a 90-second dead zone per failed dispatch.

Fix: See B2. Additionally, consider checking container health in the pre-dispatch phase rather than relying solely on post-dispatch heartbeat monitoring.

R2: set_convoy_ready_to_land and create_landing_mr in same tick

File: reconciler.ts:1508-1584
Impact: LOW — landing MR creation delayed by one tick

When a convoy first becomes ready to land, the reconciler emits set_convoy_ready_to_land (line 1510-1513). The ready_to_land flag is checked on line 1516 to decide whether to create the landing MR. But since set_convoy_ready_to_land is an action that hasn't been applied yet (it's in the same action batch), the check at line 1516 reads stale state from the database.

Result: the landing MR is created on the NEXT tick, not the current one. This is a one-tick delay (5 seconds), not a bug.

Fix: Either accept the one-tick delay (harmless) or restructure to emit both actions in the same check (when closed_count >= total_count and no landing MR exists, emit both).

R3: Concurrent agentDone and agentCompleted for refineries

File: review-queue.ts:681-782
Impact: MEDIUM — already mitigated but fragile

The agentCompleted handler deliberately does NOT unhook refineries (lines 691-703) to avoid racing with agentDone. This is correct. However, the guard at line 763-778 that sets the refinery to idle has a subtle race: if agentDone runs between agentCompleted's read and write, the WHERE NOT (status = 'working' AND current_hook_bead_id IS NOT NULL) guard prevents clobbering.

This is well-handled. No fix needed.

R4: poll_pr touches updated_at on every tick, defeating staleness checks

File: actions.ts:580-588
Impact: LOW — intentional, but creates a coupling

The poll_pr action synchronously updates updated_at on the MR bead to prevent the 30-minute orphaned PR timeout (Rule 4) from firing while the PR is actively being polled. This is correct but means Rule 4 can NEVER fire as long as polling continues every 5 seconds.

If polling fails silently (e.g., GitHub rate limit returns 200 with an error body that isn't detected), the MR bead will appear non-stale forever.

Fix: Consider a separate last_poll_at field on review_metadata instead of overloading updated_at.


4. Spec Divergences

S1: transition_bead.from is nullable in implementation, required in spec

File: actions.ts:31 vs spec §4.2
Impact: LOW — the implementation is more lenient

The spec defines from: BeadStatus (required), but the implementation uses from: z.string().nullable(). Several call sites pass from: null (e.g., reconciler.ts:794, 1335), meaning the from-state guard is skipped entirely. The applyAction for transition_bead (actions.ts:309-319) doesn't check from at all — it always applies the transition.

Fix: Either enforce the from-state guard as the spec requires, or update the spec to document that from is informational.

S2: Spec says GUPP warn at 30 min, implementation uses 15 min

File: reconciler.ts:1722 vs spec §5.3 (GUPP_WARN_MS = 30 min)
Impact: LOW — intentional tightening

The spec says GUPP warn threshold is 30 minutes. The implementation uses 15 minutes (line 1722: elapsed > 15 * 60_000). The comment says "Tighter warn threshold (15min vs old 30min) using SDK activity." This is an intentional deviation.

Fix: Update the spec to reflect the 15-minute threshold.

S3: Spec says agent_completed closes/fails bead, implementation doesn't for completed status

File: review-queue.ts:704-749 vs spec §5.2

The spec says:

if agent.current_hook_bead_id:
  status = event.payload.status == 'completed' ? 'closed' : 'failed'
  transitionBead(sql, agent.current_hook_bead_id, status)

But the implementation (lines 718-749) does NOT close the bead when status = 'completed' and the bead is still in_progress. Instead, it leaves the bead as in_progress and lets reconciler Rule 3 recover it after 5 minutes. This is intentional (documented in comments) — gt_done is the canonical completion path for polecats.

Fix: Update spec to match implementation.

S4: reconcileBeads Rule 3 timeout is 5 min, spec says 2 min

File: reconciler.ts:103 vs spec §5.3 Rule 3 (stale(updated_at, 2 minutes))
Impact: MEDIUM — longer recovery time than spec

STALE_IN_PROGRESS_TIMEOUT_MS is 5 minutes. The spec says 2 minutes. The code comment explains: "Must be longer than AGENT_IDLE_TIMEOUT_MS (2 min) + one alarm tick (5s) to avoid racing with the idle-timer → agentCompleted → reconciler flow."

Fix: Update spec to 5 minutes with the rationale.

S5: Spec event type list is incomplete

File: spec §3.1 vs reconciler.ts event handlers
Impact: LOW — spec is stale

The implementation handles additional event types not in the spec:

  • container_eviction (line 359-364)
  • pr_feedback_detected (line 372-408)
  • pr_auto_merge (line 410-434)

Fix: Update spec with the new event types.

S6: refineryCodeReview option is threaded through but unused [RETRACTED]

File: reconciler.ts:448, 455, 1018

Correction (verified against source at 30dfbde): This finding is inaccurate. The refineryCodeReview option IS actively used. At reconciler.ts:1141, it is read: const refineryCodeReview = opts?.refineryCodeReview ?? true. At lines 1143-1172, when !refineryCodeReview, open MR beads with PR URLs are fast-tracked to in_progress (bypassing refinery dispatch). At line 1176, it gates Rules 5-6 (refinery dispatch logic). This is functional, not dead code.


5. Code Quality Issues

C1: Repeated AgentRow query pattern (5+ copies)

File: reconciler.ts:478-493, 537-553, 740-757, 1179-1196, 1274-1291, 1370-1387, 1611-1630, 1709-1726
Impact: Maintainability

The same SELECT + JOIN + parse pattern for agent rows is duplicated ~8 times. Each copy selects slightly different columns but uses the same AgentRow schema. A helper function like queryWorkingAgents(sql, filters) would reduce duplication.

C2: hasExistingPrFeedbackBead / hasExistingFeedbackBead — similar but not identical

File: reconciler.ts:1881-1898 and actions.ts:939-956

Correction (verified against source at 30dfbde): These functions are similar but NOT identical. The reconciler version (hasExistingPrFeedbackBead at line 1881) queries bd.bead_id = ? (finds feedback beads that block the MR bead). The actions version (hasExistingFeedbackBead at line 939) uses the same direction but has a slightly different name and lives in a different module. The SQL is semantically equivalent in this case, but they are named differently (hasExistingPrFeedbackBead vs hasExistingFeedbackBead). Still worth deduplicating, but the original claim of "identical implementations" overstates it.

Fix: Consolidate into a shared helper and export from one module.

C3: Magic numbers for timeouts

File: reconciler.ts:324 (180 seconds grace), reconciler.ts:1677 (15 * 60_000), reconciler.ts:90_000
Impact: Readability

Some timeouts are named constants (e.g., STUCK_REVIEW_TIMEOUT_MS), others are inline magic numbers. The 180-second cold start grace (line 325), 90-second heartbeat timeout (line 513), and 15-minute GUPP warn (line 1677) should be named constants for clarity.

C4: now() helper duplicated in 4 files

File: actions.ts:290-292, agents.ts:51-53, beads.ts:60-62, events.ts:22-24, review-queue.ts:36-38

Every module defines its own now() function. This should be a shared utility.


6. Missing Tests

T1: No test for circuit breaker

Impact: HIGH — a broken circuit breaker could either block all dispatches or allow infinite retries

The circuit breaker (checkDispatchCircuitBreaker) has no test coverage. Tests should verify:

  • Breaker opens when threshold is exceeded
  • Breaker allows dispatches below threshold
  • Breaker emits notify_mayor action

T2: No test for GUPP (reconcileGUPP)

Impact: HIGH — GUPP bugs cause stuck agents consuming credits

No tests for any GUPP tier (warn, escalate, force_stop). Tests should verify:

  • 15-min idle → warn nudge
  • 1-hour → escalate + triage request
  • 2-hour → force stop + triage request
  • Draining skips GUPP
  • hasRecentNudge dedup works correctly

T3: No test for event idempotency

Impact: HIGH — duplicate events could cause double transitions

No test verifies what happens when the same event is applied twice. Key scenarios:

  • Double agent_done for the same agent
  • Double pr_status_changed (merged) for the same MR
  • bead_cancelled for an already-cancelled bead

T4: No test for reconcileReviewQueue Rules 2, 3, 4, 6, 7

Impact: HIGH — stuck MR recovery is the most common failure mode

Only Rule 5 (refinery dispatch) and Rule 6 (dispatch limits) have test coverage. Missing:

  • Rule 2: Stuck review recovery (no working agent, no PR, >30 min)
  • Rule 3: Abandoned MR recovery (no agent hooked, >2 min)
  • Rule 4: Orphaned PR review (>30 min, no working agent)
  • Rule 7: Working refinery with terminal MR → stop

T5: No test for convoy landing MR creation

Impact: MEDIUM — broken convoy landing blocks all review-then-land convoys

reconcileConvoys has a test for auto-close (review-and-merge) and verifying review-then-land doesn't auto-close, but no test for:

  • Landing MR creation when ready_to_land is set
  • Landing MR merged → convoy closes
  • Failed landing MR handling

T6: No test for applyEvent edge cases

Impact: MEDIUM — no coverage for:

  • pr_feedback_detected → feedback bead creation
  • pr_auto_merge → auto_merge_pending metadata
  • container_status with not_found during cold start grace period
  • bead_cancelled with hooked agents

T7: No unit tests for reconciler rules in isolation

Impact: MEDIUM — all tests are integration tests

The integration tests are valuable but slow and can't easily test timing-dependent scenarios (staleness thresholds, cooldown backoff). Pure unit tests with a mock SqlStorage could test individual rules with controlled timestamps.

T8: No test for invariant checker

Impact: LOW — invariant checker is observability-only (no auto-correction)

checkInvariants is never tested. Should verify all 5 invariant checks detect violations.


7. Hardening Recommendations

H1: Implement poison event protection (priority: P0)

Add a retry_count column to town_events or track retries in-memory. After 3 failures, mark the event processed and alert via Sentry. This prevents a single bad event from blocking all reconciliation.

H2: Fix dispatch rollback (priority: P0)

In dispatch_agent's async side effect catch handler (actions.ts:555), set the agent back to 'idle' on failure. This eliminates the 90-second dead zone described in B2/R1.

H3: Add from-state guards to transition_bead (priority: P1)

The spec requires checking action.from against the current bead status. The implementation skips this check. Without it, concurrent rule firings can produce unexpected transitions (e.g., a bead goes open→in_progress→open→in_progress in rapid succession). Add a guard that logs a warning and skips the transition when action.from doesn't match.

H4: Add last_poll_at to review_metadata (priority: P1)

Stop overloading updated_at for poll liveness (see R4). A dedicated last_poll_at field on review_metadata would decouple poll freshness from general bead update semantics, making staleness checks more reliable.

H5: Add auto-recovery for invariant violations (priority: P1)

Currently, checkInvariants only logs violations. Add auto-corrective actions:

  • Invariant 7 (working + no hook): transition agent to idle
  • Invariant 6 (multi-hooked bead): unhook all but the most recent agent
  • Invariant 3 (multiple in_progress MRs per rig): fail all but the oldest

H6: Hoist circuit breaker to reconcile() (priority: P2)

Call checkDispatchCircuitBreaker once in reconcile() and pass the result to reconcileBeads and reconcileReviewQueue. Saves one COUNT query per tick.

H7: Add bead_cancelled → stop_agent (priority: P2)

When a bead is cancelled while an agent is working on it, stop the agent's container immediately instead of waiting for heartbeat timeout. See E2.

H8: Parameterize the circuit breaker SQL (priority: P2)

Replace the template literal interpolation of CIRCUIT_BREAKER_WINDOW_MINUTES with a computed ISO timestamp and parameterized query. See B1.

H9: Extract shared helpers (priority: P3)

  • Deduplicate now() into util/time.util.ts
  • Consolidate hasExistingPrFeedbackBead / hasExistingFeedbackBead between reconciler.ts and actions.ts
  • Extract repeated AgentRow query patterns into helper functions

H10: Add comprehensive reconciler unit tests (priority: P1)

Create test/unit/reconciler.test.ts with controlled state setup (direct SQL inserts) to test:

  • Each reconcile rule in isolation with boundary timestamps
  • Circuit breaker open/close thresholds
  • GUPP tier progression
  • Cooldown backoff calculations
  • Idempotency of each rule (running twice produces same result)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Should fix before soft launchgt:coreReconciler, state machine, bead lifecycle, convoy flow

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions