-
Notifications
You must be signed in to change notification settings - Fork 26
audit(gastown): Reconciler code quality audit — bugs, gaps, and hardening recommendations #1986
Description
Reconciler Code Quality Audit
Exhaustive line-by-line audit of the Gastown reconciler (reconciler.ts, actions.ts, events.ts) and supporting modules (agents.ts, beads.ts, review-queue.ts, Town.do.ts), compared against reconciliation-spec.md.
Files audited:
src/dos/town/reconciler.ts(2130 lines)src/dos/town/actions.ts(974 lines)src/dos/town/events.ts(213 lines)src/dos/town/agents.ts(578 lines)src/dos/town/beads.ts(950 lines)src/dos/town/review-queue.ts(1453+ lines)src/dos/Town.do.ts(alarm method, ~300 lines)test/integration/reconciler.test.ts(489 lines)docs/gt/reconciliation-spec.md(1035 lines)
1. Bugs Found
B1: SQL interpolation in circuit breaker query — potential injection / broken query
File: reconciler.ts:63
Impact: HIGH — could break the circuit breaker or cause SQL injection
WHERE ... > strftime('%Y-%m-%dT%H:%M:%fZ', 'now', '-${CIRCUIT_BREAKER_WINDOW_MINUTES} minutes')CIRCUIT_BREAKER_WINDOW_MINUTES (30) is interpolated directly into the SQL string via a JavaScript template literal, not via a parameterized query argument. While this constant is hardcoded (not user-supplied), this violates the project convention ("use parameterized queries") and creates a maintenance hazard — if this constant were ever derived from user input, it would be a SQL injection vector.
Fix: Use a parameterized approach: compute the cutoff timestamp in JS (new Date(Date.now() - CIRCUIT_BREAKER_WINDOW_MINUTES * 60_000).toISOString()) and pass it as a query parameter.
B2: dispatch_agent side effect doesn't roll back agent status on failure
File: actions.ts:550-561
Impact: HIGH — agent stuck in 'working' with no container running
The spec (§5.4) says: "If [dispatch] not started: set agent back to idle. Bead stays in_progress. Reconciler will retry on next tick." But the implementation only catches the error and logs it — it does NOT set the agent back to idle:
return async () => {
await ctx.dispatchAgent(capturedAgentId, beadId, rigId).catch(err => {
console.warn(...); // No rollback!
});
};The comment above says "the reconciler detects the mismatch on the next tick (idle agent hooked to in_progress bead) and retries dispatch" — but the agent is NOT idle, it's 'working'. The reconciler's Rule 2 only fires for status = 'idle' agents. The reconciler's reconcileAgents would eventually catch this via heartbeat loss (90s), but that's a 90-second delay for every failed dispatch.
Fix: In the .catch() handler, call agentOps.updateAgentStatus(sql, capturedAgentId, 'idle').
B3: merge_pr side effect writes SQL inside async closure
File: actions.ts:780-798, 812-831, 838-856
Impact: MEDIUM — violates phase separation (spec §1.5: "SQL mutations are synchronous and atomic")
The merge_pr action's deferred async function writes SQL (UPDATE beads, UPDATE review_metadata) in multiple places:
- Line 783-789: Clears
auto_merge_pendingwhen fresh feedback found - Line 815-830: Clears
auto_merge_pendingon merge failure - Line 838-856: Clears on catch
These SQL writes happen AFTER Promise.allSettled() in the alarm loop (Phase 2/4), potentially racing with the next alarm tick's Phase 0 event drain. If the alarm fires quickly, a concurrent drain could read stale auto_merge_pending state.
Fix: Extract the "clear auto_merge_pending" into a synchronous pre-check, or emit events from the async handler instead of direct SQL writes. The same issue exists in poll_pr (line 646-654: SQL writes inside async context).
B4: poll_pr calls checkPRFeedback twice per tick
File: actions.ts:608, 663
Impact: LOW — wasted API calls, doubled latency
When both auto_resolve_pr_feedback and auto_merge_delay_minutes are configured, checkPRFeedback is called once at line 608 and again at line 663. Each call is a GitHub API request.
Fix: Cache the result of the first call and reuse it.
B5: reconcileReviewQueue double-evaluates circuit breaker
File: reconciler.ts:1024 vs reconciler.ts:614
Impact: LOW — unnecessary query per tick
checkDispatchCircuitBreaker(sql) is called in reconcileBeads (line 614) and again independently in reconcileReviewQueue (line 1021). The circuit breaker does a COUNT query on every call. These run in the same synchronous tick.
Fix: Pass the circuit breaker result from reconcile() into each sub-function, or hoist it to reconcile() and pass it down.
B6: Rule 4 (orphaned PR review) only checks in_progress, but spec says open too
File: reconciler.ts:1118-1134
Impact: MEDIUM — spec says Rule 4 applies to status in ('open', 'in_progress') for PR-strategy beads
The spec (§5.3 reconcileReviewQueue Rule 4) says:
for each bead where status in ('open', 'in_progress') and pr_url is not null ...
But the implementation only checks mr.status === 'in_progress' (line 1119). An open MR bead with a pr_url that is stale >30min would not be caught. This could happen if a PR URL is set on an MR bead but the bead is never transitioned to in_progress (e.g., via a race in setReviewPrUrl).
Fix: Change the guard to also handle mr.status === 'open'.
2. Missing Edge Cases
E1: No poison event protection
File: Town.do.ts:3463-3477
Impact: HIGH — a single malformed event blocks the entire reconciler queue
The alarm loop catches applyEvent errors but just skips to the next event, leaving the poison event unprocessed. The comment says "Mark it processed anyway after 3 consecutive failures" but this is NOT implemented. On every tick, the poison event is drained again and fails again, consuming resources forever.
Fix: Track failure count per event (e.g., in a retry_count column or in-memory map). After N failures, mark the event as processed with an error log / Sentry alert.
E2: bead_cancelled event doesn't stop a working agent's container
File: reconciler.ts:259-289
Impact: MEDIUM — agent continues consuming credits after bead is cancelled
When bead_cancelled is applied, it unhooks the agent but doesn't emit a stop_agent side effect. The agent's container keeps running until the heartbeat times out (90s) or the reconciler's stale-heartbeat check fires. During that window, the agent consumes compute resources working on a cancelled bead.
Fix: After unhooking agents in bead_cancelled, also insert a stop_agent action (or defer a stop side effect).
E3: No handling of agent_done + agent_completed in same drain batch
File: reconciler.ts:201-228
Impact: MEDIUM — order-dependent behavior within a batch
If both agent_done and agent_completed events arrive for the same agent in the same drain batch, the outcome depends on insertion order. Normally agent_done runs first (gt_done fires before process exit), but if agent_completed runs first:
- Non-refinery:
agentCompletedunhooks the agent →agentDonefinds no hook → logs warning and returns (for polecats). The MR is never created. - This is currently handled for refineries (line 507-547 in review-queue.ts), but not for polecats.
Fix: Document this as an accepted trade-off, or add recovery logic to agentDone for polecats similar to the refinery recovery path.
E4: container_status event with not_found uses JS timestamp comparison, not SQLite
File: reconciler.ts:324-326
Impact: LOW — potential timezone mismatch (though unlikely in practice since both use ISO 8601)
The 3-minute grace period is computed in JS (Date.now() - new Date(agent.last_activity_at).getTime()). This is correct as long as last_activity_at is stored as ISO 8601. The touchAgent function uses new Date().toISOString(), so this is consistent.
No fix needed, but worth noting for consistency.
E5: pr_feedback_detected event uses beadOps.createBead directly in applyEvent
File: reconciler.ts:392-407
Impact: LOW — spec violation: events should only update state, not create new entities
The spec says "Events represent facts that have occurred; applying them updates state to reflect those facts." Creating a new bead is a side effect, not a state update. This blurs the boundary between event application and reconciliation.
Fix: Consider moving feedback bead creation to a reconciler rule instead of the event handler.
E6: Convoy progress double-counting path
File: reconciler.ts:1440-1470 and beads.ts:330-469
Impact: LOW — redundant but not harmful
reconcileConvoys recounts closed beads and emits update_convoy_progress. But beads.updateBeadStatus also calls updateConvoyProgress inline when a bead reaches a terminal state. Both paths update the same counter. The reconciler path is a safety net, which is good, but the dual-write means the counter is updated twice on terminal transitions.
No fix needed — redundancy is acceptable for safety.
3. Race Conditions
R1: dispatch_agent action creates agent + hooks + sets working — all non-atomic with container start
File: actions.ts:522-561
Impact: HIGH — window where agent is 'working' but container hasn't started
Between the synchronous SQL mutations (agent → working, bead → in_progress) and the async container dispatch, there's a window where:
- Agent is 'working' + bead is 'in_progress'
- Container start fails
- Agent stays 'working' (see B2 above) until heartbeat timeout
During this window, the reconciler won't retry dispatch because the agent appears healthy (status = 'working'). Combined with B2, this creates a 90-second dead zone per failed dispatch.
Fix: See B2. Additionally, consider checking container health in the pre-dispatch phase rather than relying solely on post-dispatch heartbeat monitoring.
R2: set_convoy_ready_to_land and create_landing_mr in same tick
File: reconciler.ts:1508-1584
Impact: LOW — landing MR creation delayed by one tick
When a convoy first becomes ready to land, the reconciler emits set_convoy_ready_to_land (line 1510-1513). The ready_to_land flag is checked on line 1516 to decide whether to create the landing MR. But since set_convoy_ready_to_land is an action that hasn't been applied yet (it's in the same action batch), the check at line 1516 reads stale state from the database.
Result: the landing MR is created on the NEXT tick, not the current one. This is a one-tick delay (5 seconds), not a bug.
Fix: Either accept the one-tick delay (harmless) or restructure to emit both actions in the same check (when closed_count >= total_count and no landing MR exists, emit both).
R3: Concurrent agentDone and agentCompleted for refineries
File: review-queue.ts:681-782
Impact: MEDIUM — already mitigated but fragile
The agentCompleted handler deliberately does NOT unhook refineries (lines 691-703) to avoid racing with agentDone. This is correct. However, the guard at line 763-778 that sets the refinery to idle has a subtle race: if agentDone runs between agentCompleted's read and write, the WHERE NOT (status = 'working' AND current_hook_bead_id IS NOT NULL) guard prevents clobbering.
This is well-handled. No fix needed.
R4: poll_pr touches updated_at on every tick, defeating staleness checks
File: actions.ts:580-588
Impact: LOW — intentional, but creates a coupling
The poll_pr action synchronously updates updated_at on the MR bead to prevent the 30-minute orphaned PR timeout (Rule 4) from firing while the PR is actively being polled. This is correct but means Rule 4 can NEVER fire as long as polling continues every 5 seconds.
If polling fails silently (e.g., GitHub rate limit returns 200 with an error body that isn't detected), the MR bead will appear non-stale forever.
Fix: Consider a separate last_poll_at field on review_metadata instead of overloading updated_at.
4. Spec Divergences
S1: transition_bead.from is nullable in implementation, required in spec
File: actions.ts:31 vs spec §4.2
Impact: LOW — the implementation is more lenient
The spec defines from: BeadStatus (required), but the implementation uses from: z.string().nullable(). Several call sites pass from: null (e.g., reconciler.ts:794, 1335), meaning the from-state guard is skipped entirely. The applyAction for transition_bead (actions.ts:309-319) doesn't check from at all — it always applies the transition.
Fix: Either enforce the from-state guard as the spec requires, or update the spec to document that from is informational.
S2: Spec says GUPP warn at 30 min, implementation uses 15 min
File: reconciler.ts:1722 vs spec §5.3 (GUPP_WARN_MS = 30 min)
Impact: LOW — intentional tightening
The spec says GUPP warn threshold is 30 minutes. The implementation uses 15 minutes (line 1722: elapsed > 15 * 60_000). The comment says "Tighter warn threshold (15min vs old 30min) using SDK activity." This is an intentional deviation.
Fix: Update the spec to reflect the 15-minute threshold.
S3: Spec says agent_completed closes/fails bead, implementation doesn't for completed status
File: review-queue.ts:704-749 vs spec §5.2
The spec says:
if agent.current_hook_bead_id:
status = event.payload.status == 'completed' ? 'closed' : 'failed'
transitionBead(sql, agent.current_hook_bead_id, status)
But the implementation (lines 718-749) does NOT close the bead when status = 'completed' and the bead is still in_progress. Instead, it leaves the bead as in_progress and lets reconciler Rule 3 recover it after 5 minutes. This is intentional (documented in comments) — gt_done is the canonical completion path for polecats.
Fix: Update spec to match implementation.
S4: reconcileBeads Rule 3 timeout is 5 min, spec says 2 min
File: reconciler.ts:103 vs spec §5.3 Rule 3 (stale(updated_at, 2 minutes))
Impact: MEDIUM — longer recovery time than spec
STALE_IN_PROGRESS_TIMEOUT_MS is 5 minutes. The spec says 2 minutes. The code comment explains: "Must be longer than AGENT_IDLE_TIMEOUT_MS (2 min) + one alarm tick (5s) to avoid racing with the idle-timer → agentCompleted → reconciler flow."
Fix: Update spec to 5 minutes with the rationale.
S5: Spec event type list is incomplete
File: spec §3.1 vs reconciler.ts event handlers
Impact: LOW — spec is stale
The implementation handles additional event types not in the spec:
container_eviction(line 359-364)pr_feedback_detected(line 372-408)pr_auto_merge(line 410-434)
Fix: Update spec with the new event types.
S6: refineryCodeReview option is threaded through but unused [RETRACTED]
refineryCodeReview option is threaded through but unusedFile: reconciler.ts:448, 455, 1018
Correction (verified against source at 30dfbde): This finding is inaccurate. The refineryCodeReview option IS actively used. At reconciler.ts:1141, it is read: const refineryCodeReview = opts?.refineryCodeReview ?? true. At lines 1143-1172, when !refineryCodeReview, open MR beads with PR URLs are fast-tracked to in_progress (bypassing refinery dispatch). At line 1176, it gates Rules 5-6 (refinery dispatch logic). This is functional, not dead code.
5. Code Quality Issues
C1: Repeated AgentRow query pattern (5+ copies)
File: reconciler.ts:478-493, 537-553, 740-757, 1179-1196, 1274-1291, 1370-1387, 1611-1630, 1709-1726
Impact: Maintainability
The same SELECT + JOIN + parse pattern for agent rows is duplicated ~8 times. Each copy selects slightly different columns but uses the same AgentRow schema. A helper function like queryWorkingAgents(sql, filters) would reduce duplication.
C2: hasExistingPrFeedbackBead / hasExistingFeedbackBead — similar but not identical
File: reconciler.ts:1881-1898 and actions.ts:939-956
Correction (verified against source at 30dfbde): These functions are similar but NOT identical. The reconciler version (hasExistingPrFeedbackBead at line 1881) queries bd.bead_id = ? (finds feedback beads that block the MR bead). The actions version (hasExistingFeedbackBead at line 939) uses the same direction but has a slightly different name and lives in a different module. The SQL is semantically equivalent in this case, but they are named differently (hasExistingPrFeedbackBead vs hasExistingFeedbackBead). Still worth deduplicating, but the original claim of "identical implementations" overstates it.
Fix: Consolidate into a shared helper and export from one module.
C3: Magic numbers for timeouts
File: reconciler.ts:324 (180 seconds grace), reconciler.ts:1677 (15 * 60_000), reconciler.ts:90_000
Impact: Readability
Some timeouts are named constants (e.g., STUCK_REVIEW_TIMEOUT_MS), others are inline magic numbers. The 180-second cold start grace (line 325), 90-second heartbeat timeout (line 513), and 15-minute GUPP warn (line 1677) should be named constants for clarity.
C4: now() helper duplicated in 4 files
File: actions.ts:290-292, agents.ts:51-53, beads.ts:60-62, events.ts:22-24, review-queue.ts:36-38
Every module defines its own now() function. This should be a shared utility.
6. Missing Tests
T1: No test for circuit breaker
Impact: HIGH — a broken circuit breaker could either block all dispatches or allow infinite retries
The circuit breaker (checkDispatchCircuitBreaker) has no test coverage. Tests should verify:
- Breaker opens when threshold is exceeded
- Breaker allows dispatches below threshold
- Breaker emits
notify_mayoraction
T2: No test for GUPP (reconcileGUPP)
Impact: HIGH — GUPP bugs cause stuck agents consuming credits
No tests for any GUPP tier (warn, escalate, force_stop). Tests should verify:
- 15-min idle → warn nudge
- 1-hour → escalate + triage request
- 2-hour → force stop + triage request
- Draining skips GUPP
hasRecentNudgededup works correctly
T3: No test for event idempotency
Impact: HIGH — duplicate events could cause double transitions
No test verifies what happens when the same event is applied twice. Key scenarios:
- Double
agent_donefor the same agent - Double
pr_status_changed(merged) for the same MR bead_cancelledfor an already-cancelled bead
T4: No test for reconcileReviewQueue Rules 2, 3, 4, 6, 7
Impact: HIGH — stuck MR recovery is the most common failure mode
Only Rule 5 (refinery dispatch) and Rule 6 (dispatch limits) have test coverage. Missing:
- Rule 2: Stuck review recovery (no working agent, no PR, >30 min)
- Rule 3: Abandoned MR recovery (no agent hooked, >2 min)
- Rule 4: Orphaned PR review (>30 min, no working agent)
- Rule 7: Working refinery with terminal MR → stop
T5: No test for convoy landing MR creation
Impact: MEDIUM — broken convoy landing blocks all review-then-land convoys
reconcileConvoys has a test for auto-close (review-and-merge) and verifying review-then-land doesn't auto-close, but no test for:
- Landing MR creation when
ready_to_landis set - Landing MR merged → convoy closes
- Failed landing MR handling
T6: No test for applyEvent edge cases
Impact: MEDIUM — no coverage for:
pr_feedback_detected→ feedback bead creationpr_auto_merge→ auto_merge_pending metadatacontainer_statuswithnot_foundduring cold start grace periodbead_cancelledwith hooked agents
T7: No unit tests for reconciler rules in isolation
Impact: MEDIUM — all tests are integration tests
The integration tests are valuable but slow and can't easily test timing-dependent scenarios (staleness thresholds, cooldown backoff). Pure unit tests with a mock SqlStorage could test individual rules with controlled timestamps.
T8: No test for invariant checker
Impact: LOW — invariant checker is observability-only (no auto-correction)
checkInvariants is never tested. Should verify all 5 invariant checks detect violations.
7. Hardening Recommendations
H1: Implement poison event protection (priority: P0)
Add a retry_count column to town_events or track retries in-memory. After 3 failures, mark the event processed and alert via Sentry. This prevents a single bad event from blocking all reconciliation.
H2: Fix dispatch rollback (priority: P0)
In dispatch_agent's async side effect catch handler (actions.ts:555), set the agent back to 'idle' on failure. This eliminates the 90-second dead zone described in B2/R1.
H3: Add from-state guards to transition_bead (priority: P1)
The spec requires checking action.from against the current bead status. The implementation skips this check. Without it, concurrent rule firings can produce unexpected transitions (e.g., a bead goes open→in_progress→open→in_progress in rapid succession). Add a guard that logs a warning and skips the transition when action.from doesn't match.
H4: Add last_poll_at to review_metadata (priority: P1)
Stop overloading updated_at for poll liveness (see R4). A dedicated last_poll_at field on review_metadata would decouple poll freshness from general bead update semantics, making staleness checks more reliable.
H5: Add auto-recovery for invariant violations (priority: P1)
Currently, checkInvariants only logs violations. Add auto-corrective actions:
- Invariant 7 (working + no hook): transition agent to idle
- Invariant 6 (multi-hooked bead): unhook all but the most recent agent
- Invariant 3 (multiple in_progress MRs per rig): fail all but the oldest
H6: Hoist circuit breaker to reconcile() (priority: P2)
Call checkDispatchCircuitBreaker once in reconcile() and pass the result to reconcileBeads and reconcileReviewQueue. Saves one COUNT query per tick.
H7: Add bead_cancelled → stop_agent (priority: P2)
When a bead is cancelled while an agent is working on it, stop the agent's container immediately instead of waiting for heartbeat timeout. See E2.
H8: Parameterize the circuit breaker SQL (priority: P2)
Replace the template literal interpolation of CIRCUIT_BREAKER_WINDOW_MINUTES with a computed ISO timestamp and parameterized query. See B1.
H9: Extract shared helpers (priority: P3)
- Deduplicate
now()intoutil/time.util.ts - Consolidate
hasExistingPrFeedbackBead/hasExistingFeedbackBeadbetweenreconciler.tsandactions.ts - Extract repeated
AgentRowquery patterns into helper functions
H10: Add comprehensive reconciler unit tests (priority: P1)
Create test/unit/reconciler.test.ts with controlled state setup (direct SQL inserts) to test:
- Each reconcile rule in isolation with boundary timestamps
- Circuit breaker open/close thresholds
- GUPP tier progression
- Cooldown backoff calculations
- Idempotency of each rule (running twice produces same result)