fix(reconciler): P0+P1 bug fixes — poison events, Rule 4, stop_agent on cancel#2001
Closed
fix(reconciler): P0+P1 bug fixes — poison events, Rule 4, stop_agent on cancel#2001
Conversation
- docs/local-debug-testing.md: comprehensive guide covering debug endpoints, container monitoring, drain testing, wrangler management, drain architecture, and common issues - scripts/test-drain.sh: automated end-to-end drain test that sends work, waits for an agent, triggers graceful stop, monitors the drain, and prints PASS/FAIL with diagnostics
…#1979) fix(gastown): also refresh container token when mayor is alive (waiting) Refine the refreshContainerToken guard to check for an alive mayor (working/stalled/waiting) in addition to hasActiveWork(). A waiting mayor is alive in the container but hasActiveWork() returns false — without this fix the 8h container token can expire, stranding GT tool calls when the user sends the next message via sendMayorMessage (which reuses the container without calling ensureContainerToken). Co-authored-by: John Fawcett <john@kilcoode.ai>
* feat(gastown): skip review queue for gt:pr-fixup beads (#1983) * feat(gastown): skip review queue for gt:pr-fixup beads in agentDone() * style: fix formatting in local-debug-testing.md --------- Co-authored-by: John Fawcett <john@kilcoode.ai> * feat(gastown): thread labels parameter through gt_sling tool chain Add optional labels[] parameter to gt_sling so the mayor can tag beads at creation time. The parameter flows through: mayor-tools.ts → client.ts → mayor-tools.handler.ts → Town.do.ts slingBead() → createBead() createBead() already accepts labels, so no further changes needed. * feat(gastown): add pr_fixup_context to PrimeContext and buildPrimeContext Add pr_fixup_context field to PrimeContext type and populate it in buildPrimeContext when the hooked bead has the gt:pr-fixup label, mirroring the existing rework_context pattern. Extracts pr_url, branch, and target_branch from bead metadata. --------- Co-authored-by: John Fawcett <john@kilcoode.ai>
…dev libraries (#1978) * feat(container): expand apt-get block with build tools, ripgrep, and dev libraries Add comprehensive package list to both Dockerfile and Dockerfile.dev: - Build toolchain: build-essential, autoconf - Search tools: ripgrep, jq - Network: wget, gnupg, unzip - Compression: bzip2, zstd - SSL/crypto: libssl-dev, libffi-dev - Database client libs: libdb-dev, libgdbm-dev, libgdbm6 - Python build deps: libbz2-dev, liblzma-dev, libncurses5-dev, libreadline-dev, zlib1g-dev - Ruby build deps: libyaml-dev - Image processing: libvips-dev - Browser/rendering: libgbm1 - Native addon support: libc++1, libgmp-dev - Timezone data: tzdata Closes #1976 * fix(container): remove shell comments from inside backslash-continued apt-get install Shell comments inside backslash-continued lines break the apt-get install command — the # starts a comment that swallows the rest of the logical line, dropping all subsequent packages. Move category info to a block comment above the RUN statement instead. --------- Co-authored-by: John Fawcett <john@kilcoode.ai>
…_review toggle (#1586) * feat(gastown): auto-resolve PR feedback and auto-merge with grace period Implements the core infrastructure for automatically addressing unresolved review comments and failing CI checks on PRs, plus configurable auto-merge with a grace period timer. Closes #1002 * fix(gastown): address PR review feedback — pagination safety, merge failure recovery, repo validation - GraphQL reviewThreads query now includes pageInfo.hasNextPage; if there are more than 100 threads, conservatively treats PR as having unresolved comments to prevent auto-merging with unchecked feedback - merge_pr action clears auto_merge_pending flag on failure (405/409/error) and resets the auto_merge_ready_since timer so normal polling resumes - reconcileReviewQueue now emits poll_pr alongside merge_pr so PR status changes (merged/closed by human) are still observed during merge attempts - merge_pr validates the PR URL's owner/repo against the rig's git_url before calling the merge API, preventing merges targeting unrelated repos * fix(gastown): check-runs pagination and fresh feedback check before merge - check-runs API now uses per_page=100 and compares total_count to detect unpaginated runs; if more runs exist than returned, allChecksPass is conservatively set to false - merge_pr re-checks feedback (comments + CI) immediately before calling the merge API; if state regressed since the last poll, clears auto_merge_pending and resets the timer * fix(gastown): conservative hasFailingChecks for unpaginated check-runs, fix CI failures - hasFailingChecks is now conservatively true when total_count > returned runs, so PRs with >100 check-runs correctly trigger feedback handling - Replace `as` casts with Zod safeParse for GitHub API responses (checkPRFeedback) to satisfy oxlint no-unnecessary-type-assertion - Fix test constant expressions flagged by TS2871 and oxlint - Run oxfmt on changed files * fix(gastown): separate hasUncheckedRuns from hasFailingChecks in PR feedback hasFailingChecks now only reflects actual failures from inspected runs. A new hasUncheckedRuns field tracks when the check-runs response was paginated. allChecksPass is still conservatively false when runs are unchecked (blocking auto-merge), but the feedback bead is no longer dispatched for phantom CI failures on repos with >100 passing checks. * style: auto-format local-debug-testing.md * feat(gastown): polecat creates PRs, refinery reviews via GitHub, code_review toggle - Polecat creates PRs directly when merge_strategy is 'pr', passing pr_url to gt_done. Removes a full agent dispatch cycle from the flow. - Refinery reviews existing PRs via gh pr review, adding inline comments that go through the same auto-resolve feedback loop as human reviews. - New refinery.code_review toggle (default: true) lets users disable refinery code review when they have an external review bot. - Fix: allChecksPass treats 0 check-runs as passing for repos without CI. - Fix: reconciler Rule 7 stops refineries hooked to terminal MR beads, preventing post-merge review comments. - Convoy-aware: review-then-land intermediate beads skip PR creation. * fix(gastown): include hasUncheckedRuns in pr_feedback_detected condition When check-runs are paginated (>100), the first page may all pass but later pages may have failures. Previously only hasFailingChecks was checked, which only reflects the first page. Now hasUncheckedRuns also triggers feedback detection so the auto-resolve path can fix CI. * fix(gastown): include auto_merge columns in REVIEW_JOIN query REVIEW_JOIN was missing auto_merge_ready_since and last_feedback_check_at from the SELECT, causing MergeRequestBeadRecord parse failures on pr_status_changed events. * fix(gastown): refinery uses gh pr comment instead of --approve The bot account that created the PR cannot approve its own PRs. The refinery now posts a comment when the review passes instead of using gh pr review --approve. * fix(gastown): set parent_bead_id on rework and PR feedback beads Rework beads (from refinery review) and PR feedback beads (from poll_pr comment detection) now set parent_bead_id to the MR bead that triggered them. This links them into the bead chain so the relationship is visible in the UI and queryable via parent traversal. * fix(gastown): skip orphaned-review timeout when code_review disabled, fix inline comment syntax - Rule 4 orphaned-review timeout now skips when refineryCodeReview is false — poll_pr keeps MR beads alive, no refinery expected. - Removed invalid 'gh pr review --comment path:line' syntax from refinery prompt; consolidated on the GitHub API example which correctly creates inline review threads.
… auto-merge on toggle - Also check combined commit status API alongside check-runs; repos using legacy status checks are no longer missed. - Propagate hasUncheckedRuns through pr_feedback_detected event so the feedback bead title and prompt reflect the actual reason. - Gate auto-merge timer on refineryConfig.auto_merge in addition to auto_merge_delay_minutes to respect the toggle.
…, stuck MR beads (#1632), session leak (#1341) * fix(gastown): preserve org billing context across model changes (#1756) Store organizationId as a standalone env var (GASTOWN_ORGANIZATION_ID) that survives KILO_CONFIG_CONTENT rebuilds. Three defense layers: 1. Set GASTOWN_ORGANIZATION_ID on /agents/start and read it first in extractOrganizationId() before falling back to config extraction. 2. Pass organizationId in the PATCH /model request body so the container receives it even if X-Town-Config parsing fails. 3. Include organization_id in buildContainerConfig and sync it to process.env in the PATCH handler's X-Town-Config processing. * fix(gastown): stop false-positive invariant violations spamming logs every 5s Invariant 7: Exclude mayors from the 'working agent must have a hook' check — mayors are always working and intentionally hookless. Invariant 5: Widen valid convoy bead states to include in_progress and in_review, which are legitimate transient states while child beads are being worked on. Only flag truly unexpected states. Fixes #1364 * fix(gastown): prevent MR beads stuck in_progress when github_token missing (#1632) - Fall back to github_cli_pat then platform integration token in checkPRStatus - Track consecutive null poll results; fail MR bead after 10 nulls - Rate-limit PR polling to once per minute per MR bead via last_poll_at metadata - Store failureReason and failureMessage in bead metadata on permanent failure * fix: format and lint errors in reconciler.ts - Add explicit 'unknown' type annotation to lastPollAt to fix no-unsafe-assignment lint error - Auto-format ternary expression per oxfmt rules * fix(gastown): prevent duplicate session leak when restarting a starting agent Thread an AbortController through the startAgent() startup sequence so that when a restart is requested for an agent still in 'starting' status, the in-flight startup is cancelled before session.create() can produce an orphaned session. - Add startupAbortController field to ManagedAgent type - In startAgent: abort existing startup controller when agent is 'starting' - Check signal.aborted after each async step (ensureSDKServer, session.create, before session.prompt) - On abort: decrement sessionCount, remove agent entry, clean up SDK instance if no sessions remain - In stopAgent: also abort startupAbortController for 'starting' agents - Introduce StartupAbortedError to distinguish abort from real failures Fixes #1341 * fix(gastown): address review comments — reset poll_null_count on all non-null polls, abort orphaned session on startup abort, guard agents.delete with identity check * fix(gastown): store session ID before abort check to prevent orphaned session leak After session.create() resolves, parse and store the session ID on the agent object BEFORE checking signal.aborted. This ensures the catch block has agent.sessionId available to call session.abort(), closing the race window where an abort during session.create() could leak an orphaned session. Also fixes format-check CI by running oxfmt.
Add retry_count tracking to town_events table so that events which repeatedly fail applyEvent are marked as poisoned after 3 attempts instead of blocking the entire reconciler queue forever. - Add retry_count column to town_events schema with migration - Increment retry_count on each applyEvent failure - Skip and mark events as processed once they exceed MAX_EVENT_RETRIES - Log + Sentry alert when an event is poisoned with full context Closes #1986 items E1, H1
…s with stale pr_url The reconcileReviewQueue Rule 4 (orphaned PR review) only checked mr.status === 'in_progress', missing open beads with a pr_url that have been stale >30min. The from field already uses mr.status dynamically, so the transition correctly records the actual status.
…g agent When bead_cancelled is applied, the handler unhooks agents but did not stop their containers, leaving them running for up to 90s until the heartbeat timeout. Now applyEvent returns Action[] so that bead_cancelled can emit stop_agent side effects for agents that were working/stalled. All three callers in Town.do.ts (alarm loop, debug replay, dry run) collect and merge event-emitted actions into the reconciler action list.
B4: Cache the checkPRFeedback result in poll_pr so both auto_resolve and auto_merge blocks reuse a single GitHub API call instead of making two identical requests per tick. B5: Hoist checkDispatchCircuitBreaker from reconcileBeads and reconcileReviewQueue into the top-level reconcile() function, passing the result down as parameters. Saves one SQL COUNT query per tick.
Contributor
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Files Reviewed (2 files)
Reviewed by gpt-5.4-20260305 · 317,902 tokens |
630d4f9 to
c2e87af
Compare
added 2 commits
April 4, 2026 16:19
Replace direct SQL writes inside async closures in merge_pr and poll_pr action handlers with event emissions. These writes previously raced with the next alarm tick's Phase 0 event drain. New event types: - auto_merge_cleared: clears auto_merge_pending + resets timer - poll_nonnull_result: resets consecutive null counter - poll_null_result: increments null counter, fails bead at threshold - poll_feedback_checked: updates last_feedback_check_at + chains pr_feedback_detected event - poll_auto_merge_update: manages auto_merge_ready_since timer + chains pr_auto_merge event All state mutations now happen synchronously in applyEvent during Phase 0, matching the spec's phase separation model.
When the 'from' field is non-null, verify the bead's current status matches before applying the transition. This prevents concurrent rule firings from producing unexpected state transitions. When 'from' is null, existing behavior is preserved (no guard). Refs: #1986 (H3, S1)
When dispatch_agent's async side effect fails (container start fails), roll the agent back to 'idle' instead of leaving it stuck in 'working'. This eliminates the 90-second dead zone waiting for heartbeat timeout detection. The bead stays in_progress so the reconciler retries on the next tick (§5.4). Addresses #1986 items B2, R1, H2.
* fix(reconciler): parameterize circuit breaker SQL query Replace strftime template literal interpolation with a JS-computed cutoff timestamp passed as a bound SQL parameter. Same 30-minute window, no behavioral change. * WIP: container eviction save --------- Co-authored-by: John Fawcett <john@kilcoode.ai>
… and Rule 4 scope - dispatch_agent: handle resolved-false from dispatchAgent() in addition to thrown errors, so the agent rolls back to 'idle' in both failure modes - reconciler Rule 4: revert to only matching in_progress beads, since open beads with a pr_url are legitimately queued and not orphaned
Resolved 9 merge conflicts between the PR branch and main: - mayor-tools.ts: use args.metadata directly (main's pattern) - control-server.ts: include else branch to delete env var (main's addition) - town-events.table.ts: remove B3 deferred event types (main reverted to direct SQL writes) - Town.do.ts: use resolveGitHubToken helper, keep eventActions merging - actions.ts: use direct SQL writes in async handlers (main's pattern), keep hasExistingFeedbackBead guard - reconciler.ts: remove unused B3 event handlers, keep circuit breaker hoisting - polecat-system.prompt.ts: add PR Fixup Workflow section from main - TownSettingsPageClient.tsx: use main's flex-1 layout pattern - e2e-pr-feedback-testing.md: include main's new sections
Contributor
Author
|
Closing: convoy went off the rails. Will re-create from scratch. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
retry_countcolumn totown_eventstable with idempotent migration. Events that failapplyEventare retried up to 3 times before being marked as poison (processed) to unblock the queue. Sentry alerts fire for poison events. Prevents a single bad event from blocking all event processing forever.reconcileReviewQueueRule 4 (orphaned PR review) now handles bothin_progressandopenMR beads with apr_url. Previously only checkedin_progress, missing cases where a refinery created a PR but the bead stayedopenand stale >30min.applyEventreturn type fromvoidtoAction[]so thebead_cancelledhandler can emitstop_agentactions for agents inworking/stalledstate. The agent is unhooked first (existing behavior), then astop_agentaction triggers container teardown immediately instead of waiting 90s for the heartbeat timeout. All three callers inTown.do.ts(alarm loop, debug replay, dry run) collect and merge event-emitted actions into the reconciler action list.Verification
incrementRetryCountusesUPDATE ... RETURNING,markPoisonedsetsprocessed_atto remove from drain queue!mr.pr_urlvsmr.pr_url)stop_agentaction type is well-defined inactions.tsand the agent ID (row.bead_id) is the correct PK fromagent_metadataascasts, no!operators, no secrets, no build artifactsclient.test.ts)Visual Changes
N/A
Reviewer Notes
markPoisonedfunction is functionally identical tomarkProcessed(both setprocessed_at). The semantic distinction is in the name only; theretry_countcolumn serves as the actual indicator of a poison event. This is acceptable.insertEventfunction doesn't includeretry_count, relying on the DDLdefault 0— correct behavior.