feat(container): graceful container eviction via SIGTERM drain#1825
Open
feat(container): graceful container eviction via SIGTERM drain#1825
Conversation
Contributor
Code Review SummaryStatus: 3 Issues Found | Recommendation: Address before merge Overview
Fix these issues in Kilo Cloud Issue Details (click to expand)WARNING
Other Observations (not in diff)None. Files Reviewed (20 files)
Reviewed by gpt-5.4-20260305 · 1,654,801 tokens |
…1797) * feat(gastown): add container_eviction event type and /container-eviction endpoint Add foundation for Tier 1.5 graceful container eviction: - Add container_eviction to TownEventType enum - Add POST /api/towns/:townId/container-eviction route with container JWT auth for the container to signal SIGTERM receipt - Add recordContainerEviction() RPC on TownDO that inserts the event and sets a draining flag in DO KV storage - Add isDraining() RPC for reconciler consumption - Clear draining flag on agent heartbeat (container alive after restart) - Handle container_eviction in reconciler applyEvent (no-op, audit only) * feat(gastown): skip dispatch_agent actions when town is draining Add draining flag check to reconciler dispatch rules for Tier 1.5 graceful container eviction. When the town is in draining state (container eviction in progress), all dispatch_agent emissions are suppressed with a log message. Non-dispatch rules (agent health, convoy progress, GUPP, GC) continue to function normally during drain. Guards added to: - reconcileBeads Rule 1 (unassigned open beads) - reconcileBeads Rule 2 (idle agents with hooks) - reconcileReviewQueue Rule 5 (refinery dispatch for open MRs) - reconcileReviewQueue Rule 6 (idle refinery re-dispatch) * style: apply oxfmt formatting to reconciler * fix(gastown): replace heartbeat drain clearing with nonce-based /container-ready acknowledgment Heartbeats from the old (draining) container could race with the eviction signal and prematurely clear the drain flag, re-enabling dispatch into a container that was shutting down. Changes: - recordContainerEviction() now generates a drain nonce and returns it - New acknowledgeContainerReady(nonce) method validates the nonce before clearing drain - Heartbeat responses include the drainNonce when draining, so the replacement container can call /container-ready - New POST /container-ready worker endpoint - Drain auto-expires after 15 minutes as a safety net - Container heartbeat module detects drainNonce in responses and calls /container-ready to clear drain on the replacement container * fix(gastown): fix drain nonce race and idle container readiness Two issues fixed: 1. Heartbeat nonce TOCTOU race: touchAgentHeartbeat() now returns the drain nonce atomically in a single DO call instead of requiring a separate getDrainNonce() RPC. This prevents an in-flight heartbeat from the old container from observing a nonce generated between two separate calls. 2. Idle container readiness: ensureContainerReady() now passes X-Drain-Nonce and X-Town-Id headers to the container health check when draining. The container's /health endpoint reads these and calls /container-ready, handling the case where a replacement container has no running agents and the per-agent heartbeat loop has nothing to iterate. Also adds GET /drain-status endpoint for debugging. * fix(gastown): delay drain nonce handoff via health check until old container exits ensureContainerReady() talks to whichever container is currently serving this town. During drain, that is still the old container. Passing X-Drain-Nonce immediately would let the old container clear drain before the replacement is up. Now the nonce is only passed via health check headers after 11 minutes (beyond the 10-min drainAll wait + exit), ensuring the old container has exited before the handoff. * fix(gastown): skip ensureContainerReady early-return when draining ensureContainerReady() bails out early when there is no active work and the rig is not recently configured. This prevented the drain nonce handoff from ever reaching the replacement container in idle towns. Now the draining flag bypasses that early-return so the health check (with X-Drain-Nonce headers) always fires.
…iner eviction (#1795) * feat(container): add drainAll() to process-manager for graceful container eviction Implements 4-phase drain sequence: 1. Notify TownDO via POST /container-eviction 2. Nudge running polecat/refinery agents via sendMessage() 3. Poll up to 10 min waiting for agents to finish 4. Force-save stragglers with WIP git commit + push via Bun.spawn * fix(tracking): only resolve signup product when callbackPath was explicitly provided (#1793) * fix(tracking): only resolve signup product when callbackPath was explicitly provided * style: apply oxfmt formatting to after-sign-in route * fix(drain): address PR review comments on drainAll() - Phase 1: wrap TownDO fetch in 10s abortable timeout so a hung endpoint cannot stall the entire drain sequence - Phase 2: clear idle timers before sending eviction nudge so a pre-existing timer cannot race and flip the agent to exited - Phase 4: abort SDK sessions before force-saving so git operations don't race with a still-running agent's worktree writes * fix(container): clear idle timers and abort sessions before Phase 4 git save Phase 4 (force-save stragglers) now runs in two sub-steps: 1. Freeze: cancel idle timers, abort event subscriptions, abort SDK sessions, and mark agents as exited. This prevents the normal completion path (idle timer -> onExit -> bead completion) from racing with the WIP snapshot. 2. Snapshot: git add/commit/push each worktree after all sessions are frozen, avoiding .git/index.lock collisions. Also adds AbortSignal.timeout(10s) to Phase 1 TownDO notification and clearIdleTimer before Phase 2 nudge messages. * fix(container): skip WIP snapshot for agents whose freeze failed Build the 4b snapshot list from successfully frozen agents only. If session.abort() throws in 4a, the agent is excluded from snapshotting to avoid racing git against a still-active session. * fix(container): set upstream on eviction push for first-time branches Use git push --set-upstream origin HEAD so branches without a configured upstream (e.g. freshly created agent worktrees) can push the WIP eviction snapshot instead of failing silently. * fix(container): skip git push for lightweight workspaces with no origin remote Phase 4b of drainAll() assumed all workspaces have an 'origin' remote. Lightweight workspaces (mayor/triage) are initialized with 'git init' and never add a remote, causing 'fatal: origin does not appear to be a git repository' during eviction. Check for the origin remote via 'git remote get-url origin' before attempting to push. When no remote exists, commit locally only and log a warning. --------- Co-authored-by: Pedro Heyerdahl <61753986+pedroheyerdahl@users.noreply.github.com>
SIGTERM now runs stopHeartbeat → drainAll → stopAll → exit(0) to allow in-flight agent work to finish before the container shuts down. SIGINT retains the immediate shutdown path for local dev.
…lifecycle - Add 3 TownDO draining lifecycle tests in town-container.test.ts: - recordContainerEviction() sets draining flag - touchAgentHeartbeat() clears draining flag after eviction - draining flag persists across re-initialization - Add 1 reconciler applyEvent test in reconciler.test.ts: - container_eviction event is processed without error (audit-only no-op)
1eddb78 to
09e5b32
Compare
…ry view joins (#1865) The query was using microdollar_usage_view which joins 13+ tables via LEFT JOINs, but only needs columns from microdollar_usage and microdollar_usage_metadata. Query now directly joins only these two tables. Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>
* feat(admin): add safety identifier backfill panel * feat(admin): show Vercel Downstream Safety Identifier in user admin UI * refactor(admin): merge safety identifier backfill into single API call * refactor(admin): single query for safety identifier backfill * fmt * Delete obsolete script --------- Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com> Co-authored-by: Christiaan Arnoldus <christiaan.arnoldus@outlook.com>
remove beta label
…f/gt/refinery/5352c37e' into gt/toast/72c0c3a1
…Zod validation - Replace incorrect test that asserted touchAgentHeartbeat() clears drain flag (it doesn't) with tests for acknowledgeContainerReady() which is the actual mechanism that clears drain state - Add test for nonce mismatch case (wrong nonce keeps draining=true) - Add test verifying touchAgentHeartbeat returns drainNonce but doesn't clear draining flag, and only acknowledgeContainerReady clears it - Replace manual 'as' casts in handleContainerReady with Zod schema validation per project conventions
| const { data: counts, isLoading } = useQuery<SafetyIdentifierCountsResponse>({ | ||
| queryKey: ['safety-identifier-counts'], | ||
| queryFn: async () => { | ||
| const res = await fetch('/admin/api/safety-identifiers'); |
Contributor
There was a problem hiding this comment.
WARNING: Failed count requests look like 0 missing
This query always parses the response body without checking res.ok. If the admin session expires or the endpoint returns a 500, React Query receives an error payload as if it were SafetyIdentifierCountsResponse, so the page shows 0 missing and keeps the backfill button enabled instead of surfacing the failure.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements Tier 1.5: Graceful Container Eviction (closes #1236). All 4 convoy beads are now included and rebased onto current
main:Bead 1 — Foundation: Adds
container_evictionevent type to the town-events table and thePOST /container-evictionendpoint (JWT-authenticated), which sets a persisted draining flag on TownDO. Includes integration tests for the draining lifecycle.Bead 2 — Reconciler suppression: The reconciler accepts a
drainingflag and skipsdispatch_agent/ review queue actions while the town is draining, preventing new work from being assigned to a dying container.Bead 3 — Process manager drain sequence: Adds
drainAll()to process-manager, which orchestrates a 4-phase sequence: (1) notify TownDO to set drain, (2) nudge agents to commit and push, (3) poll up to 10 minutes for completion, (4) freeze and force-save stragglers via WIP commits.Bead 4 — SIGTERM handler integration: SIGTERM now runs
stopHeartbeat → drainAll → stopAll → exit(0). SIGINT retains the immediate shutdown path for dev.Key architectural additions:
_draining,_drainNonce,_drainStartedAt) with nonce-based handoff to prevent stale heartbeats from prematurely clearing drainPOST /container-eviction,POST /container-ready,GET /drain-statusdrainNonceatomically fromtouchAgentHeartbeatto avoid TOCTOU racesheartbeat.tsmodule for container-level drain polling and acknowledgmentVerification
main— diff shows only convoy changes (11 files, ~693 additions)Visual Changes
N/A
Reviewer Notes
ascasts intown-eviction.handler.tsare used defensively for body parsing; a zod schema would be cleaner but heavier for a single optional field--no-verifyon git push in Phase 4b is intentional — hooks could block the emergency WIP save during eviction