feat(container): graceful container eviction via SIGTERM drain by jrf0110 · Pull Request #1825 · Kilo-Org/cloud

jrf0110 · 2026-04-01T01:19:58Z

Summary

Implements Tier 1.5: Graceful Container Eviction (closes #1236). All 4 convoy beads are now included and rebased onto current main:

Bead 1 — Foundation: Adds container_eviction event type to the town-events table and the POST /container-eviction endpoint (JWT-authenticated), which sets a persisted draining flag on TownDO. Includes integration tests for the draining lifecycle.

Bead 2 — Reconciler suppression: The reconciler accepts a draining flag and skips dispatch_agent / review queue actions while the town is draining, preventing new work from being assigned to a dying container.

Bead 3 — Process manager drain sequence: Adds drainAll() to process-manager, which orchestrates a 4-phase sequence: (1) notify TownDO to set drain, (2) nudge agents to commit and push, (3) poll up to 10 minutes for completion, (4) freeze and force-save stragglers via WIP commits.

Bead 4 — SIGTERM handler integration: SIGTERM now runs stopHeartbeat → drainAll → stopAll → exit(0). SIGINT retains the immediate shutdown path for dev.

Key architectural additions:

TownDO gains persisted drain state (_draining, _drainNonce, _drainStartedAt) with nonce-based handoff to prevent stale heartbeats from prematurely clearing drain
Three new authenticated endpoints: POST /container-eviction, POST /container-ready, GET /drain-status
Heartbeat response includes drainNonce atomically from touchAgentHeartbeat to avoid TOCTOU races
heartbeat.ts module for container-level drain polling and acknowledgment
15-minute auto-clear timeout in the alarm handler as a safety net

Verification

Typecheck passes
Lint passes
Format passes
Branch rebased onto current main — diff shows only convoy changes (11 files, ~693 additions)

Visual Changes

N/A

Reviewer Notes

The as casts in town-eviction.handler.ts are used defensively for body parsing; a zod schema would be cleaner but heavier for a single optional field
--no-verify on git push in Phase 4b is intentional — hooks could block the emergency WIP save during eviction
Bead 1's initial commit was skipped during rebase because its content was fully superseded by the evolved code in beads 2-4; the test commit from bead 1 is preserved

cloudflare-gastown/container/src/heartbeat.ts

kilo-code-bot · 2026-04-01T01:25:42Z

Code Review Summary

Status: 3 Issues Found | Recommendation: Address before merge

Overview

Severity	Count
CRITICAL	0
WARNING	3
SUGGESTION	0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File	Line	Issue
`cloudflare-gastown/container/src/heartbeat.ts`	73	`containerReadyAcknowledged` is still set on any 2xx response, even when `/container-ready` returns `cleared: false` for a nonce mismatch, which can leave the town permanently draining until timeout.
`cloudflare-gastown/container/src/heartbeat.ts`	98	Idle replacement containers still return before polling drain state, so towns with no running agents stay draining until the 11-minute health-check handoff instead of clearing promptly.
`src/app/admin/components/SafetyIdentifiersBackfill.tsx`	25	The count fetch ignores non-2xx responses, so auth or server failures can render as `0 missing` and let admins continue running the backfill from a broken page.

Other Observations (not in diff)

None.

Files Reviewed (20 files)

cloudflare-gastown/container/src/control-server.ts - 0 issues
cloudflare-gastown/container/src/heartbeat.ts - 2 issues
cloudflare-gastown/container/src/process-manager.ts - 0 issues
cloudflare-gastown/src/db/tables/town-events.table.ts - 0 issues
cloudflare-gastown/src/dos/Town.do.ts - 0 issues
cloudflare-gastown/src/dos/town/reconciler.ts - 0 issues
cloudflare-gastown/src/gastown.worker.ts - 0 issues
cloudflare-gastown/src/handlers/rig-agents.handler.ts - 0 issues
cloudflare-gastown/src/handlers/town-eviction.handler.ts - 0 issues
cloudflare-gastown/test/integration/reconciler.test.ts - 0 issues
cloudflare-gastown/test/integration/town-container.test.ts - 0 issues
src/app/(app)/claw/components/ClawHeader.tsx - 0 issues
src/app/admin/api/safety-identifiers/route.ts - 0 issues
src/app/admin/components/AppSidebar.tsx - 0 issues
src/app/admin/components/SafetyIdentifiersBackfill.tsx - 1 issue
src/app/admin/components/UserAdmin/UserAdminAccountInfo.tsx - 0 issues
src/app/admin/safety-identifiers/page.tsx - 0 issues
src/lib/providers/gateway-error-rate.ts - 0 issues
src/lib/user.test.ts - 0 issues
src/scripts/openrouter/backfill-safety-identifier.ts - 0 issues

_{Reviewed by gpt-5.4-20260305 · 1,654,801 tokens}

…1797) * feat(gastown): add container_eviction event type and /container-eviction endpoint Add foundation for Tier 1.5 graceful container eviction: - Add container_eviction to TownEventType enum - Add POST /api/towns/:townId/container-eviction route with container JWT auth for the container to signal SIGTERM receipt - Add recordContainerEviction() RPC on TownDO that inserts the event and sets a draining flag in DO KV storage - Add isDraining() RPC for reconciler consumption - Clear draining flag on agent heartbeat (container alive after restart) - Handle container_eviction in reconciler applyEvent (no-op, audit only) * feat(gastown): skip dispatch_agent actions when town is draining Add draining flag check to reconciler dispatch rules for Tier 1.5 graceful container eviction. When the town is in draining state (container eviction in progress), all dispatch_agent emissions are suppressed with a log message. Non-dispatch rules (agent health, convoy progress, GUPP, GC) continue to function normally during drain. Guards added to: - reconcileBeads Rule 1 (unassigned open beads) - reconcileBeads Rule 2 (idle agents with hooks) - reconcileReviewQueue Rule 5 (refinery dispatch for open MRs) - reconcileReviewQueue Rule 6 (idle refinery re-dispatch) * style: apply oxfmt formatting to reconciler * fix(gastown): replace heartbeat drain clearing with nonce-based /container-ready acknowledgment Heartbeats from the old (draining) container could race with the eviction signal and prematurely clear the drain flag, re-enabling dispatch into a container that was shutting down. Changes: - recordContainerEviction() now generates a drain nonce and returns it - New acknowledgeContainerReady(nonce) method validates the nonce before clearing drain - Heartbeat responses include the drainNonce when draining, so the replacement container can call /container-ready - New POST /container-ready worker endpoint - Drain auto-expires after 15 minutes as a safety net - Container heartbeat module detects drainNonce in responses and calls /container-ready to clear drain on the replacement container * fix(gastown): fix drain nonce race and idle container readiness Two issues fixed: 1. Heartbeat nonce TOCTOU race: touchAgentHeartbeat() now returns the drain nonce atomically in a single DO call instead of requiring a separate getDrainNonce() RPC. This prevents an in-flight heartbeat from the old container from observing a nonce generated between two separate calls. 2. Idle container readiness: ensureContainerReady() now passes X-Drain-Nonce and X-Town-Id headers to the container health check when draining. The container's /health endpoint reads these and calls /container-ready, handling the case where a replacement container has no running agents and the per-agent heartbeat loop has nothing to iterate. Also adds GET /drain-status endpoint for debugging. * fix(gastown): delay drain nonce handoff via health check until old container exits ensureContainerReady() talks to whichever container is currently serving this town. During drain, that is still the old container. Passing X-Drain-Nonce immediately would let the old container clear drain before the replacement is up. Now the nonce is only passed via health check headers after 11 minutes (beyond the 10-min drainAll wait + exit), ensuring the old container has exited before the handoff. * fix(gastown): skip ensureContainerReady early-return when draining ensureContainerReady() bails out early when there is no active work and the rig is not recently configured. This prevented the drain nonce handoff from ever reaching the replacement container in idle towns. Now the draining flag bypasses that early-return so the health check (with X-Drain-Nonce headers) always fires.

…iner eviction (#1795) * feat(container): add drainAll() to process-manager for graceful container eviction Implements 4-phase drain sequence: 1. Notify TownDO via POST /container-eviction 2. Nudge running polecat/refinery agents via sendMessage() 3. Poll up to 10 min waiting for agents to finish 4. Force-save stragglers with WIP git commit + push via Bun.spawn * fix(tracking): only resolve signup product when callbackPath was explicitly provided (#1793) * fix(tracking): only resolve signup product when callbackPath was explicitly provided * style: apply oxfmt formatting to after-sign-in route * fix(drain): address PR review comments on drainAll() - Phase 1: wrap TownDO fetch in 10s abortable timeout so a hung endpoint cannot stall the entire drain sequence - Phase 2: clear idle timers before sending eviction nudge so a pre-existing timer cannot race and flip the agent to exited - Phase 4: abort SDK sessions before force-saving so git operations don't race with a still-running agent's worktree writes * fix(container): clear idle timers and abort sessions before Phase 4 git save Phase 4 (force-save stragglers) now runs in two sub-steps: 1. Freeze: cancel idle timers, abort event subscriptions, abort SDK sessions, and mark agents as exited. This prevents the normal completion path (idle timer -> onExit -> bead completion) from racing with the WIP snapshot. 2. Snapshot: git add/commit/push each worktree after all sessions are frozen, avoiding .git/index.lock collisions. Also adds AbortSignal.timeout(10s) to Phase 1 TownDO notification and clearIdleTimer before Phase 2 nudge messages. * fix(container): skip WIP snapshot for agents whose freeze failed Build the 4b snapshot list from successfully frozen agents only. If session.abort() throws in 4a, the agent is excluded from snapshotting to avoid racing git against a still-active session. * fix(container): set upstream on eviction push for first-time branches Use git push --set-upstream origin HEAD so branches without a configured upstream (e.g. freshly created agent worktrees) can push the WIP eviction snapshot instead of failing silently. * fix(container): skip git push for lightweight workspaces with no origin remote Phase 4b of drainAll() assumed all workspaces have an 'origin' remote. Lightweight workspaces (mayor/triage) are initialized with 'git init' and never add a remote, causing 'fatal: origin does not appear to be a git repository' during eviction. Check for the origin remote via 'git remote get-url origin' before attempting to push. When no remote exists, commit locally only and log a warning. --------- Co-authored-by: Pedro Heyerdahl <61753986+pedroheyerdahl@users.noreply.github.com>

SIGTERM now runs stopHeartbeat → drainAll → stopAll → exit(0) to allow in-flight agent work to finish before the container shuts down. SIGINT retains the immediate shutdown path for local dev.

…lifecycle - Add 3 TownDO draining lifecycle tests in town-container.test.ts: - recordContainerEviction() sets draining flag - touchAgentHeartbeat() clears draining flag after eviction - draining flag persists across re-initialization - Add 1 reconciler applyEvent test in reconciler.test.ts: - container_eviction event is processed without error (audit-only no-op)

…ry view joins (#1865) The query was using microdollar_usage_view which joins 13+ tables via LEFT JOINs, but only needs columns from microdollar_usage and microdollar_usage_metadata. Query now directly joins only these two tables. Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>

* feat(admin): add safety identifier backfill panel * feat(admin): show Vercel Downstream Safety Identifier in user admin UI * refactor(admin): merge safety identifier backfill into single API call * refactor(admin): single query for safety identifier backfill * fmt * Delete obsolete script --------- Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com> Co-authored-by: Christiaan Arnoldus <christiaan.arnoldus@outlook.com>

remove beta label

…f/gt/refinery/5352c37e' into gt/toast/72c0c3a1

…Zod validation - Replace incorrect test that asserted touchAgentHeartbeat() clears drain flag (it doesn't) with tests for acknowledgeContainerReady() which is the actual mechanism that clears drain state - Add test for nonce mismatch case (wrong nonce keeps draining=true) - Add test verifying touchAgentHeartbeat returns drainNonce but doesn't clear draining flag, and only acknowledgeContainerReady clears it - Replace manual 'as' casts in handleContainerReady with Zod schema validation per project conventions

kilo-code-bot · 2026-04-01T23:02:55Z

src/app/admin/components/SafetyIdentifiersBackfill.tsx

+  const { data: counts, isLoading } = useQuery<SafetyIdentifierCountsResponse>({
+    queryKey: ['safety-identifier-counts'],
+    queryFn: async () => {
+      const res = await fetch('/admin/api/safety-identifiers');


WARNING: Failed count requests look like 0 missing

This query always parses the response body without checking res.ok. If the admin session expires or the endpoint returns a 500, React Query receives an error payload as if it were SafetyIdentifierCountsResponse, so the page shows 0 missing and keeps the backfill button enabled instead of surfacing the failure.

kilo-code-bot bot reviewed Apr 1, 2026

View reviewed changes

cloudflare-gastown/container/src/heartbeat.ts Show resolved Hide resolved

cloudflare-gastown/container/src/heartbeat.ts Show resolved Hide resolved

jrf0110 mentioned this pull request Apr 1, 2026

feat(gastown): staging batch — circuit breaker, drain/eviction, token refresh, Cloudflare links, fullscreen #1828

Open

jrf0110 and others added 5 commits April 1, 2026 18:29

feat(container): use drainAll in SIGTERM handler for graceful eviction

9a95f63

SIGTERM now runs stopHeartbeat → drainAll → stopAll → exit(0) to allow in-flight agent work to finish before the container shuts down. SIGINT retains the immediate shutdown path for local dev.

style: fix formatting from oxfmt

e1bb9ad

kilo-code-bot bot force-pushed the convoy/tier-1-5-graceful-container-eviction-sig/591efcef/head branch from 1eddb78 to 09e5b32 Compare April 1, 2026 18:31

kilo-code-bot bot and others added 5 commits April 1, 2026 21:58

remove beta label for KiloClaw (#1871)

9da15e8

remove beta label

Merge branch 'convoy/tier-1-5-graceful-container-eviction-sig/591efce…

ea7b576

…f/gt/refinery/5352c37e' into gt/toast/72c0c3a1

kilo-code-bot bot reviewed Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(container): graceful container eviction via SIGTERM drain#1825

feat(container): graceful container eviction via SIGTERM drain#1825
jrf0110 wants to merge 10 commits intomainfrom
convoy/tier-1-5-graceful-container-eviction-sig/591efcef/head

jrf0110 commented Apr 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

kilo-code-bot bot commented Apr 1, 2026 •

edited

Loading

WARNING

Uh oh!

kilo-code-bot bot Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jrf0110 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Visual Changes

Reviewer Notes

Uh oh!

Uh oh!

Uh oh!

kilo-code-bot bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Overview

WARNING

Uh oh!

kilo-code-bot bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jrf0110 commented Apr 1, 2026 •

edited

Loading

kilo-code-bot bot commented Apr 1, 2026 •

edited

Loading