Skip to content

feat(container): graceful container eviction via SIGTERM drain#1825

Open
jrf0110 wants to merge 10 commits intomainfrom
convoy/tier-1-5-graceful-container-eviction-sig/591efcef/head
Open

feat(container): graceful container eviction via SIGTERM drain#1825
jrf0110 wants to merge 10 commits intomainfrom
convoy/tier-1-5-graceful-container-eviction-sig/591efcef/head

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented Apr 1, 2026

Summary

Implements Tier 1.5: Graceful Container Eviction (closes #1236). All 4 convoy beads are now included and rebased onto current main:

Bead 1 — Foundation: Adds container_eviction event type to the town-events table and the POST /container-eviction endpoint (JWT-authenticated), which sets a persisted draining flag on TownDO. Includes integration tests for the draining lifecycle.

Bead 2 — Reconciler suppression: The reconciler accepts a draining flag and skips dispatch_agent / review queue actions while the town is draining, preventing new work from being assigned to a dying container.

Bead 3 — Process manager drain sequence: Adds drainAll() to process-manager, which orchestrates a 4-phase sequence: (1) notify TownDO to set drain, (2) nudge agents to commit and push, (3) poll up to 10 minutes for completion, (4) freeze and force-save stragglers via WIP commits.

Bead 4 — SIGTERM handler integration: SIGTERM now runs stopHeartbeat → drainAll → stopAll → exit(0). SIGINT retains the immediate shutdown path for dev.

Key architectural additions:

  • TownDO gains persisted drain state (_draining, _drainNonce, _drainStartedAt) with nonce-based handoff to prevent stale heartbeats from prematurely clearing drain
  • Three new authenticated endpoints: POST /container-eviction, POST /container-ready, GET /drain-status
  • Heartbeat response includes drainNonce atomically from touchAgentHeartbeat to avoid TOCTOU races
  • heartbeat.ts module for container-level drain polling and acknowledgment
  • 15-minute auto-clear timeout in the alarm handler as a safety net

Verification

  • Typecheck passes
  • Lint passes
  • Format passes
  • Branch rebased onto current main — diff shows only convoy changes (11 files, ~693 additions)

Visual Changes

N/A

Reviewer Notes

  • The as casts in town-eviction.handler.ts are used defensively for body parsing; a zod schema would be cleaner but heavier for a single optional field
  • --no-verify on git push in Phase 4b is intentional — hooks could block the emergency WIP save during eviction
  • Bead 1's initial commit was skipped during rebase because its content was fully superseded by the evolved code in beads 2-4; the test commit from bead 1 is preserved

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Apr 1, 2026

Code Review Summary

Status: 3 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 3
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File Line Issue
cloudflare-gastown/container/src/heartbeat.ts 73 containerReadyAcknowledged is still set on any 2xx response, even when /container-ready returns cleared: false for a nonce mismatch, which can leave the town permanently draining until timeout.
cloudflare-gastown/container/src/heartbeat.ts 98 Idle replacement containers still return before polling drain state, so towns with no running agents stay draining until the 11-minute health-check handoff instead of clearing promptly.
src/app/admin/components/SafetyIdentifiersBackfill.tsx 25 The count fetch ignores non-2xx responses, so auth or server failures can render as 0 missing and let admins continue running the backfill from a broken page.
Other Observations (not in diff)

None.

Files Reviewed (20 files)
  • cloudflare-gastown/container/src/control-server.ts - 0 issues
  • cloudflare-gastown/container/src/heartbeat.ts - 2 issues
  • cloudflare-gastown/container/src/process-manager.ts - 0 issues
  • cloudflare-gastown/src/db/tables/town-events.table.ts - 0 issues
  • cloudflare-gastown/src/dos/Town.do.ts - 0 issues
  • cloudflare-gastown/src/dos/town/reconciler.ts - 0 issues
  • cloudflare-gastown/src/gastown.worker.ts - 0 issues
  • cloudflare-gastown/src/handlers/rig-agents.handler.ts - 0 issues
  • cloudflare-gastown/src/handlers/town-eviction.handler.ts - 0 issues
  • cloudflare-gastown/test/integration/reconciler.test.ts - 0 issues
  • cloudflare-gastown/test/integration/town-container.test.ts - 0 issues
  • src/app/(app)/claw/components/ClawHeader.tsx - 0 issues
  • src/app/admin/api/safety-identifiers/route.ts - 0 issues
  • src/app/admin/components/AppSidebar.tsx - 0 issues
  • src/app/admin/components/SafetyIdentifiersBackfill.tsx - 1 issue
  • src/app/admin/components/UserAdmin/UserAdminAccountInfo.tsx - 0 issues
  • src/app/admin/safety-identifiers/page.tsx - 0 issues
  • src/lib/providers/gateway-error-rate.ts - 0 issues
  • src/lib/user.test.ts - 0 issues
  • src/scripts/openrouter/backfill-safety-identifier.ts - 0 issues

Reviewed by gpt-5.4-20260305 · 1,654,801 tokens

jrf0110 and others added 5 commits April 1, 2026 18:29
…1797)

* feat(gastown): add container_eviction event type and /container-eviction endpoint

Add foundation for Tier 1.5 graceful container eviction:

- Add container_eviction to TownEventType enum
- Add POST /api/towns/:townId/container-eviction route with container
  JWT auth for the container to signal SIGTERM receipt
- Add recordContainerEviction() RPC on TownDO that inserts the event
  and sets a draining flag in DO KV storage
- Add isDraining() RPC for reconciler consumption
- Clear draining flag on agent heartbeat (container alive after restart)
- Handle container_eviction in reconciler applyEvent (no-op, audit only)

* feat(gastown): skip dispatch_agent actions when town is draining

Add draining flag check to reconciler dispatch rules for Tier 1.5
graceful container eviction. When the town is in draining state
(container eviction in progress), all dispatch_agent emissions are
suppressed with a log message. Non-dispatch rules (agent health,
convoy progress, GUPP, GC) continue to function normally during drain.

Guards added to:
- reconcileBeads Rule 1 (unassigned open beads)
- reconcileBeads Rule 2 (idle agents with hooks)
- reconcileReviewQueue Rule 5 (refinery dispatch for open MRs)
- reconcileReviewQueue Rule 6 (idle refinery re-dispatch)

* style: apply oxfmt formatting to reconciler

* fix(gastown): replace heartbeat drain clearing with nonce-based /container-ready acknowledgment

Heartbeats from the old (draining) container could race with the
eviction signal and prematurely clear the drain flag, re-enabling
dispatch into a container that was shutting down.

Changes:
- recordContainerEviction() now generates a drain nonce and returns it
- New acknowledgeContainerReady(nonce) method validates the nonce
  before clearing drain
- Heartbeat responses include the drainNonce when draining, so the
  replacement container can call /container-ready
- New POST /container-ready worker endpoint
- Drain auto-expires after 15 minutes as a safety net
- Container heartbeat module detects drainNonce in responses and
  calls /container-ready to clear drain on the replacement container

* fix(gastown): fix drain nonce race and idle container readiness

Two issues fixed:

1. Heartbeat nonce TOCTOU race: touchAgentHeartbeat() now returns
   the drain nonce atomically in a single DO call instead of
   requiring a separate getDrainNonce() RPC. This prevents an
   in-flight heartbeat from the old container from observing a nonce
   generated between two separate calls.

2. Idle container readiness: ensureContainerReady() now passes
   X-Drain-Nonce and X-Town-Id headers to the container health check
   when draining. The container's /health endpoint reads these and
   calls /container-ready, handling the case where a replacement
   container has no running agents and the per-agent heartbeat loop
   has nothing to iterate.

Also adds GET /drain-status endpoint for debugging.

* fix(gastown): delay drain nonce handoff via health check until old container exits

ensureContainerReady() talks to whichever container is currently
serving this town. During drain, that is still the old container.
Passing X-Drain-Nonce immediately would let the old container
clear drain before the replacement is up.

Now the nonce is only passed via health check headers after 11
minutes (beyond the 10-min drainAll wait + exit), ensuring the
old container has exited before the handoff.

* fix(gastown): skip ensureContainerReady early-return when draining

ensureContainerReady() bails out early when there is no active
work and the rig is not recently configured. This prevented the
drain nonce handoff from ever reaching the replacement container
in idle towns. Now the draining flag bypasses that early-return
so the health check (with X-Drain-Nonce headers) always fires.
…iner eviction (#1795)

* feat(container): add drainAll() to process-manager for graceful container eviction

Implements 4-phase drain sequence:
1. Notify TownDO via POST /container-eviction
2. Nudge running polecat/refinery agents via sendMessage()
3. Poll up to 10 min waiting for agents to finish
4. Force-save stragglers with WIP git commit + push via Bun.spawn

* fix(tracking): only resolve signup product when callbackPath was explicitly provided (#1793)

* fix(tracking): only resolve signup product when callbackPath was explicitly provided

* style: apply oxfmt formatting to after-sign-in route

* fix(drain): address PR review comments on drainAll()

- Phase 1: wrap TownDO fetch in 10s abortable timeout so a hung
  endpoint cannot stall the entire drain sequence
- Phase 2: clear idle timers before sending eviction nudge so a
  pre-existing timer cannot race and flip the agent to exited
- Phase 4: abort SDK sessions before force-saving so git operations
  don't race with a still-running agent's worktree writes

* fix(container): clear idle timers and abort sessions before Phase 4 git save

Phase 4 (force-save stragglers) now runs in two sub-steps:
1. Freeze: cancel idle timers, abort event subscriptions, abort SDK
   sessions, and mark agents as exited. This prevents the normal
   completion path (idle timer -> onExit -> bead completion) from
   racing with the WIP snapshot.
2. Snapshot: git add/commit/push each worktree after all sessions
   are frozen, avoiding .git/index.lock collisions.

Also adds AbortSignal.timeout(10s) to Phase 1 TownDO notification
and clearIdleTimer before Phase 2 nudge messages.

* fix(container): skip WIP snapshot for agents whose freeze failed

Build the 4b snapshot list from successfully frozen agents only.
If session.abort() throws in 4a, the agent is excluded from
snapshotting to avoid racing git against a still-active session.

* fix(container): set upstream on eviction push for first-time branches

Use git push --set-upstream origin HEAD so branches without a
configured upstream (e.g. freshly created agent worktrees) can
push the WIP eviction snapshot instead of failing silently.

* fix(container): skip git push for lightweight workspaces with no origin remote

Phase 4b of drainAll() assumed all workspaces have an 'origin' remote.
Lightweight workspaces (mayor/triage) are initialized with 'git init'
and never add a remote, causing 'fatal: origin does not appear to be
a git repository' during eviction.

Check for the origin remote via 'git remote get-url origin' before
attempting to push. When no remote exists, commit locally only and
log a warning.

---------

Co-authored-by: Pedro Heyerdahl <61753986+pedroheyerdahl@users.noreply.github.com>
SIGTERM now runs stopHeartbeat → drainAll → stopAll → exit(0) to allow
in-flight agent work to finish before the container shuts down.
SIGINT retains the immediate shutdown path for local dev.
…lifecycle

- Add 3 TownDO draining lifecycle tests in town-container.test.ts:
  - recordContainerEviction() sets draining flag
  - touchAgentHeartbeat() clears draining flag after eviction
  - draining flag persists across re-initialization
- Add 1 reconciler applyEvent test in reconciler.test.ts:
  - container_eviction event is processed without error (audit-only no-op)
@kilo-code-bot kilo-code-bot bot force-pushed the convoy/tier-1-5-graceful-container-eviction-sig/591efcef/head branch from 1eddb78 to 09e5b32 Compare April 1, 2026 18:31
kilo-code-bot bot and others added 5 commits April 1, 2026 21:58
…ry view joins (#1865)

The query was using microdollar_usage_view which joins 13+ tables via LEFT JOINs,
but only needs columns from microdollar_usage and microdollar_usage_metadata.
Query now directly joins only these two tables.

Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>
* feat(admin): add safety identifier backfill panel

* feat(admin): show Vercel Downstream Safety Identifier in user admin UI

* refactor(admin): merge safety identifier backfill into single API call

* refactor(admin): single query for safety identifier backfill

* fmt

* Delete obsolete script

---------

Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>
Co-authored-by: Christiaan Arnoldus <christiaan.arnoldus@outlook.com>
…f/gt/refinery/5352c37e' into gt/toast/72c0c3a1
…Zod validation

- Replace incorrect test that asserted touchAgentHeartbeat() clears drain
  flag (it doesn't) with tests for acknowledgeContainerReady() which is
  the actual mechanism that clears drain state
- Add test for nonce mismatch case (wrong nonce keeps draining=true)
- Add test verifying touchAgentHeartbeat returns drainNonce but doesn't
  clear draining flag, and only acknowledgeContainerReady clears it
- Replace manual 'as' casts in handleContainerReady with Zod schema
  validation per project conventions
const { data: counts, isLoading } = useQuery<SafetyIdentifierCountsResponse>({
queryKey: ['safety-identifier-counts'],
queryFn: async () => {
const res = await fetch('/admin/api/safety-identifiers');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Failed count requests look like 0 missing

This query always parses the response body without checking res.ok. If the admin session expires or the endpoint returns a 500, React Query receives an error payload as if it were SafetyIdentifierCountsResponse, so the page shows 0 missing and keeps the backfill button enabled instead of surfacing the failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Persist agent conversation across container restarts via AgentDO event reconstruction

2 participants