Skip to content

feat(kiloclaw): add stop recovery admin tooling#1993

Draft
pandemicsyn wants to merge 2 commits intomainfrom
florian/chore/recover-stopped
Draft

feat(kiloclaw): add stop recovery admin tooling#1993
pandemicsyn wants to merge 2 commits intomainfrom
florian/chore/recover-stopped

Conversation

@pandemicsyn
Copy link
Copy Markdown
Contributor

@pandemicsyn pandemicsyn commented Apr 3, 2026

Summary

Add unexpected-stop recovery support to KiloClaw and surface that lifecycle in the admin and user UI.

This changes the worker to move an unexpectedly stopped Fly machine into a dedicated recovering state, attempt a one-shot recovery by relocating onto a replacement volume and machine, retain the old volume only when snapshots exist, and clean that retained volume up later. Recovery is now triggered on the first observed Fly stopped state while the DO still believes the instance is running; Fly created no longer participates in this path.

Old volumes without snapshots are deleted immediately after a successful cutover; old volumes with snapshots are retained for 7 days and then cleaned up automatically, with an admin override to delete them early. It also adds admin-facing visibility and cleanup controls for the retained recovery volume, plus user-facing action gating so recovery is treated as a busy machine lifecycle state.

Architecturally, the unexpected-stop recovery flow is now extracted out of the instance DO class into a dedicated kiloclaw-instance/recovery.ts module, while index.ts remains the dispatcher/orchestrator. The recovery path was also hardened so timeout and failure cleanup is shared, alarms do not race active recovery by deleting the pending recovery volume, and retained-volume cleanup verifies sandbox ownership before force-destroying any attached machine.

This also adds explicit Analytics Engine lifecycle events for the new path: recovery started, recovery succeeded, and recovery failed.

Verification

  • pnpm --filter kiloclaw test — passed (50 files, 1168 tests)
  • pnpm --filter kiloclaw typecheck — passed
  • pnpm --filter kiloclaw lint — passed
  • pnpm --filter kiloclaw format:check — passed
  • pnpm typecheck — passed
  • pnpm lint — passed
  • pnpm format:check — passed
  • pnpm test — failed in the local environment because the DB-backed Jest suites could not connect to the test database (cleanupDbForTest / src/lib/drizzle.ts)
  • git push origin florian/chore/recover-stopped — passed repo hooks (format:check, affected package typecheck, lint)
  • Additional user-provided verification details

Visual Changes

Before After
Admin instance detail had no dedicated unexpected-stop recovery section or retained-volume cleanup action Add screenshot: admin instance detail showing the new “Unexpected Stop Recovery” card and retained-volume cleanup control
Claw UI did not represent recovering as a distinct busy state Add screenshot: user-facing claw header/status badge and disabled controls while the instance is recovering

Reviewer Notes

  • The worker logic change is split across reconcile.ts, index.ts, fly-machines.ts, and the new recovery.ts; recovery.ts is the best place to review the new lifecycle end-to-end.
  • New AE lifecycle events are emitted for instance.unexpected_stop_recovery_started, instance.unexpected_stop_recovery_succeeded, and instance.unexpected_stop_recovery_failed.
  • Unexpected-stop recovery is triggered only for Fly stopped while the DO still thinks the instance is running; Fly created is no longer treated as an unexpected-stop signal.
  • Old recovery source volumes are deleted immediately when they have no snapshots; if snapshots exist, the old volume is retained for 7 days, then deleted automatically by the alarm loop unless an admin deletes it earlier.
  • Retained recovery-volume cleanup verifies kiloclaw_sandbox_id before force-destroying an attached machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant