feat(kiloclaw): add stop recovery admin tooling by pandemicsyn · Pull Request #1993 · Kilo-Org/cloud

pandemicsyn · 2026-04-03T22:18:15Z

Summary

Add unexpected-stop recovery support to KiloClaw and surface that lifecycle in the admin and user UI.

This changes the worker to move an unexpectedly stopped Fly machine into a dedicated recovering state, attempt a one-shot recovery by relocating onto a replacement volume and machine, retain the old volume only when snapshots exist, and clean that retained volume up later. Recovery is now triggered on the first observed Fly stopped state while the DO still believes the instance is running; Fly created no longer participates in this path.

Old volumes without snapshots are deleted immediately after a successful cutover; old volumes with snapshots are retained for 7 days and then cleaned up automatically, with an admin override to delete them early. It also adds admin-facing visibility and cleanup controls for the retained recovery volume, plus user-facing action gating so recovery is treated as a busy machine lifecycle state.

Architecturally, the unexpected-stop recovery flow is now extracted out of the instance DO class into a dedicated kiloclaw-instance/recovery.ts module, while index.ts remains the dispatcher/orchestrator. The recovery path was also hardened so timeout and failure cleanup is shared, alarms do not race active recovery by deleting the pending recovery volume, and retained-volume cleanup verifies sandbox ownership before force-destroying any attached machine.

This also adds explicit Analytics Engine lifecycle events for the new path: recovery started, recovery succeeded, and recovery failed.

Verification

pnpm --filter kiloclaw test — passed (50 files, 1168 tests)
pnpm --filter kiloclaw typecheck — passed
pnpm --filter kiloclaw lint — passed
pnpm --filter kiloclaw format:check — passed
pnpm typecheck — passed
pnpm lint — passed
pnpm format:check — passed
pnpm test — failed in the local environment because the DB-backed Jest suites could not connect to the test database (cleanupDbForTest / src/lib/drizzle.ts)
git push origin florian/chore/recover-stopped — passed repo hooks (format:check, affected package typecheck, lint)
Additional user-provided verification details

Visual Changes

Before	After
Admin instance detail had no dedicated unexpected-stop recovery section or retained-volume cleanup action	Add screenshot: admin instance detail showing the new “Unexpected Stop Recovery” card and retained-volume cleanup control
Claw UI did not represent `recovering` as a distinct busy state	Add screenshot: user-facing claw header/status badge and disabled controls while the instance is recovering

Reviewer Notes

The worker logic change is split across reconcile.ts, index.ts, fly-machines.ts, and the new recovery.ts; recovery.ts is the best place to review the new lifecycle end-to-end.
New AE lifecycle events are emitted for instance.unexpected_stop_recovery_started, instance.unexpected_stop_recovery_succeeded, and instance.unexpected_stop_recovery_failed.
Unexpected-stop recovery is triggered only for Fly stopped while the DO still thinks the instance is running; Fly created is no longer treated as an unexpected-stop signal.
Old recovery source volumes are deleted immediately when they have no snapshots; if snapshots exist, the old volume is retained for 7 days, then deleted automatically by the alarm loop unless an admin deletes it earlier.
Retained recovery-volume cleanup verifies kiloclaw_sandbox_id before force-destroying an attached machine.

pandemicsyn added 2 commits April 3, 2026 17:13

feat(kiloclaw): add stop recovery admin tooling

5f86bba

fix(kiloclaw): trigger recovery on first stopped alarm

494724c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kiloclaw): add stop recovery admin tooling#1993

feat(kiloclaw): add stop recovery admin tooling#1993
pandemicsyn wants to merge 2 commits intomainfrom
florian/chore/recover-stopped

pandemicsyn commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pandemicsyn commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Visual Changes

Reviewer Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pandemicsyn commented Apr 3, 2026 •

edited

Loading