feat(kiloclaw): add stop recovery admin tooling#1993
Draft
pandemicsyn wants to merge 2 commits intomainfrom
Draft
feat(kiloclaw): add stop recovery admin tooling#1993pandemicsyn wants to merge 2 commits intomainfrom
pandemicsyn wants to merge 2 commits intomainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add unexpected-stop recovery support to KiloClaw and surface that lifecycle in the admin and user UI.
This changes the worker to move an unexpectedly stopped Fly machine into a dedicated
recoveringstate, attempt a one-shot recovery by relocating onto a replacement volume and machine, retain the old volume only when snapshots exist, and clean that retained volume up later. Recovery is now triggered on the first observed Flystoppedstate while the DO still believes the instance isrunning; Flycreatedno longer participates in this path.Old volumes without snapshots are deleted immediately after a successful cutover; old volumes with snapshots are retained for 7 days and then cleaned up automatically, with an admin override to delete them early. It also adds admin-facing visibility and cleanup controls for the retained recovery volume, plus user-facing action gating so recovery is treated as a busy machine lifecycle state.
Architecturally, the unexpected-stop recovery flow is now extracted out of the instance DO class into a dedicated
kiloclaw-instance/recovery.tsmodule, whileindex.tsremains the dispatcher/orchestrator. The recovery path was also hardened so timeout and failure cleanup is shared, alarms do not race active recovery by deleting the pending recovery volume, and retained-volume cleanup verifies sandbox ownership before force-destroying any attached machine.This also adds explicit Analytics Engine lifecycle events for the new path: recovery started, recovery succeeded, and recovery failed.
Verification
pnpm --filter kiloclaw test— passed (50files,1168tests)pnpm --filter kiloclaw typecheck— passedpnpm --filter kiloclaw lint— passedpnpm --filter kiloclaw format:check— passedpnpm typecheck— passedpnpm lint— passedpnpm format:check— passedpnpm test— failed in the local environment because the DB-backed Jest suites could not connect to the test database (cleanupDbForTest/src/lib/drizzle.ts)git push origin florian/chore/recover-stopped— passed repo hooks (format:check, affected package typecheck, lint)Visual Changes
recoveringas a distinct busy stateReviewer Notes
reconcile.ts,index.ts,fly-machines.ts, and the newrecovery.ts;recovery.tsis the best place to review the new lifecycle end-to-end.instance.unexpected_stop_recovery_started,instance.unexpected_stop_recovery_succeeded, andinstance.unexpected_stop_recovery_failed.stoppedwhile the DO still thinks the instance isrunning; Flycreatedis no longer treated as an unexpected-stop signal.kiloclaw_sandbox_idbefore force-destroying an attached machine.