feat(bootstrap): resume gateway from existing state and persist SSH handshake secret#488
Open
feat(bootstrap): resume gateway from existing state and persist SSH handshake secret#488
Conversation
…andshake secret Add a resume code path to gateway start so existing Docker volume state (k3s, etcd, sandboxes, secrets) is reused instead of requiring a full destroy/recreate cycle. When the container is gone but the volume remains (e.g. Docker restart), the CLI automatically creates a new container with the existing volume and reconciles PKI and secrets. Move the SSH handshake HMAC secret from ephemeral generation in the cluster entrypoint (regenerated on every container start) to a Kubernetes Secret that persists in etcd on the Docker volume. This ensures sandbox SSH sessions survive container restarts. Key changes: - Add DeployOptions.resume flag with resume branch in deploy flow - Add cleanup_gateway_container for volume-preserving failure cleanup - Auto-resume in gateway_admin_deploy (stopped/volume-only states) - Auto-bootstrap tries resume first, falls back to recreate - Add unless-stopped Docker restart policy to gateway container - Reconcile SSH handshake secret as K8s Secret alongside TLS PKI - Update Helm chart to read secret via secretKeyRef - Add SSH handshake secret to cluster health check Closes #487
johntmyers
previously approved these changes
Mar 19, 2026
On resume after container kill, ensure_network destroys and recreates the Docker network with a new ID. The stopped container still referenced the old network ID, causing 'network not found' on start. Fix by reconciling the container's network attachment in ensure_container. Also, reconcile_pki was attempting to load K8s secrets before k3s had booted, failing transiently, and regenerating PKI unnecessarily. This triggered a server rollout restart causing TLS errors. Fix by waiting for the openshell namespace before attempting to read existing secrets. Add gRPC readiness check to gateway_admin_deploy so the CLI waits for the server to accept connections before declaring the gateway ready. Add e2e test covering container kill, stale network, sandbox persistence, and sandbox create after resume.
The wait_for_healthy helper checked for 'healthy', 'running', or '✓' but openshell status outputs 'Connected'. All five gateway_resume tests were failing because the health check never matched.
…ternally The deploy flow now auto-detects whether to resume by checking for existing gateway state inside deploy_gateway_with_logs. Callers no longer need to compute and pass a resume flag. The explicit gateway start path still short-circuits for already-running gateways to avoid redundant work.
The gateway returns HTTP 412 (Precondition Failed) when the sandbox pod exists but hasn't reached Ready phase yet. This is a transient state after allocation. Instead of failing immediately, retry with exponential backoff (1s to 8s) for up to 60 seconds.
- Remove duplicate Duration import and use unqualified Duration in ssh.rs - Prefix unused default_image parameter with underscore in sandbox/mod.rs - Make SecretResolver pub to match its use in pub function signature
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add gateway resume from existing Docker volume state and persist the SSH handshake HMAC secret as a Kubernetes Secret, so
openshell gateway startrecovers gracefully after Docker restarts without losing sandboxes or breaking SSH sessions.Related Issue
Closes #487
Changes
Gateway Resume
DeployOptions.resumeflag with a resume branch indeploy_gateway_with_logsthat falls through to idempotentensure_*calls instead of erroring or destroyinggateway_admin_deployauto-resumes for stopped/volume-only states; already-running returns immediately;--recreatestill destroyssandbox create) tries resume first, falls back to recreate on failure (logged atwarn)cleanup_gateway_containerfor volume-preserving cleanup on resume failureunless-stoppedDocker restart policy so the container auto-restarts on Docker daemon restartSSH Handshake Secret Persistence
reconcile_ssh_handshake_secretin bootstrap — checks if K8s secret exists, reuses if present, generates new if missing (same pattern as TLS PKI reconciliation)OPENSHELL_SSH_HANDSHAKE_SECRETviasecretKeyRefinstead of plain valuecluster-entrypoint.shsshHandshakeSecretfrom HelmChart CR values; addsshHandshakeSecretNametovalues.yamlcluster-deploy-fast.shto create K8s secret directly via kubectlTesting
mise run pre-commitpasses (format, lint, license headers)cargo test --package openshell-bootstrap --package openshell-cli— all 163 tests passmise run e2e) — requires running cluster; these changes affect sandbox lifecycle and should be validated with a running gatewayChecklist