-
Notifications
You must be signed in to change notification settings - Fork 271
Description
Problem Statement
When Docker restarts (or the cluster container is lost for any reason), running openshell gateway start offers only two options:
- Destroy everything and recreate — loses all sandboxes, k3s state, etcd data
- Do nothing — pretends the gateway exists but nothing is actually running
There is no path to resume from the existing Docker volume state. Users lose all their sandboxes on every Docker restart.
Additionally, the SSH handshake HMAC secret is regenerated on every container start (in cluster-entrypoint.sh) and injected via sed into the HelmChart CR. This means every container restart invalidates existing sandbox SSH sessions, even though the sandboxes themselves may still be running.
Proposed Design
Gateway Resume
When openshell gateway start is called and existing state is detected (Docker volume exists, but container is stopped or gone), the gateway should automatically resume from that state instead of prompting for destroy/recreate.
New behavior by state:
| State | Current behavior | New behavior |
|---|---|---|
| Running | Prompt "Destroy and recreate?" | Return immediately ("already running") |
| Stopped container | Prompt "Destroy and recreate?" | Auto-resume |
| Volume only (no container) | Prompt "Destroy and recreate?" | Auto-resume |
--recreate flag |
Destroy + redeploy | No change |
Implementation approach:
- Add a
resume: boolflag toDeployOptionsinopenshell-bootstrap - When resuming, skip the
destroy_gateway_resourcescall and fall through to the existing idempotentensure_*calls (ensure_network,ensure_volume,ensure_container,start_container) - The existing
clean_stale_nodeshandles stale k3s node entries from prior containers - The existing
reconcile_pkireuses valid TLS secrets from the volume's k3s state - On resume failure, only clean up the container/network — preserve the volume so the user can retry
- The auto-bootstrap path (
sandbox create) should try resume first, falling back to recreate if resume fails
Docker restart policy:
- Add
unless-stoppedrestart policy to the container'sHostConfigso Docker automatically restarts the container when the daemon comes back gateway stopexplicitly stops the container, which overrides theunless-stoppedpolicy
SSH Handshake Secret Persistence
Move the SSH handshake secret from ephemeral generation in cluster-entrypoint.sh to a proper Kubernetes Secret that persists in etcd (on the Docker volume):
- Create a K8s secret
openshell-ssh-handshakewith asecretkey containing the hex-encoded HMAC key - The bootstrap Rust code reconciles this secret the same way it reconciles TLS PKI — check if it exists, reuse if valid, generate new if missing
- The Helm chart StatefulSet references the secret via
secretKeyRefinstead of a plain value - Remove the sed-based injection from
cluster-entrypoint.shand the__SSH_HANDSHAKE_SECRET__placeholder from the HelmChart CR
Files involved:
| File | Change |
|---|---|
crates/openshell-bootstrap/src/constants.rs |
New SSH_HANDSHAKE_SECRET_NAME constant |
crates/openshell-bootstrap/src/docker.rs |
Restart policy in ensure_container, cleanup_gateway_container function |
crates/openshell-bootstrap/src/lib.rs |
DeployOptions.resume, resume branch in deploy_gateway_with_logs, reconcile_ssh_handshake_secret |
crates/openshell-cli/src/run.rs |
gateway_admin_deploy auto-resume logic |
crates/openshell-cli/src/bootstrap.rs |
Auto-bootstrap resume-first with recreate fallback |
deploy/docker/cluster-entrypoint.sh |
Remove SSH secret generation/sed injection |
deploy/docker/cluster-healthcheck.sh |
Add SSH handshake secret health check |
deploy/helm/openshell/templates/statefulset.yaml |
secretKeyRef instead of plain value |
deploy/helm/openshell/values.yaml |
sshHandshakeSecretName instead of sshHandshakeSecret |
deploy/kube/manifests/openshell-helmchart.yaml |
Remove sshHandshakeSecret placeholder |
tasks/scripts/cluster-deploy-fast.sh |
Use K8s secret instead of Helm values |
Alternatives Considered
-
Only add restart policy, no resume logic — Handles the common Docker restart case but doesn't cover
gateway stopthengateway start, manualdocker rm, or OOM kills. -
Prompt for resume vs recreate — Adds friction to the common case. Since
--recreateexists for explicit clean starts, auto-resume is better UX. -
Store SSH handshake secret on host filesystem — Would survive container restarts but creates a separate state management problem. K8s secrets are already the pattern used for TLS PKI and live in etcd on the persistent volume.
Agent Investigation
Traced the full code path for openshell gateway start:
check_existing_gateway(docker.rs:1030) detects volume/container statedeploy_gateway_with_logs(lib.rs:293) had a binary "destroy or error" check — no resume pathensure_volume(docker.rs:383) is already idempotent (no-op if volume exists)ensure_container(docker.rs:446) handles "no container" by creating one with existing volumeclean_stale_nodes(runtime.rs:376) removes stale k3s NotReady nodesreconcile_pki(lib.rs:798) reuses existing TLS secrets from the volumecluster-entrypoint.shcopies fresh charts/manifests on every start and k3s handles data directory resume
SSH handshake secret traced through 7 stages: generation in entrypoint, sed into HelmChart CR, Helm values, StatefulSet env var, server startup, sandbox pod injection, sandbox SSH verification. The secret was regenerated on every container start, breaking existing SSH sessions.
Definition of Done
-
openshell gateway startauto-resumes from existing volume state (no prompt) -
openshell gateway start --recreatestill destroys and rebuilds from scratch - Container has
unless-stoppedrestart policy - SSH handshake secret persists as K8s secret
openshell-ssh-handshake - SSH handshake secret survives container restart (sandbox SSH sessions remain valid)
- Auto-bootstrap (
sandbox create) tries resume first, falls back to recreate - Resume failure preserves the Docker volume (no data loss on transient errors)
- Health check verifies SSH handshake secret exists
-
cluster-deploy-fast.shcreates/reuses K8s secret instead of Helm values - All existing tests pass