Skip to content

feat: gateway resume from existing state and persistent SSH handshake secret #487

@drew

Description

@drew

Problem Statement

When Docker restarts (or the cluster container is lost for any reason), running openshell gateway start offers only two options:

  1. Destroy everything and recreate — loses all sandboxes, k3s state, etcd data
  2. Do nothing — pretends the gateway exists but nothing is actually running

There is no path to resume from the existing Docker volume state. Users lose all their sandboxes on every Docker restart.

Additionally, the SSH handshake HMAC secret is regenerated on every container start (in cluster-entrypoint.sh) and injected via sed into the HelmChart CR. This means every container restart invalidates existing sandbox SSH sessions, even though the sandboxes themselves may still be running.

Proposed Design

Gateway Resume

When openshell gateway start is called and existing state is detected (Docker volume exists, but container is stopped or gone), the gateway should automatically resume from that state instead of prompting for destroy/recreate.

New behavior by state:

State Current behavior New behavior
Running Prompt "Destroy and recreate?" Return immediately ("already running")
Stopped container Prompt "Destroy and recreate?" Auto-resume
Volume only (no container) Prompt "Destroy and recreate?" Auto-resume
--recreate flag Destroy + redeploy No change

Implementation approach:

  • Add a resume: bool flag to DeployOptions in openshell-bootstrap
  • When resuming, skip the destroy_gateway_resources call and fall through to the existing idempotent ensure_* calls (ensure_network, ensure_volume, ensure_container, start_container)
  • The existing clean_stale_nodes handles stale k3s node entries from prior containers
  • The existing reconcile_pki reuses valid TLS secrets from the volume's k3s state
  • On resume failure, only clean up the container/network — preserve the volume so the user can retry
  • The auto-bootstrap path (sandbox create) should try resume first, falling back to recreate if resume fails

Docker restart policy:

  • Add unless-stopped restart policy to the container's HostConfig so Docker automatically restarts the container when the daemon comes back
  • gateway stop explicitly stops the container, which overrides the unless-stopped policy

SSH Handshake Secret Persistence

Move the SSH handshake secret from ephemeral generation in cluster-entrypoint.sh to a proper Kubernetes Secret that persists in etcd (on the Docker volume):

  • Create a K8s secret openshell-ssh-handshake with a secret key containing the hex-encoded HMAC key
  • The bootstrap Rust code reconciles this secret the same way it reconciles TLS PKI — check if it exists, reuse if valid, generate new if missing
  • The Helm chart StatefulSet references the secret via secretKeyRef instead of a plain value
  • Remove the sed-based injection from cluster-entrypoint.sh and the __SSH_HANDSHAKE_SECRET__ placeholder from the HelmChart CR

Files involved:

File Change
crates/openshell-bootstrap/src/constants.rs New SSH_HANDSHAKE_SECRET_NAME constant
crates/openshell-bootstrap/src/docker.rs Restart policy in ensure_container, cleanup_gateway_container function
crates/openshell-bootstrap/src/lib.rs DeployOptions.resume, resume branch in deploy_gateway_with_logs, reconcile_ssh_handshake_secret
crates/openshell-cli/src/run.rs gateway_admin_deploy auto-resume logic
crates/openshell-cli/src/bootstrap.rs Auto-bootstrap resume-first with recreate fallback
deploy/docker/cluster-entrypoint.sh Remove SSH secret generation/sed injection
deploy/docker/cluster-healthcheck.sh Add SSH handshake secret health check
deploy/helm/openshell/templates/statefulset.yaml secretKeyRef instead of plain value
deploy/helm/openshell/values.yaml sshHandshakeSecretName instead of sshHandshakeSecret
deploy/kube/manifests/openshell-helmchart.yaml Remove sshHandshakeSecret placeholder
tasks/scripts/cluster-deploy-fast.sh Use K8s secret instead of Helm values

Alternatives Considered

  1. Only add restart policy, no resume logic — Handles the common Docker restart case but doesn't cover gateway stop then gateway start, manual docker rm, or OOM kills.

  2. Prompt for resume vs recreate — Adds friction to the common case. Since --recreate exists for explicit clean starts, auto-resume is better UX.

  3. Store SSH handshake secret on host filesystem — Would survive container restarts but creates a separate state management problem. K8s secrets are already the pattern used for TLS PKI and live in etcd on the persistent volume.

Agent Investigation

Traced the full code path for openshell gateway start:

  • check_existing_gateway (docker.rs:1030) detects volume/container state
  • deploy_gateway_with_logs (lib.rs:293) had a binary "destroy or error" check — no resume path
  • ensure_volume (docker.rs:383) is already idempotent (no-op if volume exists)
  • ensure_container (docker.rs:446) handles "no container" by creating one with existing volume
  • clean_stale_nodes (runtime.rs:376) removes stale k3s NotReady nodes
  • reconcile_pki (lib.rs:798) reuses existing TLS secrets from the volume
  • cluster-entrypoint.sh copies fresh charts/manifests on every start and k3s handles data directory resume

SSH handshake secret traced through 7 stages: generation in entrypoint, sed into HelmChart CR, Helm values, StatefulSet env var, server startup, sandbox pod injection, sandbox SSH verification. The secret was regenerated on every container start, breaking existing SSH sessions.

Definition of Done

  • openshell gateway start auto-resumes from existing volume state (no prompt)
  • openshell gateway start --recreate still destroys and rebuilds from scratch
  • Container has unless-stopped restart policy
  • SSH handshake secret persists as K8s secret openshell-ssh-handshake
  • SSH handshake secret survives container restart (sandbox SSH sessions remain valid)
  • Auto-bootstrap (sandbox create) tries resume first, falls back to recreate
  • Resume failure preserves the Docker volume (no data loss on transient errors)
  • Health check verifies SSH handshake secret exists
  • cluster-deploy-fast.sh creates/reuses K8s secret instead of Helm values
  • All existing tests pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:clusterRelated to running OpenShell on k3s/dockerarea:gatewayGateway server and control-plane work

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions