feat(cluster): add optional backup controller container by dennisklein · Pull Request #46 · GSI-HPC/sind

dennisklein · 2026-04-12T22:14:27Z

Adds an optional second controller container spawned alongside the primary when a controller node spec sets backupController: true. The backup comes up idle — slurmctld is intentionally not enabled on it — so it can stand in as the passive half of an active/passive slurmctld setup.

kind: Cluster
name: dev
nodes:
  - role: controller
    backupController: true
  - role: worker
    count: 2

Related: #28 — that issue asks for an un-managed primary controller. The backup controller approach is complementary: the primary keeps running normally and the backup container is the unmanaged one.

Changes

pkg/config: new backupController field on controller node spec, rejected on non-controller roles
pkg/cluster: clones the controller RunConfig in NodeRunConfigs when set, reusing every field except ShortName (controller-backup) and ContainerNumber — identical resources, volumes, caps, devices, security opts
pkg/cluster/create: enableSlurm skips the backup container so slurmctld never starts there
pkg/cluster/status: roleServices returns no Slurm services for the backup, so sind status doesn't probe a service that isn't meant to run
Docs updated under docs/content/configuration and docs/content/architecture
Drive-by: removed the unused exported EnableSlurmServices helper (dead code since the create orchestrator landed); the package-private enableSlurm is the sole production path

EnableSlurmServices was added before the create orchestrator and has been dead code ever since. The only production path that enables Slurm services is the package-private enableSlurm in create.go, which is richer: it runs per-node enablement concurrently through an errgroup, integrates with the readiness probes, and hooks into the monitor event watcher. Keeping the two implementations in parallel doubled the maintenance surface — any change to service enablement (like honoring the backup controller) had to be written twice. Drop the exported helper and its tests; enableSlurm is the single source of truth.

- add BackupController bool on Node decoded from backupController YAML key - reject backupController on non-controller roles in Validate - extend parse and validate tests to cover the new field

- add ControllerBackupShortName constant for the controller-backup hostname - clone the controller RunConfig in NodeRunConfigs when n.BackupController is true, reusing every field except ShortName and ContainerNumber so the backup container has identical resources, volumes, caps, and devices - skip the backup controller in enableSlurm so slurmctld is not started at create time; the container comes up idle, available for manual debug runs or active/passive experiments - teach roleServices to return no Slurm services for the backup controller and thread the short name through GetNodeHealth via ContainerPrefix so sind status does not probe slurmctld on a container where it is intentionally not running - cover run-config expansion and the create-path skip with unit tests

- node-definitions: add backupController row to the parameters table and a dedicated section covering the use cases, naming, DNS, and the fact that worker discovery and slurm.conf generation still reference the primary controller - cluster-config: list the new validation rule that backupController is controller-only - docker-resources: add the controller-backup row to the per-cluster resource naming table

dennisklein added 4 commits April 12, 2026 23:55

feat(config): add backupController field on controller node spec

816446f

- add BackupController bool on Node decoded from backupController YAML key - reject backupController on non-controller roles in Validate - extend parse and validate tests to cover the new field

dennisklein force-pushed the backup-controller branch from c9331c8 to 8ada7d1 Compare April 12, 2026 22:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cluster): add optional backup controller container#46

feat(cluster): add optional backup controller container#46
dennisklein wants to merge 4 commits intonextfrom
backup-controller

dennisklein commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dennisklein commented Apr 12, 2026

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant