Skip to content

feat(cluster): add optional backup controller container#46

Open
dennisklein wants to merge 4 commits intonextfrom
backup-controller
Open

feat(cluster): add optional backup controller container#46
dennisklein wants to merge 4 commits intonextfrom
backup-controller

Conversation

@dennisklein
Copy link
Copy Markdown
Member

Adds an optional second controller container spawned alongside the primary when a controller node spec sets backupController: true. The backup comes up idle — slurmctld is intentionally not enabled on it — so it can stand in as the passive half of an active/passive slurmctld setup.

kind: Cluster
name: dev
nodes:
  - role: controller
    backupController: true
  - role: worker
    count: 2

Related: #28 — that issue asks for an un-managed primary controller. The backup controller approach is complementary: the primary keeps running normally and the backup container is the unmanaged one.

Changes

  • pkg/config: new backupController field on controller node spec, rejected on non-controller roles
  • pkg/cluster: clones the controller RunConfig in NodeRunConfigs when set, reusing every field except ShortName (controller-backup) and ContainerNumber — identical resources, volumes, caps, devices, security opts
  • pkg/cluster/create: enableSlurm skips the backup container so slurmctld never starts there
  • pkg/cluster/status: roleServices returns no Slurm services for the backup, so sind status doesn't probe a service that isn't meant to run
  • Docs updated under docs/content/configuration and docs/content/architecture
  • Drive-by: removed the unused exported EnableSlurmServices helper (dead code since the create orchestrator landed); the package-private enableSlurm is the sole production path

EnableSlurmServices was added before the create orchestrator and has
been dead code ever since. The only production path that enables Slurm
services is the package-private enableSlurm in create.go, which is
richer: it runs per-node enablement concurrently through an errgroup,
integrates with the readiness probes, and hooks into the monitor
event watcher.

Keeping the two implementations in parallel doubled the maintenance
surface — any change to service enablement (like honoring the backup
controller) had to be written twice. Drop the exported helper and its
tests; enableSlurm is the single source of truth.
- add BackupController bool on Node decoded from backupController YAML key
- reject backupController on non-controller roles in Validate
- extend parse and validate tests to cover the new field
- add ControllerBackupShortName constant for the controller-backup
  hostname
- clone the controller RunConfig in NodeRunConfigs when
  n.BackupController is true, reusing every field except ShortName
  and ContainerNumber so the backup container has identical resources,
  volumes, caps, and devices
- skip the backup controller in enableSlurm so slurmctld is not
  started at create time; the container comes up idle, available for
  manual debug runs or active/passive experiments
- teach roleServices to return no Slurm services for the backup
  controller and thread the short name through GetNodeHealth via
  ContainerPrefix so sind status does not probe slurmctld on a
  container where it is intentionally not running
- cover run-config expansion and the create-path skip with unit tests
- node-definitions: add backupController row to the parameters table
  and a dedicated section covering the use cases, naming, DNS, and the
  fact that worker discovery and slurm.conf generation still reference
  the primary controller
- cluster-config: list the new validation rule that backupController
  is controller-only
- docker-resources: add the controller-backup row to the per-cluster
  resource naming table
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant