feat(cluster): add optional backup controller container#46
Open
dennisklein wants to merge 4 commits intonextfrom
Open
feat(cluster): add optional backup controller container#46dennisklein wants to merge 4 commits intonextfrom
dennisklein wants to merge 4 commits intonextfrom
Conversation
EnableSlurmServices was added before the create orchestrator and has been dead code ever since. The only production path that enables Slurm services is the package-private enableSlurm in create.go, which is richer: it runs per-node enablement concurrently through an errgroup, integrates with the readiness probes, and hooks into the monitor event watcher. Keeping the two implementations in parallel doubled the maintenance surface — any change to service enablement (like honoring the backup controller) had to be written twice. Drop the exported helper and its tests; enableSlurm is the single source of truth.
- add BackupController bool on Node decoded from backupController YAML key - reject backupController on non-controller roles in Validate - extend parse and validate tests to cover the new field
- add ControllerBackupShortName constant for the controller-backup hostname - clone the controller RunConfig in NodeRunConfigs when n.BackupController is true, reusing every field except ShortName and ContainerNumber so the backup container has identical resources, volumes, caps, and devices - skip the backup controller in enableSlurm so slurmctld is not started at create time; the container comes up idle, available for manual debug runs or active/passive experiments - teach roleServices to return no Slurm services for the backup controller and thread the short name through GetNodeHealth via ContainerPrefix so sind status does not probe slurmctld on a container where it is intentionally not running - cover run-config expansion and the create-path skip with unit tests
- node-definitions: add backupController row to the parameters table and a dedicated section covering the use cases, naming, DNS, and the fact that worker discovery and slurm.conf generation still reference the primary controller - cluster-config: list the new validation rule that backupController is controller-only - docker-resources: add the controller-backup row to the per-cluster resource naming table
c9331c8 to
8ada7d1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds an optional second controller container spawned alongside the primary when a controller node spec sets
backupController: true. The backup comes up idle —slurmctldis intentionally not enabled on it — so it can stand in as the passive half of an active/passiveslurmctldsetup.Related: #28 — that issue asks for an un-managed primary controller. The backup controller approach is complementary: the primary keeps running normally and the backup container is the unmanaged one.
Changes
pkg/config: newbackupControllerfield on controller node spec, rejected on non-controller rolespkg/cluster: clones the controllerRunConfiginNodeRunConfigswhen set, reusing every field exceptShortName(controller-backup) andContainerNumber— identical resources, volumes, caps, devices, security optspkg/cluster/create:enableSlurmskips the backup container soslurmctldnever starts therepkg/cluster/status:roleServicesreturns no Slurm services for the backup, sosind statusdoesn't probe a service that isn't meant to rundocs/content/configurationanddocs/content/architectureEnableSlurmServiceshelper (dead code since the create orchestrator landed); the package-privateenableSlurmis the sole production path