Add "Install Slurm" documentation by lunamorrow · Pull Request #72 · OpenCHAMI/openchami.org

lunamorrow · 2026-02-09T05:31:39Z

Pull Request Template

Thank you for your contribution! Please ensure the following before submitting:

Checklist

My code follows the style guidelines of this project
I have added/updated comments where needed
I have added tests that prove my fix is effective or my feature works
I have run make test (or equivalent) locally and all tests pass
DCO Sign-off: All commits are signed off (git commit -s) with my real name and email
REUSE Compliance:
- Each new/modified source file has SPDX copyright and license headers
- Any non-commentable files include a <filename>.license sidecar
- All referenced licenses are present in the LICENSES/ directory

Description

Contributing to the "Install Slurm" documentation under the OpenCHAMI guides.

Any feedback or suggestions about making the documentation broad enough for general purpose or to fit in well with the existing documentation are appreciated.

Type of Change

Bug fix
New feature
Breaking change
Documentation update

For more info, see Contributing Guidelines.

…n will need some further updates to align better with the Tutorial (e.g. changing IP addresses, adjusting comments to support bare-metal and cloud setups, etc.) and to ensure the documented approach is sufficiently broad for general purpose. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>

lunamorrow · 2026-02-09T05:35:08Z

I am in the process of adjusting the documentation to align better with the format of the OpenCHAMI tutorial to enable ease of use for OpenCHAMI users. I will be making some more commits to adjust and fine-tune the documentation, and I would appreciate feedback/suggestions as this is my first time contributing to this project.

davidallendj · 2026-02-09T17:51:51Z

At a glance, this looks great! I'm going try to take time to run through this today if I get a chance.

content/docs/guides/install_slurm.md

… for creating some files from cat to copy-paste to prevent issues with bash command/variable processing Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>

… - this should make this guide easy to follow on with after the tutorial Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>

…Next step will be expanding comments/explanations to provide more context to users, as well as providing more code blocks to show expected output of commands that produce output. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>

content/docs/guides/install_slurm.md

…id. Changes include making it more clear when pwgen password is used, correcting the file creation step for slurm.conf to prevent errors, removing instructions for aliasing the build commend (and instead redirecting to the appropriate tutorial section), updating instructions inline with a recent PR to replace MinIO with Versity S3 and some minor typo fixes Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>

lunamorrow · 2026-02-11T04:17:43Z

Thank you for your suggestions/fixes @davidallendj, I have applied all of them except for simplifying the MariaDB configuration process. I will look into if there is a more hands-off approach to configuring MariaDB that we can implement instead. I will also expand the comments/explanations that are provided with code blocks, add more 'expected output' code blocks and provide support for cloud or bare-metal deployment variations over the next few days.

content/docs/guides/install_slurm.md

…ck from David. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>

content/docs/guides/install_slurm.md

…Some reviews are still pending as I figure out the source of the problem and a solution, and I will address these in a later commit. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>

lunamorrow · 2026-02-13T07:26:26Z

I have been notified of a security vulnerability with versions 0.5-0.5.17 of munge. I will update the documentation next week to pin installation of munge >= 0.5.18, so we can ensure anyone following the guide isn't installing a vulnerable version of Munge.

More info: https://nvd.nist.gov/vuln/detail/CVE-2026-25506

alexlovelltroy · 2026-02-13T12:57:14Z

I really appreciate all the work you're doing here. Really shaping up nicely!

davidallendj · 2026-02-13T18:27:13Z

Thank you for your suggestions/fixes @davidallendj, I have applied all of them except for simplifying the MariaDB configuration process. I will look into if there is a more hands-off approach to configuring MariaDB that we can implement instead. I will also expand the comments/explanations that are provided with code blocks, add more 'expected output' code blocks and provide support for cloud or bare-metal deployment variations over the next few days.

Thanks for making all these changes! Just a few more changes/adjustments and we should be able to merge soon!

davidallendj · 2026-02-13T19:11:37Z

content/docs/guides/install_slurm.md

+Run a test job as the user 'testuser':
+
+```bash
+srun hostname


Okay, I think this is the last issue for me. Whenever I try to run this, I'm getting this error about accounts and partitions. Did I miss a step?

[testuser@openchami-dev ~]$ srun hostname srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

Here's what I'm seeing for journalctl slurmctld.

Feb 13 18:53:48 openchami-dev.novalocal slurmctld[1412948]: slurmctld: _job_create: invalid account or partition for user 1002, account '(null)', and partition 'main' Feb 13 18:53:48 openchami-dev.novalocal slurmctld[1412948]: slurmctld: _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified

And here's the sinfo too.

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST main* up infinite 1 idle de01

Hmmm strange. I've only had that issue when I switch user and immediately try to run a job, but it looks like you are running srun hostname as the user you've created with Slurm privileges. I will look into this and get back to you.

I tried this again after running the commands above and I got this instead now.

[testuser@openchami-dev ~]$ srun hostname srun: job 1 queued and waiting for resources srun: job 1 has been allocated resources slurmstepd: error: couldn't chdir to `/home/testuser': No such file or directory: going to /tmp instead slurmstepd: error: couldn't chdir to `/home/testuser': No such file or directory: going to /tmp instead de01 ERROR: ld.so: object '/software/r9/xalt/3.0.1/$LIB/libxalt_init.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.

I do see the hostname like expected but I'm not sure if the other things warnings/errors are really all that important here. If not, then I'd say we're pretty much done with PR and it should be ready to merge.

Fantastic! That looks to be working now, which is great. The two slurmstepd errors are expected as LDAP is not configured. I haven't seen the LD_PRELOAD error before, but it could be due to your cluster having xalt installed or on a path somewhere? It isn't something that should be installed from my directions and I can't see it on my cluster. It doesn't seem to be causing issues with srun though, so I am not worried.

Hopefully the issues you had previously were a one-off. I was trying to replicate them, but have been having some issues with my test cluster that was preventing me unfortunately.

lunamorrow · 2026-02-16T06:21:30Z

I really appreciate all the work you're doing here. Really shaping up nicely!

Thanks Alex :)

Thanks for making all these changes! Just a few more changes/adjustments and we should be able to merge soon!

No problem and sounds great! Thanks David :)

…ecurity vulnerabilities with versions 0.5-0.5.17 Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>

…ompute node. Additionally made some tweaks to the documentation to make the workflow more robust after repeating it on a fresh node. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>

lunamorrow · 2026-02-20T05:55:38Z

I've setup a new test cluster to run through the documentation again and have made some minor tweaks which should address some of the issues found with slurmdbd and slurmctld. Additionally, munge install has been pinned to version 0.5.18 to address security concerns.

@davidallendj I was originally planning to add more comments/explanations to explain the workflow and certain choices more, but if that isn't needed (or we want to do that at a later time) then I am happy with the state of the documentation to merge now if you are.

davidallendj · 2026-02-23T16:22:18Z

@davidallendj I was originally planning to add more comments/explanations to explain the workflow and certain choices more, but if that isn't needed (or we want to do that at a later time) then I am happy with the state of the documentation to merge now if you are.

I think this is a good start so we can add that later if you want but that's up to you. I'm going to go ahead and approve.

davidallendj

Tested and can verify that it works 🚀

lunamorrow · 2026-02-23T23:52:41Z

I think this is a good start so we can add that later if you want but that's up to you. I'm going to go ahead and approve.

Thanks David, that sounds great to me! I've got some things coming up at work, so I can pivot back to expand it later. I also want to flesh out some other documentation components once I have figured them out if the OpenCHAMI dev team would be interested (e.g. K8s, serving images with NFS, etc.).

…in a few places Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>

synackd · 2026-02-24T15:38:35Z

(e.g. K8s, serving images with NFS, etc.)

All of the above! I've been intending to add these to the handbook at some point as well, but I think we could use some help as we've been quite busy. 🙂

lunamorrow · 2026-02-25T02:06:27Z

All of the above! I've been intending to add these to the handbook at some point as well, but I think we could use some help as we've been quite busy. 🙂

Fantastic! I haven't gotten around to implementing these yet, but once I do I can start pushing up documentation :)

lunamorrow · 2026-03-03T02:46:42Z

Hi @synackd, could I please have your approval on this PR to merge? I'd love to have it added to the code base. Thank you!

synackd · 2026-03-03T15:57:26Z

Yep, will take a look at it today.

synackd

Thank you for all of the work you put into writing these docs! I went through this and tested it using a VM head/compute setup as documented in the tutorial.

I have requested changes, most of which are alterations to make it easier for the reader to copy and paste, and for commands to be more general so it doesn't assume the environment too much (e.g. tutorial).

I also noticed that a lot of the configuration of Slurm within the compute node happens by executing commands on a running node. However, as an OpenCHAMI guide, it would be useful to know how to make these configurations in the image config itself and/or cloud-init so that they become persistent across nodes and on reboot. Depending on others' opinions (@davidallendj, @alexlovelltroy), we can keep the current, ephemeral config and file a new PR for the persistence since this is already pretty big. Thoughts?

Finally, I ran into an error when starting slurmctld on the head node using the configuration in this guide.

content/docs/guides/install_slurm.md

synackd · 2026-03-04T00:08:17Z

content/docs/guides/install_slurm.md

+
+```bash
+sudo systemctl start slurmdbd
+sudo systemctl start slurmctld


I'm getting these errors on the head node (which is a VM) when starting Slurm here:

slurmctld[22796]: slurmctld: error: If using PrologFlags=Contain for pam_slurm_adopt, proctrack/cgroup is required. If not using pam_slurm_adopt, please ignore error. ... slurmctld[22796]: slurmctld: error: Configured MailProg is invalid slurmctld[22796]: slurmctld: slurmctld version 24.05.5 started on cluster demo slurmctld[22796]: slurmctld: error: This host (head/head) not a valid controller

lunamorrow · 2026-03-04T01:04:21Z

Thank you for all of the work you put into writing these docs! I went through this and tested it using a VM head/compute setup as documented in the tutorial.

I have requested changes, most of which are alterations to make it easier for the reader to copy and paste, and for commands to be more general so it doesn't assume the environment too much (e.g. tutorial).

Thank you Devon! I will start going through and applying your reviews now.

I also noticed that a lot of the configuration of Slurm within the compute node happens by executing commands on a running node. However, as an OpenCHAMI guide, it would be useful to know how to make these configurations in the image config itself and/or cloud-init so that they become persistent across nodes and on reboot. Depending on others' opinions (@davidallendj, @alexlovelltroy), we can keep the current, ephemeral config and file a new PR for the persistence since this is already pretty big. Thoughts?

I agree, and I think this would be a great idea. This is something I haven't explored yet as I am still figuring out all of OpenCHAMI's capabilities, but is something I would like to implement long term as I am planning to make an OpenCHAMI and Slurm IaC solution in the next month or two. I'll wait to hear David and Alex's thoughts.

Finally, I ran into an error when starting slurmctld on the head node using the configuration in this guide.

Sorry about that. This is an error that has cropped up a bit for me, but I am unsure why the slurm controller is trying to use the hostname 'head' instead of 'demo' in your case. Once I update the documentation I'll go through my workflow again to identify where this hostname is coming from and fix it.

…erence to the 'Install Slurm' guide Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>

…t and the image config to reduce the number of commands needing to be run on the compute node. We are waiting on feedback from David and Alex before potentially implementing a more persistent Slurm configuration on the compute node/s. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>

lunamorrow · 2026-03-04T05:54:18Z

@synackd Thank you for taking the time to go over the documentation. I have applied all of your feedback and suggestions, except for updating cloud-init and the image config to reduce how many commands are run on the compute node. I am going to go back over my workflow on my test cluster and figure out the issue with the slurmctld service.

synackd

This looks great! Thank you for the updated changes. There were just a few very small stylistic changes I requested changes on. Other than that, it looks solid!

I still need to run through this again, but my test machine is currently down. Once it's back up, I can resume testing.

content/docs/guides/install_slurm.md

synackd · 2026-03-04T18:35:11Z

I also noticed that a lot of the configuration of Slurm within the compute node happens by executing commands on a running node. However, as an OpenCHAMI guide, it would be useful to know how to make these configurations in the image config itself and/or cloud-init so that they become persistent across nodes and on reboot. Depending on others' opinions (@davidallendj, @alexlovelltroy), we can keep the current, ephemeral config and file a new PR for the persistence since this is already pretty big. Thoughts?

I agree, and I think this would be a great idea. This is something I haven't explored yet as I am still figuring out all of OpenCHAMI's capabilities, but is something I would like to implement long term as I am planning to make an OpenCHAMI and Slurm IaC solution in the next month or two. I'll wait to hear David and Alex's thoughts.

I think that, if they don't interject, it would be fine to have this be merged in and have it updated later. Just so users have something to go off of. 🙂 Would you be willing to be responsible for updating these docs with the cloud-init/image configs once you get them figured out? If you need help, we'd be more than willing to. Just reach out in one of our channels, e.g. Slack. 😉

Finally, I ran into an error when starting slurmctld on the head node using the configuration in this guide.

Sorry about that. This is an error that has cropped up a bit for me, but I am unsure why the slurm controller is trying to use the hostname 'head' instead of 'demo' in your case. Once I update the documentation I'll go through my workflow again to identify where this hostname is coming from and fix it.

I'll poke around once I run through it again. Otherwise, I'll wait for your response.

…evon Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>

lunamorrow · 2026-03-05T01:14:06Z

I also noticed that a lot of the configuration of Slurm within the compute node happens by executing commands on a running node. However, as an OpenCHAMI guide, it would be useful to know how to make these configurations in the image config itself and/or cloud-init so that they become persistent across nodes and on reboot. Depending on others' opinions (@davidallendj, @alexlovelltroy), we can keep the current, ephemeral config and file a new PR for the persistence since this is already pretty big. Thoughts?

I agree, and I think this would be a great idea. This is something I haven't explored yet as I am still figuring out all of OpenCHAMI's capabilities, but is something I would like to implement long term as I am planning to make an OpenCHAMI and Slurm IaC solution in the next month or two. I'll wait to hear David and Alex's thoughts.

I think that, if they don't interject, it would be fine to have this be merged in and have it updated later. Just so users have something to go off of. 🙂 Would you be willing to be responsible for updating these docs with the cloud-init/image configs once you get them figured out? If you need help, we'd be more than willing to. Just reach out in one of our channels, e.g. Slack. 😉

Thanks Devon, that sounds good to me. Yes I am happy to update these docs once I figure out the cloud-init and image configs. I'll see if I can figure them out myself using the docs, but if I need a hand I will reach out. I've just joined the Slack channel which will help!

Finally, I ran into an error when starting slurmctld on the head node using the configuration in this guide.

Sorry about that. This is an error that has cropped up a bit for me, but I am unsure why the slurm controller is trying to use the hostname 'head' instead of 'demo' in your case. Once I update the documentation I'll go through my workflow again to identify where this hostname is coming from and fix it.

I'll poke around once I run through it again. Otherwise, I'll wait for your response.

Sounds good. I'm currently getting my test system up and running again as well, so I'm hoping to investigate this today. I'll keep you posted if I find anything, otherwise if you can replicate the issue and grab some logs and config files for me that would be great to pin down the cause too.

… in the working directory '/opt/workdir' (as desired) and not the user's home directory Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>

…r' in the slurm-local.repo file Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>

…f slurm RPMs in '/opt/workdir' Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>

lunamorrow · 2026-03-05T04:15:29Z

I'm getting these errors on the head node (which is a VM) when starting Slurm here:

slurmctld[22796]: slurmctld: error: If using PrologFlags=Contain for pam_slurm_adopt, proctrack/cgroup is required. If not using pam_slurm_adopt, please ignore error.
...
slurmctld[22796]: slurmctld: error: Configured MailProg is invalid
slurmctld[22796]: slurmctld: slurmctld version 24.05.5 started on cluster demo
slurmctld[22796]: slurmctld: error: This host (head/head) not a valid controller

@synackd I have just finished running through the process again from scratch and I was not able to replicate this error. If you have this error again when you run through the guide, let me know and we can troubleshoot it.

davidallendj reviewed Feb 9, 2026

View reviewed changes

content/docs/guides/install_slurm.md Show resolved Hide resolved

davidallendj reviewed Feb 9, 2026

View reviewed changes

content/docs/guides/install_slurm.md Outdated Show resolved Hide resolved

lunamorrow added 3 commits February 10, 2026 09:41

Convert all relevant code blocks to bash format and adjust the method…

2b2e7a5

… for creating some files from cat to copy-paste to prevent issues with bash command/variable processing Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>

Update hostnames in tutorial to align with those used in the tutorial…

e64b50d

… - this should make this guide easy to follow on with after the tutorial Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>

davidallendj reviewed Feb 10, 2026

View reviewed changes