Skip to content

Add "Install Slurm" documentation#72

Open
lunamorrow wants to merge 18 commits intoOpenCHAMI:mainfrom
lunamorrow:lunamorrow/install-slurm-documentation
Open

Add "Install Slurm" documentation#72
lunamorrow wants to merge 18 commits intoOpenCHAMI:mainfrom
lunamorrow:lunamorrow/install-slurm-documentation

Conversation

@lunamorrow
Copy link

Pull Request Template

Thank you for your contribution! Please ensure the following before submitting:

Checklist

  • My code follows the style guidelines of this project
  • I have added/updated comments where needed
  • I have added tests that prove my fix is effective or my feature works
  • I have run make test (or equivalent) locally and all tests pass
  • DCO Sign-off: All commits are signed off (git commit -s) with my real name and email
  • REUSE Compliance:
    • Each new/modified source file has SPDX copyright and license headers
    • Any non-commentable files include a <filename>.license sidecar
    • All referenced licenses are present in the LICENSES/ directory

Description

Contributing to the "Install Slurm" documentation under the OpenCHAMI guides.

Any feedback or suggestions about making the documentation broad enough for general purpose or to fit in well with the existing documentation are appreciated.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update

For more info, see Contributing Guidelines.

…n will need some further updates to align better with the Tutorial (e.g. changing IP addresses, adjusting comments to support bare-metal and cloud setups, etc.) and to ensure the documented approach is sufficiently broad for general purpose.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
@lunamorrow
Copy link
Author

I am in the process of adjusting the documentation to align better with the format of the OpenCHAMI tutorial to enable ease of use for OpenCHAMI users. I will be making some more commits to adjust and fine-tune the documentation, and I would appreciate feedback/suggestions as this is my first time contributing to this project.

@davidallendj
Copy link
Contributor

At a glance, this looks great! I'm going try to take time to run through this today if I get a chance.

… for creating some files from cat to copy-paste to prevent issues with bash command/variable processing

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
… - this should make this guide easy to follow on with after the tutorial

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…Next step will be expanding comments/explanations to provide more context to users, as well as providing more code blocks to show expected output of commands that produce output.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…id. Changes include making it more clear when pwgen password is used, correcting the file creation step for slurm.conf to prevent errors, removing instructions for aliasing the build commend (and instead redirecting to the appropriate tutorial section), updating instructions inline with a recent PR to replace MinIO with Versity S3 and some minor typo fixes

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
@lunamorrow
Copy link
Author

Thank you for your suggestions/fixes @davidallendj, I have applied all of them except for simplifying the MariaDB configuration process. I will look into if there is a more hands-off approach to configuring MariaDB that we can implement instead. I will also expand the comments/explanations that are provided with code blocks, add more 'expected output' code blocks and provide support for cloud or bare-metal deployment variations over the next few days.

…ck from David.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…Some reviews are still pending as I figure out the source of the problem and a solution, and I will address these in a later commit.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
@lunamorrow
Copy link
Author

I have been notified of a security vulnerability with versions 0.5-0.5.17 of munge. I will update the documentation next week to pin installation of munge >= 0.5.18, so we can ensure anyone following the guide isn't installing a vulnerable version of Munge.

More info: https://nvd.nist.gov/vuln/detail/CVE-2026-25506

@alexlovelltroy
Copy link
Member

I really appreciate all the work you're doing here. Really shaping up nicely!

@davidallendj
Copy link
Contributor

Thank you for your suggestions/fixes @davidallendj, I have applied all of them except for simplifying the MariaDB configuration process. I will look into if there is a more hands-off approach to configuring MariaDB that we can implement instead. I will also expand the comments/explanations that are provided with code blocks, add more 'expected output' code blocks and provide support for cloud or bare-metal deployment variations over the next few days.

Thanks for making all these changes! Just a few more changes/adjustments and we should be able to merge soon!

Run a test job as the user 'testuser':

```bash
srun hostname
Copy link
Contributor

@davidallendj davidallendj Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think this is the last issue for me. Whenever I try to run this, I'm getting this error about accounts and partitions. Did I miss a step?

[testuser@openchami-dev ~]$ srun hostname
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

Here's what I'm seeing for journalctl slurmctld.

Feb 13 18:53:48 openchami-dev.novalocal slurmctld[1412948]: slurmctld: _job_create: invalid account or partition for user 1002, account '(null)', and partition 'main'
Feb 13 18:53:48 openchami-dev.novalocal slurmctld[1412948]: slurmctld: _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified

And here's the sinfo too.

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
main*        up   infinite      1   idle de01

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm strange. I've only had that issue when I switch user and immediately try to run a job, but it looks like you are running srun hostname as the user you've created with Slurm privileges. I will look into this and get back to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this again after running the commands above and I got this instead now.

[testuser@openchami-dev ~]$ srun hostname
srun: job 1 queued and waiting for resources
srun: job 1 has been allocated resources
slurmstepd: error: couldn't chdir to `/home/testuser': No such file or directory: going to /tmp instead
slurmstepd: error: couldn't chdir to `/home/testuser': No such file or directory: going to /tmp instead
de01
ERROR: ld.so: object '/software/r9/xalt/3.0.1/$LIB/libxalt_init.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.

I do see the hostname like expected but I'm not sure if the other things warnings/errors are really all that important here. If not, then I'd say we're pretty much done with PR and it should be ready to merge.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! That looks to be working now, which is great. The two slurmstepd errors are expected as LDAP is not configured. I haven't seen the LD_PRELOAD error before, but it could be due to your cluster having xalt installed or on a path somewhere? It isn't something that should be installed from my directions and I can't see it on my cluster. It doesn't seem to be causing issues with srun though, so I am not worried.

Hopefully the issues you had previously were a one-off. I was trying to replicate them, but have been having some issues with my test cluster that was preventing me unfortunately.

@lunamorrow
Copy link
Author

I really appreciate all the work you're doing here. Really shaping up nicely!

Thanks Alex :)

Thanks for making all these changes! Just a few more changes/adjustments and we should be able to merge soon!

No problem and sounds great! Thanks David :)

…ecurity vulnerabilities with versions 0.5-0.5.17

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…ompute node. Additionally made some tweaks to the documentation to make the workflow more robust after repeating it on a fresh node.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
@lunamorrow
Copy link
Author

lunamorrow commented Feb 20, 2026

I've setup a new test cluster to run through the documentation again and have made some minor tweaks which should address some of the issues found with slurmdbd and slurmctld. Additionally, munge install has been pinned to version 0.5.18 to address security concerns.

@davidallendj I was originally planning to add more comments/explanations to explain the workflow and certain choices more, but if that isn't needed (or we want to do that at a later time) then I am happy with the state of the documentation to merge now if you are.

@davidallendj
Copy link
Contributor

@davidallendj I was originally planning to add more comments/explanations to explain the workflow and certain choices more, but if that isn't needed (or we want to do that at a later time) then I am happy with the state of the documentation to merge now if you are.

I think this is a good start so we can add that later if you want but that's up to you. I'm going to go ahead and approve.

Copy link
Contributor

@davidallendj davidallendj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and can verify that it works 🚀

@lunamorrow
Copy link
Author

I think this is a good start so we can add that later if you want but that's up to you. I'm going to go ahead and approve.

Thanks David, that sounds great to me! I've got some things coming up at work, so I can pivot back to expand it later. I also want to flesh out some other documentation components once I have figured them out if the OpenCHAMI dev team would be interested (e.g. K8s, serving images with NFS, etc.).

…in a few places

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
@synackd
Copy link
Contributor

synackd commented Feb 24, 2026

(e.g. K8s, serving images with NFS, etc.)

All of the above! I've been intending to add these to the handbook at some point as well, but I think we could use some help as we've been quite busy. 🙂

@lunamorrow
Copy link
Author

All of the above! I've been intending to add these to the handbook at some point as well, but I think we could use some help as we've been quite busy. 🙂

Fantastic! I haven't gotten around to implementing these yet, but once I do I can start pushing up documentation :)

@lunamorrow lunamorrow requested a review from synackd February 25, 2026 02:06
@lunamorrow
Copy link
Author

Hi @synackd, could I please have your approval on this PR to merge? I'd love to have it added to the code base. Thank you!

@synackd
Copy link
Contributor

synackd commented Mar 3, 2026

Yep, will take a look at it today.

Copy link
Contributor

@synackd synackd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all of the work you put into writing these docs! I went through this and tested it using a VM head/compute setup as documented in the tutorial.

I have requested changes, most of which are alterations to make it easier for the reader to copy and paste, and for commands to be more general so it doesn't assume the environment too much (e.g. tutorial).

I also noticed that a lot of the configuration of Slurm within the compute node happens by executing commands on a running node. However, as an OpenCHAMI guide, it would be useful to know how to make these configurations in the image config itself and/or cloud-init so that they become persistent across nodes and on reboot. Depending on others' opinions (@davidallendj, @alexlovelltroy), we can keep the current, ephemeral config and file a new PR for the persistence since this is already pretty big. Thoughts?

Finally, I ran into an error when starting slurmctld on the head node using the configuration in this guide.


```bash
sudo systemctl start slurmdbd
sudo systemctl start slurmctld
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting these errors on the head node (which is a VM) when starting Slurm here:

slurmctld[22796]: slurmctld: error: If using PrologFlags=Contain for pam_slurm_adopt, proctrack/cgroup is required. If not using pam_slurm_adopt, please ignore error.
...
slurmctld[22796]: slurmctld: error: Configured MailProg is invalid
slurmctld[22796]: slurmctld: slurmctld version 24.05.5 started on cluster demo
slurmctld[22796]: slurmctld: error: This host (head/head) not a valid controller

@lunamorrow
Copy link
Author

Thank you for all of the work you put into writing these docs! I went through this and tested it using a VM head/compute setup as documented in the tutorial.

I have requested changes, most of which are alterations to make it easier for the reader to copy and paste, and for commands to be more general so it doesn't assume the environment too much (e.g. tutorial).

Thank you Devon! I will start going through and applying your reviews now.

I also noticed that a lot of the configuration of Slurm within the compute node happens by executing commands on a running node. However, as an OpenCHAMI guide, it would be useful to know how to make these configurations in the image config itself and/or cloud-init so that they become persistent across nodes and on reboot. Depending on others' opinions (@davidallendj, @alexlovelltroy), we can keep the current, ephemeral config and file a new PR for the persistence since this is already pretty big. Thoughts?

I agree, and I think this would be a great idea. This is something I haven't explored yet as I am still figuring out all of OpenCHAMI's capabilities, but is something I would like to implement long term as I am planning to make an OpenCHAMI and Slurm IaC solution in the next month or two. I'll wait to hear David and Alex's thoughts.

Finally, I ran into an error when starting slurmctld on the head node using the configuration in this guide.

Sorry about that. This is an error that has cropped up a bit for me, but I am unsure why the slurm controller is trying to use the hostname 'head' instead of 'demo' in your case. Once I update the documentation I'll go through my workflow again to identify where this hostname is coming from and fix it.

…erence to the 'Install Slurm' guide

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…t and the image config to reduce the number of commands needing to be run on the compute node. We are waiting on feedback from David and Alex before potentially implementing a more persistent Slurm configuration on the compute node/s.

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
@lunamorrow
Copy link
Author

@synackd Thank you for taking the time to go over the documentation. I have applied all of your feedback and suggestions, except for updating cloud-init and the image config to reduce how many commands are run on the compute node. I am going to go back over my workflow on my test cluster and figure out the issue with the slurmctld service.

Copy link
Contributor

@synackd synackd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thank you for the updated changes. There were just a few very small stylistic changes I requested changes on. Other than that, it looks solid!

I still need to run through this again, but my test machine is currently down. Once it's back up, I can resume testing.

@synackd
Copy link
Contributor

synackd commented Mar 4, 2026

I also noticed that a lot of the configuration of Slurm within the compute node happens by executing commands on a running node. However, as an OpenCHAMI guide, it would be useful to know how to make these configurations in the image config itself and/or cloud-init so that they become persistent across nodes and on reboot. Depending on others' opinions (@davidallendj, @alexlovelltroy), we can keep the current, ephemeral config and file a new PR for the persistence since this is already pretty big. Thoughts?

I agree, and I think this would be a great idea. This is something I haven't explored yet as I am still figuring out all of OpenCHAMI's capabilities, but is something I would like to implement long term as I am planning to make an OpenCHAMI and Slurm IaC solution in the next month or two. I'll wait to hear David and Alex's thoughts.

I think that, if they don't interject, it would be fine to have this be merged in and have it updated later. Just so users have something to go off of. 🙂 Would you be willing to be responsible for updating these docs with the cloud-init/image configs once you get them figured out? If you need help, we'd be more than willing to. Just reach out in one of our channels, e.g. Slack. 😉

Finally, I ran into an error when starting slurmctld on the head node using the configuration in this guide.

Sorry about that. This is an error that has cropped up a bit for me, but I am unsure why the slurm controller is trying to use the hostname 'head' instead of 'demo' in your case. Once I update the documentation I'll go through my workflow again to identify where this hostname is coming from and fix it.

I'll poke around once I run through it again. Otherwise, I'll wait for your response.

…evon

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
@lunamorrow
Copy link
Author

I also noticed that a lot of the configuration of Slurm within the compute node happens by executing commands on a running node. However, as an OpenCHAMI guide, it would be useful to know how to make these configurations in the image config itself and/or cloud-init so that they become persistent across nodes and on reboot. Depending on others' opinions (@davidallendj, @alexlovelltroy), we can keep the current, ephemeral config and file a new PR for the persistence since this is already pretty big. Thoughts?

I agree, and I think this would be a great idea. This is something I haven't explored yet as I am still figuring out all of OpenCHAMI's capabilities, but is something I would like to implement long term as I am planning to make an OpenCHAMI and Slurm IaC solution in the next month or two. I'll wait to hear David and Alex's thoughts.

I think that, if they don't interject, it would be fine to have this be merged in and have it updated later. Just so users have something to go off of. 🙂 Would you be willing to be responsible for updating these docs with the cloud-init/image configs once you get them figured out? If you need help, we'd be more than willing to. Just reach out in one of our channels, e.g. Slack. 😉

Thanks Devon, that sounds good to me. Yes I am happy to update these docs once I figure out the cloud-init and image configs. I'll see if I can figure them out myself using the docs, but if I need a hand I will reach out. I've just joined the Slack channel which will help!

Finally, I ran into an error when starting slurmctld on the head node using the configuration in this guide.

Sorry about that. This is an error that has cropped up a bit for me, but I am unsure why the slurm controller is trying to use the hostname 'head' instead of 'demo' in your case. Once I update the documentation I'll go through my workflow again to identify where this hostname is coming from and fix it.

I'll poke around once I run through it again. Otherwise, I'll wait for your response.

Sounds good. I'm currently getting my test system up and running again as well, so I'm hoping to investigate this today. I'll keep you posted if I find anything, otherwise if you can replicate the issue and grab some logs and config files for me that would be great to pin down the cause too.

… in the working directory '/opt/workdir' (as desired) and not the user's home directory

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…r' in the slurm-local.repo file

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…f slurm RPMs in '/opt/workdir'

Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
@lunamorrow
Copy link
Author

I'm getting these errors on the head node (which is a VM) when starting Slurm here:

slurmctld[22796]: slurmctld: error: If using PrologFlags=Contain for pam_slurm_adopt, proctrack/cgroup is required. If not using pam_slurm_adopt, please ignore error.
...
slurmctld[22796]: slurmctld: error: Configured MailProg is invalid
slurmctld[22796]: slurmctld: slurmctld version 24.05.5 started on cluster demo
slurmctld[22796]: slurmctld: error: This host (head/head) not a valid controller

@synackd I have just finished running through the process again from scratch and I was not able to replicate this error. If you have this error again when you run through the guide, let me know and we can troubleshoot it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants