Conversation
…n will need some further updates to align better with the Tutorial (e.g. changing IP addresses, adjusting comments to support bare-metal and cloud setups, etc.) and to ensure the documented approach is sufficiently broad for general purpose. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
|
I am in the process of adjusting the documentation to align better with the format of the OpenCHAMI tutorial to enable ease of use for OpenCHAMI users. I will be making some more commits to adjust and fine-tune the documentation, and I would appreciate feedback/suggestions as this is my first time contributing to this project. |
|
At a glance, this looks great! I'm going try to take time to run through this today if I get a chance. |
… for creating some files from cat to copy-paste to prevent issues with bash command/variable processing Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
… - this should make this guide easy to follow on with after the tutorial Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…Next step will be expanding comments/explanations to provide more context to users, as well as providing more code blocks to show expected output of commands that produce output. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…id. Changes include making it more clear when pwgen password is used, correcting the file creation step for slurm.conf to prevent errors, removing instructions for aliasing the build commend (and instead redirecting to the appropriate tutorial section), updating instructions inline with a recent PR to replace MinIO with Versity S3 and some minor typo fixes Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
|
Thank you for your suggestions/fixes @davidallendj, I have applied all of them except for simplifying the MariaDB configuration process. I will look into if there is a more hands-off approach to configuring MariaDB that we can implement instead. I will also expand the comments/explanations that are provided with code blocks, add more 'expected output' code blocks and provide support for cloud or bare-metal deployment variations over the next few days. |
…ck from David. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…Some reviews are still pending as I figure out the source of the problem and a solution, and I will address these in a later commit. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
|
I have been notified of a security vulnerability with versions 0.5-0.5.17 of munge. I will update the documentation next week to pin installation of munge >= 0.5.18, so we can ensure anyone following the guide isn't installing a vulnerable version of Munge. |
|
I really appreciate all the work you're doing here. Really shaping up nicely! |
Thanks for making all these changes! Just a few more changes/adjustments and we should be able to merge soon! |
| Run a test job as the user 'testuser': | ||
|
|
||
| ```bash | ||
| srun hostname |
There was a problem hiding this comment.
Okay, I think this is the last issue for me. Whenever I try to run this, I'm getting this error about accounts and partitions. Did I miss a step?
[testuser@openchami-dev ~]$ srun hostname
srun: error: Unable to allocate resources: Invalid account or account/partition combination specifiedHere's what I'm seeing for journalctl slurmctld.
Feb 13 18:53:48 openchami-dev.novalocal slurmctld[1412948]: slurmctld: _job_create: invalid account or partition for user 1002, account '(null)', and partition 'main'
Feb 13 18:53:48 openchami-dev.novalocal slurmctld[1412948]: slurmctld: _slurm_rpc_allocate_resources: Invalid account or account/partition combination specifiedAnd here's the sinfo too.
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
main* up infinite 1 idle de01There was a problem hiding this comment.
Hmmm strange. I've only had that issue when I switch user and immediately try to run a job, but it looks like you are running srun hostname as the user you've created with Slurm privileges. I will look into this and get back to you.
There was a problem hiding this comment.
I tried this again after running the commands above and I got this instead now.
[testuser@openchami-dev ~]$ srun hostname
srun: job 1 queued and waiting for resources
srun: job 1 has been allocated resources
slurmstepd: error: couldn't chdir to `/home/testuser': No such file or directory: going to /tmp instead
slurmstepd: error: couldn't chdir to `/home/testuser': No such file or directory: going to /tmp instead
de01
ERROR: ld.so: object '/software/r9/xalt/3.0.1/$LIB/libxalt_init.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.I do see the hostname like expected but I'm not sure if the other things warnings/errors are really all that important here. If not, then I'd say we're pretty much done with PR and it should be ready to merge.
There was a problem hiding this comment.
Fantastic! That looks to be working now, which is great. The two slurmstepd errors are expected as LDAP is not configured. I haven't seen the LD_PRELOAD error before, but it could be due to your cluster having xalt installed or on a path somewhere? It isn't something that should be installed from my directions and I can't see it on my cluster. It doesn't seem to be causing issues with srun though, so I am not worried.
Hopefully the issues you had previously were a one-off. I was trying to replicate them, but have been having some issues with my test cluster that was preventing me unfortunately.
Thanks Alex :)
No problem and sounds great! Thanks David :) |
…ecurity vulnerabilities with versions 0.5-0.5.17 Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…ompute node. Additionally made some tweaks to the documentation to make the workflow more robust after repeating it on a fresh node. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
|
I've setup a new test cluster to run through the documentation again and have made some minor tweaks which should address some of the issues found with slurmdbd and slurmctld. Additionally, munge install has been pinned to version 0.5.18 to address security concerns. @davidallendj I was originally planning to add more comments/explanations to explain the workflow and certain choices more, but if that isn't needed (or we want to do that at a later time) then I am happy with the state of the documentation to merge now if you are. |
I think this is a good start so we can add that later if you want but that's up to you. I'm going to go ahead and approve. |
davidallendj
left a comment
There was a problem hiding this comment.
Tested and can verify that it works 🚀
Thanks David, that sounds great to me! I've got some things coming up at work, so I can pivot back to expand it later. I also want to flesh out some other documentation components once I have figured them out if the OpenCHAMI dev team would be interested (e.g. K8s, serving images with NFS, etc.). |
…in a few places Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
All of the above! I've been intending to add these to the handbook at some point as well, but I think we could use some help as we've been quite busy. 🙂 |
Fantastic! I haven't gotten around to implementing these yet, but once I do I can start pushing up documentation :) |
|
Hi @synackd, could I please have your approval on this PR to merge? I'd love to have it added to the code base. Thank you! |
|
Yep, will take a look at it today. |
synackd
left a comment
There was a problem hiding this comment.
Thank you for all of the work you put into writing these docs! I went through this and tested it using a VM head/compute setup as documented in the tutorial.
I have requested changes, most of which are alterations to make it easier for the reader to copy and paste, and for commands to be more general so it doesn't assume the environment too much (e.g. tutorial).
I also noticed that a lot of the configuration of Slurm within the compute node happens by executing commands on a running node. However, as an OpenCHAMI guide, it would be useful to know how to make these configurations in the image config itself and/or cloud-init so that they become persistent across nodes and on reboot. Depending on others' opinions (@davidallendj, @alexlovelltroy), we can keep the current, ephemeral config and file a new PR for the persistence since this is already pretty big. Thoughts?
Finally, I ran into an error when starting slurmctld on the head node using the configuration in this guide.
|
|
||
| ```bash | ||
| sudo systemctl start slurmdbd | ||
| sudo systemctl start slurmctld |
There was a problem hiding this comment.
I'm getting these errors on the head node (which is a VM) when starting Slurm here:
slurmctld[22796]: slurmctld: error: If using PrologFlags=Contain for pam_slurm_adopt, proctrack/cgroup is required. If not using pam_slurm_adopt, please ignore error.
...
slurmctld[22796]: slurmctld: error: Configured MailProg is invalid
slurmctld[22796]: slurmctld: slurmctld version 24.05.5 started on cluster demo
slurmctld[22796]: slurmctld: error: This host (head/head) not a valid controller
Thank you Devon! I will start going through and applying your reviews now.
I agree, and I think this would be a great idea. This is something I haven't explored yet as I am still figuring out all of OpenCHAMI's capabilities, but is something I would like to implement long term as I am planning to make an OpenCHAMI and Slurm IaC solution in the next month or two. I'll wait to hear David and Alex's thoughts.
Sorry about that. This is an error that has cropped up a bit for me, but I am unsure why the slurm controller is trying to use the hostname 'head' instead of 'demo' in your case. Once I update the documentation I'll go through my workflow again to identify where this hostname is coming from and fix it. |
…erence to the 'Install Slurm' guide Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…t and the image config to reduce the number of commands needing to be run on the compute node. We are waiting on feedback from David and Alex before potentially implementing a more persistent Slurm configuration on the compute node/s. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
|
@synackd Thank you for taking the time to go over the documentation. I have applied all of your feedback and suggestions, except for updating cloud-init and the image config to reduce how many commands are run on the compute node. I am going to go back over my workflow on my test cluster and figure out the issue with the slurmctld service. |
synackd
left a comment
There was a problem hiding this comment.
This looks great! Thank you for the updated changes. There were just a few very small stylistic changes I requested changes on. Other than that, it looks solid!
I still need to run through this again, but my test machine is currently down. Once it's back up, I can resume testing.
I think that, if they don't interject, it would be fine to have this be merged in and have it updated later. Just so users have something to go off of. 🙂 Would you be willing to be responsible for updating these docs with the cloud-init/image configs once you get them figured out? If you need help, we'd be more than willing to. Just reach out in one of our channels, e.g. Slack. 😉
I'll poke around once I run through it again. Otherwise, I'll wait for your response. |
…evon Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
Thanks Devon, that sounds good to me. Yes I am happy to update these docs once I figure out the cloud-init and image configs. I'll see if I can figure them out myself using the docs, but if I need a hand I will reach out. I've just joined the Slack channel which will help!
Sounds good. I'm currently getting my test system up and running again as well, so I'm hoping to investigate this today. I'll keep you posted if I find anything, otherwise if you can replicate the issue and grab some logs and config files for me that would be great to pin down the cause too. |
… in the working directory '/opt/workdir' (as desired) and not the user's home directory Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…r' in the slurm-local.repo file Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…f slurm RPMs in '/opt/workdir' Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
@synackd I have just finished running through the process again from scratch and I was not able to replicate this error. If you have this error again when you run through the guide, let me know and we can troubleshoot it. |
Pull Request Template
Thank you for your contribution! Please ensure the following before submitting:
Checklist
make test(or equivalent) locally and all tests passgit commit -s) with my real name and email<filename>.licensesidecarLICENSES/directoryDescription
Contributing to the "Install Slurm" documentation under the OpenCHAMI guides.
Any feedback or suggestions about making the documentation broad enough for general purpose or to fit in well with the existing documentation are appreciated.
Type of Change
For more info, see Contributing Guidelines.