-
Notifications
You must be signed in to change notification settings - Fork 2
feat: add Docker Compose SLURM environment for local job testing #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
271f81f
e50bf0d
4e57e70
121ffd6
1a55b70
9ecc9c4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,127 @@ | ||
| name: Test SLURM Jobs | ||
|
|
||
| on: | ||
| push: | ||
| branches: [ main, develop ] | ||
| paths: | ||
| - 'scripts/job_*.sh' | ||
| - 'research/**/job_*.sh' | ||
| - 'docker/slurm/**' | ||
| - '.github/workflows/test-slurm-jobs.yml' | ||
| pull_request: | ||
| branches: [ main, develop ] | ||
| paths: | ||
| - 'scripts/job_*.sh' | ||
| - 'research/**/job_*.sh' | ||
| - 'docker/slurm/**' | ||
| - '.github/workflows/test-slurm-jobs.yml' | ||
| workflow_dispatch: | ||
|
|
||
| jobs: | ||
| test-slurm-environment: | ||
| runs-on: ubuntu-latest | ||
| permissions: | ||
| contents: read | ||
|
|
||
| steps: | ||
| - name: Checkout code | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Set up Docker Buildx | ||
| uses: docker/setup-buildx-action@v3 | ||
|
|
||
| - name: Build SLURM Docker images | ||
| run: | | ||
| docker-compose -f docker-compose.slurm.yml build | ||
|
|
||
| - name: Start SLURM cluster | ||
| run: | | ||
| docker-compose -f docker-compose.slurm.yml up -d | ||
| # Wait for SLURM to be ready | ||
| sleep 30 | ||
|
|
||
| - name: Check SLURM cluster status | ||
| run: | | ||
| docker exec ami-ml-slurmctld sinfo | ||
| docker exec ami-ml-slurmctld scontrol show nodes | ||
|
|
||
| - name: Test basic SLURM job submission | ||
| run: | | ||
| # Submit test job | ||
| docker exec ami-ml-slurmctld bash -c "cd /workspace && sbatch docker/slurm/examples/job_hello.sh" | ||
|
|
||
| # Wait for job to complete | ||
| sleep 10 | ||
|
|
||
| # Check job status | ||
| docker exec ami-ml-slurmctld squeue | ||
|
|
||
| # Display job output | ||
| docker exec ami-ml-slurmctld bash -c "cd /workspace && cat hello_slurm_*.out || echo 'Job output not found yet'" | ||
|
Comment on lines
+48
to
+60
|
||
|
|
||
| - name: Test environment setup job | ||
| run: | | ||
| # Submit environment test job (simplified version that doesn't require network) | ||
| docker exec ami-ml-slurmctld bash -c "cd /workspace && cat > /tmp/test_simple.sh << 'EOF' | ||
| #!/bin/bash | ||
| #SBATCH --job-name=test_simple | ||
| #SBATCH --output=test_simple_%j.out | ||
| #SBATCH --ntasks=1 | ||
| #SBATCH --time=00:05:00 | ||
| #SBATCH --mem=2G | ||
| #SBATCH --cpus-per-task=1 | ||
| #SBATCH --partition=main | ||
|
|
||
| echo \"Testing basic environment...\" | ||
| echo \"Job ID: \$SLURM_JOB_ID\" | ||
| echo \"Working directory: \$(pwd)\" | ||
| echo \"Python version:\" | ||
| python3 --version | ||
| echo \"Conda available:\" | ||
| which conda | ||
| conda --version | ||
| echo \"Poetry available:\" | ||
| which poetry || echo \"Poetry not in PATH\" | ||
| poetry --version || echo \"Poetry command failed\" | ||
| echo \"Workspace contents:\" | ||
| ls -la /workspace/ | head -20 | ||
| echo \"Test completed successfully!\" | ||
| EOF | ||
| " | ||
| docker exec ami-ml-slurmctld chmod +x /tmp/test_simple.sh | ||
|
|
||
| JOB_ID=$(docker exec ami-ml-slurmctld bash -c "sbatch /tmp/test_simple.sh" | grep -oP '\d+') | ||
| echo "Submitted job ID: $JOB_ID" | ||
|
|
||
| # Wait for job to complete (with timeout) | ||
| timeout=60 | ||
| elapsed=0 | ||
| while [ $elapsed -lt $timeout ]; do | ||
| status=$(docker exec ami-ml-slurmctld squeue -j $JOB_ID -h -o "%T" 2>/dev/null || echo "COMPLETED") | ||
| if [ "$status" = "COMPLETED" ] || [ -z "$status" ]; then | ||
| echo "Job $JOB_ID completed" | ||
| break | ||
| fi | ||
| echo "Job $JOB_ID status: $status (waiting...)" | ||
| sleep 5 | ||
| elapsed=$((elapsed + 5)) | ||
| done | ||
|
|
||
| # Display job output | ||
| docker exec ami-ml-slurmctld bash -c "cd /workspace && cat test_simple_*.out" | ||
|
|
||
|
Comment on lines
+96
to
+112
|
||
| - name: Collect SLURM logs on failure | ||
| if: failure() | ||
| run: | | ||
| echo "=== SLURM Controller Logs ===" | ||
| docker-compose -f docker-compose.slurm.yml logs slurm-controller | ||
| echo "=== SLURM Compute Node Logs ===" | ||
| docker-compose -f docker-compose.slurm.yml logs slurm-compute | ||
| echo "=== All job outputs ===" | ||
| docker exec ami-ml-slurmctld bash -c "cd /workspace && ls -la *.out 2>/dev/null || echo 'No job outputs found'" | ||
| docker exec ami-ml-slurmctld bash -c "cd /workspace && cat *.out 2>/dev/null || echo 'No job outputs to display'" | ||
|
|
||
| - name: Stop SLURM cluster | ||
| if: always() | ||
| run: | | ||
| docker-compose -f docker-compose.slurm.yml down -v | ||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -58,3 +58,27 @@ Alternatively, one can run the scripts without activating poetry's shell: | |||||
| ```bash | ||||||
| poetry run python <script> | ||||||
| ``` | ||||||
|
|
||||||
| ## Testing SLURM Jobs Locally | ||||||
|
|
||||||
| A Docker Compose environment is available for testing SLURM job scripts locally before submitting to DRAC/Compute Canada HPC clusters. This simulates a minimal SLURM environment with a controller and compute node. | ||||||
|
|
||||||
| See [docker/slurm/README.md](docker/slurm/README.md) for detailed instructions on: | ||||||
| - Starting the SLURM environment | ||||||
| - Submitting and monitoring jobs | ||||||
| - Adapting real job scripts for local testing | ||||||
| - Troubleshooting common issues | ||||||
|
|
||||||
| Quick start: | ||||||
| ```bash | ||||||
| # Build and start the SLURM cluster | ||||||
| docker-compose -f docker-compose.slurm.yml up -d | ||||||
|
||||||
| docker-compose -f docker-compose.slurm.yml up -d | |
| docker compose -f docker-compose.slurm.yml up -d |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,61 @@ | ||||||||||||||||||||||||||||
| services: | ||||||||||||||||||||||||||||
| slurm-controller: | ||||||||||||||||||||||||||||
| build: | ||||||||||||||||||||||||||||
| context: . | ||||||||||||||||||||||||||||
| dockerfile: docker/slurm/Dockerfile | ||||||||||||||||||||||||||||
| image: ami-ml-slurm:latest | ||||||||||||||||||||||||||||
| hostname: slurmctld | ||||||||||||||||||||||||||||
| container_name: ami-ml-slurmctld | ||||||||||||||||||||||||||||
| privileged: true | ||||||||||||||||||||||||||||
| networks: | ||||||||||||||||||||||||||||
| - slurm | ||||||||||||||||||||||||||||
| volumes: | ||||||||||||||||||||||||||||
| - ./:/workspace | ||||||||||||||||||||||||||||
| - slurm-etc:/etc/slurm | ||||||||||||||||||||||||||||
| - slurm-var:/var/spool/slurm | ||||||||||||||||||||||||||||
| - slurm-log:/var/log/slurm | ||||||||||||||||||||||||||||
| environment: | ||||||||||||||||||||||||||||
| - SLURM_ROLE=controller | ||||||||||||||||||||||||||||
| command: ["slurmctld"] | ||||||||||||||||||||||||||||
| ports: | ||||||||||||||||||||||||||||
| - "6817:6817" | ||||||||||||||||||||||||||||
| - "6818:6818" | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| slurm-compute: | ||||||||||||||||||||||||||||
| build: | ||||||||||||||||||||||||||||
| context: . | ||||||||||||||||||||||||||||
| dockerfile: docker/slurm/Dockerfile | ||||||||||||||||||||||||||||
| image: ami-ml-slurm:latest | ||||||||||||||||||||||||||||
| hostname: c1 | ||||||||||||||||||||||||||||
| container_name: ami-ml-c1 | ||||||||||||||||||||||||||||
| privileged: true | ||||||||||||||||||||||||||||
| networks: | ||||||||||||||||||||||||||||
| - slurm | ||||||||||||||||||||||||||||
| volumes: | ||||||||||||||||||||||||||||
| - ./:/workspace | ||||||||||||||||||||||||||||
| - slurm-etc:/etc/slurm | ||||||||||||||||||||||||||||
| - slurm-var:/var/spool/slurm | ||||||||||||||||||||||||||||
| - slurm-log:/var/log/slurm | ||||||||||||||||||||||||||||
| environment: | ||||||||||||||||||||||||||||
| - SLURM_ROLE=compute | ||||||||||||||||||||||||||||
| command: ["slurmd"] | ||||||||||||||||||||||||||||
| depends_on: | ||||||||||||||||||||||||||||
| - slurm-controller | ||||||||||||||||||||||||||||
| # Uncomment the deploy section below to enable GPU support | ||||||||||||||||||||||||||||
| # Requires NVIDIA Docker runtime to be installed | ||||||||||||||||||||||||||||
| # deploy: | ||||||||||||||||||||||||||||
| # resources: | ||||||||||||||||||||||||||||
| # reservations: | ||||||||||||||||||||||||||||
| # devices: | ||||||||||||||||||||||||||||
| # - driver: nvidia | ||||||||||||||||||||||||||||
| # count: all | ||||||||||||||||||||||||||||
| # capabilities: [gpu] | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
|
Comment on lines
+44
to
+53
|
||||||||||||||||||||||||||||
| # Uncomment the deploy section below to enable GPU support | |
| # Requires NVIDIA Docker runtime to be installed | |
| # deploy: | |
| # resources: | |
| # reservations: | |
| # devices: | |
| # - driver: nvidia | |
| # count: all | |
| # capabilities: [gpu] | |
| # Uncomment the line below to enable GPU support with Docker Compose v2 | |
| # Requires NVIDIA Container Toolkit / runtime to be installed and configured | |
| # Run with: docker compose -f docker-compose.slurm.yml up | |
| # gpus: all |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| FROM ubuntu:22.04 | ||
|
|
||
| ENV DEBIAN_FRONTEND=noninteractive | ||
|
|
||
| # Install basic dependencies and SLURM from Ubuntu packages | ||
| RUN apt-get update && apt-get install -y \ | ||
| wget \ | ||
| curl \ | ||
| gcc \ | ||
| make \ | ||
| build-essential \ | ||
| munge \ | ||
| slurm-wlm \ | ||
| slurm-wlm-basic-plugins \ | ||
| slurmd \ | ||
| slurmctld \ | ||
| python3 \ | ||
| python3-pip \ | ||
| python3-dev \ | ||
| git \ | ||
| vim \ | ||
| sudo \ | ||
| supervisor \ | ||
| && rm -rf /var/lib/apt/lists/* | ||
|
|
||
| # Create required directories | ||
| RUN mkdir -p /var/spool/slurm/ctld \ | ||
| /var/spool/slurm/d \ | ||
| /var/log/slurm \ | ||
| /etc/slurm \ | ||
| /run/munge | ||
|
|
||
| # Install Miniconda | ||
| RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh && \ | ||
| bash /tmp/miniconda.sh -b -p /opt/miniconda3 && \ | ||
| rm /tmp/miniconda.sh | ||
|
|
||
| ENV PATH="/opt/miniconda3/bin:${PATH}" | ||
|
|
||
| # Install Poetry | ||
| RUN pip3 install poetry | ||
|
|
||
| # Copy SLURM configuration files | ||
| COPY docker/slurm/slurm.conf /etc/slurm/slurm.conf | ||
| COPY docker/slurm/cgroup.conf /etc/slurm/cgroup.conf | ||
| # COPY docker/slurm/gres.conf /etc/slurm/gres.conf # Uncomment for GPU support | ||
|
|
||
| # Set proper permissions | ||
| RUN chown -R slurm:slurm /var/spool/slurm /var/log/slurm && \ | ||
| chown -R munge:munge /etc/munge /var/log/munge /run/munge && \ | ||
| chmod 700 /etc/munge /var/log/munge /run/munge | ||
|
|
||
| # Copy entrypoint script | ||
| COPY docker/slurm/entrypoint.sh /usr/local/bin/entrypoint.sh | ||
| RUN chmod +x /usr/local/bin/entrypoint.sh | ||
|
|
||
| WORKDIR /workspace | ||
|
|
||
| ENTRYPOINT ["/usr/local/bin/entrypoint.sh"] | ||
| CMD ["slurmctld"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow uses the legacy
docker-composebinary. On GitHub-hosted runners (and many local setups) only the Compose v2 plugin (docker compose) is guaranteed to be present;docker-composemay be missing. Consider switching these commands todocker compose ...for more reliable CI execution.