Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions .github/workflows/test-slurm-jobs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
name: Test SLURM Jobs

on:
push:
branches: [ main, develop ]
paths:
- 'scripts/job_*.sh'
- 'research/**/job_*.sh'
- 'docker/slurm/**'
- '.github/workflows/test-slurm-jobs.yml'
pull_request:
branches: [ main, develop ]
paths:
- 'scripts/job_*.sh'
- 'research/**/job_*.sh'
- 'docker/slurm/**'
- '.github/workflows/test-slurm-jobs.yml'
workflow_dispatch:

jobs:
test-slurm-environment:
runs-on: ubuntu-latest
permissions:
contents: read

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

- name: Build SLURM Docker images
run: |
docker-compose -f docker-compose.slurm.yml build

- name: Start SLURM cluster
run: |
docker-compose -f docker-compose.slurm.yml up -d
Comment on lines +35 to +39
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow uses the legacy docker-compose binary. On GitHub-hosted runners (and many local setups) only the Compose v2 plugin (docker compose) is guaranteed to be present; docker-compose may be missing. Consider switching these commands to docker compose ... for more reliable CI execution.

Suggested change
docker-compose -f docker-compose.slurm.yml build
- name: Start SLURM cluster
run: |
docker-compose -f docker-compose.slurm.yml up -d
docker compose -f docker-compose.slurm.yml build
- name: Start SLURM cluster
run: |
docker compose -f docker-compose.slurm.yml up -d

Copilot uses AI. Check for mistakes.
# Wait for SLURM to be ready
sleep 30

- name: Check SLURM cluster status
run: |
docker exec ami-ml-slurmctld sinfo
docker exec ami-ml-slurmctld scontrol show nodes

- name: Test basic SLURM job submission
run: |
# Submit test job
docker exec ami-ml-slurmctld bash -c "cd /workspace && sbatch docker/slurm/examples/job_hello.sh"

# Wait for job to complete
sleep 10

# Check job status
docker exec ami-ml-slurmctld squeue

# Display job output
docker exec ami-ml-slurmctld bash -c "cd /workspace && cat hello_slurm_*.out || echo 'Job output not found yet'"
Comment on lines +48 to +60
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Test basic SLURM job submission" step doesn't assert that the job_hello job actually reaches a terminal state (COMPLETED) before proceeding; it just sleeps and prints squeue. This can let failures slip through (e.g., jobs stuck PENDING/RUNNING). Consider capturing the submitted job id, polling squeue/sacct with a timeout, and failing the step if the job doesn't complete successfully.

Copilot uses AI. Check for mistakes.

- name: Test environment setup job
run: |
# Submit environment test job (simplified version that doesn't require network)
docker exec ami-ml-slurmctld bash -c "cd /workspace && cat > /tmp/test_simple.sh << 'EOF'
#!/bin/bash
#SBATCH --job-name=test_simple
#SBATCH --output=test_simple_%j.out
#SBATCH --ntasks=1
#SBATCH --time=00:05:00
#SBATCH --mem=2G
#SBATCH --cpus-per-task=1
#SBATCH --partition=main

echo \"Testing basic environment...\"
echo \"Job ID: \$SLURM_JOB_ID\"
echo \"Working directory: \$(pwd)\"
echo \"Python version:\"
python3 --version
echo \"Conda available:\"
which conda
conda --version
echo \"Poetry available:\"
which poetry || echo \"Poetry not in PATH\"
poetry --version || echo \"Poetry command failed\"
echo \"Workspace contents:\"
ls -la /workspace/ | head -20
echo \"Test completed successfully!\"
EOF
"
docker exec ami-ml-slurmctld chmod +x /tmp/test_simple.sh

JOB_ID=$(docker exec ami-ml-slurmctld bash -c "sbatch /tmp/test_simple.sh" | grep -oP '\d+')
echo "Submitted job ID: $JOB_ID"

# Wait for job to complete (with timeout)
timeout=60
elapsed=0
while [ $elapsed -lt $timeout ]; do
status=$(docker exec ami-ml-slurmctld squeue -j $JOB_ID -h -o "%T" 2>/dev/null || echo "COMPLETED")
if [ "$status" = "COMPLETED" ] || [ -z "$status" ]; then
echo "Job $JOB_ID completed"
break
fi
echo "Job $JOB_ID status: $status (waiting...)"
sleep 5
elapsed=$((elapsed + 5))
done

# Display job output
docker exec ami-ml-slurmctld bash -c "cd /workspace && cat test_simple_*.out"

Comment on lines +96 to +112
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the polling loop, if the job never completes within the timeout, the script exits the loop and continues without failing explicitly. Add a check after the loop to fail the step when elapsed >= timeout (and/or when the final state is not COMPLETED), so CI deterministically reports a failure instead of hanging or passing with partial output.

Copilot uses AI. Check for mistakes.
- name: Collect SLURM logs on failure
if: failure()
run: |
echo "=== SLURM Controller Logs ==="
docker-compose -f docker-compose.slurm.yml logs slurm-controller
echo "=== SLURM Compute Node Logs ==="
docker-compose -f docker-compose.slurm.yml logs slurm-compute
echo "=== All job outputs ==="
docker exec ami-ml-slurmctld bash -c "cd /workspace && ls -la *.out 2>/dev/null || echo 'No job outputs found'"
docker exec ami-ml-slurmctld bash -c "cd /workspace && cat *.out 2>/dev/null || echo 'No job outputs to display'"

- name: Stop SLURM cluster
if: always()
run: |
docker-compose -f docker-compose.slurm.yml down -v
24 changes: 24 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,3 +58,27 @@ Alternatively, one can run the scripts without activating poetry's shell:
```bash
poetry run python <script>
```

## Testing SLURM Jobs Locally

A Docker Compose environment is available for testing SLURM job scripts locally before submitting to DRAC/Compute Canada HPC clusters. This simulates a minimal SLURM environment with a controller and compute node.

See [docker/slurm/README.md](docker/slurm/README.md) for detailed instructions on:
- Starting the SLURM environment
- Submitting and monitoring jobs
- Adapting real job scripts for local testing
- Troubleshooting common issues

Quick start:
```bash
# Build and start the SLURM cluster
docker-compose -f docker-compose.slurm.yml up -d
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs use the legacy docker-compose binary. Consider using the Compose v2 plugin syntax (docker compose -f docker-compose.slurm.yml ...) to match modern Docker installs where docker-compose may not be available.

Suggested change
docker-compose -f docker-compose.slurm.yml up -d
docker compose -f docker-compose.slurm.yml up -d

Copilot uses AI. Check for mistakes.

# Access the controller to submit jobs
docker exec -it ami-ml-slurmctld bash

# Inside the container
sinfo # Check cluster status
sbatch docker/slurm/examples/job_hello.sh # Submit a test job
squeue # Check job queue
```
61 changes: 61 additions & 0 deletions docker-compose.slurm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
services:
slurm-controller:
build:
context: .
dockerfile: docker/slurm/Dockerfile
image: ami-ml-slurm:latest
hostname: slurmctld
container_name: ami-ml-slurmctld
privileged: true
networks:
- slurm
volumes:
- ./:/workspace
- slurm-etc:/etc/slurm
- slurm-var:/var/spool/slurm
- slurm-log:/var/log/slurm
environment:
- SLURM_ROLE=controller
command: ["slurmctld"]
ports:
- "6817:6817"
- "6818:6818"

slurm-compute:
build:
context: .
dockerfile: docker/slurm/Dockerfile
image: ami-ml-slurm:latest
hostname: c1
container_name: ami-ml-c1
privileged: true
networks:
- slurm
volumes:
- ./:/workspace
- slurm-etc:/etc/slurm
- slurm-var:/var/spool/slurm
- slurm-log:/var/log/slurm
environment:
- SLURM_ROLE=compute
command: ["slurmd"]
depends_on:
- slurm-controller
# Uncomment the deploy section below to enable GPU support
# Requires NVIDIA Docker runtime to be installed
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]

Comment on lines +44 to +53
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commented GPU passthrough example uses the deploy: section. For docker compose up (non-Swarm), deploy is typically ignored, so uncommenting this may not actually provide GPUs to the container. Prefer the Compose v2 GPU mechanism (e.g., gpus: all / device requests) and document the exact command/runtime requirement.

Suggested change
# Uncomment the deploy section below to enable GPU support
# Requires NVIDIA Docker runtime to be installed
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]
# Uncomment the line below to enable GPU support with Docker Compose v2
# Requires NVIDIA Container Toolkit / runtime to be installed and configured
# Run with: docker compose -f docker-compose.slurm.yml up
# gpus: all

Copilot uses AI. Check for mistakes.
networks:
slurm:
driver: bridge

volumes:
slurm-etc:
slurm-var:
slurm-log:
60 changes: 60 additions & 0 deletions docker/slurm/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
FROM ubuntu:22.04

ENV DEBIAN_FRONTEND=noninteractive

# Install basic dependencies and SLURM from Ubuntu packages
RUN apt-get update && apt-get install -y \
wget \
curl \
gcc \
make \
build-essential \
munge \
slurm-wlm \
slurm-wlm-basic-plugins \
slurmd \
slurmctld \
python3 \
python3-pip \
python3-dev \
git \
vim \
sudo \
supervisor \
&& rm -rf /var/lib/apt/lists/*

# Create required directories
RUN mkdir -p /var/spool/slurm/ctld \
/var/spool/slurm/d \
/var/log/slurm \
/etc/slurm \
/run/munge

# Install Miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh && \
bash /tmp/miniconda.sh -b -p /opt/miniconda3 && \
rm /tmp/miniconda.sh

ENV PATH="/opt/miniconda3/bin:${PATH}"

# Install Poetry
RUN pip3 install poetry

# Copy SLURM configuration files
COPY docker/slurm/slurm.conf /etc/slurm/slurm.conf
COPY docker/slurm/cgroup.conf /etc/slurm/cgroup.conf
# COPY docker/slurm/gres.conf /etc/slurm/gres.conf # Uncomment for GPU support

# Set proper permissions
RUN chown -R slurm:slurm /var/spool/slurm /var/log/slurm && \
chown -R munge:munge /etc/munge /var/log/munge /run/munge && \
chmod 700 /etc/munge /var/log/munge /run/munge

# Copy entrypoint script
COPY docker/slurm/entrypoint.sh /usr/local/bin/entrypoint.sh
RUN chmod +x /usr/local/bin/entrypoint.sh

WORKDIR /workspace

ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]
CMD ["slurmctld"]
Loading