Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
71d0cc1
Added pixi env and documentation for Kaiju
Mar 11, 2026
d22d37b
Fix pixi env check and use PATH instead of PIXI_ENVIRONMENT
Mar 11, 2026
7c3fdcb
Added CC=mpicc
Mar 11, 2026
f36b5b7
Modified PETSc MPI detection
Mar 11, 2026
2b2abaa
Fix LD_LIBRARY_PATH for spack OpenMPI
Mar 11, 2026
6498cea
Add MMG_INSTALL_PRIVATE_HEADERS=ON for PARMMG (kaiju only)
Mar 11, 2026
188d1ed
Disable SCOTCH in MMG build to fix PARMMG configure on Kaiju
Mar 11, 2026
f26da68
Update kaiju cluster setup docs with shared install and troubleshooting
Mar 11, 2026
322e916
Added info re. shared installation in Kaiju cluster
Mar 12, 2026
04fca69
Added link to admin-repo
Mar 12, 2026
a189fcc
Merge branch 'underworldcode:development' into development
jcgraciosa Mar 17, 2026
2c60aaa
Added pixi env for Gadi baremetal install
Mar 17, 2026
47332c9
Merge branch 'development' of https://github.com/jcgraciosa/underworl…
Mar 17, 2026
23791c0
Added build-petsc-gadi script
Mar 17, 2026
330560d
Specified MPI_DIR
Mar 17, 2026
566ecc9
Unset conda/pixi compiler variables that interfere with mpicc
Mar 17, 2026
98d84f6
Removed explicit setting of MPI_DIR
Mar 17, 2026
9637ed9
Added ignoreLinkOutput
Mar 17, 2026
6952d42
Added missing libraries: libucc and libnl_3
Mar 17, 2026
4e09ec6
Reordering PATH
Mar 17, 2026
edacae7
Added symlink and setting LD_LIBRARY_PATH before configure runs
Mar 18, 2026
201acdb
fix linking
Mar 18, 2026
c82aa58
Added OMPI_FCFLAGS
Mar 18, 2026
1d888c3
Added symlink to scratch for petsc; download fblaslapack
Mar 18, 2026
e81feff
Updated env vars for build
Mar 18, 2026
f11e375
Added patchelf to reorder h5py RPATH after source build
Mar 18, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
302 changes: 302 additions & 0 deletions docs/developer/guides/kaiju-cluster-setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,302 @@
# Kaiju Cluster Setup

This guide covers installing and running Underworld3 on the **Kaiju** cluster — a Rocky Linux 8.10 HPC system using Spack for module management and Slurm for job scheduling.

Python packages are managed by **pixi** (the same tool used for local development). MPI-dependent packages — `mpi4py`, PETSc+AMR tools, `petsc4py`, and `h5py` — are built from source against Spack's OpenMPI to ensure compatibility with Slurm's parallel interconnect.

---

## Hardware Overview

| Resource | Specification |
|----------|--------------|
| Head node | 1× Intel Xeon Silver 4210R, 40 CPUs @ 2.4 GHz |
| Compute nodes | 8× Intel Xeon Gold 6230R, 104 CPUs @ 2.1 GHz each |
| Shared storage | `/opt/cluster` via NFS (cluster-wide) |
| Scheduler | Slurm with Munge authentication |

---

## Why pixi + spack?

Pixi manages the Python environment consistently with the developer's local machine (same `pixi.toml`, same package versions). Spack provides the cluster's OpenMPI, which is what Slurm uses for inter-node communication.

The key constraint is that **anything linked against MPI must use the same MPI as Slurm**. This means `mpi4py`, `h5py`, PETSc, and `petsc4py` are built from source against Spack's OpenMPI — not from conda-forge (which bundles MPICH).

```
pixi kaiju env → Python 3.12, sympy, scipy, pint, pydantic, ... (conda-forge, no MPI)
spack → openmpi@4.1.6 (cluster MPI)
source build → mpi4py, PETSc+AMR+petsc4py, h5py (linked to spack MPI)
```

---

## Prerequisites

Spack must have OpenMPI available:

```bash
spack find openmpi
# openmpi@4.1.6
```

Pixi must be installed in your user space (no root needed):

```bash
# Check if already installed
pixi --version

# Install if missing
curl -fsSL https://pixi.sh/install.sh | bash
```

---

## Installation

Use the install script at `uw3_install_kaiju_amr.sh` from the [kaiju-admin-notes](https://github.com/jcgraciosa/kaiju-admin-notes) repo.

### Step 1: Edit configuration

Open the script and set the variables at the top:

```bash
SPACK_MPI_VERSION="openmpi@4.1.6" # Spack MPI module to load
INSTALL_PATH="${HOME}/uw3-installation" # Root directory for everything
UW3_BRANCH="development" # UW3 git branch
```

### Step 2: Run the full install

```bash
source uw3_install_kaiju_amr.sh install
```

This runs the following steps in order:

| Step | Function | Time |
|------|----------|------|
| Install pixi | `setup_pixi` | ~1 min |
| Clone Underworld3 | `clone_uw3` | ~1 min |
| Install pixi kaiju env | `install_pixi_env` | ~3 min |
| Build mpi4py from source | `install_mpi4py` | ~2 min |
| Build PETSc + AMR tools | `install_petsc` | ~1 hour |
| Build MPI-enabled h5py | `install_h5py` | ~2 min |
| Install Underworld3 | `install_uw3` | ~2 min |
| Verify | `verify_install` | ~1 min |

You can also run individual steps after sourcing:

```bash
source uw3_install_kaiju_amr.sh
install_petsc # run just one step
```

### What PETSc builds

PETSc is compiled from source (`petsc-custom/build-petsc-kaiju.sh`) with:

- **AMR tools**: mmg, parmmg, pragmatic, eigen, bison
- **Solvers**: mumps, scalapack, slepc
- **Partitioners**: metis, parmetis, ptscotch
- **MPI**: Spack's OpenMPI (`--with-mpi-dir`)
- **HDF5**: downloaded and built with MPI support
- **BLAS/LAPACK**: fblaslapack (Rocky Linux 8 has no guaranteed system BLAS)
- **cmake**: downloaded (not in Spack)
- **petsc4py**: built during configure (`--with-petsc4py=1`)

---

## Activating the Environment

In every new session (interactive or job), source the install script:

```bash
source ~/install_scripts/uw3_install_kaiju_amr.sh
```

This:
1. Loads `spack openmpi@4.1.6`
2. Activates the pixi `kaiju` environment via `pixi shell-hook`
3. Sets `PETSC_DIR`, `PETSC_ARCH`, and `PYTHONPATH` for petsc4py
4. Sets `PMIX_MCA_psec=native` and `OMPI_MCA_btl_tcp_if_include=eno1`

{note}
`pixi shell-hook` is used instead of `pixi shell` because it activates the environment in the current shell without spawning a new one. This is required for Slurm batch jobs.
{/note}

---

## Running with Slurm

Two job script templates are available in the [kaiju-admin-notes](https://github.com/jcgraciosa/kaiju-admin-notes) repo:

| Script | Use when |
|--------|----------|
| `uw3_slurm_job.sh` | Per-user install (sources `uw3_install_kaiju_amr.sh`) |
| `uw3_slurm_job_shared.sh` | Shared install (`module load underworld3/...`) |

### Submitting a job

```bash
sbatch uw3_slurm_job.sh # per-user install
sbatch uw3_slurm_job_shared.sh # shared install
```

Monitor progress:

```bash
squeue -u $USER
tail -f uw3_<jobid>.out
```

### The `srun` invocation

`--mpi=pmix` is **required** on Kaiju (Spack has `pmix@5.0.3`):

```bash
srun --mpi=pmix python3 my_model.py
```

### Scaling examples

```bash
# 1 node, 30 ranks
sbatch --nodes=1 --ntasks-per-node=30 uw3_slurm_job.sh

# 4 nodes, 120 ranks
sbatch --nodes=4 --ntasks-per-node=30 uw3_slurm_job.sh
```

---

## Shared Installation (Admin)

A system-wide installation can be deployed to `/opt/cluster/software/underworld3/` so all users access it via Environment Modules:

```bash
module load underworld3/development-12Mar26
```

Run as an admin with write access to `/opt/cluster/software`:

```bash
source uw3_install_kaiju_shared.sh install
```

This script is identical to the per-user script except:
- `INSTALL_PATH=/opt/cluster/software`
- Adds `fix_permissions()` — sets world-readable permissions after install
- Adds `install_modulefile()` — copies the TCL modulefile with a date-stamped name to `/opt/cluster/modulefiles/underworld3/`

The modulefile (`modulefiles/underworld3/development.tcl`) hardcodes the spack OpenMPI and pixi env paths. If spack is rebuilt (hash changes), update `mpi_root` in the modulefile.

### Slurm job script (shared install)

Users with the shared install should use `uw3_slurm_job_shared.sh`:

```bash
# Edit UW3_MODULE and SCRIPT at the top, then:
sbatch uw3_slurm_job_shared.sh
```

The key difference from the per-user job script is environment setup:

```bash
# Shared install: load module
module load underworld3/development-12Mar26

# Per-user install: source install script
source ~/install_scripts/uw3_install_kaiju_amr.sh
```

---

## Troubleshooting

### `import underworld3` fails on compute nodes

Sourcing the install script in the job script (not the login shell) ensures all paths propagate to compute nodes. The `uw3_slurm_job.sh` template does this correctly.

### h5py HDF5 version mismatch

h5py must be built against the same HDF5 that PETSc built. If you see HDF5 errors, rebuild:

```bash
source uw3_install_kaiju_amr.sh
install_h5py
```

### PETSc needs rebuilding after Spack module update

PETSc links against Spack's OpenMPI at build time. If `openmpi@4.1.6` is reinstalled or updated, rebuild PETSc:

```bash
source uw3_install_kaiju_amr.sh
rm -rf ~/uw3-installation/underworld3/petsc-custom/petsc
install_petsc
install_h5py
```

### h5py replaces source-built mpi4py

`pip install h5py` without `--no-deps` silently replaces the source-built mpi4py (spack OpenMPI) with a pre-built wheel linked to a different MPI. Always use `--no-deps` when installing h5py. The install script handles this correctly.

If mpi4py was accidentally replaced, rebuild it from source:
```bash
source uw3_install_kaiju_amr.sh
pip install --no-binary :all: --no-cache-dir --force-reinstall "mpi4py>=4,<5"
```

Verify it links to spack OpenMPI:
```bash
ldd $(python3 -c "import mpi4py; print(mpi4py.__file__.replace('__init__.py',''))") \
MPI*.so | grep mpi
# Should show: libmpi.so.40 => /opt/cluster/spack/.../openmpi-4.1.6-.../lib/libmpi.so.40
```

### numpy ABI mismatch after h5py install

If numpy is upgraded after petsc4py is compiled, `import petsc4py` fails with:
```
ValueError: numpy.dtype size changed, may indicate binary incompatibility.
```

Fix: restore the numpy version used during the PETSc build, then rebuild h5py:
```bash
pip install --force-reinstall "numpy==1.26.4"
CC=mpicc HDF5_MPI="ON" HDF5_DIR="${PETSC_DIR}/${PETSC_ARCH}" \
pip install --no-binary=h5py --no-cache-dir --force-reinstall --no-deps h5py
```

### PARMMG configure failure (pixi ld + spack transitive deps)

pixi's conda linker (`ld` 14.x) requires transitive shared library dependencies to be explicitly linked. `libmmg.so` built with SCOTCH support causes PARMMG's `MMG_WORKS` link test to fail because `libscotch.so` is not explicitly passed. This is fixed in `petsc-custom/build-petsc-kaiju.sh` by building MMG without SCOTCH (`-DUSE_SCOTCH=OFF`). PARMMG uses ptscotch separately for parallel partitioning, which is unaffected.

### Checking what's installed

```bash
source uw3_install_kaiju_amr.sh
verify_install
```

---

## Rebuilding Underworld3 after source changes

After pulling new UW3 code:

```bash
source uw3_install_kaiju_amr.sh
cd ~/uw3-installation/underworld3
git pull
pip install -e .
```

---

## Related

- [Development Setup](development-setup.md) — local development with pixi
- [Branching Strategy](branching-strategy.md) — git workflow
- [Parallel Computing](../../advanced/parallel-computing.md) — writing parallel-safe UW3 code
1 change: 1 addition & 0 deletions docs/developer/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ guides/SPELLING_CONVENTION
guides/version-management
guides/branching-strategy
guides/BINDER_CONTAINER_SETUP
guides/kaiju-cluster-setup
```

```{toctree}
Expand Down
Loading
Loading