A Device-Side Execution Model for Multi-GPU Task Graphs

A.k.a. Mustard: MUlti-gpu Scheduling of TAsk gRaphs on the Device

Mustard is a device-side execution model for static task graphs that moves the runtime functionality to the GPU, minimizing runtime overheads. It transforms a single-GPU CUDA graph so that it can be executed across multiple GPUs — without code changes or learning the APIs of a runtime system. Dependencies and load are tracked on the device, removing the need for in-kernel synchronisation while introducing little additional overhead. Memory management, data transfers, and task allocation all happen on the device.

Paper: Mustard – ACM ICS '25

Key Results

In a multi-node scenario with 64 GPUs, Mustard achieves an average 5.83× speedup over the linear algebra library SLATE.
On a single node, compared to the best-performing baseline, Mustard delivers an average 1.66× speedup for LU and 1.29× for Cholesky.

Single-node performance (8 GPUs, N = 24 000)

Multi-node performance (8 × 8 GPUs)

Dependencies

Dependency	Minimum Version	Notes
C++ compiler	C++17 support	GCC ≥ 11 recommended
CUDA Toolkit	12.3
NVSHMEM	2.7.0	InfiniBand GPUDirect Async support*
OpenMPI	4.1.4	Or any MPI implementation supported by NVSHMEM
CMake	3.23

* GPUDirect Async is required for multi-node execution.

Optional (baselines)

Dependency	Purpose
StarPU ≥ 1.3	StarPU LU / Cholesky baselines
SLATE	SLATE LU / Cholesky baselines
Intel MKL	Required if your SLATE installation was built against MKL

Plotting scripts

Package	Install
Python ≥ 3.8	—
seaborn	`pip install seaborn`
pandas	`pip install pandas`
matplotlib	`pip install matplotlib`

Or install all at once:

pip install seaborn pandas matplotlib

Building

1. Environment

Make sure the CUDA toolkit, NVSHMEM, and MPI are visible. A typical setup:

export CUDA_HOME=/usr/local/cuda-12.3
export NVSHMEM_HOME=/path/to/nvshmem-2.10
export PATH="$CUDA_HOME/bin:$NVSHMEM_HOME/bin:$PATH"
export LD_LIBRARY_PATH="$NVSHMEM_HOME/lib:$LD_LIBRARY_PATH"
export NVSHMEM_BOOTSTRAP=MPI          # required for multi-PE runs

For baselines, also set:

# StarPU (for StarPU baselines)
export STARPU_DIR=/path/to/starpu              # StarPU install prefix
export PKG_CONFIG_PATH="$STARPU_DIR/lib/pkgconfig:$PKG_CONFIG_PATH"

# SLATE (for SLATE baselines)
export SLATE_DIR=/path/to/slate                # SLATE install prefix
export LD_LIBRARY_PATH="$SLATE_DIR/lib:$LD_LIBRARY_PATH"

# MKL (only if SLATE was built against MKL)
export MKLROOT=/opt/intel/oneapi/2024.1        # or wherever MKL is installed
export LD_LIBRARY_PATH="$MKLROOT/lib:$LD_LIBRARY_PATH"

2. Configure & compile

mkdir -p build && cd build

cmake .. \
    -DCMAKE_CUDA_COMPILER=$CUDA_HOME/bin/nvcc \
    -DNVSHMEM_DIR=$NVSHMEM_HOME/lib/cmake/nvshmem \
    -DMUSTARD_CUDA_ARCHITECTURES=80        # adjust to your GPU (e.g. 80 for A100, 90 for H100)

make -j$(nproc)

This produces three executables in the build/ directory:

Binary	Description
`lu_mustard`	LU decomposition (single-node, 1–N GPUs)
`cholesky_mustard`	Cholesky decomposition (single-node, 1–N GPUs)
`p_lu_mustard`	Partitioned LU (multi-node)

CMake options

Option	Default	Description
`MUSTARD_CUDA_ARCHITECTURES`	`native`	CUDA SM architectures (e.g. `80`, `80;90`)
`MUSTARD_CUDA_MIN_VERSION`	`12.3`	Minimum required CUDA version
`MUSTARD_BUILD_BASELINES`	`OFF`	Build baselines (cuSOLVER-Mg, StarPU, SLATE)

3. Building baselines (optional)

Pass -DMUSTARD_BUILD_BASELINES=ON and point CMake to the baseline dependencies. Each baseline is skipped gracefully if its dependency is not found, so you can build whichever subset you have installed.

cmake .. \
    -DCMAKE_CUDA_COMPILER=$CUDA_HOME/bin/nvcc \
    -DNVSHMEM_DIR=$NVSHMEM_HOME/lib/cmake/nvshmem \
    -DMUSTARD_CUDA_ARCHITECTURES=80 \
    -DMUSTARD_BUILD_BASELINES=ON \
    -DCMAKE_PREFIX_PATH="$SLATE_DIR"       # for SLATE + blaspp + lapackpp

make -j$(nproc)

StarPU is discovered via pkg-config (set PKG_CONFIG_PATH as shown above). SLATE is discovered via CMake's find_package; if SLATE was built against MKL, set MKLROOT so the linker can find the MKL shared libraries.

Baseline target	Dependency	Discovery
`cusolver_MgGetrf_example`	CUDA Toolkit	Automatic
`cusolver_MgPotrf_example`	CUDA Toolkit	Automatic
`starpu_lu_example`	StarPU ≥ 1.3	`pkg-config` (`PKG_CONFIG_PATH`)
`starpu_cholesky_tile_tag`	StarPU ≥ 1.3	`pkg-config` (`PKG_CONFIG_PATH`)
`lu_slate`	SLATE	`CMAKE_PREFIX_PATH` / `slate_DIR`
`chol_slate`	SLATE	`CMAKE_PREFIX_PATH` / `slate_DIR`

Running

All Mustard executables are launched through NVSHMEM (via MPI). The number of GPUs is determined by the number of MPI ranks.

Single GPU

./lu_mustard -n=600 -t=2 --tiled --verify
./cholesky_mustard -n=600 -t=2 --subgraph --verify

Multiple GPUs (single node)

mpirun -np 4 ./lu_mustard -n=6000 -t=10 --subgraph -r=5
mpirun -np 8 ./cholesky_mustard -n=24000 -t=8 --tiled -r=10

Multiple nodes (partitioned LU)

mpirun -np 64 ./p_lu_mustard -n=48000 -t=64 -r=5 --verify

CLI reference

Run any executable with --help to see all options:

$ ./lu_mustard --help
Usage: ./lu_mustard [options]

  LU decomposition on one or more GPUs using CUDA graphs.
  The number of GPUs is determined by the number of NVSHMEM PEs (MPI ranks).

  Mode (pick one; default is single-kernel if none given):
    --tiled              Tiled execution (one graph per tile step)
    --subgraph           Sub-graph insertion execution

  Common options:
    -n, -N=<int>         Matrix dimension N                       [default: 15]
    -t, -T=<int>         Number of tiles (N must be divisible)    [default: 5]
    --sm, --smLimit=<int> SM limit per kernel (1-108)             [default: 20]
    --ws, --workspace=<int> cuBLAS workspace in kB (1-1048576)   [default: 256]
    -r, --runs=<int>     Number of timing runs                    [default: 1]
    -v, --verbose        Enable verbose output
    --verify             Verify result correctness
    --dot                Dump execution graph in DOT format

  Examples:
    ./lu_mustard -n=600 -t=2 --tiled --verify
    mpirun -np 4 ./lu_mustard -n=6000 -t=10 --subgraph -r=5

p_lu_mustard additionally accepts -p, -P=<int> to select which PE's graph to print (default 0, use -1 to disable).

Project Structure

mustard/
├── CMakeLists.txt              # Top-level build
├── include/                    # Shared headers
│   ├── cli.h                   # CLI parsing & help messages
│   ├── gen.h                   # Matrix generation (CPU & GPU)
│   ├── verify.h                # LU / Cholesky verification
│   ├── nvshmem_kernels.h       # NVSHMEM put/get device kernels
│   ├── gpu_debug.h             # Device-side debug utilities
│   └── utils.h                 # CUDA error checking, timing
├── mustard/                    # Core library & executables
│   ├── mustard.h               # TiledGraphCreator, CUDA graph helpers
│   ├── lu_mustard.cu           # Single-node LU
│   ├── cholesky_mustard.cu     # Single-node Cholesky
│   └── lu_mustard_p.cu         # Partitioned multi-node LU
├── baselines/                  # Baseline implementations
│   ├── cusolver_Mg/            # cuSOLVER multi-GPU (LU & Cholesky)
│   ├── starpu/                 # StarPU task-based runtime
│   │   ├── lu/                 # StarPU LU kernel
│   │   └── cholesky/           # StarPU Cholesky kernel
│   └── slate/                  # SLATE distributed dense LA
└── scripts/                    # Plotting & benchmarking scripts
    ├── calc.py                 # Bar-chart FLOPS analysis
    ├── plot.py                 # Line-plot generator
    ├── lu_all.sh               # LU benchmark runner
    ├── chol_all.sh             # Cholesky benchmark runner
    ├── figures/                # Pre-generated result figures
    └── results/                # Raw result data

Plotting scripts

Generate performance bar charts:

# Run from scripts/results/
cd scripts/results
python ../calc.py -m lu -s 24000
python ../calc.py -m chol -s 24000

Generate line plots:

# Run from scripts/
cd scripts
python plot.py -m lu -s 24000 -f png -l      # PNG with legend
python plot.py -m chol -s 48000 -f pdf        # PDF

Options for both scripts: -m {lu,chol}, -s <size>, -l (legend), -t (title). plot.py also accepts -f {pdf,png,jpg} and -show (display interactively).

Acknowledgement

This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No 949587).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Device-Side Execution Model for Multi-GPU Task Graphs

A.k.a. Mustard: MUlti-gpu Scheduling of TAsk gRaphs on the Device

Key Results

Single-node performance (8 GPUs, N = 24 000)

Multi-node performance (8 × 8 GPUs)

Dependencies

Optional (baselines)

Plotting scripts

Building

1. Environment

2. Configure & compile

CMake options

3. Building baselines (optional)

Running

Single GPU

Multiple GPUs (single node)

Multiple nodes (partitioned LU)

CLI reference

Project Structure

Plotting scripts

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
baselines		baselines
include		include
mustard		mustard
scripts		scripts
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

License

ParCoreLab/mustard

Folders and files

Latest commit

History

Repository files navigation

A Device-Side Execution Model for Multi-GPU Task Graphs

A.k.a. Mustard: MUlti-gpu Scheduling of TAsk gRaphs on the Device

Key Results

Single-node performance (8 GPUs, N = 24 000)

Multi-node performance (8 × 8 GPUs)

Dependencies

Optional (baselines)

Plotting scripts

Building

1. Environment

2. Configure & compile

CMake options

3. Building baselines (optional)

Running

Single GPU

Multiple GPUs (single node)

Multiple nodes (partitioned LU)

CLI reference

Project Structure

Plotting scripts

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages