|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +NeMo Run is a tool for configuring, executing, and managing ML experiments across various computing environments. Its three core pillars are: |
| 8 | +1. **Configuration** - Python-native config using Google's Fiddle library |
| 9 | +2. **Execution** - Running tasks on local machines, SLURM clusters, Docker, cloud (SkyPilot, DGX Cloud, Lepton) |
| 10 | +3. **Management** - Tracking experiment metadata locally in `NEMORUN_HOME` (default: `~/.nemo_run`) |
| 11 | + |
| 12 | +## Commands |
| 13 | + |
| 14 | +```bash |
| 15 | +# Install for development |
| 16 | +uv sync --extra skypilot |
| 17 | + |
| 18 | +# Run tests (slow tests are skipped by default) |
| 19 | +uv run -- pytest test/ |
| 20 | + |
| 21 | +# Run a single test |
| 22 | +uv run -- pytest test/test_config.py::TestClass::test_method |
| 23 | + |
| 24 | +# Run including slow tests |
| 25 | +uv run -- pytest -m "" test/ |
| 26 | + |
| 27 | +# Lint |
| 28 | +uv run --group lint -- ruff check |
| 29 | + |
| 30 | +# Format |
| 31 | +uv run --group lint -- ruff format |
| 32 | + |
| 33 | +# Run with coverage |
| 34 | +uv run -- coverage run --branch --source=nemo_run -a -m pytest test/ |
| 35 | +uv run -- coverage report -m |
| 36 | +``` |
| 37 | + |
| 38 | +Line length is 100 (configured in `pyproject.toml` under `[tool.ruff]`). |
| 39 | + |
| 40 | +## Architecture |
| 41 | + |
| 42 | +### Core Abstractions |
| 43 | + |
| 44 | +**`Config[T]` / `Partial[T]`** (`nemo_run/config.py`): Built on Fiddle. `Config` instantiates the target directly when built; `Partial` creates a `functools.partial`. `Script` wraps shell commands. These are the primary user-facing types. |
| 45 | + |
| 46 | +**`Executor`** (`nemo_run/core/execution/base.py`): Abstract base for all execution environments. Key fields: `packager`, `launcher`, `env_vars`, `retries`. Implementations: |
| 47 | +- `LocalExecutor` - direct local execution |
| 48 | +- `DockerExecutor` - via Docker |
| 49 | +- `SlurmExecutor` - HPC via SLURM + SSH tunnel |
| 50 | +- `SkypilotExecutor` / `SkypilotJobsExecutor` - multi-cloud via SkyPilot |
| 51 | +- `DGXCloudExecutor` - NVIDIA DGX Cloud |
| 52 | +- `LeptonExecutor` - Lepton AI |
| 53 | + |
| 54 | +**`Experiment`** (`nemo_run/run/experiment.py`): Context manager that groups multiple tasks/jobs, handles parallel execution, log syncing, state tracking, and plugin hooks. Uses TorchX (`torchx>=0.7.0`) as the distributed execution backend. |
| 55 | + |
| 56 | +**`Packager`** (`nemo_run/core/packaging/`): Strategies to bundle code for remote execution: |
| 57 | +- `GitArchivePackager` - packages via `git archive` |
| 58 | +- `PatternPackager` - file glob patterns |
| 59 | +- `HybridPackager` - combines strategies |
| 60 | + |
| 61 | +**`Launcher`** (`nemo_run/core/execution/launcher.py`): Controls how tasks are launched within an executor. Options: `Torchrun`, `FaultTolerance` (NVIDIA), `SlurmRay`, `SlurmTemplate`. |
| 62 | + |
| 63 | +**Tunnels** (`nemo_run/core/tunnel/`): `SSHTunnel` for remote cluster access with rsync for file syncing. |
| 64 | + |
| 65 | +### Data Flow |
| 66 | + |
| 67 | +1. User defines a function/class and wraps it in `run.Config` or `run.Partial` |
| 68 | +2. An `Executor` is configured (with `Packager` + optional `Launcher`) |
| 69 | +3. `run.run(task, executor)` or `run.Experiment` is used to execute |
| 70 | +4. TorchX schedulers (registered as entry points in `pyproject.toml`) dispatch work |
| 71 | +5. Metadata stored in `~/.nemo_run/` for experiment tracking |
| 72 | + |
| 73 | +### CLI |
| 74 | + |
| 75 | +Entry points `nemorun` / `nemo` (via Typer) provide experiment management and configuration inspection. The CLI uses lazy imports (`nemo_run/cli/lazy.py`) for fast startup. Extensible via `nemo_run.cli.entrypoints` namespace. |
| 76 | + |
| 77 | +### Serialization |
| 78 | + |
| 79 | +Configurations can be serialized to YAML (`nemo_run/core/serialization/yaml.py`) or compressed JSON (`zlib_json.py`) for persistence. |
| 80 | + |
| 81 | +### Plugin System |
| 82 | + |
| 83 | +`ExperimentPlugin` (`nemo_run/run/plugin.py`) provides hooks into the experiment lifecycle. |
| 84 | + |
| 85 | +## Key Files |
| 86 | + |
| 87 | +- `nemo_run/api.py` - all public exports |
| 88 | +- `nemo_run/config.py` - `Config`, `Partial`, `Script` classes |
| 89 | +- `nemo_run/run/experiment.py` - `Experiment` context manager |
| 90 | +- `nemo_run/core/execution/base.py` - `Executor` base class |
| 91 | +- `nemo_run/core/execution/slurm.py` - most complex executor (SLURM + SSH) |
| 92 | +- `test/conftest.py` - shared fixtures |
| 93 | + |
| 94 | +## Testing Notes |
| 95 | + |
| 96 | +- Pytest marker `slow` is skipped by default (`addopts = -m "not slow"` in `pyproject.toml`) |
| 97 | +- `INCLUDE_WORKSPACE_FILE` env var controls workspace-related test behavior |
| 98 | +- Test directory is added to `PYTHONPATH` via `add_test_to_pythonpath` fixture |
0 commit comments